linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Handle oom bypass more gracefully
@ 2016-05-26 12:40 Michal Hocko
  2016-05-26 12:40 ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm Michal Hocko
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 12:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Hi,
the following 6 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and so much more reliable finally.  Even though mm
shared outside of threadgroup is rare (either use_mm by kernel threads
or exotic clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND) it
makes the current oom killer logic quite hard to follow and evaluate. It
is possible to select an oom victim which shares the mm with unkillable
process or bypass the oom killer even when other processes sharing the
mm are still alive and other weird cases.

Patch 1 optimizes oom_kill_task to skip the costly process
iteration when the current oom victim is not sharing mm with other
processes. Patch 2 is a clean up of oom_score_adj handling and a
preparatory work. Patch 3 enforces oom_adj_score to be consistent
between processes sharing the mm to behave consistently with the regular
thread groups. Patch 4 tries to handle vforked tasks better in the oom
path, patch 5 ensures that all tasks sharing the mm are killed and
finally patch 6 should guarantee that task_will_free_mem will always
imply reapable bypass of the oom killer.

The patchset is based on the current mmotm tree (mmotm-2016-05-23-16-51).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I am CCing Oleg (sorry I know you hate this code) but I would feel much
better if you double checked my assumptions about locking and vfork
behavior.

Michal Hocko (6):
      mm, oom: do not loop over all tasks if there are no external tasks sharing mm
      proc, oom_adj: extract oom_score_adj setting into a helper
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
      mm, oom: skip over vforked tasks
      mm, oom: kill all tasks sharing the mm
      mm, oom: fortify task_will_free_mem

 fs/proc/base.c      | 168 +++++++++++++++++++++++++++++-----------------------
 include/linux/mm.h  |   2 +
 include/linux/oom.h |  72 ++++++++++++++++++++--
 mm/memcontrol.c     |   4 +-
 mm/oom_kill.c       |  96 ++++++++++--------------------
 5 files changed, 196 insertions(+), 146 deletions(-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm
  2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
@ 2016-05-26 12:40 ` Michal Hocko
  2016-05-26 14:30   ` Tetsuo Handa
  2016-05-26 12:40 ` [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 12:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_kill_process makes sure to kill all processes outside of the thread
group which are sharing the mm. This requires to iterate over all tasks.
This is however not a common case so we can optimize it a bit and only
do that path only if we know that there are external users of the mm
struct outside of the thread group.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bb2f7698ad7..0e33e912f7e4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_unlock(victim);
 
 	/*
+	 * skip expensive iterations over all tasks if we know that there
+	 * are no users outside of threads in the same thread group
+	 */
+	if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
+		goto oom_reap;
+
+	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
 	 * any.  They don't get access to memory reserves, though, to avoid
 	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
@@ -848,6 +855,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	rcu_read_unlock();
 
+oom_reap:
 	if (can_oom_reap)
 		wake_oom_reaper(victim);
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper
  2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
  2016-05-26 12:40 ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm Michal Hocko
@ 2016-05-26 12:40 ` Michal Hocko
  2016-05-26 12:40 ` [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 12:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 133 +++++++++++++++++++++++++--------------------------------
 1 file changed, 59 insertions(+), 74 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..23679673bf5a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1041,6 +1041,63 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+	struct task_struct *task;
+	unsigned long flags;
+	int err = 0;
+
+	task = get_proc_task(file_inode(file));
+	if (!task) {
+		err = -ESRCH;
+		goto out;
+	}
+
+	task_lock(task);
+	if (!task->mm) {
+		err = -EINVAL;
+		goto err_task_lock;
+	}
+
+	if (!lock_task_sighand(task, &flags)) {
+		err = -ESRCH;
+		goto err_task_lock;
+	}
+
+	if (legacy) {
+		if (oom_adj < task->signal->oom_score_adj &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_sighand;
+		}
+		/*
+		 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+		 * /proc/pid/oom_score_adj instead.
+		 */
+		pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+			  current->comm, task_pid_nr(current), task_pid_nr(task),
+			  task_pid_nr(task));
+	} else {
+		if ((short)oom_adj < task->signal->oom_score_adj_min &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_sighand;
+		}
+	}
+
+	task->signal->oom_score_adj = oom_adj;
+	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = (short)oom_adj;
+	trace_oom_score_adj_update(task);
+err_sighand:
+	unlock_task_sighand(task, &flags);
+err_task_lock:
+	task_unlock(task);
+	put_task_struct(task);
+out:
+	return err;
+}
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1054,10 +1111,8 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1077,23 +1132,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_task_lock;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1103,27 +1141,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-	if (oom_adj < task->signal->oom_score_adj &&
-	    !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
-
-	/*
-	 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-	 * /proc/pid/oom_score_adj instead.
-	 */
-	pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
-		  current->comm, task_pid_nr(current), task_pid_nr(task),
-		  task_pid_nr(task));
+	err = __set_oom_adj(file, oom_adj, true);
 
-	task->signal->oom_score_adj = oom_adj;
-	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
-	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
 }
@@ -1157,9 +1176,7 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1180,39 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_task_lock;
-	}
-
-	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
-
-	task->signal->oom_score_adj = (short)oom_score_adj;
-	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
-		task->signal->oom_score_adj_min = (short)oom_score_adj;
-	trace_oom_score_adj_update(task);
-
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_score_adj, false);
 out:
 	return err < 0 ? err : count;
 }
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
  2016-05-26 12:40 ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm Michal Hocko
  2016-05-26 12:40 ` [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
@ 2016-05-26 12:40 ` Michal Hocko
  2016-05-27 11:18   ` Michal Hocko
  2016-05-26 12:40 ` [PATCH 4/6] mm, oom: skip over vforked tasks Michal Hocko
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 12:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_THREAD resp. CLONE_SIGHAND) and so we can easily end up in a
situation when some processes update their oom_score_adj and confuse
the oom killer. In the worst case some of those processes might hide
from oom killer altogether via OOM_SCORE_ADJ_MIN while others are
eligible. OOM killer would then pick up those eligible but won't be
allowed to kill others sharing the same mm so the mm wouldn't release
the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct becuase that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
	vfork()
	set_oom_adj()
	exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj.

Note that we have to serialize all the oom_score_adj writers now to
guarantee they do not interleave and generate inconsistent results.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 35 +++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 23679673bf5a..e3ee4fb1930c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1043,10 +1043,13 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
+	static DEFINE_MUTEX(oom_adj_mutex);
+	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	unsigned long flags;
 	int err = 0;
 
+	mutex_lock(&oom_adj_mutex);
 	task = get_proc_task(file_inode(file));
 	if (!task) {
 		err = -ESRCH;
@@ -1085,6 +1088,20 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
+	/*
+	 * If we are not in the vfork and share mm with other processes we
+	 * have to propagate the score otherwise we would have a schizophrenic
+	 * requirements for the same mm. We can use racy check because we
+	 * only risk the slow path.
+	 */
+	if (!task->vfork_done &&
+			atomic_read(&task->mm->mm_users) > get_nr_threads(task)) {
+		mm = task->mm;
+
+		/* pin the mm so it doesn't go away and get reused */
+		atomic_inc(&mm->mm_count);
+	}
+
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
@@ -1094,7 +1111,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 err_task_lock:
 	task_unlock(task);
 	put_task_struct(task);
+
+	if (mm) {
+		struct task_struct *p;
+
+		rcu_read_lock();
+		for_each_process(p) {
+			task_lock(p);
+			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				p->signal->oom_score_adj = oom_adj;
+				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+					p->signal->oom_score_adj_min = (short)oom_adj;
+			}
+			task_unlock(p);
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
 out:
+	mutex_unlock(&oom_adj_mutex);
 	return err;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05102822912c..b44d3d792a00 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2248,6 +2248,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e33e912f7e4..eeccb4d7e7f5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,7 +416,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/6] mm, oom: skip over vforked tasks
  2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
                   ` (2 preceding siblings ...)
  2016-05-26 12:40 ` [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
@ 2016-05-26 12:40 ` Michal Hocko
  2016-05-27 16:48   ` Vladimir Davydov
  2016-05-30 12:03   ` Michal Hocko
  2016-05-26 12:40 ` [PATCH 5/6] mm, oom: kill all tasks sharing the mm Michal Hocko
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 12:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vforked tasks are not really sitting on memory so it doesn't matter much
to kill them. Parents are waiting for vforked task killable so it is
better to chose parent which is the real mm owner. Teach oom_badness
to ignore all tasks which haven't passed mm_release. oom_kill_process
should ignore them as well because they will drop the mm soon and they
will not block oom_reaper because they cannot touch any memory.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eeccb4d7e7f5..d1cbaaa1a666 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped.
+	 * unkillable or have been already oom reaped or they are in
+	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			p->vfork_done) {
 		task_unlock(p);
 		return 0;
 	}
@@ -839,6 +841,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	for_each_process(p) {
 		if (!process_shares_mm(p, mm))
 			continue;
+		/*
+		 * vforked tasks are ignored because they will drop the mm soon
+		 * hopefully and even if not they will not mind being oom
+		 * reaped because they cannot touch any memory.
+		 */
+		if (p->vfork_done)
+			continue;
 		if (same_thread_group(p, victim))
 			continue;
 		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
                   ` (3 preceding siblings ...)
  2016-05-26 12:40 ` [PATCH 4/6] mm, oom: skip over vforked tasks Michal Hocko
@ 2016-05-26 12:40 ` Michal Hocko
  2016-05-26 12:40 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
  2016-05-27 16:00 ` [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
  6 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 12:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).
Moreover after "mm, oom: skip over vforked tasks" we even cannot
encounter vfork task so we can allow both SIG_KILL and oom reaper. A
potential race is highly unlikely but possible. It would happen if
__set_oom_adj raced with select_bad_process and then it is OK to
consider the old value or with fork when it should be acceptable as
well.
Let's add a little note to the log so that people would tell us that
this really happens in the real life and it matters.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d1cbaaa1a666..008c5b4732de 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -850,8 +850,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
@@ -860,6 +859,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			can_oom_reap = false;
 			continue;
 		}
+		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
+			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
+					" Report at linux-mm@kvack.org\n",
+					victim->comm, task_pid_nr(victim),
+					p->comm, task_pid_nr(p));
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
                   ` (4 preceding siblings ...)
  2016-05-26 12:40 ` [PATCH 5/6] mm, oom: kill all tasks sharing the mm Michal Hocko
@ 2016-05-26 12:40 ` Michal Hocko
  2016-05-26 14:11   ` Tetsuo Handa
  2016-05-27 11:07   ` Michal Hocko
  2016-05-27 16:00 ` [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
  6 siblings, 2 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 12:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down. This builds on top for more
complex scenarios where mm is shared between different processes
(CLONE_VM without CLONE_THREAD resp CLONE_SIGHAND).

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Drop the mm checks for the bypass because those are not really
guaranteeing anything as the condition might change at any time
after task_lock is dropped.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 72 +++++++++++++++++++++++++++++++++++++++++++++++++----
 mm/memcontrol.c     |  4 +--
 mm/oom_kill.c       | 65 ++++-------------------------------------------
 3 files changed, 74 insertions(+), 67 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 83469522690a..412c4ecb42b1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -70,9 +70,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -105,7 +105,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
 
@@ -117,13 +117,75 @@ static inline bool task_will_free_mem(struct task_struct *task)
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
 		return false;
 
-	if (!(task->flags & PF_EXITING))
+	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
 		return false;
 
 	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+	if (!thread_group_empty(task) &&
+		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
+		return false;
+
+	return true;
+}
+
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *p;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
 		return false;
 
+	if (!__task_will_free_mem(p)) {
+		task_unlock(p);
+		return false;
+	}
+
+	/*
+	 * Check whether there are other processes sharing the mm - they all have
+	 * to be killed or exiting.
+	 */
+	if (atomic_read(&p->mm->mm_users) > get_nr_threads(p)) {
+		mm = p->mm;
+		/* pin the mm to not get freed and reused */
+		atomic_inc(&mm->mm_count);
+	}
+	task_unlock(p);
+
+	if (mm) {
+		rcu_read_lock();
+		for_each_process(p) {
+			bool vfork;
+
+			/*
+			 * skip over vforked tasks because they are mostly
+			 * independent and will drop the mm soon
+			 */
+			task_lock(p);
+			vfork = p->vfork_done;
+			task_unlock(p);
+			if (vfork)
+				continue;
+
+			if (!__task_will_free_mem(p))
+				break;
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
+
 	return true;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6477a9dbe7a..878a4308164c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1273,9 +1273,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 008c5b4732de..428e34df9f49 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -573,7 +573,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -594,50 +594,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 /* Check if we can reap the given task. This has to be called with stable
  * tsk->mm
  */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			bool exiting;
-
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (same_thread_group(p, tsk))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			spin_lock_irq(&p->sighand->siglock);
-			exiting = signal_group_exit(p->signal);
-			spin_unlock_irq(&p->sighand->siglock);
-			if (exiting)
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -649,10 +605,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -750,15 +702,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
@@ -945,14 +894,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-26 12:40 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
@ 2016-05-26 14:11   ` Tetsuo Handa
  2016-05-26 14:23     ` Michal Hocko
  2016-05-27 11:07   ` Michal Hocko
  1 sibling, 1 reply; 36+ messages in thread
From: Tetsuo Handa @ 2016-05-26 14:11 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> +/*
> + * Checks whether the given task is dying or exiting and likely to
> + * release its address space. This means that all threads and processes
> + * sharing the same mm have to be killed or exiting.
> + */
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> +	struct mm_struct *mm = NULL;
> +	struct task_struct *p;
> +
> +	/*
> +	 * If the process has passed exit_mm we have to skip it because
> +	 * we have lost a link to other tasks sharing this mm, we do not
> +	 * have anything to reap and the task might then get stuck waiting
> +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +	 */
> +	p = find_lock_task_mm(task);
> +	if (!p)
>  		return false;
>  
> +	if (!__task_will_free_mem(p)) {
> +		task_unlock(p);
> +		return false;
> +	}
> +
> +	/*
> +	 * Check whether there are other processes sharing the mm - they all have
> +	 * to be killed or exiting.
> +	 */
> +	if (atomic_read(&p->mm->mm_users) > get_nr_threads(p)) {
> +		mm = p->mm;
> +		/* pin the mm to not get freed and reused */
> +		atomic_inc(&mm->mm_count);
> +	}
> +	task_unlock(p);
> +
> +	if (mm) {
> +		rcu_read_lock();
> +		for_each_process(p) {
> +			bool vfork;
> +
> +			/*
> +			 * skip over vforked tasks because they are mostly
> +			 * independent and will drop the mm soon
> +			 */
> +			task_lock(p);
> +			vfork = p->vfork_done;
> +			task_unlock(p);
> +			if (vfork)
> +				continue;
> +
> +			if (!__task_will_free_mem(p))
> +				break;
> +		}
> +		rcu_read_unlock();
> +		mmdrop(mm);

Did you forget "if (something) return false;" here?

> +	}
> +

If task_will_free_mem() == true is always followed by mark_oom_victim()
and wake_oom_reaper(), can we call them from here?

>  	return true;
>  }

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-26 14:11   ` Tetsuo Handa
@ 2016-05-26 14:23     ` Michal Hocko
  2016-05-26 14:41       ` Tetsuo Handa
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 14:23 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel

On Thu 26-05-16 23:11:47, Tetsuo Handa wrote:
[...]
> > +	if (mm) {
> > +		rcu_read_lock();
> > +		for_each_process(p) {
> > +			bool vfork;
> > +
> > +			/*
> > +			 * skip over vforked tasks because they are mostly
> > +			 * independent and will drop the mm soon
> > +			 */
> > +			task_lock(p);
> > +			vfork = p->vfork_done;
> > +			task_unlock(p);
> > +			if (vfork)
> > +				continue;
> > +
> > +			if (!__task_will_free_mem(p))
> > +				break;
> > +		}
> > +		rcu_read_unlock();
> > +		mmdrop(mm);
> 
> Did you forget "if (something) return false;" here?

Dohh, I screwed when cleaning up the code before submission. Thanks for
catching that.

> 
> > +	}
> > +
> 
> If task_will_free_mem() == true is always followed by mark_oom_victim()
> and wake_oom_reaper(), can we call them from here?

I would rather keep the function doing only the check. We never know
when this could be reused somewhere else. All the callsites are quite
trivial so we do not duplicate a lot of code.
---
>From 0b94b4d6f749f322279e911e00abf1390fea4b2b Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 26 May 2016 09:38:32 +0200
Subject: [PATCH] mm, oom: fortify task_will_free_mem

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down. This builds on top for more
complex scenarios where mm is shared between different processes
(CLONE_VM without CLONE_THREAD resp CLONE_SIGHAND).

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Drop the mm checks for the bypass because those are not really
guaranteeing anything as the condition might change at any time
after task_lock is dropped.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 74 +++++++++++++++++++++++++++++++++++++++++++++++++----
 mm/memcontrol.c     |  4 +--
 mm/oom_kill.c       | 65 ++++------------------------------------------
 3 files changed, 76 insertions(+), 67 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 83469522690a..91d90a5885dc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -70,9 +70,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -105,7 +105,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
 
@@ -117,16 +117,80 @@ static inline bool task_will_free_mem(struct task_struct *task)
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
 		return false;
 
-	if (!(task->flags & PF_EXITING))
+	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
 		return false;
 
 	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+	if (!thread_group_empty(task) &&
+		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
 		return false;
 
 	return true;
 }
 
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *p;
+	bool ret = false;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	if (!__task_will_free_mem(p)) {
+		task_unlock(p);
+		return false;
+	}
+
+	/*
+	 * Check whether there are other processes sharing the mm - they all have
+	 * to be killed or exiting.
+	 */
+	if (atomic_read(&p->mm->mm_users) > get_nr_threads(p)) {
+		mm = p->mm;
+		/* pin the mm to not get freed and reused */
+		atomic_inc(&mm->mm_count);
+	}
+	task_unlock(p);
+
+	if (mm) {
+		rcu_read_lock();
+		for_each_process(p) {
+			bool vfork;
+
+			/*
+			 * skip over vforked tasks because they are mostly
+			 * independent and will drop the mm soon
+			 */
+			task_lock(p);
+			vfork = p->vfork_done;
+			task_unlock(p);
+			if (vfork)
+				continue;
+
+			ret = __task_will_free_mem(p);
+			if (!ret)
+				break;
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
+
+	return ret;
+}
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6477a9dbe7a..878a4308164c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1273,9 +1273,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 008c5b4732de..428e34df9f49 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -573,7 +573,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -594,50 +594,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 /* Check if we can reap the given task. This has to be called with stable
  * tsk->mm
  */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			bool exiting;
-
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (same_thread_group(p, tsk))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			spin_lock_irq(&p->sighand->siglock);
-			exiting = signal_group_exit(p->signal);
-			spin_unlock_irq(&p->sighand->siglock);
-			if (exiting)
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -649,10 +605,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -750,15 +702,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
@@ -945,14 +894,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm
  2016-05-26 12:40 ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm Michal Hocko
@ 2016-05-26 14:30   ` Tetsuo Handa
  2016-05-26 14:59     ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Tetsuo Handa @ 2016-05-26 14:30 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5bb2f7698ad7..0e33e912f7e4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	task_unlock(victim);
>  
>  	/*
> +	 * skip expensive iterations over all tasks if we know that there
> +	 * are no users outside of threads in the same thread group
> +	 */
> +	if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
> +		goto oom_reap;

Is this really safe? Isn't it possible that victim thread's thread group has
more than atomic_read(&mm->mm_users) threads which are past exit_mm() and blocked
at exit_task_work() which are before __exit_signal() from release_task() from
exit_notify()?

> +
> +	/*
>  	 * Kill all user processes sharing victim->mm in other thread groups, if
>  	 * any.  They don't get access to memory reserves, though, to avoid
>  	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-26 14:23     ` Michal Hocko
@ 2016-05-26 14:41       ` Tetsuo Handa
  2016-05-26 14:56         ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Tetsuo Handa @ 2016-05-26 14:41 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> +/*
> + * Checks whether the given task is dying or exiting and likely to
> + * release its address space. This means that all threads and processes
> + * sharing the same mm have to be killed or exiting.
> + */
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> +	struct mm_struct *mm = NULL;
> +	struct task_struct *p;
> +	bool ret = false;

If atomic_read(&p->mm->mm_users) <= get_nr_threads(p), this returns "false".
According to previous version, I think this is "bool ret = true;".

> +
> +	/*
> +	 * If the process has passed exit_mm we have to skip it because
> +	 * we have lost a link to other tasks sharing this mm, we do not
> +	 * have anything to reap and the task might then get stuck waiting
> +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +	 */
> +	p = find_lock_task_mm(task);
> +	if (!p)
> +		return false;
> +
> +	if (!__task_will_free_mem(p)) {
> +		task_unlock(p);
> +		return false;
> +	}
> +
> +	/*
> +	 * Check whether there are other processes sharing the mm - they all have
> +	 * to be killed or exiting.
> +	 */
> +	if (atomic_read(&p->mm->mm_users) > get_nr_threads(p)) {
> +		mm = p->mm;
> +		/* pin the mm to not get freed and reused */
> +		atomic_inc(&mm->mm_count);
> +	}
> +	task_unlock(p);
> +
> +	if (mm) {
> +		rcu_read_lock();
> +		for_each_process(p) {
> +			bool vfork;
> +
> +			/*
> +			 * skip over vforked tasks because they are mostly
> +			 * independent and will drop the mm soon
> +			 */
> +			task_lock(p);
> +			vfork = p->vfork_done;
> +			task_unlock(p);
> +			if (vfork)
> +				continue;
> +
> +			ret = __task_will_free_mem(p);
> +			if (!ret)
> +				break;
> +		}
> +		rcu_read_unlock();
> +		mmdrop(mm);
> +	}
> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-26 14:41       ` Tetsuo Handa
@ 2016-05-26 14:56         ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 14:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Thu 26-05-16 23:41:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > +/*
> > + * Checks whether the given task is dying or exiting and likely to
> > + * release its address space. This means that all threads and processes
> > + * sharing the same mm have to be killed or exiting.
> > + */
> > +static inline bool task_will_free_mem(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = NULL;
> > +	struct task_struct *p;
> > +	bool ret = false;
> 
> If atomic_read(&p->mm->mm_users) <= get_nr_threads(p), this returns "false".
> According to previous version, I think this is "bool ret = true;".

true. Thanks for catching this. Fixed locally.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm
  2016-05-26 14:30   ` Tetsuo Handa
@ 2016-05-26 14:59     ` Michal Hocko
  2016-05-26 15:25       ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are noexternal " Tetsuo Handa
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 14:59 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Thu 26-05-16 23:30:06, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5bb2f7698ad7..0e33e912f7e4 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	task_unlock(victim);
> >  
> >  	/*
> > +	 * skip expensive iterations over all tasks if we know that there
> > +	 * are no users outside of threads in the same thread group
> > +	 */
> > +	if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
> > +		goto oom_reap;
> 
> Is this really safe? Isn't it possible that victim thread's thread group has
> more than atomic_read(&mm->mm_users) threads which are past exit_mm() and blocked
> at exit_task_work() which are before __exit_signal() from release_task() from
> exit_notify()?

You are right. The race window between exit_mm and __exit_signal is
really large. I thought about == check instead but that wouldn't work
for the same reason, dang, it looked so promissing.

Scratch this patch then.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are noexternal tasks sharing mm
  2016-05-26 14:59     ` Michal Hocko
@ 2016-05-26 15:25       ` Tetsuo Handa
  2016-05-26 15:35         ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Tetsuo Handa @ 2016-05-26 15:25 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Thu 26-05-16 23:30:06, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 5bb2f7698ad7..0e33e912f7e4 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > >  	task_unlock(victim);
> > >  
> > >  	/*
> > > +	 * skip expensive iterations over all tasks if we know that there
> > > +	 * are no users outside of threads in the same thread group
> > > +	 */
> > > +	if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
> > > +		goto oom_reap;
> > 
> > Is this really safe? Isn't it possible that victim thread's thread group has
> > more than atomic_read(&mm->mm_users) threads which are past exit_mm() and blocked
> > at exit_task_work() which are before __exit_signal() from release_task() from
> > exit_notify()?
> 
> You are right. The race window between exit_mm and __exit_signal is
> really large. I thought about == check instead but that wouldn't work
> for the same reason, dang, it looked so promissing.
> 
> Scratch this patch then.
> 

I think that remembering whether this mm might be shared between
multiple thread groups at clone() time (i.e. whether
clone(CLONE_VM without CLONE_SIGHAND) was ever requested on this mm)
is safe (given that that thread already got SIGKILL or is exiting).

By the way, in oom_kill_process(), how (p->flags & PF_KTHREAD) case can
become true when process_shares_mm() is true? Even if it can become true,
why can't we reap that mm? Is (p->flags & PF_KTHREAD) case only for
not to send SIGKILL rather than not to reap that mm?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are noexternal tasks sharing mm
  2016-05-26 15:25       ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are noexternal " Tetsuo Handa
@ 2016-05-26 15:35         ` Michal Hocko
  2016-05-26 16:14           ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external " Tetsuo Handa
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-26 15:35 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 27-05-16 00:25:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 26-05-16 23:30:06, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 5bb2f7698ad7..0e33e912f7e4 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > > >  	task_unlock(victim);
> > > >  
> > > >  	/*
> > > > +	 * skip expensive iterations over all tasks if we know that there
> > > > +	 * are no users outside of threads in the same thread group
> > > > +	 */
> > > > +	if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
> > > > +		goto oom_reap;
> > > 
> > > Is this really safe? Isn't it possible that victim thread's thread group has
> > > more than atomic_read(&mm->mm_users) threads which are past exit_mm() and blocked
> > > at exit_task_work() which are before __exit_signal() from release_task() from
> > > exit_notify()?
> > 
> > You are right. The race window between exit_mm and __exit_signal is
> > really large. I thought about == check instead but that wouldn't work
> > for the same reason, dang, it looked so promissing.
> > 
> > Scratch this patch then.
> > 
> 
> I think that remembering whether this mm might be shared between
> multiple thread groups at clone() time (i.e. whether
> clone(CLONE_VM without CLONE_SIGHAND) was ever requested on this mm)
> is safe (given that that thread already got SIGKILL or is exiting).

I was already playing with that idea but I didn't want to add anything
to the fork path which is really hot. This patch is not really needed
for the rest. It just felt like a nice optimization. I do not think it
is worth deeper changes in the fast paths.

> By the way, in oom_kill_process(), how (p->flags & PF_KTHREAD) case can
> become true when process_shares_mm() is true?

not sure I understand. But the PF_KTHREAD check is there to catch
use_mm() usage by kernel threads.

> Even if it can become true,
> why can't we reap that mm? Is (p->flags & PF_KTHREAD) case only for
> not to send SIGKILL rather than not to reap that mm?

If we reaped the mm then the kernel thread could blow up when accessing
a memory.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm
  2016-05-26 15:35         ` Michal Hocko
@ 2016-05-26 16:14           ` Tetsuo Handa
  2016-05-27  6:45             ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Tetsuo Handa @ 2016-05-26 16:14 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Fri 27-05-16 00:25:23, Tetsuo Handa wrote:
> > I think that remembering whether this mm might be shared between
> > multiple thread groups at clone() time (i.e. whether
> > clone(CLONE_VM without CLONE_SIGHAND) was ever requested on this mm)
> > is safe (given that that thread already got SIGKILL or is exiting).
> 
> I was already playing with that idea but I didn't want to add anything
> to the fork path which is really hot. This patch is not really needed
> for the rest. It just felt like a nice optimization. I do not think it
> is worth deeper changes in the fast paths.

"[PATCH 6/6] mm, oom: fortify task_will_free_mem" depends on [PATCH 1/6].
You will need to update [PATCH 6/6].

It seems to me that [PATCH 6/6] resembles
http://lkml.kernel.org/r/201605250005.GHH26082.JOtQOSLMFFOFVH@I-love.SAKURA.ne.jp .
I think we will be happy if we can speed up mm_is_reapable() test using
"whether this mm might be shared between multiple thread groups" flag.
I don't think updating such flag at clone() is too heavy operation to add.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm
  2016-05-26 16:14           ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external " Tetsuo Handa
@ 2016-05-27  6:45             ` Michal Hocko
  2016-05-27  7:15               ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-27  6:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 27-05-16 01:14:35, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 27-05-16 00:25:23, Tetsuo Handa wrote:
> > > I think that remembering whether this mm might be shared between
> > > multiple thread groups at clone() time (i.e. whether
> > > clone(CLONE_VM without CLONE_SIGHAND) was ever requested on this mm)
> > > is safe (given that that thread already got SIGKILL or is exiting).
> > 
> > I was already playing with that idea but I didn't want to add anything
> > to the fork path which is really hot. This patch is not really needed
> > for the rest. It just felt like a nice optimization. I do not think it
> > is worth deeper changes in the fast paths.
> 
> "[PATCH 6/6] mm, oom: fortify task_will_free_mem" depends on [PATCH 1/6].
> You will need to update [PATCH 6/6].
> 
> It seems to me that [PATCH 6/6] resembles
> http://lkml.kernel.org/r/201605250005.GHH26082.JOtQOSLMFFOFVH@I-love.SAKURA.ne.jp .
> I think we will be happy if we can speed up mm_is_reapable() test using
> "whether this mm might be shared between multiple thread groups" flag.
> I don't think updating such flag at clone() is too heavy operation to add.

It is still an operation which is not needed for 99% of situations. So
if we do not need it for correctness then I do not think this is worth
bothering.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm
  2016-05-27  6:45             ` Michal Hocko
@ 2016-05-27  7:15               ` Michal Hocko
  2016-05-27  8:03                 ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-27  7:15 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 27-05-16 08:45:10, Michal Hocko wrote:
[...]
> It is still an operation which is not needed for 99% of situations. So
> if we do not need it for correctness then I do not think this is worth
> bothering.

Since you have pointed out exit_mm vs. __exit_signal race yesterday I
was thinking how to make the check reliable. Even
atomic_read(mm->mm_users) > get_nr_threads() is not reliable and we can
miss other tasks just because the current thread group is mostly past
exit_mm. So far I couldn't find a way to tweak this around though.
I will think about it more but I am afraid that a flag would be really
needed afterall.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm
  2016-05-27  7:15               ` Michal Hocko
@ 2016-05-27  8:03                 ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-27  8:03 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 27-05-16 09:15:07, Michal Hocko wrote:
> On Fri 27-05-16 08:45:10, Michal Hocko wrote:
> [...]
> > It is still an operation which is not needed for 99% of situations. So
> > if we do not need it for correctness then I do not think this is worth
> > bothering.
> 
> Since you have pointed out exit_mm vs. __exit_signal race yesterday I
> was thinking how to make the check reliable. Even
> atomic_read(mm->mm_users) > get_nr_threads() is not reliable and we can
> miss other tasks just because the current thread group is mostly past
> exit_mm. So far I couldn't find a way to tweak this around though.

Just for the record I was playing with the following yesterday but I
couldn't convince myself that this is safe and reasonable in the first
place (I do not like it to be honest).
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1685890d424e..db027eca8be5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -123,6 +123,35 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
 	return t;
 }
 
+bool task_has_external_users(struct task_struct *p)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *t;
+	int active_threads = 0;
+	bool ret = true;	/* be pessimistic */
+
+	rcu_read_lock();
+	for_each_thread(p, t) {
+		task_lock(t);
+		if (likely(t->mm)) {
+			active_threads++;
+			if (!mm) {
+				mm = t->mm;
+				atomic_inc(&mm->mm_count);
+			}
+		}
+		task_unlock(t);
+	}
+	rcu_read_unlock();
+
+	if (mm) {
+		if (atomic_read(&mm->mm_users) <= active_threads)
+			ret = false;
+		mmdrop(mm);
+	}
+	return ret;
+}
+
 /*
  * order == -1 means the oom kill is required by sysrq, otherwise only
  * for display purposes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-26 12:40 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
  2016-05-26 14:11   ` Tetsuo Handa
@ 2016-05-27 11:07   ` Michal Hocko
  1 sibling, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-27 11:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

I have updated the patch to not rely on the mm_users check because it is
not reliable as pointed by Tetsuo and we really want this function to be
reliable. I do not have a good and reliable way to check for existence
of external users sharing the mm so we are checking the whole list
unconditionally if mm_users > 1. This should be acceptable because we
are doing this only in the slow path of task_will_free_mem. We can
surely make this more optimum later.
---
>From 24149128304fcd90cb656abbfaef91e734db242a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 26 May 2016 09:38:32 +0200
Subject: [PATCH] mm, oom: fortify task_will_free_mem

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_THREAD resp
CLONE_SIGHAND, or in kernel use_mm().

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 75 +++++++++++++++++++++++++++++++++++++++++++++++++----
 mm/memcontrol.c     |  4 +--
 mm/oom_kill.c       | 66 ++++------------------------------------------
 3 files changed, 77 insertions(+), 68 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 83469522690a..4acafc201e78 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -70,9 +70,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -105,7 +105,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
 
@@ -117,16 +117,81 @@ static inline bool task_will_free_mem(struct task_struct *task)
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
 		return false;
 
-	if (!(task->flags & PF_EXITING))
+	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
 		return false;
 
 	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+	if (!thread_group_empty(task) &&
+		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
 		return false;
 
 	return true;
 }
 
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *p;
+	bool ret;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	if (!__task_will_free_mem(p)) {
+		task_unlock(p);
+		return false;
+	}
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		bool vfork;
+
+		/*
+		 * skip over vforked tasks because they are mostly
+		 * independent and will drop the mm soon
+		 */
+		task_lock(p);
+		vfork = p->vfork_done;
+		task_unlock(p);
+		if (vfork)
+			continue;
+
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 37ba604984c9..1353920e065e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1275,9 +1275,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d48ec58cf4d2..bcb6d3b26c94 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -590,7 +590,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -608,51 +608,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			bool exiting;
-
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			spin_lock_irq(&p->sighand->siglock);
-			exiting = signal_group_exit(p->signal);
-			spin_unlock_irq(&p->sighand->siglock);
-			if (exiting)
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -664,10 +619,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -765,15 +716,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
@@ -952,14 +900,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-26 12:40 ` [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
@ 2016-05-27 11:18   ` Michal Hocko
  2016-05-27 16:18     ` Vladimir Davydov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-27 11:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

And here again. Get rid of the mm_users check because it is not
reliable.
---
>From 7681e91cba6bcd45f9ebc5d2dcee3df06c687296 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 25 May 2016 19:50:34 +0200
Subject: [PATCH] mm, oom_adj: make sure processes sharing mm have same view of
 oom_score_adj

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_THREAD resp. CLONE_SIGHAND) and so we can easily end up in a
situation when some processes update their oom_score_adj and confuse
the oom killer. In the worst case some of those processes might hide
from oom killer altogether via OOM_SCORE_ADJ_MIN while others are
eligible. OOM killer would then pick up those eligible but won't be
allowed to kill others sharing the same mm so the mm wouldn't release
the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct becuase that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
	vfork()
	set_oom_adj()
	exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj. The current implementation is rather pessimistic
and checks all the existing processes by default if there are more than
1 holder of the mm but we do not have any reliable way to check for
external users yet.

Note that we have to serialize all the oom_score_adj writers now to
guarantee they do not interleave and generate inconsistent results.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0afc77d4d84a..fa0b3ca94dfb 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1043,10 +1043,13 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
+	static DEFINE_MUTEX(oom_adj_mutex);
+	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	unsigned long flags;
 	int err = 0;
 
+	mutex_lock(&oom_adj_mutex);
 	task = get_proc_task(file_inode(file));
 	if (!task) {
 		err = -ESRCH;
@@ -1079,6 +1082,21 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
+	/*
+	 * Make sure we will check other processes sharing the mm if this is
+	 * not vfrok which wants its own oom_score_adj.
+	 * pin the mm so it doesn't go away and get reused.
+	 */
+	if (!task->vfork_done) {
+		struct task_struct *p = find_lock_task_mm(task);
+
+		if (p && atomic_read(&p->mm->mm_users) > 1) {
+			mm = p->mm;
+			atomic_inc(&mm->mm_count);
+			task_unlock(p);
+		}
+	}
+
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
@@ -1087,7 +1105,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 	unlock_task_sighand(task, &flags);
 err_put_task:
 	put_task_struct(task);
+
+	if (mm) {
+		struct task_struct *p;
+
+		rcu_read_lock();
+		for_each_process(p) {
+			task_lock(p);
+			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				p->signal->oom_score_adj = oom_adj;
+				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+					p->signal->oom_score_adj_min = (short)oom_adj;
+			}
+			task_unlock(p);
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
 out:
+	mutex_unlock(&oom_adj_mutex);
 	return err;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05102822912c..b44d3d792a00 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2248,6 +2248,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1685890d424e..268b76b88220 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,7 +416,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/5] Handle oom bypass more gracefully
  2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
                   ` (5 preceding siblings ...)
  2016-05-26 12:40 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
@ 2016-05-27 16:00 ` Michal Hocko
  6 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-27 16:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

JFYI, I plan to repost the series early next week after I review all the
pieces again properly with a clean head. If some parts are not sound or
completely unacceptable in principle then let me know of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-27 11:18   ` Michal Hocko
@ 2016-05-27 16:18     ` Vladimir Davydov
  2016-05-30  7:07       ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Vladimir Davydov @ 2016-05-27 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Fri, May 27, 2016 at 01:18:03PM +0200, Michal Hocko wrote:
...
> @@ -1087,7 +1105,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  	unlock_task_sighand(task, &flags);
>  err_put_task:
>  	put_task_struct(task);
> +
> +	if (mm) {
> +		struct task_struct *p;
> +
> +		rcu_read_lock();
> +		for_each_process(p) {
> +			task_lock(p);
> +			if (!p->vfork_done && process_shares_mm(p, mm)) {
> +				p->signal->oom_score_adj = oom_adj;
> +				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> +					p->signal->oom_score_adj_min = (short)oom_adj;
> +			}
> +			task_unlock(p);

I.e. you write to /proc/pid1/oom_score_adj and get
/proc/pid2/oom_score_adj updated if pid1 and pid2 share mm?
IMO that looks unexpected from userspace pov.

May be, we'd better add mm->oom_score_adj and set it to the min
signal->oom_score_adj over all processes sharing it? This would
require iterating over all processes every time oom_score_adj gets
updated, but that's a slow path.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip over vforked tasks
  2016-05-26 12:40 ` [PATCH 4/6] mm, oom: skip over vforked tasks Michal Hocko
@ 2016-05-27 16:48   ` Vladimir Davydov
  2016-05-30  7:13     ` Michal Hocko
  2016-05-30 12:03   ` Michal Hocko
  1 sibling, 1 reply; 36+ messages in thread
From: Vladimir Davydov @ 2016-05-27 16:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML, Michal Hocko

On Thu, May 26, 2016 at 02:40:13PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> vforked tasks are not really sitting on memory so it doesn't matter much
> to kill them. Parents are waiting for vforked task killable so it is
> better to chose parent which is the real mm owner. Teach oom_badness
> to ignore all tasks which haven't passed mm_release. oom_kill_process
> should ignore them as well because they will drop the mm soon and they
> will not block oom_reaper because they cannot touch any memory.

That is, if a process calls vfork->exec to spawn a child, and a newly
spawned child happens to invoke oom somewhere in exec, instead of
killing the child, which hasn't done anything yet, we'll kill the main
process while the child continues to run. Not sure if it's really bad
though.

...
> @@ -839,6 +841,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	for_each_process(p) {
>  		if (!process_shares_mm(p, mm))
>  			continue;
> +		/*
> +		 * vforked tasks are ignored because they will drop the mm soon
> +		 * hopefully and even if not they will not mind being oom
> +		 * reaped because they cannot touch any memory.

They shouldn't modify memory, but they still can touch it AFAIK.

> +		 */
> +		if (p->vfork_done)
> +			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
>  		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-27 16:18     ` Vladimir Davydov
@ 2016-05-30  7:07       ` Michal Hocko
  2016-05-30  8:47         ` Vladimir Davydov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-30  7:07 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Fri 27-05-16 19:18:21, Vladimir Davydov wrote:
> On Fri, May 27, 2016 at 01:18:03PM +0200, Michal Hocko wrote:
> ...
> > @@ -1087,7 +1105,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >  	unlock_task_sighand(task, &flags);
> >  err_put_task:
> >  	put_task_struct(task);
> > +
> > +	if (mm) {
> > +		struct task_struct *p;
> > +
> > +		rcu_read_lock();
> > +		for_each_process(p) {
> > +			task_lock(p);
> > +			if (!p->vfork_done && process_shares_mm(p, mm)) {
> > +				p->signal->oom_score_adj = oom_adj;
> > +				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> > +					p->signal->oom_score_adj_min = (short)oom_adj;
> > +			}
> > +			task_unlock(p);
> 
> I.e. you write to /proc/pid1/oom_score_adj and get
> /proc/pid2/oom_score_adj updated if pid1 and pid2 share mm?
> IMO that looks unexpected from userspace pov.

How much different it is from threads in the same thread group?
Processes sharing the mm without signals is a rather weird threading
model isn't it? Currently we just lie to users about their oom_score_adj
in this weird corner case. The only exception was OOM_SCORE_ADJ_MIN
where we really didn't kill the task but all other values are simply
ignored in practice.

> May be, we'd better add mm->oom_score_adj and set it to the min
> signal->oom_score_adj over all processes sharing it? This would
> require iterating over all processes every time oom_score_adj gets
> updated, but that's a slow path.

Not sure I understand. So you would prefer that mm->oom_score_adj might
disagree with p->signal->oom_score_adj?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip over vforked tasks
  2016-05-27 16:48   ` Vladimir Davydov
@ 2016-05-30  7:13     ` Michal Hocko
  2016-05-30  9:52       ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-30  7:13 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Fri 27-05-16 19:48:30, Vladimir Davydov wrote:
> On Thu, May 26, 2016 at 02:40:13PM +0200, Michal Hocko wrote:
[...]
> > @@ -839,6 +841,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	for_each_process(p) {
> >  		if (!process_shares_mm(p, mm))
> >  			continue;
> > +		/*
> > +		 * vforked tasks are ignored because they will drop the mm soon
> > +		 * hopefully and even if not they will not mind being oom
> > +		 * reaped because they cannot touch any memory.
> 
> They shouldn't modify memory, but they still can touch it AFAIK.

You are right. This means that the vforked child might see zero pages.
Let me think whether this is acceptable or not.

Thanks!

> 
> > +		 */
> > +		if (p->vfork_done)
> > +			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> >  		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30  7:07       ` Michal Hocko
@ 2016-05-30  8:47         ` Vladimir Davydov
  2016-05-30  9:39           ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Vladimir Davydov @ 2016-05-30  8:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Mon, May 30, 2016 at 09:07:05AM +0200, Michal Hocko wrote:
> On Fri 27-05-16 19:18:21, Vladimir Davydov wrote:
> > On Fri, May 27, 2016 at 01:18:03PM +0200, Michal Hocko wrote:
> > ...
> > > @@ -1087,7 +1105,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > >  	unlock_task_sighand(task, &flags);
> > >  err_put_task:
> > >  	put_task_struct(task);
> > > +
> > > +	if (mm) {
> > > +		struct task_struct *p;
> > > +
> > > +		rcu_read_lock();
> > > +		for_each_process(p) {
> > > +			task_lock(p);
> > > +			if (!p->vfork_done && process_shares_mm(p, mm)) {
> > > +				p->signal->oom_score_adj = oom_adj;
> > > +				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> > > +					p->signal->oom_score_adj_min = (short)oom_adj;
> > > +			}
> > > +			task_unlock(p);
> > 
> > I.e. you write to /proc/pid1/oom_score_adj and get
> > /proc/pid2/oom_score_adj updated if pid1 and pid2 share mm?
> > IMO that looks unexpected from userspace pov.
> 
> How much different it is from threads in the same thread group?
> Processes sharing the mm without signals is a rather weird threading
> model isn't it?

I think so too. I wouldn't be surprised if it turned out that nobody had
ever used it. But may be there's someone out there who does.

> Currently we just lie to users about their oom_score_adj
> in this weird corner case.

Hmm, looks like a bug, but nobody has ever complained about it.

> The only exception was OOM_SCORE_ADJ_MIN
> where we really didn't kill the task but all other values are simply
> ignored in practice.
> 
> > May be, we'd better add mm->oom_score_adj and set it to the min
> > signal->oom_score_adj over all processes sharing it? This would
> > require iterating over all processes every time oom_score_adj gets
> > updated, but that's a slow path.
> 
> Not sure I understand. So you would prefer that mm->oom_score_adj might
> disagree with p->signal->oom_score_adj?

No, I wouldn't. I'd rather agree that oom_score_adj should be per mm,
because we choose the victim basing solely on mm stats.

What I mean is we don't touch p->signal->oom_score_adj of other tasks
sharing mm, but instead store minimal oom_score_adj over all tasks
sharing mm in the mm_struct whenever a task's oom_score_adj is modified.
And use mm->oom_score_adj instead of signal->oom_score_adj in oom killer
code. This would save us from any accusations of user API modifications
and it would also make the oom code a bit easier to follow IMHO.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30  8:47         ` Vladimir Davydov
@ 2016-05-30  9:39           ` Michal Hocko
  2016-05-30 10:26             ` Vladimir Davydov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-30  9:39 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Mon 30-05-16 11:47:53, Vladimir Davydov wrote:
> On Mon, May 30, 2016 at 09:07:05AM +0200, Michal Hocko wrote:
> > On Fri 27-05-16 19:18:21, Vladimir Davydov wrote:
> > > On Fri, May 27, 2016 at 01:18:03PM +0200, Michal Hocko wrote:
> > > ...
> > > > @@ -1087,7 +1105,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > > >  	unlock_task_sighand(task, &flags);
> > > >  err_put_task:
> > > >  	put_task_struct(task);
> > > > +
> > > > +	if (mm) {
> > > > +		struct task_struct *p;
> > > > +
> > > > +		rcu_read_lock();
> > > > +		for_each_process(p) {
> > > > +			task_lock(p);
> > > > +			if (!p->vfork_done && process_shares_mm(p, mm)) {
> > > > +				p->signal->oom_score_adj = oom_adj;
> > > > +				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> > > > +					p->signal->oom_score_adj_min = (short)oom_adj;
> > > > +			}
> > > > +			task_unlock(p);
> > > 
> > > I.e. you write to /proc/pid1/oom_score_adj and get
> > > /proc/pid2/oom_score_adj updated if pid1 and pid2 share mm?
> > > IMO that looks unexpected from userspace pov.
> > 
> > How much different it is from threads in the same thread group?
> > Processes sharing the mm without signals is a rather weird threading
> > model isn't it?
> 
> I think so too. I wouldn't be surprised if it turned out that nobody had
> ever used it. But may be there's someone out there who does.

I have heard some rumors about users. But I haven't heard anything about
their oom_score_adj usage patterns.

> > Currently we just lie to users about their oom_score_adj
> > in this weird corner case.
> 
> Hmm, looks like a bug, but nobody has ever complained about it.

Yes and that leads me to a suspicion that we can do that. Maybe I should
just add a note into the log that we are doing that so that people can
complain? Something like the following
diff --git a/fs/proc/base.c b/fs/proc/base.c
index fa0b3ca94dfb..7f3495415719 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1104,7 +1104,6 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 err_sighand:
 	unlock_task_sighand(task, &flags);
 err_put_task:
-	put_task_struct(task);
 
 	if (mm) {
 		struct task_struct *p;
@@ -1113,6 +1112,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		for_each_process(p) {
 			task_lock(p);
 			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+						task_pid_nr(p), p->comm,
+						p->signal->oom_score_adj, oom_adj,
+						task_pid_nr(task), task->comm);
 				p->signal->oom_score_adj = oom_adj;
 				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 					p->signal->oom_score_adj_min = (short)oom_adj;
@@ -1122,6 +1125,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		rcu_read_unlock();
 		mmdrop(mm);
 	}
+	put_task_struct(task);
 out:
 	mutex_unlock(&oom_adj_mutex);
 	return err;

> > The only exception was OOM_SCORE_ADJ_MIN
> > where we really didn't kill the task but all other values are simply
> > ignored in practice.
> > 
> > > May be, we'd better add mm->oom_score_adj and set it to the min
> > > signal->oom_score_adj over all processes sharing it? This would
> > > require iterating over all processes every time oom_score_adj gets
> > > updated, but that's a slow path.
> > 
> > Not sure I understand. So you would prefer that mm->oom_score_adj might
> > disagree with p->signal->oom_score_adj?
> 
> No, I wouldn't. I'd rather agree that oom_score_adj should be per mm,
> because we choose the victim basing solely on mm stats.
> 
> What I mean is we don't touch p->signal->oom_score_adj of other tasks
> sharing mm, but instead store minimal oom_score_adj over all tasks
> sharing mm in the mm_struct whenever a task's oom_score_adj is modified.
> And use mm->oom_score_adj instead of signal->oom_score_adj in oom killer
> code. This would save us from any accusations of user API modifications
> and it would also make the oom code a bit easier to follow IMHO.

I understand your point but this is essentially lying because we
consider a different value than the user can observe in userspace.
Consider somebody doing insanity like

current->oom_score_adj = OOM_SCORE_ADJ_MIN
p = clone(CLONE_VM)
p->oom_score_adj = OOM_SCORE_ADJ_MAX

so one process would want to be always selected while the other one
doesn't want to get killed. All they can see is that everything is
put in place until the oom killer comes over and ignores that.

I think we should just be explicit. Maybe we want to treat
OOM_SCORE_ADJ_MIN special - e.g. do not even try to set oom_score_adj if
one of the sharing tasks is oom disabled. But I would rather wait for
somebody to complain and explain why the usecase really makes sense than
be all silent with implicit behavior.

Btw. we have already had per mm oom_core_adj but we had to revert it due
to vfork behavior. See 0753ba01e126 ("mm: revert "oom: move oom_adj
value""). This patch gets us back except it handles the vfork issue.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip over vforked tasks
  2016-05-30  7:13     ` Michal Hocko
@ 2016-05-30  9:52       ` Michal Hocko
  2016-05-30 10:40         ` Vladimir Davydov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-30  9:52 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Mon 30-05-16 09:13:57, Michal Hocko wrote:
> On Fri 27-05-16 19:48:30, Vladimir Davydov wrote:
> > On Thu, May 26, 2016 at 02:40:13PM +0200, Michal Hocko wrote:
> [...]
> > > @@ -839,6 +841,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > >  	for_each_process(p) {
> > >  		if (!process_shares_mm(p, mm))
> > >  			continue;
> > > +		/*
> > > +		 * vforked tasks are ignored because they will drop the mm soon
> > > +		 * hopefully and even if not they will not mind being oom
> > > +		 * reaped because they cannot touch any memory.
> > 
> > They shouldn't modify memory, but they still can touch it AFAIK.
> 
> You are right. This means that the vforked child might see zero pages.
> Let me think whether this is acceptable or not.

OK, I was thinking about it some more and I think you have a good point
here. I can see two options here:
- keep vforked task alive and skip the oom reaper. If the victim exits
  normally and the oom wouldn't get resolved the vforked task will be
  selected in the next round because the victim would clean up
  vfork_done state in  wait_for_vfork_done. We are still risking that
  the victim gets stuck though
- kill vforked task and so it would be reapable.

The later sounds more robust to me because we invoke the oom_reaper and
the side effect shouldn't be really a problem because the vforked task
couldn't have done a lot of useful work anyway. So I will drop this
patch and update "mm, oom: fortify task_will_free_mem" to skip the
the vfork check as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30  9:39           ` Michal Hocko
@ 2016-05-30 10:26             ` Vladimir Davydov
  2016-05-30 11:11               ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Vladimir Davydov @ 2016-05-30 10:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Mon, May 30, 2016 at 11:39:50AM +0200, Michal Hocko wrote:
> On Mon 30-05-16 11:47:53, Vladimir Davydov wrote:
> > On Mon, May 30, 2016 at 09:07:05AM +0200, Michal Hocko wrote:
> > > On Fri 27-05-16 19:18:21, Vladimir Davydov wrote:
> > > > On Fri, May 27, 2016 at 01:18:03PM +0200, Michal Hocko wrote:
> > > > ...
> > > > > @@ -1087,7 +1105,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > > > >  	unlock_task_sighand(task, &flags);
> > > > >  err_put_task:
> > > > >  	put_task_struct(task);
> > > > > +
> > > > > +	if (mm) {
> > > > > +		struct task_struct *p;
> > > > > +
> > > > > +		rcu_read_lock();
> > > > > +		for_each_process(p) {
> > > > > +			task_lock(p);
> > > > > +			if (!p->vfork_done && process_shares_mm(p, mm)) {
> > > > > +				p->signal->oom_score_adj = oom_adj;
> > > > > +				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> > > > > +					p->signal->oom_score_adj_min = (short)oom_adj;
> > > > > +			}
> > > > > +			task_unlock(p);
> > > > 
> > > > I.e. you write to /proc/pid1/oom_score_adj and get
> > > > /proc/pid2/oom_score_adj updated if pid1 and pid2 share mm?
> > > > IMO that looks unexpected from userspace pov.
> > > 
> > > How much different it is from threads in the same thread group?
> > > Processes sharing the mm without signals is a rather weird threading
> > > model isn't it?
> > 
> > I think so too. I wouldn't be surprised if it turned out that nobody had
> > ever used it. But may be there's someone out there who does.
> 
> I have heard some rumors about users. But I haven't heard anything about
> their oom_score_adj usage patterns.
> 
> > > Currently we just lie to users about their oom_score_adj
> > > in this weird corner case.
> > 
> > Hmm, looks like a bug, but nobody has ever complained about it.
> 
> Yes and that leads me to a suspicion that we can do that. Maybe I should
> just add a note into the log that we are doing that so that people can
> complain? Something like the following
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index fa0b3ca94dfb..7f3495415719 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1104,7 +1104,6 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  err_sighand:
>  	unlock_task_sighand(task, &flags);
>  err_put_task:
> -	put_task_struct(task);
>  
>  	if (mm) {
>  		struct task_struct *p;
> @@ -1113,6 +1112,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		for_each_process(p) {
>  			task_lock(p);
>  			if (!p->vfork_done && process_shares_mm(p, mm)) {
> +				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
> +						task_pid_nr(p), p->comm,
> +						p->signal->oom_score_adj, oom_adj,
> +						task_pid_nr(task), task->comm);

IMO this could be acceptable from userspace pov, but I don't very much
like how vfork is special-cased here and in oom killer code.

>  				p->signal->oom_score_adj = oom_adj;
>  				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
>  					p->signal->oom_score_adj_min = (short)oom_adj;
> @@ -1122,6 +1125,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		rcu_read_unlock();
>  		mmdrop(mm);
>  	}
> +	put_task_struct(task);
>  out:
>  	mutex_unlock(&oom_adj_mutex);
>  	return err;
> 
> > > The only exception was OOM_SCORE_ADJ_MIN
> > > where we really didn't kill the task but all other values are simply
> > > ignored in practice.
> > > 
> > > > May be, we'd better add mm->oom_score_adj and set it to the min
> > > > signal->oom_score_adj over all processes sharing it? This would
> > > > require iterating over all processes every time oom_score_adj gets
> > > > updated, but that's a slow path.
> > > 
> > > Not sure I understand. So you would prefer that mm->oom_score_adj might
> > > disagree with p->signal->oom_score_adj?
> > 
> > No, I wouldn't. I'd rather agree that oom_score_adj should be per mm,
> > because we choose the victim basing solely on mm stats.
> > 
> > What I mean is we don't touch p->signal->oom_score_adj of other tasks
> > sharing mm, but instead store minimal oom_score_adj over all tasks
> > sharing mm in the mm_struct whenever a task's oom_score_adj is modified.
> > And use mm->oom_score_adj instead of signal->oom_score_adj in oom killer
> > code. This would save us from any accusations of user API modifications
> > and it would also make the oom code a bit easier to follow IMHO.
> 
> I understand your point but this is essentially lying because we
> consider a different value than the user can observe in userspace.
> Consider somebody doing insanity like
> 
> current->oom_score_adj = OOM_SCORE_ADJ_MIN
> p = clone(CLONE_VM)
> p->oom_score_adj = OOM_SCORE_ADJ_MAX
> 
> so one process would want to be always selected while the other one
> doesn't want to get killed. All they can see is that everything is
> put in place until the oom killer comes over and ignores that.

If we stored minimal oom_score_adj in mm struct, oom killer wouldn't
kill any of these processes, and it looks fine to me as long as we want
oom killer to be mm based, not task or signal_struct based.

Come to think of it, it'd be difficult to keep mm->oom_score_adj in sync
with p->signal->oom_score_adj, because we would need to update
mm->oom_score_adj not only on /proc write, but also on fork. May be, we
could keep all signal_structs sharing mm linked in per mm list so that
we could quickly update mm->oom_score_adj on fork? That way we wouldn't
need to special case vfork.

> 
> I think we should just be explicit. Maybe we want to treat
> OOM_SCORE_ADJ_MIN special - e.g. do not even try to set oom_score_adj if
> one of the sharing tasks is oom disabled. But I would rather wait for
> somebody to complain and explain why the usecase really makes sense than
> be all silent with implicit behavior.
> 
> Btw. we have already had per mm oom_core_adj but we had to revert it due
> to vfork behavior. See 0753ba01e126 ("mm: revert "oom: move oom_adj
> value""). This patch gets us back except it handles the vfork issue.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip over vforked tasks
  2016-05-30  9:52       ` Michal Hocko
@ 2016-05-30 10:40         ` Vladimir Davydov
  2016-05-30 10:53           ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Vladimir Davydov @ 2016-05-30 10:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Mon, May 30, 2016 at 11:52:12AM +0200, Michal Hocko wrote:
> On Mon 30-05-16 09:13:57, Michal Hocko wrote:
> > On Fri 27-05-16 19:48:30, Vladimir Davydov wrote:
> > > On Thu, May 26, 2016 at 02:40:13PM +0200, Michal Hocko wrote:
> > [...]
> > > > @@ -839,6 +841,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > > >  	for_each_process(p) {
> > > >  		if (!process_shares_mm(p, mm))
> > > >  			continue;
> > > > +		/*
> > > > +		 * vforked tasks are ignored because they will drop the mm soon
> > > > +		 * hopefully and even if not they will not mind being oom
> > > > +		 * reaped because they cannot touch any memory.
> > > 
> > > They shouldn't modify memory, but they still can touch it AFAIK.
> > 
> > You are right. This means that the vforked child might see zero pages.
> > Let me think whether this is acceptable or not.
> 
> OK, I was thinking about it some more and I think you have a good point
> here. I can see two options here:
> - keep vforked task alive and skip the oom reaper. If the victim exits
>   normally and the oom wouldn't get resolved the vforked task will be
>   selected in the next round because the victim would clean up
>   vfork_done state in  wait_for_vfork_done. We are still risking that
>   the victim gets stuck though
> - kill vforked task and so it would be reapable.

IMHO it all depends on what we're trying to achieve. If we want per task
oom, which could make some sense since a task can consume a lot of mem
via e.g. pipe buffers, we would go with option #1. However, it's rather
difficult to find out how much of kmem a task consumes w/o using kmemcg,
so IMHO per-mm approach makes more sense in general. In this case I
think we should kill both vforked task and its parent if their mm was
selected provided their oom_score_adj allows that.

> 
> The later sounds more robust to me because we invoke the oom_reaper and
> the side effect shouldn't be really a problem because the vforked task
> couldn't have done a lot of useful work anyway. So I will drop this
> patch and update "mm, oom: fortify task_will_free_mem" to skip the
> the vfork check as well.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip over vforked tasks
  2016-05-30 10:40         ` Vladimir Davydov
@ 2016-05-30 10:53           ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-30 10:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Mon 30-05-16 13:40:17, Vladimir Davydov wrote:
> On Mon, May 30, 2016 at 11:52:12AM +0200, Michal Hocko wrote:
> > On Mon 30-05-16 09:13:57, Michal Hocko wrote:
> > > On Fri 27-05-16 19:48:30, Vladimir Davydov wrote:
> > > > On Thu, May 26, 2016 at 02:40:13PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > @@ -839,6 +841,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > > > >  	for_each_process(p) {
> > > > >  		if (!process_shares_mm(p, mm))
> > > > >  			continue;
> > > > > +		/*
> > > > > +		 * vforked tasks are ignored because they will drop the mm soon
> > > > > +		 * hopefully and even if not they will not mind being oom
> > > > > +		 * reaped because they cannot touch any memory.
> > > > 
> > > > They shouldn't modify memory, but they still can touch it AFAIK.
> > > 
> > > You are right. This means that the vforked child might see zero pages.
> > > Let me think whether this is acceptable or not.
> > 
> > OK, I was thinking about it some more and I think you have a good point
> > here. I can see two options here:
> > - keep vforked task alive and skip the oom reaper. If the victim exits
> >   normally and the oom wouldn't get resolved the vforked task will be
> >   selected in the next round because the victim would clean up
> >   vfork_done state in  wait_for_vfork_done. We are still risking that
> >   the victim gets stuck though
> > - kill vforked task and so it would be reapable.
> 
> IMHO it all depends on what we're trying to achieve. If we want per task
> oom, which could make some sense since a task can consume a lot of mem
> via e.g. pipe buffers, we would go with option #1. However, it's rather
> difficult to find out how much of kmem a task consumes w/o using kmemcg,
> so IMHO per-mm approach makes more sense in general. In this case I
> think we should kill both vforked task and its parent if their mm was
> selected provided their oom_score_adj allows that.

Yes agreed. Going with per-mm is a safier behavior because the vast
majority of the consumed memory should be per mm not per task_struct.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30 10:26             ` Vladimir Davydov
@ 2016-05-30 11:11               ` Michal Hocko
  2016-05-30 12:19                 ` Vladimir Davydov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-05-30 11:11 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Mon 30-05-16 13:26:44, Vladimir Davydov wrote:
> On Mon, May 30, 2016 at 11:39:50AM +0200, Michal Hocko wrote:
[...]
> > Yes and that leads me to a suspicion that we can do that. Maybe I should
> > just add a note into the log that we are doing that so that people can
> > complain? Something like the following
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index fa0b3ca94dfb..7f3495415719 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1104,7 +1104,6 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >  err_sighand:
> >  	unlock_task_sighand(task, &flags);
> >  err_put_task:
> > -	put_task_struct(task);
> >  
> >  	if (mm) {
> >  		struct task_struct *p;
> > @@ -1113,6 +1112,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >  		for_each_process(p) {
> >  			task_lock(p);
> >  			if (!p->vfork_done && process_shares_mm(p, mm)) {
> > +				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
> > +						task_pid_nr(p), p->comm,
> > +						p->signal->oom_score_adj, oom_adj,
> > +						task_pid_nr(task), task->comm);
> 
> IMO this could be acceptable from userspace pov, but I don't very much
> like how vfork is special-cased here and in oom killer code.

Well, the vfork has to be special cased here. We definitely have to
support
	vfork()
	set_oom_score_adj()
	exec()

use case. And I do not see other way without adding something to the
clone hot paths which sounds like not justifiable considering we are
talking about a really rare usecase that basically nobody cares about.
 
[...]
> > so one process would want to be always selected while the other one
> > doesn't want to get killed. All they can see is that everything is
> > put in place until the oom killer comes over and ignores that.
> 
> If we stored minimal oom_score_adj in mm struct, oom killer wouldn't
> kill any of these processes, and it looks fine to me as long as we want
> oom killer to be mm based, not task or signal_struct based.
> 
> Come to think of it, it'd be difficult to keep mm->oom_score_adj in sync
> with p->signal->oom_score_adj, because we would need to update
> mm->oom_score_adj not only on /proc write, but also on fork. May be, we
> could keep all signal_structs sharing mm linked in per mm list so that
> we could quickly update mm->oom_score_adj on fork? That way we wouldn't
> need to special case vfork.

Yes the current approach is slightly racy but I do not see that would
matter all that much. What you are suggesting might work but I am not
really sure we want to complicate the whole thing now. Sure if we see
that those races are real we can try to find a better solution, but I
would like to start as easy as possible and placing all the logic into
the oom_score_adj proc handler sounds like a good spot to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip over vforked tasks
  2016-05-26 12:40 ` [PATCH 4/6] mm, oom: skip over vforked tasks Michal Hocko
  2016-05-27 16:48   ` Vladimir Davydov
@ 2016-05-30 12:03   ` Michal Hocko
  1 sibling, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-30 12:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

So I've ended up with a replacement for this patch which does the
following:
---
>From c40900923c78b51215794cc445d3f5a589b8f785 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 30 May 2016 13:53:28 +0200
Subject: [PATCH] mm, oom: skip vforked tasks from being selected

vforked tasks are not really sitting on any memory. They are sharing
the mm with parent until they exec into a new code. Until then it is
just pinning the address space. OOM killer will kill the vforked task
along with its parent but we still can end up selecting vforked task
when the parent wouldn't be selected. E.g. init doing vfork to launch
a task or vforked being a child of oom unkillable task with an updated
oom_score_adj to be killable.

Make sure to not select vforked task as an oom victim by checking
vfork_done in oom_badness.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ba6bdf9ae94..36c821403d0f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped.
+	 * unkillable or have been already oom reaped or the are in
+	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			p->vfork_done) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30 11:11               ` Michal Hocko
@ 2016-05-30 12:19                 ` Vladimir Davydov
  2016-05-30 12:28                   ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Vladimir Davydov @ 2016-05-30 12:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Mon, May 30, 2016 at 01:11:48PM +0200, Michal Hocko wrote:
> On Mon 30-05-16 13:26:44, Vladimir Davydov wrote:
> > On Mon, May 30, 2016 at 11:39:50AM +0200, Michal Hocko wrote:
> [...]
> > > Yes and that leads me to a suspicion that we can do that. Maybe I should
> > > just add a note into the log that we are doing that so that people can
> > > complain? Something like the following
> > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > index fa0b3ca94dfb..7f3495415719 100644
> > > --- a/fs/proc/base.c
> > > +++ b/fs/proc/base.c
> > > @@ -1104,7 +1104,6 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > >  err_sighand:
> > >  	unlock_task_sighand(task, &flags);
> > >  err_put_task:
> > > -	put_task_struct(task);
> > >  
> > >  	if (mm) {
> > >  		struct task_struct *p;
> > > @@ -1113,6 +1112,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > >  		for_each_process(p) {
> > >  			task_lock(p);
> > >  			if (!p->vfork_done && process_shares_mm(p, mm)) {
> > > +				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
> > > +						task_pid_nr(p), p->comm,
> > > +						p->signal->oom_score_adj, oom_adj,
> > > +						task_pid_nr(task), task->comm);
> > 
> > IMO this could be acceptable from userspace pov, but I don't very much
> > like how vfork is special-cased here and in oom killer code.
> 
> Well, the vfork has to be special cased here. We definitely have to
> support
> 	vfork()
> 	set_oom_score_adj()
> 	exec()
> 
> use case. And I do not see other way without adding something to the
> clone hot paths which sounds like not justifiable considering we are
> talking about a really rare usecase that basically nobody cares about.

I don't think that vfork->exec use case is rare. Quite the contrary, I'm
pretty sure it's used often, because in contrast to fork->exec it avoids
copying page tables, which can be very expensive for fat processes.

Frankly, I don't understand why you are so determined not to add
anything to the fork path. Of course, if the overhead were that
dramatic, we would have to forget the idea, but if it were say <= 0.1 %
for a contrived test that calls fork in a loop, IMHO the modification
would be justified.

>  
> [...]
> > > so one process would want to be always selected while the other one
> > > doesn't want to get killed. All they can see is that everything is
> > > put in place until the oom killer comes over and ignores that.
> > 
> > If we stored minimal oom_score_adj in mm struct, oom killer wouldn't
> > kill any of these processes, and it looks fine to me as long as we want
> > oom killer to be mm based, not task or signal_struct based.
> > 
> > Come to think of it, it'd be difficult to keep mm->oom_score_adj in sync
> > with p->signal->oom_score_adj, because we would need to update
> > mm->oom_score_adj not only on /proc write, but also on fork. May be, we
> > could keep all signal_structs sharing mm linked in per mm list so that
> > we could quickly update mm->oom_score_adj on fork? That way we wouldn't
> > need to special case vfork.
> 
> Yes the current approach is slightly racy but I do not see that would
> matter all that much. What you are suggesting might work but I am not
> really sure we want to complicate the whole thing now. Sure if we see
> that those races are real we can try to find a better solution, but I
> would like to start as easy as possible and placing all the logic into
> the oom_score_adj proc handler sounds like a good spot to me.

IMHO with all those p->vfork_done and p->signal_struct->oom_score_adj
checks the oom code is becoming more difficult to understand. Linking
signal_struct in a per mm list would probably require more code, but
IMHO it would be easier to follow.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30 12:19                 ` Vladimir Davydov
@ 2016-05-30 12:28                   ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-05-30 12:28 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML

On Mon 30-05-16 15:19:32, Vladimir Davydov wrote:
> On Mon, May 30, 2016 at 01:11:48PM +0200, Michal Hocko wrote:
> > On Mon 30-05-16 13:26:44, Vladimir Davydov wrote:
> > > On Mon, May 30, 2016 at 11:39:50AM +0200, Michal Hocko wrote:
> > [...]
> > > > Yes and that leads me to a suspicion that we can do that. Maybe I should
> > > > just add a note into the log that we are doing that so that people can
> > > > complain? Something like the following
> > > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > > index fa0b3ca94dfb..7f3495415719 100644
> > > > --- a/fs/proc/base.c
> > > > +++ b/fs/proc/base.c
> > > > @@ -1104,7 +1104,6 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > > >  err_sighand:
> > > >  	unlock_task_sighand(task, &flags);
> > > >  err_put_task:
> > > > -	put_task_struct(task);
> > > >  
> > > >  	if (mm) {
> > > >  		struct task_struct *p;
> > > > @@ -1113,6 +1112,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > > >  		for_each_process(p) {
> > > >  			task_lock(p);
> > > >  			if (!p->vfork_done && process_shares_mm(p, mm)) {
> > > > +				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
> > > > +						task_pid_nr(p), p->comm,
> > > > +						p->signal->oom_score_adj, oom_adj,
> > > > +						task_pid_nr(task), task->comm);
> > > 
> > > IMO this could be acceptable from userspace pov, but I don't very much
> > > like how vfork is special-cased here and in oom killer code.
> > 
> > Well, the vfork has to be special cased here. We definitely have to
> > support
> > 	vfork()
> > 	set_oom_score_adj()
> > 	exec()
> > 
> > use case. And I do not see other way without adding something to the
> > clone hot paths which sounds like not justifiable considering we are
> > talking about a really rare usecase that basically nobody cares about.
> 
> I don't think that vfork->exec use case is rare. Quite the contrary, I'm
> pretty sure it's used often, because in contrast to fork->exec it avoids
> copying page tables, which can be very expensive for fat processes.

Ohh, yes, the way I put it is ambiguous. What I wanted to say is that
the oom is really unlikely so it doesn't justify hot path changes.

> Frankly, I don't understand why you are so determined not to add
> anything to the fork path.

It is not just the fork path. It would require touching exit path as
well and all that code is quite complex already. I would prefer if the
oom related complexity stay in the oom proper.

> Of course, if the overhead were that
> dramatic, we would have to forget the idea, but if it were say <= 0.1 %
> for a contrived test that calls fork in a loop, IMHO the modification
> would be justified.

But why if the proc handler resp. oom_kill_process paths can handle most
cases and the occasional races should be tolerate able AFAICS.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-05-30 12:28 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
2016-05-26 12:40 ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm Michal Hocko
2016-05-26 14:30   ` Tetsuo Handa
2016-05-26 14:59     ` Michal Hocko
2016-05-26 15:25       ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are noexternal " Tetsuo Handa
2016-05-26 15:35         ` Michal Hocko
2016-05-26 16:14           ` [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external " Tetsuo Handa
2016-05-27  6:45             ` Michal Hocko
2016-05-27  7:15               ` Michal Hocko
2016-05-27  8:03                 ` Michal Hocko
2016-05-26 12:40 ` [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
2016-05-26 12:40 ` [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
2016-05-27 11:18   ` Michal Hocko
2016-05-27 16:18     ` Vladimir Davydov
2016-05-30  7:07       ` Michal Hocko
2016-05-30  8:47         ` Vladimir Davydov
2016-05-30  9:39           ` Michal Hocko
2016-05-30 10:26             ` Vladimir Davydov
2016-05-30 11:11               ` Michal Hocko
2016-05-30 12:19                 ` Vladimir Davydov
2016-05-30 12:28                   ` Michal Hocko
2016-05-26 12:40 ` [PATCH 4/6] mm, oom: skip over vforked tasks Michal Hocko
2016-05-27 16:48   ` Vladimir Davydov
2016-05-30  7:13     ` Michal Hocko
2016-05-30  9:52       ` Michal Hocko
2016-05-30 10:40         ` Vladimir Davydov
2016-05-30 10:53           ` Michal Hocko
2016-05-30 12:03   ` Michal Hocko
2016-05-26 12:40 ` [PATCH 5/6] mm, oom: kill all tasks sharing the mm Michal Hocko
2016-05-26 12:40 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
2016-05-26 14:11   ` Tetsuo Handa
2016-05-26 14:23     ` Michal Hocko
2016-05-26 14:41       ` Tetsuo Handa
2016-05-26 14:56         ` Michal Hocko
2016-05-27 11:07   ` Michal Hocko
2016-05-27 16:00 ` [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).