All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 -v2] Handle oom bypass more gracefully
@ 2016-05-30 13:05 ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Hi,
based on the feedback from Tetsuo and Vladimir (thanks to you both) I
had to change some of my assumptions and rework some patches. I planned
to resend later this week but I guess it would help to argue about the
code after those changes if I resubmit earlier. The previous version was
posted here http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org

The following 6 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and so much more reliable finally. Even though mm
shared outside of threadgroup is rare (either vforked tasks for a
short period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND). Not only it
makes the current oom killer logic quite hard to follow and evaluate it
can lead to weird corner cases. E.g. it is possible to select an oom
victim which shares the mm with unkillable process or bypass the oom
killer even when other processes sharing the mm are still alive and
other weird cases.

Patch 1 drops a bogus task_lock and mm check from oom_adj_write. This
can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 is a clean up of oom_score_adj handling and a preparatory
work. Patch 3 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread
groups. This can be considered a user visible behavior change because
one thread group oom_score_adj update will affect others which share
the same mm via clone(CLONE_VM). I argue that this should be acceptable
because we already have the same behavior for threads in the same thread
group and sharing the mm without signal struct is just a different model
of threading. This is probably the most controversial part of the series,
I would like to find some consensus here though. There were some
suggestions to hook some counter/oom_score_adj into the mm_struct
but I feel that this is not necessary right now and we can rely on
proc handler + oom_kill_process to DTRT. I can be convinced otherwise
but I strongly think that whatever we do the userspace has to have
a way to see the current oom priority as consistently as possible.

Patch 4 makes sure that no vforked task is selected if it is sharing
the mm with oom unkillable task.

Patch 5 ensures that all tasks sharing the mm are killed which in turn
makes sure that all oom victims are oom reapable.

Patch 6 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

The patchset is based on the current mmotm tree (mmotm-2016-05-27-15-19).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I am CCing Oleg (sorry I know you hate this code) but I would feel much
better if you double checked my assumptions about locking and vfork
behavior.

Michal Hocko (6):
      proc, oom: drop bogus task_lock and mm check
      proc, oom_adj: extract oom_score_adj setting into a helper
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
      mm, oom: skip vforked tasks from being selected
      mm, oom: kill all tasks sharing the mm
      mm, oom: fortify task_will_free_mem

 fs/proc/base.c      | 172 ++++++++++++++++++++++++++++++----------------------
 include/linux/mm.h  |   2 +
 include/linux/oom.h |  63 +++++++++++++++++--
 mm/memcontrol.c     |   4 +-
 mm/oom_kill.c       |  82 +++++--------------------
 5 files changed, 176 insertions(+), 147 deletions(-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 0/6 -v2] Handle oom bypass more gracefully
@ 2016-05-30 13:05 ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Hi,
based on the feedback from Tetsuo and Vladimir (thanks to you both) I
had to change some of my assumptions and rework some patches. I planned
to resend later this week but I guess it would help to argue about the
code after those changes if I resubmit earlier. The previous version was
posted here http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org

The following 6 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and so much more reliable finally. Even though mm
shared outside of threadgroup is rare (either vforked tasks for a
short period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND). Not only it
makes the current oom killer logic quite hard to follow and evaluate it
can lead to weird corner cases. E.g. it is possible to select an oom
victim which shares the mm with unkillable process or bypass the oom
killer even when other processes sharing the mm are still alive and
other weird cases.

Patch 1 drops a bogus task_lock and mm check from oom_adj_write. This
can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 is a clean up of oom_score_adj handling and a preparatory
work. Patch 3 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread
groups. This can be considered a user visible behavior change because
one thread group oom_score_adj update will affect others which share
the same mm via clone(CLONE_VM). I argue that this should be acceptable
because we already have the same behavior for threads in the same thread
group and sharing the mm without signal struct is just a different model
of threading. This is probably the most controversial part of the series,
I would like to find some consensus here though. There were some
suggestions to hook some counter/oom_score_adj into the mm_struct
but I feel that this is not necessary right now and we can rely on
proc handler + oom_kill_process to DTRT. I can be convinced otherwise
but I strongly think that whatever we do the userspace has to have
a way to see the current oom priority as consistently as possible.

Patch 4 makes sure that no vforked task is selected if it is sharing
the mm with oom unkillable task.

Patch 5 ensures that all tasks sharing the mm are killed which in turn
makes sure that all oom victims are oom reapable.

Patch 6 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

The patchset is based on the current mmotm tree (mmotm-2016-05-27-15-19).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I am CCing Oleg (sorry I know you hate this code) but I would feel much
better if you double checked my assumptions about locking and vfork
behavior.

Michal Hocko (6):
      proc, oom: drop bogus task_lock and mm check
      proc, oom_adj: extract oom_score_adj setting into a helper
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
      mm, oom: skip vforked tasks from being selected
      mm, oom: kill all tasks sharing the mm
      mm, oom: fortify task_will_free_mem

 fs/proc/base.c      | 172 ++++++++++++++++++++++++++++++----------------------
 include/linux/mm.h  |   2 +
 include/linux/oom.h |  63 +++++++++++++++++--
 mm/memcontrol.c     |   4 +-
 mm/oom_kill.c       |  82 +++++--------------------
 5 files changed, 176 insertions(+), 147 deletions(-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-30 13:05 ` Michal Hocko
@ 2016-05-30 13:05   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

both oom_adj_write and oom_score_adj_write are using task_lock,
check for task->mm and fail if it is NULL. This is not needed because
the oom_score_adj is per signal struct so we do not need mm at all.
The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
disable count") but we do not do per-mm oom disable since c9f01245b6a7
("oom: remove oom_disable_count").

The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread
group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
would succeed. This is unexpected at best.

Remove the lock along with the check to fix the unexpected behavior
and also because there is not real need for the lock in the first place.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..a6014e45c516 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1083,15 +1083,9 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	/*
@@ -1121,8 +1115,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	trace_oom_score_adj_update(task);
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1186,15 +1179,9 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
@@ -1210,8 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
@ 2016-05-30 13:05   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

both oom_adj_write and oom_score_adj_write are using task_lock,
check for task->mm and fail if it is NULL. This is not needed because
the oom_score_adj is per signal struct so we do not need mm at all.
The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
disable count") but we do not do per-mm oom disable since c9f01245b6a7
("oom: remove oom_disable_count").

The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread
group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
would succeed. This is unexpected at best.

Remove the lock along with the check to fix the unexpected behavior
and also because there is not real need for the lock in the first place.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..a6014e45c516 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1083,15 +1083,9 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	/*
@@ -1121,8 +1115,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	trace_oom_score_adj_update(task);
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1186,15 +1179,9 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
@@ -1210,8 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper
  2016-05-30 13:05 ` Michal Hocko
@ 2016-05-30 13:05   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 112 +++++++++++++++++++++++++++------------------------------
 1 file changed, 52 insertions(+), 60 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..0afc77d4d84a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1041,6 +1041,56 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+	struct task_struct *task;
+	unsigned long flags;
+	int err = 0;
+
+	task = get_proc_task(file_inode(file));
+	if (!task) {
+		err = -ESRCH;
+		goto out;
+	}
+
+	if (!lock_task_sighand(task, &flags)) {
+		err = -ESRCH;
+		goto err_put_task;
+	}
+
+	if (legacy) {
+		if (oom_adj < task->signal->oom_score_adj &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_sighand;
+		}
+		/*
+		 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+		 * /proc/pid/oom_score_adj instead.
+		 */
+		pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+			  current->comm, task_pid_nr(current), task_pid_nr(task),
+			  task_pid_nr(task));
+	} else {
+		if ((short)oom_adj < task->signal->oom_score_adj_min &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_sighand;
+		}
+	}
+
+	task->signal->oom_score_adj = oom_adj;
+	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = (short)oom_adj;
+	trace_oom_score_adj_update(task);
+err_sighand:
+	unlock_task_sighand(task, &flags);
+err_put_task:
+	put_task_struct(task);
+out:
+	return err;
+}
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1054,10 +1104,8 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1077,17 +1125,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1097,26 +1134,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-	if (oom_adj < task->signal->oom_score_adj &&
-	    !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
+	err = __set_oom_adj(file, oom_adj, true);
 
-	/*
-	 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-	 * /proc/pid/oom_score_adj instead.
-	 */
-	pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
-		  current->comm, task_pid_nr(current), task_pid_nr(task),
-		  task_pid_nr(task));
-
-	task->signal->oom_score_adj = oom_adj;
-	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
-	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
 }
@@ -1150,9 +1169,7 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1173,32 +1190,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
-	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
-
-	task->signal->oom_score_adj = (short)oom_score_adj;
-	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
-		task->signal->oom_score_adj_min = (short)oom_score_adj;
-	trace_oom_score_adj_update(task);
-
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_score_adj, false);
 out:
 	return err < 0 ? err : count;
 }
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper
@ 2016-05-30 13:05   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 112 +++++++++++++++++++++++++++------------------------------
 1 file changed, 52 insertions(+), 60 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..0afc77d4d84a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1041,6 +1041,56 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+	struct task_struct *task;
+	unsigned long flags;
+	int err = 0;
+
+	task = get_proc_task(file_inode(file));
+	if (!task) {
+		err = -ESRCH;
+		goto out;
+	}
+
+	if (!lock_task_sighand(task, &flags)) {
+		err = -ESRCH;
+		goto err_put_task;
+	}
+
+	if (legacy) {
+		if (oom_adj < task->signal->oom_score_adj &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_sighand;
+		}
+		/*
+		 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+		 * /proc/pid/oom_score_adj instead.
+		 */
+		pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+			  current->comm, task_pid_nr(current), task_pid_nr(task),
+			  task_pid_nr(task));
+	} else {
+		if ((short)oom_adj < task->signal->oom_score_adj_min &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_sighand;
+		}
+	}
+
+	task->signal->oom_score_adj = oom_adj;
+	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = (short)oom_adj;
+	trace_oom_score_adj_update(task);
+err_sighand:
+	unlock_task_sighand(task, &flags);
+err_put_task:
+	put_task_struct(task);
+out:
+	return err;
+}
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1054,10 +1104,8 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1077,17 +1125,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1097,26 +1134,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-	if (oom_adj < task->signal->oom_score_adj &&
-	    !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
+	err = __set_oom_adj(file, oom_adj, true);
 
-	/*
-	 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-	 * /proc/pid/oom_score_adj instead.
-	 */
-	pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
-		  current->comm, task_pid_nr(current), task_pid_nr(task),
-		  task_pid_nr(task));
-
-	task->signal->oom_score_adj = oom_adj;
-	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
-	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
 }
@@ -1150,9 +1169,7 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1173,32 +1190,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
-	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
-
-	task->signal->oom_score_adj = (short)oom_score_adj;
-	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
-		task->signal->oom_score_adj_min = (short)oom_score_adj;
-	trace_oom_score_adj_update(task);
-
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_score_adj, false);
 out:
 	return err < 0 ? err : count;
 }
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30 13:05 ` Michal Hocko
@ 2016-05-30 13:05   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_THREAD resp. CLONE_SIGHAND) and so we can easily end up in a
situation when some processes update their oom_score_adj and confuse
the oom killer. In the worst case some of those processes might hide
from oom killer altogether via OOM_SCORE_ADJ_MIN while others are
eligible. OOM killer would then pick up those eligible but won't be
allowed to kill others sharing the same mm so the mm wouldn't release
the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
	vfork()
	set_oom_adj()
	exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj. The current implementation is rather pessimistic
and checks all the existing processes by default if there are more than
1 holder of the mm but we do not have any reliable way to check for
external users yet.

Note that we have to serialize all the oom_score_adj writers now to
guarantee they do not interleave and generate inconsistent results.

Changes since v1
- note that we are changing oom_score_adj outside of the thread group
  to the log
- skip over kernel threads

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0afc77d4d84a..3947a0ea5465 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1043,10 +1043,13 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
+	static DEFINE_MUTEX(oom_adj_mutex);
+	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	unsigned long flags;
 	int err = 0;
 
+	mutex_lock(&oom_adj_mutex);
 	task = get_proc_task(file_inode(file));
 	if (!task) {
 		err = -ESRCH;
@@ -1079,6 +1082,23 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
+	/*
+	 * Make sure we will check other processes sharing the mm if this is
+	 * not vfrok which wants its own oom_score_adj.
+	 * pin the mm so it doesn't go away and get reused after task_unlock
+	 */
+	if (!task->vfork_done) {
+		struct task_struct *p = find_lock_task_mm(task);
+
+		if (p) {
+			if (atomic_read(&p->mm->mm_users) > 1) {
+				mm = p->mm;
+				atomic_inc(&mm->mm_count);
+			}
+			task_unlock(p);
+		}
+	}
+
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
@@ -1086,8 +1106,34 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 err_sighand:
 	unlock_task_sighand(task, &flags);
 err_put_task:
+
+	if (mm) {
+		struct task_struct *p;
+
+		rcu_read_lock();
+		for_each_process(p) {
+			/* do not touch kernel threads */
+			if (p->flags & PF_KTHREAD)
+				continue;
+
+			task_lock(p);
+			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+						task_pid_nr(p), p->comm,
+						p->signal->oom_score_adj, oom_adj,
+						task_pid_nr(task), task->comm);
+				p->signal->oom_score_adj = oom_adj;
+				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+					p->signal->oom_score_adj_min = (short)oom_adj;
+			}
+			task_unlock(p);
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
 	put_task_struct(task);
 out:
+	mutex_unlock(&oom_adj_mutex);
 	return err;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 79e5129a3277..9bb4331f92f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2248,6 +2248,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e01cc3e2e755..6bb322577fc6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -415,7 +415,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
@ 2016-05-30 13:05   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_THREAD resp. CLONE_SIGHAND) and so we can easily end up in a
situation when some processes update their oom_score_adj and confuse
the oom killer. In the worst case some of those processes might hide
from oom killer altogether via OOM_SCORE_ADJ_MIN while others are
eligible. OOM killer would then pick up those eligible but won't be
allowed to kill others sharing the same mm so the mm wouldn't release
the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
	vfork()
	set_oom_adj()
	exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj. The current implementation is rather pessimistic
and checks all the existing processes by default if there are more than
1 holder of the mm but we do not have any reliable way to check for
external users yet.

Note that we have to serialize all the oom_score_adj writers now to
guarantee they do not interleave and generate inconsistent results.

Changes since v1
- note that we are changing oom_score_adj outside of the thread group
  to the log
- skip over kernel threads

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0afc77d4d84a..3947a0ea5465 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1043,10 +1043,13 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
+	static DEFINE_MUTEX(oom_adj_mutex);
+	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	unsigned long flags;
 	int err = 0;
 
+	mutex_lock(&oom_adj_mutex);
 	task = get_proc_task(file_inode(file));
 	if (!task) {
 		err = -ESRCH;
@@ -1079,6 +1082,23 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
+	/*
+	 * Make sure we will check other processes sharing the mm if this is
+	 * not vfrok which wants its own oom_score_adj.
+	 * pin the mm so it doesn't go away and get reused after task_unlock
+	 */
+	if (!task->vfork_done) {
+		struct task_struct *p = find_lock_task_mm(task);
+
+		if (p) {
+			if (atomic_read(&p->mm->mm_users) > 1) {
+				mm = p->mm;
+				atomic_inc(&mm->mm_count);
+			}
+			task_unlock(p);
+		}
+	}
+
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
@@ -1086,8 +1106,34 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 err_sighand:
 	unlock_task_sighand(task, &flags);
 err_put_task:
+
+	if (mm) {
+		struct task_struct *p;
+
+		rcu_read_lock();
+		for_each_process(p) {
+			/* do not touch kernel threads */
+			if (p->flags & PF_KTHREAD)
+				continue;
+
+			task_lock(p);
+			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+						task_pid_nr(p), p->comm,
+						p->signal->oom_score_adj, oom_adj,
+						task_pid_nr(task), task->comm);
+				p->signal->oom_score_adj = oom_adj;
+				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+					p->signal->oom_score_adj_min = (short)oom_adj;
+			}
+			task_unlock(p);
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
 	put_task_struct(task);
 out:
+	mutex_unlock(&oom_adj_mutex);
 	return err;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 79e5129a3277..9bb4331f92f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2248,6 +2248,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e01cc3e2e755..6bb322577fc6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -415,7 +415,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-30 13:05 ` Michal Hocko
@ 2016-05-30 13:05   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vforked tasks are not really sitting on any memory. They are sharing
the mm with parent until they exec into a new code. Until then it is
just pinning the address space. OOM killer will kill the vforked task
along with its parent but we still can end up selecting vforked task
when the parent wouldn't be selected. E.g. init doing vfork to launch
a task or vforked being a child of oom unkillable task with an updated
oom_score_adj to be killable.

Make sure to not select vforked task as an oom victim by checking
vfork_done in oom_badness.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6bb322577fc6..92bc8c3ec97b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped.
+	 * unkillable or have been already oom reaped or the are in
+	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			p->vfork_done) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 4/6] mm, oom: skip vforked tasks from being selected
@ 2016-05-30 13:05   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vforked tasks are not really sitting on any memory. They are sharing
the mm with parent until they exec into a new code. Until then it is
just pinning the address space. OOM killer will kill the vforked task
along with its parent but we still can end up selecting vforked task
when the parent wouldn't be selected. E.g. init doing vfork to launch
a task or vforked being a child of oom unkillable task with an updated
oom_score_adj to be killable.

Make sure to not select vforked task as an oom victim by checking
vfork_done in oom_badness.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6bb322577fc6..92bc8c3ec97b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped.
+	 * unkillable or have been already oom reaped or the are in
+	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			p->vfork_done) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-30 13:05 ` Michal Hocko
@ 2016-05-30 13:05   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).

We can still encounter oom disabled vforked task which has to be killed
as well if we want to have other tasks sharing the mm reapable
because it can access the memory before doing exec. Killing such a task
should be acceptable because it is highly unlikely it has done anything
useful because it cannot modify any memory before it calls exec. An
alternative would be to keep the task alive and skip the oom reaper and
risk all the weird corner cases where the OOM killer cannot make forward
progress because the oom victim hung somewhere on the way to exit.

There is a potential race where we kill the oom disabled task which is
highly unlikely but possible. It would happen if __set_oom_adj raced
with select_bad_process and then it is OK to consider the old value or
with fork when it should be acceptable as well.
Let's add a little note to the log so that people would tell us that
this really happens in the real life and it matters.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 92bc8c3ec97b..d296f4467500 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
@@ -862,6 +861,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			can_oom_reap = false;
 			continue;
 		}
+		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
+			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
+					" Report at linux-mm@kvack.org\n",
+					victim->comm, task_pid_nr(victim),
+					p->comm, task_pid_nr(p));
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 5/6] mm, oom: kill all tasks sharing the mm
@ 2016-05-30 13:05   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).

We can still encounter oom disabled vforked task which has to be killed
as well if we want to have other tasks sharing the mm reapable
because it can access the memory before doing exec. Killing such a task
should be acceptable because it is highly unlikely it has done anything
useful because it cannot modify any memory before it calls exec. An
alternative would be to keep the task alive and skip the oom reaper and
risk all the weird corner cases where the OOM killer cannot make forward
progress because the oom victim hung somewhere on the way to exit.

There is a potential race where we kill the oom disabled task which is
highly unlikely but possible. It would happen if __set_oom_adj raced
with select_bad_process and then it is OK to consider the old value or
with fork when it should be acceptable as well.
Let's add a little note to the log so that people would tell us that
this really happens in the real life and it matters.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 92bc8c3ec97b..d296f4467500 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
@@ -862,6 +861,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			can_oom_reap = false;
 			continue;
 		}
+		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
+			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
+					" Report at linux-mm@kvack.org\n",
+					victim->comm, task_pid_nr(victim),
+					p->comm, task_pid_nr(p));
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-30 13:05 ` Michal Hocko
@ 2016-05-30 13:05   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_THREAD resp
CLONE_SIGHAND, or in kernel use_mm().

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 63 ++++++++++++++++++++++++++++++++++++++++++++++----
 mm/memcontrol.c     |  4 ++--
 mm/oom_kill.c       | 66 ++++-------------------------------------------------
 3 files changed, 65 insertions(+), 68 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index cbc24a5fe28d..c4cc0591d959 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -107,7 +107,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
 
@@ -119,16 +119,69 @@ static inline bool task_will_free_mem(struct task_struct *task)
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
 		return false;
 
-	if (!(task->flags & PF_EXITING))
+	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
 		return false;
 
 	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+	if (!thread_group_empty(task) &&
+		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
 		return false;
 
 	return true;
 }
 
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *p;
+	bool ret;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	if (!__task_will_free_mem(p)) {
+		task_unlock(p);
+		return false;
+	}
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eeb3b14de01a..0ae1abe6cd39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d296f4467500..0b7c02869bc0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -591,7 +591,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -609,51 +609,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			bool exiting;
-
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			spin_lock_irq(&p->sighand->siglock);
-			exiting = signal_group_exit(p->signal);
-			spin_unlock_irq(&p->sighand->siglock);
-			if (exiting)
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -665,10 +620,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -766,15 +717,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
@@ -945,14 +893,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 6/6] mm, oom: fortify task_will_free_mem
@ 2016-05-30 13:05   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_THREAD resp
CLONE_SIGHAND, or in kernel use_mm().

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 63 ++++++++++++++++++++++++++++++++++++++++++++++----
 mm/memcontrol.c     |  4 ++--
 mm/oom_kill.c       | 66 ++++-------------------------------------------------
 3 files changed, 65 insertions(+), 68 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index cbc24a5fe28d..c4cc0591d959 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -107,7 +107,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
 
@@ -119,16 +119,69 @@ static inline bool task_will_free_mem(struct task_struct *task)
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
 		return false;
 
-	if (!(task->flags & PF_EXITING))
+	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
 		return false;
 
 	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+	if (!thread_group_empty(task) &&
+		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
 		return false;
 
 	return true;
 }
 
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *p;
+	bool ret;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	if (!__task_will_free_mem(p)) {
+		task_unlock(p);
+		return false;
+	}
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eeb3b14de01a..0ae1abe6cd39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d296f4467500..0b7c02869bc0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -591,7 +591,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -609,51 +609,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			bool exiting;
-
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			spin_lock_irq(&p->sighand->siglock);
-			exiting = signal_group_exit(p->signal);
-			spin_unlock_irq(&p->sighand->siglock);
-			if (exiting)
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -665,10 +620,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -766,15 +717,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
@@ -945,14 +893,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-30 13:05   ` Michal Hocko
@ 2016-05-30 13:49     ` Vladimir Davydov
  -1 siblings, 0 replies; 68+ messages in thread
From: Vladimir Davydov @ 2016-05-30 13:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML, Michal Hocko

On Mon, May 30, 2016 at 03:05:51PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> both oom_adj_write and oom_score_adj_write are using task_lock,
> check for task->mm and fail if it is NULL. This is not needed because
> the oom_score_adj is per signal struct so we do not need mm at all.
> The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
> disable count") but we do not do per-mm oom disable since c9f01245b6a7
> ("oom: remove oom_disable_count").
> 
> The task->mm check is even not correct because the current thread might
> have exited but the thread group might be still alive - e.g. thread
> group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
> always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
> would succeed. This is unexpected at best.
> 
> Remove the lock along with the check to fix the unexpected behavior
> and also because there is not real need for the lock in the first place.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
@ 2016-05-30 13:49     ` Vladimir Davydov
  0 siblings, 0 replies; 68+ messages in thread
From: Vladimir Davydov @ 2016-05-30 13:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML, Michal Hocko

On Mon, May 30, 2016 at 03:05:51PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> both oom_adj_write and oom_score_adj_write are using task_lock,
> check for task->mm and fail if it is NULL. This is not needed because
> the oom_score_adj is per signal struct so we do not need mm at all.
> The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
> disable count") but we do not do per-mm oom disable since c9f01245b6a7
> ("oom: remove oom_disable_count").
> 
> The task->mm check is even not correct because the current thread might
> have exited but the thread group might be still alive - e.g. thread
> group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
> always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
> would succeed. This is unexpected at best.
> 
> Remove the lock along with the check to fix the unexpected behavior
> and also because there is not real need for the lock in the first place.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-30 13:05   ` Michal Hocko
@ 2016-05-30 17:35     ` Oleg Nesterov
  -1 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-30 17:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> task_will_free_mem is rather weak.

I was thinking about the similar change because I noticed that try_oom_reaper()
is very, very wrong.

To the point I think that we need another change for stable which simply removes
spin_lock_irq(sighand->siglock) from try_oom_reaper(). It buys nothing, we can
check signal_group_exit() (which is wrong too ;) lockless, and at the same time
the kernel can crash because we can hit ->siglock == NULL.

So I do think this change is good in general.

I think that task_will_free_mem() should be un-inlined, and __task_will_free_mem()
should go into mm/oom-kill.c... but this is minor.

> -static inline bool task_will_free_mem(struct task_struct *task)
> +static inline bool __task_will_free_mem(struct task_struct *task)
>  {
>  	struct signal_struct *sig = task->signal;
>  
> @@ -119,16 +119,69 @@ static inline bool task_will_free_mem(struct task_struct *task)
>  	if (sig->flags & SIGNAL_GROUP_COREDUMP)
>  		return false;
>  
> -	if (!(task->flags & PF_EXITING))
> +	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
>  		return false;
>  
>  	/* Make sure that the whole thread group is going down */
> -	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
> +	if (!thread_group_empty(task) &&
> +		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
>  		return false;
>  
>  	return true;
>  }

Well, let me suggest this again. I think it should do


	if (SIGNAL_GROUP_COREDUMP)
		return false;

	if (SIGNAL_GROUP_EXIT)
		return true;

	if (thread_group_empty() && PF_EXITING)
		return true;

	return false;

we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
be set (ignoring some bugs with sub-namespaces which we need to fix anyway).

At the same time, we do not want to return false if PF_EXITING is not set
if SIGNAL_GROUP_EXIT is set.

> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> +	struct mm_struct *mm = NULL;
> +	struct task_struct *p;
> +	bool ret;
> +
> +	/*
> +	 * If the process has passed exit_mm we have to skip it because
> +	 * we have lost a link to other tasks sharing this mm, we do not
> +	 * have anything to reap and the task might then get stuck waiting
> +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +	 */
> +	p = find_lock_task_mm(task);
> +	if (!p)
> +		return false;
> +
> +	if (!__task_will_free_mem(p)) {
> +		task_unlock(p);
> +		return false;
> +	}
> +
> +	mm = p->mm;
> +	if (atomic_read(&mm->mm_users) <= 1) {

this is sub-optimal, we should probably take signal->live or ->nr_threads
into account... but OK, we can do this later.

> +	rcu_read_lock();
> +	for_each_process(p) {
> +		ret = __task_will_free_mem(p);
> +		if (!ret)
> +			break;
> +	}
> +	rcu_read_unlock();

Yes, I agree very much.

But it seems you forgot to add the process_shares_mm() check into this loop?

and perhaps it also makes sense to add

	if (same_thread_group(tsk, p))
		continue;

This should not really matter, we know that __task_will_free_mem(p) should return
true. Just to make it more clear.

And. I think this needs smp_rmb() at the end of the loop (assuming we have the
process_shares_mm() check here). We need it to ensure that we read p->mm before
we read next_task(), to avoid the race with exit() + clone(CLONE_VM).

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
@ 2016-05-30 17:35     ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-30 17:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> task_will_free_mem is rather weak.

I was thinking about the similar change because I noticed that try_oom_reaper()
is very, very wrong.

To the point I think that we need another change for stable which simply removes
spin_lock_irq(sighand->siglock) from try_oom_reaper(). It buys nothing, we can
check signal_group_exit() (which is wrong too ;) lockless, and at the same time
the kernel can crash because we can hit ->siglock == NULL.

So I do think this change is good in general.

I think that task_will_free_mem() should be un-inlined, and __task_will_free_mem()
should go into mm/oom-kill.c... but this is minor.

> -static inline bool task_will_free_mem(struct task_struct *task)
> +static inline bool __task_will_free_mem(struct task_struct *task)
>  {
>  	struct signal_struct *sig = task->signal;
>  
> @@ -119,16 +119,69 @@ static inline bool task_will_free_mem(struct task_struct *task)
>  	if (sig->flags & SIGNAL_GROUP_COREDUMP)
>  		return false;
>  
> -	if (!(task->flags & PF_EXITING))
> +	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
>  		return false;
>  
>  	/* Make sure that the whole thread group is going down */
> -	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
> +	if (!thread_group_empty(task) &&
> +		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
>  		return false;
>  
>  	return true;
>  }

Well, let me suggest this again. I think it should do


	if (SIGNAL_GROUP_COREDUMP)
		return false;

	if (SIGNAL_GROUP_EXIT)
		return true;

	if (thread_group_empty() && PF_EXITING)
		return true;

	return false;

we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
be set (ignoring some bugs with sub-namespaces which we need to fix anyway).

At the same time, we do not want to return false if PF_EXITING is not set
if SIGNAL_GROUP_EXIT is set.

> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> +	struct mm_struct *mm = NULL;
> +	struct task_struct *p;
> +	bool ret;
> +
> +	/*
> +	 * If the process has passed exit_mm we have to skip it because
> +	 * we have lost a link to other tasks sharing this mm, we do not
> +	 * have anything to reap and the task might then get stuck waiting
> +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +	 */
> +	p = find_lock_task_mm(task);
> +	if (!p)
> +		return false;
> +
> +	if (!__task_will_free_mem(p)) {
> +		task_unlock(p);
> +		return false;
> +	}
> +
> +	mm = p->mm;
> +	if (atomic_read(&mm->mm_users) <= 1) {

this is sub-optimal, we should probably take signal->live or ->nr_threads
into account... but OK, we can do this later.

> +	rcu_read_lock();
> +	for_each_process(p) {
> +		ret = __task_will_free_mem(p);
> +		if (!ret)
> +			break;
> +	}
> +	rcu_read_unlock();

Yes, I agree very much.

But it seems you forgot to add the process_shares_mm() check into this loop?

and perhaps it also makes sense to add

	if (same_thread_group(tsk, p))
		continue;

This should not really matter, we know that __task_will_free_mem(p) should return
true. Just to make it more clear.

And. I think this needs smp_rmb() at the end of the loop (assuming we have the
process_shares_mm() check here). We need it to ensure that we read p->mm before
we read next_task(), to avoid the race with exit() + clone(CLONE_VM).

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-30 13:05   ` Michal Hocko
@ 2016-05-30 17:43     ` Oleg Nesterov
  -1 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-30 17:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> both oom_adj_write and oom_score_adj_write are using task_lock,
> check for task->mm and fail if it is NULL. This is not needed because
> the oom_score_adj is per signal struct so we do not need mm at all.
> The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
> disable count") but we do not do per-mm oom disable since c9f01245b6a7
> ("oom: remove oom_disable_count").
>
> The task->mm check is even not correct because the current thread might
> have exited but the thread group might be still alive - e.g. thread
> group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
> always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
> would succeed. This is unexpected at best.
>
> Remove the lock along with the check to fix the unexpected behavior
> and also because there is not real need for the lock in the first place.

ACK

and we should also remove lock_task_sighand(). as for oom_adj_read() and
oom_score_adj_read() we can just remove it right now; it was previously
needed to ensure the task->signal != NULL, today this is always true.

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
@ 2016-05-30 17:43     ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-30 17:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> both oom_adj_write and oom_score_adj_write are using task_lock,
> check for task->mm and fail if it is NULL. This is not needed because
> the oom_score_adj is per signal struct so we do not need mm at all.
> The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
> disable count") but we do not do per-mm oom disable since c9f01245b6a7
> ("oom: remove oom_disable_count").
>
> The task->mm check is even not correct because the current thread might
> have exited but the thread group might be still alive - e.g. thread
> group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
> always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
> would succeed. This is unexpected at best.
>
> Remove the lock along with the check to fix the unexpected behavior
> and also because there is not real need for the lock in the first place.

ACK

and we should also remove lock_task_sighand(). as for oom_adj_read() and
oom_score_adj_read() we can just remove it right now; it was previously
needed to ensure the task->signal != NULL, today this is always true.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-30 13:05   ` Michal Hocko
@ 2016-05-30 18:18     ` Oleg Nesterov
  -1 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-30 18:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> @@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
>  			/*
>  			 * We cannot use oom_reaper for the mm shared by this
>  			 * process because it wouldn't get killed and so the
> @@ -862,6 +861,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			can_oom_reap = false;
>  			continue;
>  		}
> +		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> +			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
> +					" Report at linux-mm@kvack.org\n",
> +					victim->comm, task_pid_nr(victim),
> +					p->comm, task_pid_nr(p));

Oh, yes, I personally do agree ;)

perhaps the is_global_init() == T case needs a warning too? the previous changes
take care about vfork() from /sbin/init, so the only reason we can see it true
is that /sbin/init shares the memory with a memory hog... Nevermind, forget.

This is a bit off-topic, but perhaps we can also change the PF_KTHREAD check later.
Of course we should not try to kill this kthread, but can_oom_reap can be true in
this case. A kernel thread which does use_mm() should handle the errors correctly
if (say) get_user() fails because we unmap the memory.

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/6] mm, oom: kill all tasks sharing the mm
@ 2016-05-30 18:18     ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-30 18:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> @@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
>  			/*
>  			 * We cannot use oom_reaper for the mm shared by this
>  			 * process because it wouldn't get killed and so the
> @@ -862,6 +861,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			can_oom_reap = false;
>  			continue;
>  		}
> +		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> +			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
> +					" Report at linux-mm@kvack.org\n",
> +					victim->comm, task_pid_nr(victim),
> +					p->comm, task_pid_nr(p));

Oh, yes, I personally do agree ;)

perhaps the is_global_init() == T case needs a warning too? the previous changes
take care about vfork() from /sbin/init, so the only reason we can see it true
is that /sbin/init shares the memory with a memory hog... Nevermind, forget.

This is a bit off-topic, but perhaps we can also change the PF_KTHREAD check later.
Of course we should not try to kill this kthread, but can_oom_reap can be true in
this case. A kernel thread which does use_mm() should handle the errors correctly
if (say) get_user() fails because we unmap the memory.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-30 13:05   ` Michal Hocko
@ 2016-05-30 19:28     ` Oleg Nesterov
  -1 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-30 19:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> Make sure to not select vforked task as an oom victim by checking
> vfork_done in oom_badness.

I agree, this look like a good change to me... But.

> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  
>  	/*
>  	 * Do not even consider tasks which are explicitly marked oom
> -	 * unkillable or have been already oom reaped.
> +	 * unkillable or have been already oom reaped or the are in
> +	 * the middle of vfork
>  	 */
>  	adj = (long)p->signal->oom_score_adj;
>  	if (adj == OOM_SCORE_ADJ_MIN ||
> -			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
> +			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
> +			p->vfork_done) {

I don't think we can trust vfork_done != NULL.

copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
it would be trivial to make the exploit which hides a memory hog from oom-killer.

So perhaps we need something like

		bool in_vfork(p)
		{
			return	p->vfork_done &&
				p->real_parent->mm == mm;

			
		}

task_lock() is not enough if CLONE_VM was used along with CLONE_PARENT... so this
also needs rcu_read_lock() to access ->real_parent.

Or I am totally confused?

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
@ 2016-05-30 19:28     ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-30 19:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> Make sure to not select vforked task as an oom victim by checking
> vfork_done in oom_badness.

I agree, this look like a good change to me... But.

> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  
>  	/*
>  	 * Do not even consider tasks which are explicitly marked oom
> -	 * unkillable or have been already oom reaped.
> +	 * unkillable or have been already oom reaped or the are in
> +	 * the middle of vfork
>  	 */
>  	adj = (long)p->signal->oom_score_adj;
>  	if (adj == OOM_SCORE_ADJ_MIN ||
> -			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
> +			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
> +			p->vfork_done) {

I don't think we can trust vfork_done != NULL.

copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
it would be trivial to make the exploit which hides a memory hog from oom-killer.

So perhaps we need something like

		bool in_vfork(p)
		{
			return	p->vfork_done &&
				p->real_parent->mm == mm;

			
		}

task_lock() is not enough if CLONE_VM was used along with CLONE_PARENT... so this
also needs rcu_read_lock() to access ->real_parent.

Or I am totally confused?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-30 17:43     ` Oleg Nesterov
@ 2016-05-31  7:32       ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 19:43:24, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > both oom_adj_write and oom_score_adj_write are using task_lock,
> > check for task->mm and fail if it is NULL. This is not needed because
> > the oom_score_adj is per signal struct so we do not need mm at all.
> > The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
> > disable count") but we do not do per-mm oom disable since c9f01245b6a7
> > ("oom: remove oom_disable_count").
> >
> > The task->mm check is even not correct because the current thread might
> > have exited but the thread group might be still alive - e.g. thread
> > group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
> > always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
> > would succeed. This is unexpected at best.
> >
> > Remove the lock along with the check to fix the unexpected behavior
> > and also because there is not real need for the lock in the first place.
> 
> ACK

thanks!

> and we should also remove lock_task_sighand(). as for oom_adj_read() and
> oom_score_adj_read() we can just remove it right now; it was previously
> needed to ensure the task->signal != NULL, today this is always true.

OK, I will add the following patch to the series.
---
>From 952c464a31ffbe158233c4cc05f4b8a64384635c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 31 May 2016 09:28:36 +0200
Subject: [PATCH] proc, oom: drop bogus sighand lock

Oleg has pointed out that can simplify both oom_adj_write and
oom_score_adj_write even further and drop the sighand lock. The only
purpose of the lock was to protect p->signal from going away but this
will not happen since ea6d290ca34c ("signals: make task_struct->signal
immutable/refcountable").

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..3761f107615a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1057,7 +1057,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1083,11 +1082,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1100,7 +1094,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	if (oom_adj < task->signal->oom_score_adj &&
 	    !capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_put_task;
 	}
 
 	/*
@@ -1113,8 +1107,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 
 	task->signal->oom_score_adj = oom_adj;
 	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
 err_put_task:
 	put_task_struct(task);
 out:
@@ -1152,7 +1144,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 {
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1179,15 +1170,10 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
 			!capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_put_task;
 	}
 
 	task->signal->oom_score_adj = (short)oom_score_adj;
@@ -1195,8 +1181,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		task->signal->oom_score_adj_min = (short)oom_score_adj;
 	trace_oom_score_adj_update(task);
 
-err_sighand:
-	unlock_task_sighand(task, &flags);
 err_put_task:
 	put_task_struct(task);
 out:
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
@ 2016-05-31  7:32       ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 19:43:24, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > both oom_adj_write and oom_score_adj_write are using task_lock,
> > check for task->mm and fail if it is NULL. This is not needed because
> > the oom_score_adj is per signal struct so we do not need mm at all.
> > The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
> > disable count") but we do not do per-mm oom disable since c9f01245b6a7
> > ("oom: remove oom_disable_count").
> >
> > The task->mm check is even not correct because the current thread might
> > have exited but the thread group might be still alive - e.g. thread
> > group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
> > always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
> > would succeed. This is unexpected at best.
> >
> > Remove the lock along with the check to fix the unexpected behavior
> > and also because there is not real need for the lock in the first place.
> 
> ACK

thanks!

> and we should also remove lock_task_sighand(). as for oom_adj_read() and
> oom_score_adj_read() we can just remove it right now; it was previously
> needed to ensure the task->signal != NULL, today this is always true.

OK, I will add the following patch to the series.
---

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30 13:05   ` Michal Hocko
@ 2016-05-31  7:41     ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

This got lost during the rebase.
---
>From cb16205a71c29d80425922cfc584373eb14b018e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 31 May 2016 07:16:12 +0200
Subject: [PATCH] fold me "mm, oom_adj: make sure processes sharing mm have
 same view of oom_score_adj"

- skip over same thread group
- skip over kernel threads and global init

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f4fcd2954f42..164fe38022d2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1104,8 +1104,11 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 
 		rcu_read_lock();
 		for_each_process(p) {
-			/* do not touch kernel threads */
-			if (p->flags & PF_KTHREAD)
+			if (same_thread_group(task,p))
+				continue;
+
+			/* do not touch kernel threads or the global init */
+			if (p->flags & PF_KTHREAD || is_global_init(p))
 				continue;
 
 			task_lock(p);
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
@ 2016-05-31  7:41     ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

This got lost during the rebase.
---

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-30 19:28     ` Oleg Nesterov
@ 2016-05-31  7:42       ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 21:28:57, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > Make sure to not select vforked task as an oom victim by checking
> > vfork_done in oom_badness.
> 
> I agree, this look like a good change to me... But.
> 
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> >  
> >  	/*
> >  	 * Do not even consider tasks which are explicitly marked oom
> > -	 * unkillable or have been already oom reaped.
> > +	 * unkillable or have been already oom reaped or the are in
> > +	 * the middle of vfork
> >  	 */
> >  	adj = (long)p->signal->oom_score_adj;
> >  	if (adj == OOM_SCORE_ADJ_MIN ||
> > -			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
> > +			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
> > +			p->vfork_done) {
> 
> I don't think we can trust vfork_done != NULL.
> 
> copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
> it would be trivial to make the exploit which hides a memory hog from oom-killer.

OK, I wasn't aware of this possibility. It sounds really weird because I
thought that the whole point of vfork is to prevent from MM copy
overhead for quick exec.

> So perhaps we need something like
> 
> 		bool in_vfork(p)
> 		{
> 			return	p->vfork_done &&
> 				p->real_parent->mm == mm;
> 
> 			
> 		}
> 
> task_lock() is not enough if CLONE_VM was used along with CLONE_PARENT... so this
> also needs rcu_read_lock() to access ->real_parent.
> 
> Or I am totally confused?

I cannot judge I am afraid. You are definitely much more familiar with
all these subtle details than me.

So what do you think about this follow up:
---
>From 9300b4f59b0b108f0013ff44a5da93e8c6b12b46 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 31 May 2016 07:45:45 +0200
Subject: [PATCH 3/3] fold me "mm, oom: skip vforked tasks from being selected"

per Oleg:
- copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with
  this patch it would be trivial to make the exploit which hides a
  memory hog from oom-killer.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 16 ++++++++++++++++
 mm/oom_kill.c         |  2 +-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec636400669f..e1877f4da4cc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,6 +1883,22 @@ extern int arch_task_struct_size __read_mostly;
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
 
+/* expects to be called with task_lock held */
+static inline bool in_vfork(struct task_struct *tsk)
+{
+	bool ret;
+
+	/*
+	 * need RCU to access ->real_parent if CLONE_VM was used along with
+	 * CLONE_PARENT
+	 */
+	rcu_read_lock();
+	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
+	rcu_read_unlock();
+
+	return ret;
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index aa28315ac310..d1e24deb79b9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -182,7 +182,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
 			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
-			p->vfork_done) {
+			in_vfork(p)) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
@ 2016-05-31  7:42       ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 21:28:57, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > Make sure to not select vforked task as an oom victim by checking
> > vfork_done in oom_badness.
> 
> I agree, this look like a good change to me... But.
> 
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> >  
> >  	/*
> >  	 * Do not even consider tasks which are explicitly marked oom
> > -	 * unkillable or have been already oom reaped.
> > +	 * unkillable or have been already oom reaped or the are in
> > +	 * the middle of vfork
> >  	 */
> >  	adj = (long)p->signal->oom_score_adj;
> >  	if (adj == OOM_SCORE_ADJ_MIN ||
> > -			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
> > +			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
> > +			p->vfork_done) {
> 
> I don't think we can trust vfork_done != NULL.
> 
> copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
> it would be trivial to make the exploit which hides a memory hog from oom-killer.

OK, I wasn't aware of this possibility. It sounds really weird because I
thought that the whole point of vfork is to prevent from MM copy
overhead for quick exec.

> So perhaps we need something like
> 
> 		bool in_vfork(p)
> 		{
> 			return	p->vfork_done &&
> 				p->real_parent->mm == mm;
> 
> 			
> 		}
> 
> task_lock() is not enough if CLONE_VM was used along with CLONE_PARENT... so this
> also needs rcu_read_lock() to access ->real_parent.
> 
> Or I am totally confused?

I cannot judge I am afraid. You are definitely much more familiar with
all these subtle details than me.

So what do you think about this follow up:
---

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-30 18:18     ` Oleg Nesterov
@ 2016-05-31  7:43       ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 20:18:16, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > @@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> > -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
> >  			/*
> >  			 * We cannot use oom_reaper for the mm shared by this
> >  			 * process because it wouldn't get killed and so the
> > @@ -862,6 +861,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			can_oom_reap = false;
> >  			continue;
> >  		}
> > +		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> > +			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
> > +					" Report at linux-mm@kvack.org\n",
> > +					victim->comm, task_pid_nr(victim),
> > +					p->comm, task_pid_nr(p));
> 
> Oh, yes, I personally do agree ;)
> 
> perhaps the is_global_init() == T case needs a warning too? the previous changes
> take care about vfork() from /sbin/init, so the only reason we can see it true
> is that /sbin/init shares the memory with a memory hog... Nevermind, forget.

I have another two patches waiting for this to settle and one of them
adds a warning to that path.

> This is a bit off-topic, but perhaps we can also change the PF_KTHREAD check later.
> Of course we should not try to kill this kthread, but can_oom_reap can be true in
> this case. A kernel thread which does use_mm() should handle the errors correctly
> if (say) get_user() fails because we unmap the memory.

I was worried that the kernel thread would see a zero page so this could
lead to a data corruption.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/6] mm, oom: kill all tasks sharing the mm
@ 2016-05-31  7:43       ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 20:18:16, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > @@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> > -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
> >  			/*
> >  			 * We cannot use oom_reaper for the mm shared by this
> >  			 * process because it wouldn't get killed and so the
> > @@ -862,6 +861,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			can_oom_reap = false;
> >  			continue;
> >  		}
> > +		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> > +			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
> > +					" Report at linux-mm@kvack.org\n",
> > +					victim->comm, task_pid_nr(victim),
> > +					p->comm, task_pid_nr(p));
> 
> Oh, yes, I personally do agree ;)
> 
> perhaps the is_global_init() == T case needs a warning too? the previous changes
> take care about vfork() from /sbin/init, so the only reason we can see it true
> is that /sbin/init shares the memory with a memory hog... Nevermind, forget.

I have another two patches waiting for this to settle and one of them
adds a warning to that path.

> This is a bit off-topic, but perhaps we can also change the PF_KTHREAD check later.
> Of course we should not try to kill this kthread, but can_oom_reap can be true in
> this case. A kernel thread which does use_mm() should handle the errors correctly
> if (say) get_user() fails because we unmap the memory.

I was worried that the kernel thread would see a zero page so this could
lead to a data corruption.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-30 17:35     ` Oleg Nesterov
@ 2016-05-31  7:46       ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 19:35:05, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > task_will_free_mem is rather weak.
> 
> I was thinking about the similar change because I noticed that try_oom_reaper()
> is very, very wrong.
> 
> To the point I think that we need another change for stable which simply removes
> spin_lock_irq(sighand->siglock) from try_oom_reaper(). It buys nothing, we can
> check signal_group_exit() (which is wrong too ;) lockless, and at the same time
> the kernel can crash because we can hit ->siglock == NULL.

OK, I have sent a separate patch
http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org
and rebase the series on top. This would be 4.7 material. Thanks for
catching that!

> So I do think this change is good in general.
> 
> I think that task_will_free_mem() should be un-inlined, and __task_will_free_mem()
> should go into mm/oom-kill.c... but this is minor.

I was thinking about it as well but then thought that this would be
harder to review. But OK, I will do that.
 
> > -static inline bool task_will_free_mem(struct task_struct *task)
> > +static inline bool __task_will_free_mem(struct task_struct *task)
> >  {
> >  	struct signal_struct *sig = task->signal;
> >  
> > @@ -119,16 +119,69 @@ static inline bool task_will_free_mem(struct task_struct *task)
> >  	if (sig->flags & SIGNAL_GROUP_COREDUMP)
> >  		return false;
> >  
> > -	if (!(task->flags & PF_EXITING))
> > +	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
> >  		return false;
> >  
> >  	/* Make sure that the whole thread group is going down */
> > -	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
> > +	if (!thread_group_empty(task) &&
> > +		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
> >  		return false;
> >  
> >  	return true;
> >  }
> 
> Well, let me suggest this again. I think it should do
> 
> 
> 	if (SIGNAL_GROUP_COREDUMP)
> 		return false;
> 
> 	if (SIGNAL_GROUP_EXIT)
> 		return true;
> 
> 	if (thread_group_empty() && PF_EXITING)
> 		return true;
> 
> 	return false;
> 
> we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
> be set (ignoring some bugs with sub-namespaces which we need to fix anyway).

OK, so we shouldn't care about race when the fatal_signal is set on the
task until it reaches do_group_exit?

> At the same time, we do not want to return false if PF_EXITING is not set
> if SIGNAL_GROUP_EXIT is set.

makes sense.

> > +static inline bool task_will_free_mem(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = NULL;
> > +	struct task_struct *p;
> > +	bool ret;
> > +
> > +	/*
> > +	 * If the process has passed exit_mm we have to skip it because
> > +	 * we have lost a link to other tasks sharing this mm, we do not
> > +	 * have anything to reap and the task might then get stuck waiting
> > +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> > +	 */
> > +	p = find_lock_task_mm(task);
> > +	if (!p)
> > +		return false;
> > +
> > +	if (!__task_will_free_mem(p)) {
> > +		task_unlock(p);
> > +		return false;
> > +	}
> > +
> > +	mm = p->mm;
> > +	if (atomic_read(&mm->mm_users) <= 1) {
> 
> this is sub-optimal, we should probably take signal->live or ->nr_threads
> into account... but OK, we can do this later.

Yes I would prefer to add a more complex checks later. We want
mm_has_external_refs for other purposes as well.
 
> > +	rcu_read_lock();
> > +	for_each_process(p) {
> > +		ret = __task_will_free_mem(p);
> > +		if (!ret)
> > +			break;
> > +	}
> > +	rcu_read_unlock();
> 
> Yes, I agree very much.
> 
> But it seems you forgot to add the process_shares_mm() check into this loop?

Yes. Dunno where it got lost but it surely wasn't in the previous
version either. I definitely screwed somewhere...

> and perhaps it also makes sense to add
> 
> 	if (same_thread_group(tsk, p))
> 		continue;
> 
> This should not really matter, we know that __task_will_free_mem(p) should return
> true. Just to make it more clear.

ok

> And. I think this needs smp_rmb() at the end of the loop (assuming we have the
> process_shares_mm() check here). We need it to ensure that we read p->mm before
> we read next_task(), to avoid the race with exit() + clone(CLONE_VM).

Why don't we need the same barrier in oom_kill_process? Which barrier it
would pair with? Anyway I think this would deserve it's own patch.
Barriers are always tricky and it is better to have them in a small
patch with a full explanation.

Thanks for your review. It was really helpful!

The whole pile is currently in my k.org git tree in
attempts/process-share-mm-oom-sanitization branch if somebody wants to
see the full series.

My current diff on top of the patch
---
>From eb2755127e53f9f3cbc3cab757fb46bfb61c2a10 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 31 May 2016 07:33:06 +0200
Subject: [PATCH] fold me "mm, oom: fortify task_will_free_mem"

As per Oleg
- uninline task_will_free_mem
- reorganize checks and simplify __task_will_free_mem
- add missing process_shares_mm in task_will_free_mem
- add same_thread_group to task_will_free_mem for clarity

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 64 +++++------------------------------------------------
 mm/oom_kill.c       | 56 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index c4cc0591d959..f3ac9d088645 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -119,69 +119,17 @@ static inline bool __task_will_free_mem(struct task_struct *task)
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
 		return false;
 
-	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
-		return false;
-
-	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) &&
-		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
-		return false;
-
-	return true;
-}
-
-/*
- * Checks whether the given task is dying or exiting and likely to
- * release its address space. This means that all threads and processes
- * sharing the same mm have to be killed or exiting.
- */
-static inline bool task_will_free_mem(struct task_struct *task)
-{
-	struct mm_struct *mm = NULL;
-	struct task_struct *p;
-	bool ret;
-
-	/*
-	 * If the process has passed exit_mm we have to skip it because
-	 * we have lost a link to other tasks sharing this mm, we do not
-	 * have anything to reap and the task might then get stuck waiting
-	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
-	 */
-	p = find_lock_task_mm(task);
-	if (!p)
-		return false;
-
-	if (!__task_will_free_mem(p)) {
-		task_unlock(p);
-		return false;
-	}
-
-	mm = p->mm;
-	if (atomic_read(&mm->mm_users) <= 1) {
-		task_unlock(p);
+	if (sig->flags & SIGNAL_GROUP_EXIT)
 		return true;
-	}
 
-	/* pin the mm to not get freed and reused */
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
+	if (thread_group_empty(task) && PF_EXITING)
+		return true;
 
-	/*
-	 * This is really pessimistic but we do not have any reliable way
-	 * to check that external processes share with our mm
-	 */
-	rcu_read_lock();
-	for_each_process(p) {
-		ret = __task_will_free_mem(p);
-		if (!ret)
-			break;
-	}
-	rcu_read_unlock();
-	mmdrop(mm);
-
-	return ret;
+	return false;
 }
 
+bool task_will_free_mem(struct task_struct *task);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0b7c02869bc0..aa28315ac310 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -697,6 +697,62 @@ void oom_killer_enable(void)
 }
 
 /*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *p;
+	bool ret;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	if (!__task_will_free_mem(p)) {
+		task_unlock(p);
+		return false;
+	}
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (same_thread_group(task, p))
+			continue;
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
+/*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
  */
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
@ 2016-05-31  7:46       ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-05-31  7:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 19:35:05, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > task_will_free_mem is rather weak.
> 
> I was thinking about the similar change because I noticed that try_oom_reaper()
> is very, very wrong.
> 
> To the point I think that we need another change for stable which simply removes
> spin_lock_irq(sighand->siglock) from try_oom_reaper(). It buys nothing, we can
> check signal_group_exit() (which is wrong too ;) lockless, and at the same time
> the kernel can crash because we can hit ->siglock == NULL.

OK, I have sent a separate patch
http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org
and rebase the series on top. This would be 4.7 material. Thanks for
catching that!

> So I do think this change is good in general.
> 
> I think that task_will_free_mem() should be un-inlined, and __task_will_free_mem()
> should go into mm/oom-kill.c... but this is minor.

I was thinking about it as well but then thought that this would be
harder to review. But OK, I will do that.
 
> > -static inline bool task_will_free_mem(struct task_struct *task)
> > +static inline bool __task_will_free_mem(struct task_struct *task)
> >  {
> >  	struct signal_struct *sig = task->signal;
> >  
> > @@ -119,16 +119,69 @@ static inline bool task_will_free_mem(struct task_struct *task)
> >  	if (sig->flags & SIGNAL_GROUP_COREDUMP)
> >  		return false;
> >  
> > -	if (!(task->flags & PF_EXITING))
> > +	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
> >  		return false;
> >  
> >  	/* Make sure that the whole thread group is going down */
> > -	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
> > +	if (!thread_group_empty(task) &&
> > +		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
> >  		return false;
> >  
> >  	return true;
> >  }
> 
> Well, let me suggest this again. I think it should do
> 
> 
> 	if (SIGNAL_GROUP_COREDUMP)
> 		return false;
> 
> 	if (SIGNAL_GROUP_EXIT)
> 		return true;
> 
> 	if (thread_group_empty() && PF_EXITING)
> 		return true;
> 
> 	return false;
> 
> we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
> be set (ignoring some bugs with sub-namespaces which we need to fix anyway).

OK, so we shouldn't care about race when the fatal_signal is set on the
task until it reaches do_group_exit?

> At the same time, we do not want to return false if PF_EXITING is not set
> if SIGNAL_GROUP_EXIT is set.

makes sense.

> > +static inline bool task_will_free_mem(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = NULL;
> > +	struct task_struct *p;
> > +	bool ret;
> > +
> > +	/*
> > +	 * If the process has passed exit_mm we have to skip it because
> > +	 * we have lost a link to other tasks sharing this mm, we do not
> > +	 * have anything to reap and the task might then get stuck waiting
> > +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> > +	 */
> > +	p = find_lock_task_mm(task);
> > +	if (!p)
> > +		return false;
> > +
> > +	if (!__task_will_free_mem(p)) {
> > +		task_unlock(p);
> > +		return false;
> > +	}
> > +
> > +	mm = p->mm;
> > +	if (atomic_read(&mm->mm_users) <= 1) {
> 
> this is sub-optimal, we should probably take signal->live or ->nr_threads
> into account... but OK, we can do this later.

Yes I would prefer to add a more complex checks later. We want
mm_has_external_refs for other purposes as well.
 
> > +	rcu_read_lock();
> > +	for_each_process(p) {
> > +		ret = __task_will_free_mem(p);
> > +		if (!ret)
> > +			break;
> > +	}
> > +	rcu_read_unlock();
> 
> Yes, I agree very much.
> 
> But it seems you forgot to add the process_shares_mm() check into this loop?

Yes. Dunno where it got lost but it surely wasn't in the previous
version either. I definitely screwed somewhere...

> and perhaps it also makes sense to add
> 
> 	if (same_thread_group(tsk, p))
> 		continue;
> 
> This should not really matter, we know that __task_will_free_mem(p) should return
> true. Just to make it more clear.

ok

> And. I think this needs smp_rmb() at the end of the loop (assuming we have the
> process_shares_mm() check here). We need it to ensure that we read p->mm before
> we read next_task(), to avoid the race with exit() + clone(CLONE_VM).

Why don't we need the same barrier in oom_kill_process? Which barrier it
would pair with? Anyway I think this would deserve it's own patch.
Barriers are always tricky and it is better to have them in a small
patch with a full explanation.

Thanks for your review. It was really helpful!

The whole pile is currently in my k.org git tree in
attempts/process-share-mm-oom-sanitization branch if somebody wants to
see the full series.

My current diff on top of the patch
---

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-30 13:05   ` Michal Hocko
  (?)
  (?)
@ 2016-05-31 15:03   ` Tetsuo Handa
  2016-05-31 15:10     ` Michal Hocko
  -1 siblings, 1 reply; 68+ messages in thread
From: Tetsuo Handa @ 2016-05-31 15:03 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, mhocko

Michal Hocko wrote:
> task_will_free_mem is rather weak. It doesn't really tell whether
> the task has chance to drop its mm. 98748bd72200 ("oom: consider
> multi-threaded tasks in task_will_free_mem") made a first step
> into making it more robust for multi-threaded applications so now we
> know that the whole process is going down and probably drop the mm.
> 
> This patch builds on top for more complex scenarios where mm is shared
> between different processes - CLONE_VM without CLONE_THREAD resp
> CLONE_SIGHAND, or in kernel use_mm().
> 
> Make sure that all processes sharing the mm are killed or exiting. This
> will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
> all paths which bypass the oom killer are now reapable and so they
> shouldn't lock up the oom killer.

Really? The can_oom_reap variable was not removed before this patch.
It means that oom_kill_process() might fail to call wake_oom_reaper()
while setting TIF_MEMDIE to one of threads using that mm_struct.
If use_mm() or global init keeps that mm_struct not OOM reapable, other
threads sharing that mm_struct will get task_will_free_mem() == false,
won't it?

How is it guaranteed that task_will_free_mem() == false && oom_victims > 0
shall not lock up the OOM killer?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-31 15:03   ` Tetsuo Handa
@ 2016-05-31 15:10     ` Michal Hocko
  2016-05-31 15:29       ` Tetsuo Handa
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2016-05-31 15:10 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

On Wed 01-06-16 00:03:53, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > task_will_free_mem is rather weak. It doesn't really tell whether
> > the task has chance to drop its mm. 98748bd72200 ("oom: consider
> > multi-threaded tasks in task_will_free_mem") made a first step
> > into making it more robust for multi-threaded applications so now we
> > know that the whole process is going down and probably drop the mm.
> > 
> > This patch builds on top for more complex scenarios where mm is shared
> > between different processes - CLONE_VM without CLONE_THREAD resp
> > CLONE_SIGHAND, or in kernel use_mm().
> > 
> > Make sure that all processes sharing the mm are killed or exiting. This
> > will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
> > all paths which bypass the oom killer are now reapable and so they
> > shouldn't lock up the oom killer.
> 
> Really? The can_oom_reap variable was not removed before this patch.
> It means that oom_kill_process() might fail to call wake_oom_reaper()
> while setting TIF_MEMDIE to one of threads using that mm_struct.
> If use_mm() or global init keeps that mm_struct not OOM reapable, other
> threads sharing that mm_struct will get task_will_free_mem() == false,
> won't it?
> 
> How is it guaranteed that task_will_free_mem() == false && oom_victims > 0
> shall not lock up the OOM killer?

But this patch is talking about task_will_free_mem == true. Is the
description confusing? Should I reword the changelog?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-31 15:10     ` Michal Hocko
@ 2016-05-31 15:29       ` Tetsuo Handa
  2016-06-01  7:25         ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Tetsuo Handa @ 2016-05-31 15:29 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

Michal Hocko wrote:
> On Wed 01-06-16 00:03:53, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > task_will_free_mem is rather weak. It doesn't really tell whether
> > > the task has chance to drop its mm. 98748bd72200 ("oom: consider
> > > multi-threaded tasks in task_will_free_mem") made a first step
> > > into making it more robust for multi-threaded applications so now we
> > > know that the whole process is going down and probably drop the mm.
> > > 
> > > This patch builds on top for more complex scenarios where mm is shared
> > > between different processes - CLONE_VM without CLONE_THREAD resp
> > > CLONE_SIGHAND, or in kernel use_mm().
> > > 
> > > Make sure that all processes sharing the mm are killed or exiting. This
> > > will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
> > > all paths which bypass the oom killer are now reapable and so they
> > > shouldn't lock up the oom killer.
> > 
> > Really? The can_oom_reap variable was not removed before this patch.
> > It means that oom_kill_process() might fail to call wake_oom_reaper()
> > while setting TIF_MEMDIE to one of threads using that mm_struct.
> > If use_mm() or global init keeps that mm_struct not OOM reapable, other
> > threads sharing that mm_struct will get task_will_free_mem() == false,
> > won't it?
> > 
> > How is it guaranteed that task_will_free_mem() == false && oom_victims > 0
> > shall not lock up the OOM killer?
> 
> But this patch is talking about task_will_free_mem == true. Is the
> description confusing? Should I reword the changelog?

The situation I'm talking about is

  (1) out_of_memory() is called.
  (2) select_bad_process() is called because task_will_free_mem(current) == false.
  (3) oom_kill_process() is called because select_bad_process() chose a victim.
  (4) oom_kill_process() sets TIF_MEMDIE on that victim.
  (5) oom_kill_process() fails to call wake_oom_reaper() because that victim's
      memory was shared by use_mm() or global init.
  (6) other !TIF_MEMDIE threads sharing that victim's memory call out_of_memory().
  (7) select_bad_process() is called because task_will_free_mem(current) == false.
  (8) oom_scan_process_thread() returns OOM_SCAN_ABORT because it finds TIF_MEMDIE
      set at (4).
  (9) other !TIF_MEMDIE threads sharing that victim's memory fail to get TIF_MEMDIE.
  (10) How other !TIF_MEMDIE threads sharing that victim's memory will release
       that memory?

I'm fine with task_will_free_mem(current) == true case. My question is that
"doesn't this patch break task_will_free_mem(current) == false case when there is
already TIF_MEMDIE thread" ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-31  7:42       ` Michal Hocko
@ 2016-05-31 21:43         ` Oleg Nesterov
  -1 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-31 21:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> On Mon 30-05-16 21:28:57, Oleg Nesterov wrote:
> >
> > I don't think we can trust vfork_done != NULL.
> >
> > copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
> > it would be trivial to make the exploit which hides a memory hog from oom-killer.
>
> OK, I wasn't aware of this possibility.

Neither was me ;) I noticed this during this review.

> > Or I am totally confused?
>
> I cannot judge I am afraid. You are definitely much more familiar with
> all these subtle details than me.

OK, I just verified that clone(CLONE_VFORK|SIGCHLD) really works to be sure.

> +/* expects to be called with task_lock held */
> +static inline bool in_vfork(struct task_struct *tsk)
> +{
> +	bool ret;
> +
> +	/*
> +	 * need RCU to access ->real_parent if CLONE_VM was used along with
> +	 * CLONE_PARENT
> +	 */
> +	rcu_read_lock();
> +	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
> +	rcu_read_unlock();
> +
> +	return ret;
> +}

Yes, but may I ask to add a comment? And note that "expects to be called with
task_lock held" looks misleading, we do not need the "stable" tsk->vfork_done
since we only need to check if it is NULL or not.

It would be nice to explain that

	1. we check real_parent->mm == tsk->mm because CLONE_VFORK does not
	   imply CLONE_VM

	2. CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
	   ->real_parent is not necessarily the task doing vfork(), so in
	   theory we can't rely on task_lock() if we want to dereference it.

	   And in this case we can't trust the real_parent->mm == tsk->mm
	   check, it can be false negative. But we do not care, if init or
	   another oom-unkillable task does this it should blame itself.

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
@ 2016-05-31 21:43         ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-31 21:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> On Mon 30-05-16 21:28:57, Oleg Nesterov wrote:
> >
> > I don't think we can trust vfork_done != NULL.
> >
> > copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
> > it would be trivial to make the exploit which hides a memory hog from oom-killer.
>
> OK, I wasn't aware of this possibility.

Neither was me ;) I noticed this during this review.

> > Or I am totally confused?
>
> I cannot judge I am afraid. You are definitely much more familiar with
> all these subtle details than me.

OK, I just verified that clone(CLONE_VFORK|SIGCHLD) really works to be sure.

> +/* expects to be called with task_lock held */
> +static inline bool in_vfork(struct task_struct *tsk)
> +{
> +	bool ret;
> +
> +	/*
> +	 * need RCU to access ->real_parent if CLONE_VM was used along with
> +	 * CLONE_PARENT
> +	 */
> +	rcu_read_lock();
> +	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
> +	rcu_read_unlock();
> +
> +	return ret;
> +}

Yes, but may I ask to add a comment? And note that "expects to be called with
task_lock held" looks misleading, we do not need the "stable" tsk->vfork_done
since we only need to check if it is NULL or not.

It would be nice to explain that

	1. we check real_parent->mm == tsk->mm because CLONE_VFORK does not
	   imply CLONE_VM

	2. CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
	   ->real_parent is not necessarily the task doing vfork(), so in
	   theory we can't rely on task_lock() if we want to dereference it.

	   And in this case we can't trust the real_parent->mm == tsk->mm
	   check, it can be false negative. But we do not care, if init or
	   another oom-unkillable task does this it should blame itself.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-31  7:43       ` Michal Hocko
@ 2016-05-31 21:48         ` Oleg Nesterov
  -1 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-31 21:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> On Mon 30-05-16 20:18:16, Oleg Nesterov wrote:
> >
> > perhaps the is_global_init() == T case needs a warning too? the previous changes
> > take care about vfork() from /sbin/init, so the only reason we can see it true
> > is that /sbin/init shares the memory with a memory hog... Nevermind, forget.
>
> I have another two patches waiting for this to settle and one of them
> adds a warning to that path.

Good,

> > This is a bit off-topic, but perhaps we can also change the PF_KTHREAD check later.
> > Of course we should not try to kill this kthread, but can_oom_reap can be true in
> > this case. A kernel thread which does use_mm() should handle the errors correctly
> > if (say) get_user() fails because we unmap the memory.
>
> I was worried that the kernel thread would see a zero page so this could
> lead to a data corruption.

We can't avoid this anyway. use_mm(victim->mm) can be called after we decide to kill
the victim.

So I think that we should always ignore kthreads, and in task_will_free_mem() too.

But let me repeat, I agree we should discuss this later, I am not trying to suggest
this change right now.

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/6] mm, oom: kill all tasks sharing the mm
@ 2016-05-31 21:48         ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-31 21:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> On Mon 30-05-16 20:18:16, Oleg Nesterov wrote:
> >
> > perhaps the is_global_init() == T case needs a warning too? the previous changes
> > take care about vfork() from /sbin/init, so the only reason we can see it true
> > is that /sbin/init shares the memory with a memory hog... Nevermind, forget.
>
> I have another two patches waiting for this to settle and one of them
> adds a warning to that path.

Good,

> > This is a bit off-topic, but perhaps we can also change the PF_KTHREAD check later.
> > Of course we should not try to kill this kthread, but can_oom_reap can be true in
> > this case. A kernel thread which does use_mm() should handle the errors correctly
> > if (say) get_user() fails because we unmap the memory.
>
> I was worried that the kernel thread would see a zero page so this could
> lead to a data corruption.

We can't avoid this anyway. use_mm(victim->mm) can be called after we decide to kill
the victim.

So I think that we should always ignore kthreads, and in task_will_free_mem() too.

But let me repeat, I agree we should discuss this later, I am not trying to suggest
this change right now.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-31  7:46       ` Michal Hocko
@ 2016-05-31 22:29         ` Oleg Nesterov
  -1 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-31 22:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> On Mon 30-05-16 19:35:05, Oleg Nesterov wrote:
> >
> > Well, let me suggest this again. I think it should do
> >
> >
> > 	if (SIGNAL_GROUP_COREDUMP)
> > 		return false;
> >
> > 	if (SIGNAL_GROUP_EXIT)
> > 		return true;
> >
> > 	if (thread_group_empty() && PF_EXITING)
> > 		return true;
> >
> > 	return false;
> >
> > we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
> > be set (ignoring some bugs with sub-namespaces which we need to fix anyway).
>
> OK, so we shouldn't care about race when the fatal_signal is set on the
> task until it reaches do_group_exit?

if fatal_signal() is true then (ignoring exec and coredump) SIGNAL_GROUP_EXIT
is already set (again, ignoring the bugs with sub-namespace inits).

At the same time, SIGKILL can be already dequeued when the task exits, so
fatal_signal_pending() can be "false negative".

> > And. I think this needs smp_rmb() at the end of the loop (assuming we have the
> > process_shares_mm() check here). We need it to ensure that we read p->mm before
> > we read next_task(), to avoid the race with exit() + clone(CLONE_VM).
>
> Why don't we need the same barrier in oom_kill_process?

Because it calls do_send_sig_info() which takes ->siglock and copy_process()
takes the same lock. Not a barrier, but acts the same way.

> Which barrier it
> would pair with?

With the barrier implied by list_add_tail_rcu(&p->tasks, &init_task.tasks).

> Anyway I think this would deserve it's own patch.
> Barriers are always tricky and it is better to have them in a small
> patch with a full explanation.

OK, agreed.


I am not sure I can read the new patch correctly, it depends on the previous
changes... but afaics it looks good.

Cosmetic/subjective nit, feel free to ignore,

> +bool task_will_free_mem(struct task_struct *task)
> +{
> +	struct mm_struct *mm = NULL;

unnecessary initialization ;)

> +	struct task_struct *p;
> +	bool ret;
> +
> +	/*
> +	 * If the process has passed exit_mm we have to skip it because
> +	 * we have lost a link to other tasks sharing this mm, we do not
> +	 * have anything to reap and the task might then get stuck waiting
> +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +	 */
> +	p = find_lock_task_mm(task);
> +	if (!p)
> +		return false;
> +
> +	if (!__task_will_free_mem(p)) {
> +		task_unlock(p);
> +		return false;
> +	}

We can call the 1st __task_will_free_mem(p) before find_lock_task_mm(). In the
likely case (I think) it should return false.

And since __task_will_free_mem() has no other callers perhaps it should go into
oom_kill.c too.

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
@ 2016-05-31 22:29         ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-31 22:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> On Mon 30-05-16 19:35:05, Oleg Nesterov wrote:
> >
> > Well, let me suggest this again. I think it should do
> >
> >
> > 	if (SIGNAL_GROUP_COREDUMP)
> > 		return false;
> >
> > 	if (SIGNAL_GROUP_EXIT)
> > 		return true;
> >
> > 	if (thread_group_empty() && PF_EXITING)
> > 		return true;
> >
> > 	return false;
> >
> > we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
> > be set (ignoring some bugs with sub-namespaces which we need to fix anyway).
>
> OK, so we shouldn't care about race when the fatal_signal is set on the
> task until it reaches do_group_exit?

if fatal_signal() is true then (ignoring exec and coredump) SIGNAL_GROUP_EXIT
is already set (again, ignoring the bugs with sub-namespace inits).

At the same time, SIGKILL can be already dequeued when the task exits, so
fatal_signal_pending() can be "false negative".

> > And. I think this needs smp_rmb() at the end of the loop (assuming we have the
> > process_shares_mm() check here). We need it to ensure that we read p->mm before
> > we read next_task(), to avoid the race with exit() + clone(CLONE_VM).
>
> Why don't we need the same barrier in oom_kill_process?

Because it calls do_send_sig_info() which takes ->siglock and copy_process()
takes the same lock. Not a barrier, but acts the same way.

> Which barrier it
> would pair with?

With the barrier implied by list_add_tail_rcu(&p->tasks, &init_task.tasks).

> Anyway I think this would deserve it's own patch.
> Barriers are always tricky and it is better to have them in a small
> patch with a full explanation.

OK, agreed.


I am not sure I can read the new patch correctly, it depends on the previous
changes... but afaics it looks good.

Cosmetic/subjective nit, feel free to ignore,

> +bool task_will_free_mem(struct task_struct *task)
> +{
> +	struct mm_struct *mm = NULL;

unnecessary initialization ;)

> +	struct task_struct *p;
> +	bool ret;
> +
> +	/*
> +	 * If the process has passed exit_mm we have to skip it because
> +	 * we have lost a link to other tasks sharing this mm, we do not
> +	 * have anything to reap and the task might then get stuck waiting
> +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +	 */
> +	p = find_lock_task_mm(task);
> +	if (!p)
> +		return false;
> +
> +	if (!__task_will_free_mem(p)) {
> +		task_unlock(p);
> +		return false;
> +	}

We can call the 1st __task_will_free_mem(p) before find_lock_task_mm(). In the
likely case (I think) it should return false.

And since __task_will_free_mem() has no other callers perhaps it should go into
oom_kill.c too.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-31  7:32       ` Michal Hocko
@ 2016-05-31 22:53         ` Oleg Nesterov
  -1 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-31 22:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> Oleg has pointed out that can simplify both oom_adj_write and
> oom_score_adj_write even further and drop the sighand lock. The only
> purpose of the lock was to protect p->signal from going away but this
> will not happen since ea6d290ca34c ("signals: make task_struct->signal
> immutable/refcountable").

Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().

As for oom_adj_write/oom_score_adj_write we can remove it too, but then
we need to ensure (say, using cmpxchg) that unpriviliged user can not
not decrease signal->oom_score_adj_min if its oom_score_adj_write()
races with someone else (say, admin) which tries to increase the same
oom_score_adj_min.

If you think this is not a problem - I am fine with this change. But
please also update oom_adj_read/oom_score_adj_read ;)

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
@ 2016-05-31 22:53         ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2016-05-31 22:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> Oleg has pointed out that can simplify both oom_adj_write and
> oom_score_adj_write even further and drop the sighand lock. The only
> purpose of the lock was to protect p->signal from going away but this
> will not happen since ea6d290ca34c ("signals: make task_struct->signal
> immutable/refcountable").

Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().

As for oom_adj_write/oom_score_adj_write we can remove it too, but then
we need to ensure (say, using cmpxchg) that unpriviliged user can not
not decrease signal->oom_score_adj_min if its oom_score_adj_write()
races with someone else (say, admin) which tries to increase the same
oom_score_adj_min.

If you think this is not a problem - I am fine with this change. But
please also update oom_adj_read/oom_score_adj_read ;)

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-31 22:53         ` Oleg Nesterov
@ 2016-06-01  6:53           ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-01  6:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 01-06-16 00:53:03, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > Oleg has pointed out that can simplify both oom_adj_write and
> > oom_score_adj_write even further and drop the sighand lock. The only
> > purpose of the lock was to protect p->signal from going away but this
> > will not happen since ea6d290ca34c ("signals: make task_struct->signal
> > immutable/refcountable").
> 
> Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().
> 
> As for oom_adj_write/oom_score_adj_write we can remove it too, but then
> we need to ensure (say, using cmpxchg) that unpriviliged user can not
> not decrease signal->oom_score_adj_min if its oom_score_adj_write()
> races with someone else (say, admin) which tries to increase the same
> oom_score_adj_min.

I am introducing oom_adj_mutex in a later patch so I will move it here.

> If you think this is not a problem - I am fine with this change. But
> please also update oom_adj_read/oom_score_adj_read ;)

will do. It stayed in the blind spot... Thanks for pointing that out

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
@ 2016-06-01  6:53           ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-01  6:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 01-06-16 00:53:03, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > Oleg has pointed out that can simplify both oom_adj_write and
> > oom_score_adj_write even further and drop the sighand lock. The only
> > purpose of the lock was to protect p->signal from going away but this
> > will not happen since ea6d290ca34c ("signals: make task_struct->signal
> > immutable/refcountable").
> 
> Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().
> 
> As for oom_adj_write/oom_score_adj_write we can remove it too, but then
> we need to ensure (say, using cmpxchg) that unpriviliged user can not
> not decrease signal->oom_score_adj_min if its oom_score_adj_write()
> races with someone else (say, admin) which tries to increase the same
> oom_score_adj_min.

I am introducing oom_adj_mutex in a later patch so I will move it here.

> If you think this is not a problem - I am fine with this change. But
> please also update oom_adj_read/oom_score_adj_read ;)

will do. It stayed in the blind spot... Thanks for pointing that out

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-31 22:29         ` Oleg Nesterov
@ 2016-06-01  7:03           ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-01  7:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 01-06-16 00:29:33, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > On Mon 30-05-16 19:35:05, Oleg Nesterov wrote:
> > >
> > > Well, let me suggest this again. I think it should do
> > >
> > >
> > > 	if (SIGNAL_GROUP_COREDUMP)
> > > 		return false;
> > >
> > > 	if (SIGNAL_GROUP_EXIT)
> > > 		return true;
> > >
> > > 	if (thread_group_empty() && PF_EXITING)
> > > 		return true;
> > >
> > > 	return false;
> > >
> > > we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
> > > be set (ignoring some bugs with sub-namespaces which we need to fix anyway).
> >
> > OK, so we shouldn't care about race when the fatal_signal is set on the
> > task until it reaches do_group_exit?
> 
> if fatal_signal() is true then (ignoring exec and coredump) SIGNAL_GROUP_EXIT
> is already set (again, ignoring the bugs with sub-namespace inits).
> 
> At the same time, SIGKILL can be already dequeued when the task exits, so
> fatal_signal_pending() can be "false negative".

Thanks for the clarification. I guess I got the point but this is a land
of surprises so one can never be sure...

> > > And. I think this needs smp_rmb() at the end of the loop (assuming we have the
> > > process_shares_mm() check here). We need it to ensure that we read p->mm before
> > > we read next_task(), to avoid the race with exit() + clone(CLONE_VM).
> >
> > Why don't we need the same barrier in oom_kill_process?
> 
> Because it calls do_send_sig_info() which takes ->siglock and copy_process()
> takes the same lock. Not a barrier, but acts the same way.

Ahh ok, so an implicit barrier.

> > Which barrier it
> > would pair with?
> 
> With the barrier implied by list_add_tail_rcu(&p->tasks, &init_task.tasks).

Ahh I see. rcu_assign_pointer that is, right?

> > Anyway I think this would deserve it's own patch.
> > Barriers are always tricky and it is better to have them in a small
> > patch with a full explanation.
> 
> OK, agreed.

cool

> I am not sure I can read the new patch correctly, it depends on the previous
> changes... but afaics it looks good.
> 
> Cosmetic/subjective nit, feel free to ignore,
> 
> > +bool task_will_free_mem(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = NULL;
> 
> unnecessary initialization ;)

fixed

> > +	struct task_struct *p;
> > +	bool ret;
> > +
> > +	/*
> > +	 * If the process has passed exit_mm we have to skip it because
> > +	 * we have lost a link to other tasks sharing this mm, we do not
> > +	 * have anything to reap and the task might then get stuck waiting
> > +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> > +	 */
> > +	p = find_lock_task_mm(task);
> > +	if (!p)
> > +		return false;
> > +
> > +	if (!__task_will_free_mem(p)) {
> > +		task_unlock(p);
> > +		return false;
> > +	}
> 
> We can call the 1st __task_will_free_mem(p) before find_lock_task_mm(). In the
> likely case (I think) it should return false.

OK

> 
> And since __task_will_free_mem() has no other callers perhaps it should go into
> oom_kill.c too.

ok

I will resend the whole series with the fixups later during this week.
Thanks again for your review.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
@ 2016-06-01  7:03           ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-01  7:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 01-06-16 00:29:33, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > On Mon 30-05-16 19:35:05, Oleg Nesterov wrote:
> > >
> > > Well, let me suggest this again. I think it should do
> > >
> > >
> > > 	if (SIGNAL_GROUP_COREDUMP)
> > > 		return false;
> > >
> > > 	if (SIGNAL_GROUP_EXIT)
> > > 		return true;
> > >
> > > 	if (thread_group_empty() && PF_EXITING)
> > > 		return true;
> > >
> > > 	return false;
> > >
> > > we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
> > > be set (ignoring some bugs with sub-namespaces which we need to fix anyway).
> >
> > OK, so we shouldn't care about race when the fatal_signal is set on the
> > task until it reaches do_group_exit?
> 
> if fatal_signal() is true then (ignoring exec and coredump) SIGNAL_GROUP_EXIT
> is already set (again, ignoring the bugs with sub-namespace inits).
> 
> At the same time, SIGKILL can be already dequeued when the task exits, so
> fatal_signal_pending() can be "false negative".

Thanks for the clarification. I guess I got the point but this is a land
of surprises so one can never be sure...

> > > And. I think this needs smp_rmb() at the end of the loop (assuming we have the
> > > process_shares_mm() check here). We need it to ensure that we read p->mm before
> > > we read next_task(), to avoid the race with exit() + clone(CLONE_VM).
> >
> > Why don't we need the same barrier in oom_kill_process?
> 
> Because it calls do_send_sig_info() which takes ->siglock and copy_process()
> takes the same lock. Not a barrier, but acts the same way.

Ahh ok, so an implicit barrier.

> > Which barrier it
> > would pair with?
> 
> With the barrier implied by list_add_tail_rcu(&p->tasks, &init_task.tasks).

Ahh I see. rcu_assign_pointer that is, right?

> > Anyway I think this would deserve it's own patch.
> > Barriers are always tricky and it is better to have them in a small
> > patch with a full explanation.
> 
> OK, agreed.

cool

> I am not sure I can read the new patch correctly, it depends on the previous
> changes... but afaics it looks good.
> 
> Cosmetic/subjective nit, feel free to ignore,
> 
> > +bool task_will_free_mem(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = NULL;
> 
> unnecessary initialization ;)

fixed

> > +	struct task_struct *p;
> > +	bool ret;
> > +
> > +	/*
> > +	 * If the process has passed exit_mm we have to skip it because
> > +	 * we have lost a link to other tasks sharing this mm, we do not
> > +	 * have anything to reap and the task might then get stuck waiting
> > +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> > +	 */
> > +	p = find_lock_task_mm(task);
> > +	if (!p)
> > +		return false;
> > +
> > +	if (!__task_will_free_mem(p)) {
> > +		task_unlock(p);
> > +		return false;
> > +	}
> 
> We can call the 1st __task_will_free_mem(p) before find_lock_task_mm(). In the
> likely case (I think) it should return false.

OK

> 
> And since __task_will_free_mem() has no other callers perhaps it should go into
> oom_kill.c too.

ok

I will resend the whole series with the fixups later during this week.
Thanks again for your review.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-31 21:43         ` Oleg Nesterov
@ 2016-06-01  7:09           ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-01  7:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Tue 31-05-16 23:43:38, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > On Mon 30-05-16 21:28:57, Oleg Nesterov wrote:
> > >
> > > I don't think we can trust vfork_done != NULL.
> > >
> > > copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
> > > it would be trivial to make the exploit which hides a memory hog from oom-killer.
> >
> > OK, I wasn't aware of this possibility.
> 
> Neither was me ;) I noticed this during this review.

Heh, as I've said in other email, this is a land of dragons^Wsurprises.
 
> > > Or I am totally confused?
> >
> > I cannot judge I am afraid. You are definitely much more familiar with
> > all these subtle details than me.
> 
> OK, I just verified that clone(CLONE_VFORK|SIGCHLD) really works to be sure.

great, thanks

> > +/* expects to be called with task_lock held */
> > +static inline bool in_vfork(struct task_struct *tsk)
> > +{
> > +	bool ret;
> > +
> > +	/*
> > +	 * need RCU to access ->real_parent if CLONE_VM was used along with
> > +	 * CLONE_PARENT
> > +	 */
> > +	rcu_read_lock();
> > +	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> 
> Yes, but may I ask to add a comment? And note that "expects to be called with
> task_lock held" looks misleading, we do not need the "stable" tsk->vfork_done
> since we only need to check if it is NULL or not.

OK, I thought it was needed for the stability but as you explain below
this is not really true...

> It would be nice to explain that
> 
> 	1. we check real_parent->mm == tsk->mm because CLONE_VFORK does not
> 	   imply CLONE_VM
> 
> 	2. CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
> 	   ->real_parent is not necessarily the task doing vfork(), so in
> 	   theory we can't rely on task_lock() if we want to dereference it.
> 
> 	   And in this case we can't trust the real_parent->mm == tsk->mm
> 	   check, it can be false negative. But we do not care, if init or
> 	   another oom-unkillable task does this it should blame itself.

I've stolen this explanation and put it right there.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
@ 2016-06-01  7:09           ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-01  7:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Tue 31-05-16 23:43:38, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > On Mon 30-05-16 21:28:57, Oleg Nesterov wrote:
> > >
> > > I don't think we can trust vfork_done != NULL.
> > >
> > > copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
> > > it would be trivial to make the exploit which hides a memory hog from oom-killer.
> >
> > OK, I wasn't aware of this possibility.
> 
> Neither was me ;) I noticed this during this review.

Heh, as I've said in other email, this is a land of dragons^Wsurprises.
 
> > > Or I am totally confused?
> >
> > I cannot judge I am afraid. You are definitely much more familiar with
> > all these subtle details than me.
> 
> OK, I just verified that clone(CLONE_VFORK|SIGCHLD) really works to be sure.

great, thanks

> > +/* expects to be called with task_lock held */
> > +static inline bool in_vfork(struct task_struct *tsk)
> > +{
> > +	bool ret;
> > +
> > +	/*
> > +	 * need RCU to access ->real_parent if CLONE_VM was used along with
> > +	 * CLONE_PARENT
> > +	 */
> > +	rcu_read_lock();
> > +	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> 
> Yes, but may I ask to add a comment? And note that "expects to be called with
> task_lock held" looks misleading, we do not need the "stable" tsk->vfork_done
> since we only need to check if it is NULL or not.

OK, I thought it was needed for the stability but as you explain below
this is not really true...

> It would be nice to explain that
> 
> 	1. we check real_parent->mm == tsk->mm because CLONE_VFORK does not
> 	   imply CLONE_VM
> 
> 	2. CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
> 	   ->real_parent is not necessarily the task doing vfork(), so in
> 	   theory we can't rely on task_lock() if we want to dereference it.
> 
> 	   And in this case we can't trust the real_parent->mm == tsk->mm
> 	   check, it can be false negative. But we do not care, if init or
> 	   another oom-unkillable task does this it should blame itself.

I've stolen this explanation and put it right there.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-31 15:29       ` Tetsuo Handa
@ 2016-06-01  7:25         ` Michal Hocko
  2016-06-01 12:04           ` Tetsuo Handa
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2016-06-01  7:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

On Wed 01-06-16 00:29:45, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 01-06-16 00:03:53, Tetsuo Handa wrote:
[...]
> > > How is it guaranteed that task_will_free_mem() == false && oom_victims > 0
> > > shall not lock up the OOM killer?
> > 
> > But this patch is talking about task_will_free_mem == true. Is the
> > description confusing? Should I reword the changelog?
> 
> The situation I'm talking about is
> 
>   (1) out_of_memory() is called.
>   (2) select_bad_process() is called because task_will_free_mem(current) == false.
>   (3) oom_kill_process() is called because select_bad_process() chose a victim.
>   (4) oom_kill_process() sets TIF_MEMDIE on that victim.
>   (5) oom_kill_process() fails to call wake_oom_reaper() because that victim's
>       memory was shared by use_mm() or global init.
>   (6) other !TIF_MEMDIE threads sharing that victim's memory call out_of_memory().
>   (7) select_bad_process() is called because task_will_free_mem(current) == false.
>   (8) oom_scan_process_thread() returns OOM_SCAN_ABORT because it finds TIF_MEMDIE
>       set at (4).
>   (9) other !TIF_MEMDIE threads sharing that victim's memory fail to get TIF_MEMDIE.
>   (10) How other !TIF_MEMDIE threads sharing that victim's memory will release
>        that memory?
> 
> I'm fine with task_will_free_mem(current) == true case. My question is that
> "doesn't this patch break task_will_free_mem(current) == false case when there is
> already TIF_MEMDIE thread" ?

OK, I see your point now. This is certainly possible, albeit unlikely. I
think calling this a regression would be a bit an overstatement. We are
basically replacing one unreliable heuristic by another one which is
more likely to lead to a deterministic behavior.

If you are worried about locking up the oom killer I have another 2
patches on top of this series which should deal with that (one of them
was already posted [1] and another one was drafted in [2]. Both of them
on top of this series should remove the concern of the lockup. I just
wait to post them until this thread settles down.

[1] http://lkml.kernel.org/r/1464276476-25136-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20160527133502.GN27686@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-06-01  6:53           ` Michal Hocko
@ 2016-06-01 10:41             ` Tetsuo Handa
  -1 siblings, 0 replies; 68+ messages in thread
From: Tetsuo Handa @ 2016-06-01 10:41 UTC (permalink / raw)
  To: mhocko, oleg; +Cc: linux-mm, rientjes, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Wed 01-06-16 00:53:03, Oleg Nesterov wrote:
> > On 05/31, Michal Hocko wrote:
> > >
> > > Oleg has pointed out that can simplify both oom_adj_write and
> > > oom_score_adj_write even further and drop the sighand lock. The only
> > > purpose of the lock was to protect p->signal from going away but this
> > > will not happen since ea6d290ca34c ("signals: make task_struct->signal
> > > immutable/refcountable").
> > 
> > Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().
> > 
> > As for oom_adj_write/oom_score_adj_write we can remove it too, but then
> > we need to ensure (say, using cmpxchg) that unpriviliged user can not
> > not decrease signal->oom_score_adj_min if its oom_score_adj_write()
> > races with someone else (say, admin) which tries to increase the same
> > oom_score_adj_min.
> 
> I am introducing oom_adj_mutex in a later patch so I will move it here.

Can't we reuse oom_lock like

	if (mutex_lock_killable(&oom_lock))
		return -EINTR;

? I think that updating oom_score_adj unlikely races with OOM killer
invocation, and updating oom_score_adj should be a killable operation.

> 
> > If you think this is not a problem - I am fine with this change. But
> > please also update oom_adj_read/oom_score_adj_read ;)
> 
> will do. It stayed in the blind spot... Thanks for pointing that out
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
@ 2016-06-01 10:41             ` Tetsuo Handa
  0 siblings, 0 replies; 68+ messages in thread
From: Tetsuo Handa @ 2016-06-01 10:41 UTC (permalink / raw)
  To: mhocko, oleg; +Cc: linux-mm, rientjes, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Wed 01-06-16 00:53:03, Oleg Nesterov wrote:
> > On 05/31, Michal Hocko wrote:
> > >
> > > Oleg has pointed out that can simplify both oom_adj_write and
> > > oom_score_adj_write even further and drop the sighand lock. The only
> > > purpose of the lock was to protect p->signal from going away but this
> > > will not happen since ea6d290ca34c ("signals: make task_struct->signal
> > > immutable/refcountable").
> > 
> > Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().
> > 
> > As for oom_adj_write/oom_score_adj_write we can remove it too, but then
> > we need to ensure (say, using cmpxchg) that unpriviliged user can not
> > not decrease signal->oom_score_adj_min if its oom_score_adj_write()
> > races with someone else (say, admin) which tries to increase the same
> > oom_score_adj_min.
> 
> I am introducing oom_adj_mutex in a later patch so I will move it here.

Can't we reuse oom_lock like

	if (mutex_lock_killable(&oom_lock))
		return -EINTR;

? I think that updating oom_score_adj unlikely races with OOM killer
invocation, and updating oom_score_adj should be a killable operation.

> 
> > If you think this is not a problem - I am fine with this change. But
> > please also update oom_adj_read/oom_score_adj_read ;)
> 
> will do. It stayed in the blind spot... Thanks for pointing that out
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-06-01 10:41             ` Tetsuo Handa
@ 2016-06-01 10:48               ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-01 10:48 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: oleg, linux-mm, rientjes, vdavydov, akpm, linux-kernel

On Wed 01-06-16 19:41:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 01-06-16 00:53:03, Oleg Nesterov wrote:
> > > On 05/31, Michal Hocko wrote:
> > > >
> > > > Oleg has pointed out that can simplify both oom_adj_write and
> > > > oom_score_adj_write even further and drop the sighand lock. The only
> > > > purpose of the lock was to protect p->signal from going away but this
> > > > will not happen since ea6d290ca34c ("signals: make task_struct->signal
> > > > immutable/refcountable").
> > > 
> > > Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().
> > > 
> > > As for oom_adj_write/oom_score_adj_write we can remove it too, but then
> > > we need to ensure (say, using cmpxchg) that unpriviliged user can not
> > > not decrease signal->oom_score_adj_min if its oom_score_adj_write()
> > > races with someone else (say, admin) which tries to increase the same
> > > oom_score_adj_min.
> > 
> > I am introducing oom_adj_mutex in a later patch so I will move it here.
> 
> Can't we reuse oom_lock like
> 
> 	if (mutex_lock_killable(&oom_lock))
> 		return -EINTR;
> 
> ? I think that updating oom_score_adj unlikely races with OOM killer
> invocation, and updating oom_score_adj should be a killable operation.

We could but what would be an advantage? Do we really need a full oom
exclusion?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
@ 2016-06-01 10:48               ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-01 10:48 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: oleg, linux-mm, rientjes, vdavydov, akpm, linux-kernel

On Wed 01-06-16 19:41:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 01-06-16 00:53:03, Oleg Nesterov wrote:
> > > On 05/31, Michal Hocko wrote:
> > > >
> > > > Oleg has pointed out that can simplify both oom_adj_write and
> > > > oom_score_adj_write even further and drop the sighand lock. The only
> > > > purpose of the lock was to protect p->signal from going away but this
> > > > will not happen since ea6d290ca34c ("signals: make task_struct->signal
> > > > immutable/refcountable").
> > > 
> > > Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().
> > > 
> > > As for oom_adj_write/oom_score_adj_write we can remove it too, but then
> > > we need to ensure (say, using cmpxchg) that unpriviliged user can not
> > > not decrease signal->oom_score_adj_min if its oom_score_adj_write()
> > > races with someone else (say, admin) which tries to increase the same
> > > oom_score_adj_min.
> > 
> > I am introducing oom_adj_mutex in a later patch so I will move it here.
> 
> Can't we reuse oom_lock like
> 
> 	if (mutex_lock_killable(&oom_lock))
> 		return -EINTR;
> 
> ? I think that updating oom_score_adj unlikely races with OOM killer
> invocation, and updating oom_score_adj should be a killable operation.

We could but what would be an advantage? Do we really need a full oom
exclusion?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-06-01  7:25         ` Michal Hocko
@ 2016-06-01 12:04           ` Tetsuo Handa
  2016-06-01 12:43             ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Tetsuo Handa @ 2016-06-01 12:04 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

Michal Hocko wrote:
> On Wed 01-06-16 00:29:45, Tetsuo Handa wrote:
> > I'm fine with task_will_free_mem(current) == true case. My question is that
> > "doesn't this patch break task_will_free_mem(current) == false case when there is
> > already TIF_MEMDIE thread" ?
> 
> OK, I see your point now. This is certainly possible, albeit unlikely. I
> think calling this a regression would be a bit an overstatement. We are
> basically replacing one unreliable heuristic by another one which is
> more likely to lead to a deterministic behavior.
> 
> If you are worried about locking up the oom killer I have another 2
> patches on top of this series which should deal with that (one of them
> was already posted [1] and another one was drafted in [2]. Both of them
> on top of this series should remove the concern of the lockup. I just
> wait to post them until this thread settles down.
> 
> [1] http://lkml.kernel.org/r/1464276476-25136-1-git-send-email-mhocko@kernel.org
> [2] http://lkml.kernel.org/r/20160527133502.GN27686@dhcp22.suse.cz

I want [1] though I don't know we should try twice. But I still can't
understand why [2] works. Setting MMF_OOM_REAPED on victim's mm_struct
allows oom_scan_process_thread() to call oom_badness() only after TIF_MEMDIE
was removed from that victim. Since oom_reap_task() is not called due to
can_oom_reap == false, nobody will be able to remove TIF_MEMDIE from that
victim if that victim got stuck at unkillable wait.
down_write_killable(&mm->mmap_sem) might not reduce possibility of lockup
when oom_reap_task() is not called.

Quoting from http://lkml.kernel.org/r/20160530115551.GU22928@dhcp22.suse.cz :
> But your bottom half would just decide to back off the same way I do
> here. And as for the bonus your bottom half would have to do the rather
> costly process iteration to find that out.

Doing the rather costly process iteration for up to one second (though most
of duration is schedule_timeout_idle(HZ/10)) gives that victim some reasonable
grace period for termination before the OOM killer selects next OOM victim.

If we set MMF_OOM_REAPED as of oom_kill_process() while also setting
TIF_MEMDIE, the OOM killer can lock up like described above.
If we set MMF_OOM_REAPED as of oom_kill_process() while not setting
TIF_MEMDIE, the OOM killer will immediately select next OOM victim
which is almost

 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
 		return OOM_SCAN_CONTINUE;
 
-	/*
-	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
-	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
-		return OOM_SCAN_ABORT;
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
 	 */
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
 	return OOM_SCAN_OK;
 }

situation when we hit can_oom_reap == false. Since not many thread groups
will hit can_oom_reap == false condition, above situation won't kill all
thread groups. But I think that waiting for one second before backing off
is acceptable from the point of view of least killing. This resembles

Quoting from http://lkml.kernel.org/r/20160601073441.GE26601@dhcp22.suse.cz :
> > > > Third is that oom_kill_process() chooses a thread group which already
> > > > has a TIF_MEMDIE thread when the candidate select_bad_process() chose
> > > > has children because oom_badness() does not take TIF_MEMDIE into account.
> > > > This patch checks child->signal->oom_victims before calling oom_badness()
> > > > if oom_kill_process() was called by SysRq-f case. This resembles making
> > > > sure that oom_badness() is skipped by returning OOM_SCAN_CONTINUE.
> > > 
> > > This makes sense to me as well but why should be limit this to sysrq case?
> > > Does it make any sense to select a child which already got killed for
> > > normal OOM killer? Anyway I think it would be better to split this into
> > > its own patch as well.
> > 
> > The reason is described in next paragraph.
> > Do we prefer immediately killing all children of the allocating task?
> 
> I do not think we want to select them _all_. We haven't been doing that
> and I do not see a reason we should start now. But it surely doesn't
> make any sense to select a task which has already TIF_MEMDIE set.

"although selecting a TIF_MEMDIE thread group forever does not make any
sense, we haven't tried selecting next thread group as soon as some thread
group got TIF_MEMDIE".

Setting MMF_OOM_REAPED and clearing TIF_MEMDIE after some period is the key.

My bottom half does not require user visible changes. If some programs use
clone(CLONE_VM without CLONE_SIGHAND) and mix OOM_SCORE_ADJ_MIN /
OOM_SCORE_ADJ_MAX, I think they have reason they want to do so (e.g.
shrink memory usage when OOM_SCORE_ADJ_MAX thread group was OOM-killed).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-06-01 12:04           ` Tetsuo Handa
@ 2016-06-01 12:43             ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-01 12:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

On Wed 01-06-16 21:04:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 01-06-16 00:29:45, Tetsuo Handa wrote:
> > > I'm fine with task_will_free_mem(current) == true case. My question is that
> > > "doesn't this patch break task_will_free_mem(current) == false case when there is
> > > already TIF_MEMDIE thread" ?
> > 
> > OK, I see your point now. This is certainly possible, albeit unlikely. I
> > think calling this a regression would be a bit an overstatement. We are
> > basically replacing one unreliable heuristic by another one which is
> > more likely to lead to a deterministic behavior.
> > 
> > If you are worried about locking up the oom killer I have another 2
> > patches on top of this series which should deal with that (one of them
> > was already posted [1] and another one was drafted in [2]. Both of them
> > on top of this series should remove the concern of the lockup. I just
> > wait to post them until this thread settles down.
> > 
> > [1] http://lkml.kernel.org/r/1464276476-25136-1-git-send-email-mhocko@kernel.org
> > [2] http://lkml.kernel.org/r/20160527133502.GN27686@dhcp22.suse.cz
> 
> I want [1] though I don't know we should try twice. But I still can't
> understand why [2] works. Setting MMF_OOM_REAPED on victim's mm_struct
> allows oom_scan_process_thread() to call oom_badness() only after TIF_MEMDIE
> was removed from that victim. Since oom_reap_task() is not called due to
> can_oom_reap == false, nobody will be able to remove TIF_MEMDIE from that
> victim if that victim got stuck at unkillable wait.
> down_write_killable(&mm->mmap_sem) might not reduce possibility of lockup
> when oom_reap_task() is not called.

You are right of course. I didn't say [2] is complete. That's why I've said
"was drafted in". Let's discuss this further when I send the actual
patch, ok?

> Quoting from http://lkml.kernel.org/r/20160530115551.GU22928@dhcp22.suse.cz :

And we are yet again conflating different threads which will end up in a
mess. :/ Please not I won't follow up to any comments which are not
related to the discussed patch.

> > But your bottom half would just decide to back off the same way I do
> > here. And as for the bonus your bottom half would have to do the rather
> > costly process iteration to find that out.
> 
> Doing the rather costly process iteration for up to one second (though most
> of duration is schedule_timeout_idle(HZ/10)) gives that victim some reasonable
> grace period for termination before the OOM killer selects next OOM victim.

we have that short sleep in out_of_memory path already. Btw. we do not
do costly operations just to emulate a short sleep...
 
> If we set MMF_OOM_REAPED as of oom_kill_process() while also setting
> TIF_MEMDIE, the OOM killer can lock up like described above.
> If we set MMF_OOM_REAPED as of oom_kill_process() while not setting
> TIF_MEMDIE, the OOM killer will immediately select next OOM victim
> which is almost

I would rather risk a next OOM victim than risk a lockup in a highly
unlikely scenario. This is simply a trade off.

[...]

> situation when we hit can_oom_reap == false. Since not many thread groups
> will hit can_oom_reap == false condition, above situation won't kill all
> thread groups. But I think that waiting for one second before backing off
> is acceptable from the point of view of least killing. This resembles
> 
> Quoting from http://lkml.kernel.org/r/20160601073441.GE26601@dhcp22.suse.cz :
> > > > > Third is that oom_kill_process() chooses a thread group which already
> > > > > has a TIF_MEMDIE thread when the candidate select_bad_process() chose
> > > > > has children because oom_badness() does not take TIF_MEMDIE into account.
> > > > > This patch checks child->signal->oom_victims before calling oom_badness()
> > > > > if oom_kill_process() was called by SysRq-f case. This resembles making
> > > > > sure that oom_badness() is skipped by returning OOM_SCAN_CONTINUE.
> > > > 
> > > > This makes sense to me as well but why should be limit this to sysrq case?
> > > > Does it make any sense to select a child which already got killed for
> > > > normal OOM killer? Anyway I think it would be better to split this into
> > > > its own patch as well.
> > > 
> > > The reason is described in next paragraph.
> > > Do we prefer immediately killing all children of the allocating task?
> > 
> > I do not think we want to select them _all_. We haven't been doing that
> > and I do not see a reason we should start now. But it surely doesn't
> > make any sense to select a task which has already TIF_MEMDIE set.
> 
> "although selecting a TIF_MEMDIE thread group forever does not make any
> sense, we haven't tried selecting next thread group as soon as some thread
> group got TIF_MEMDIE".

Note that we are talking about sysctl_oom_kill_allocating_task and that
kills tasks randomly without trying to wait for other victims to release
their memory. I do not _really_ there is anything to optimize for here
with some timeouts.
 
> Setting MMF_OOM_REAPED and clearing TIF_MEMDIE after some period is the key.
> 
> My bottom half does not require user visible changes. If some programs use
> clone(CLONE_VM without CLONE_SIGHAND) and mix OOM_SCORE_ADJ_MIN /
> OOM_SCORE_ADJ_MAX, I think they have reason they want to do so (e.g.
> shrink memory usage when OOM_SCORE_ADJ_MAX thread group was OOM-killed).

If there is such a use case then I would really like to hear about that.
I do not want to make the already obscure code more complicated just for
some usecase that even might not exist.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-30 13:05   ` Michal Hocko
  (?)
  (?)
@ 2016-06-01 14:12   ` Tetsuo Handa
  2016-06-01 14:25     ` Michal Hocko
  -1 siblings, 1 reply; 68+ messages in thread
From: Tetsuo Handa @ 2016-06-01 14:12 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, mhocko

Michal Hocko wrote:
> vforked tasks are not really sitting on any memory. They are sharing
> the mm with parent until they exec into a new code. Until then it is
> just pinning the address space. OOM killer will kill the vforked task
> along with its parent but we still can end up selecting vforked task
> when the parent wouldn't be selected. E.g. init doing vfork to launch
> a task or vforked being a child of oom unkillable task with an updated
> oom_score_adj to be killable.
> 
> Make sure to not select vforked task as an oom victim by checking
> vfork_done in oom_badness.

While vfork()ed task cannot modify userspace memory, can't such task
allocate significant amount of kernel memory inside execve() operation
(as demonstrated by CVE-2010-4243 64bit_dos.c )?

It is possible that killing vfork()ed task releases a lot of memory,
isn't it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-06-01 14:12   ` Tetsuo Handa
@ 2016-06-01 14:25     ` Michal Hocko
  2016-06-02 10:45       ` Tetsuo Handa
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2016-06-01 14:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

On Wed 01-06-16 23:12:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > vforked tasks are not really sitting on any memory. They are sharing
> > the mm with parent until they exec into a new code. Until then it is
> > just pinning the address space. OOM killer will kill the vforked task
> > along with its parent but we still can end up selecting vforked task
> > when the parent wouldn't be selected. E.g. init doing vfork to launch
> > a task or vforked being a child of oom unkillable task with an updated
> > oom_score_adj to be killable.
> > 
> > Make sure to not select vforked task as an oom victim by checking
> > vfork_done in oom_badness.
> 
> While vfork()ed task cannot modify userspace memory, can't such task
> allocate significant amount of kernel memory inside execve() operation
> (as demonstrated by CVE-2010-4243 64bit_dos.c )?
> 
> It is possible that killing vfork()ed task releases a lot of memory,
> isn't it?

I am not familiar with the above CVE but doesn't that allocated memory
come after flush_old_exec (and so mm_release)?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-06-01 14:25     ` Michal Hocko
@ 2016-06-02 10:45       ` Tetsuo Handa
  2016-06-02 11:20         ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Tetsuo Handa @ 2016-06-02 10:45 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

Michal Hocko wrote:
> On Wed 01-06-16 23:12:20, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > vforked tasks are not really sitting on any memory. They are sharing
> > > the mm with parent until they exec into a new code. Until then it is
> > > just pinning the address space. OOM killer will kill the vforked task
> > > along with its parent but we still can end up selecting vforked task
> > > when the parent wouldn't be selected. E.g. init doing vfork to launch
> > > a task or vforked being a child of oom unkillable task with an updated
> > > oom_score_adj to be killable.
> > > 
> > > Make sure to not select vforked task as an oom victim by checking
> > > vfork_done in oom_badness.
> > 
> > While vfork()ed task cannot modify userspace memory, can't such task
> > allocate significant amount of kernel memory inside execve() operation
> > (as demonstrated by CVE-2010-4243 64bit_dos.c )?
> > 
> > It is possible that killing vfork()ed task releases a lot of memory,
> > isn't it?
> 
> I am not familiar with the above CVE but doesn't that allocated memory
> come after flush_old_exec (and so mm_release)?

That memory is allocated as of copy_strings() in do_execveat_common().

An example shown below (based on https://grsecurity.net/~spender/exploits/64bit_dos.c )
can consume nearly 50% of 2GB RAM while execve() from vfork(). That is, selecting
vfork()ed task as an OOM victim might release nearly 50% of 2GB RAM.

----------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define NUM_ARGS 8000 /* Nearly 50% of 2GB RAM */

int main(void)
{
        /* Be sure to do "ulimit -s unlimited" before run. */
        char **args;
        char *str;
        int i;
        str = malloc(128 * 1024);
        memset(str, ' ', 128 * 1024 - 1);
        str[128 * 1024 - 1] = '\0';
        args = malloc(NUM_ARGS * sizeof(char *));
        for (i = 0; i < (NUM_ARGS - 1); i++)
                args[i] = str;
        args[i] = NULL;
        if (vfork() == 0) {
                execve("/bin/true", args, NULL);
                _exit(1);
        }
        return 0;
}
----------

# strace -f ./a.out
execve("./a.out", ["./a.out"], [/* 22 vars */]) = 0
brk(0)                                  = 0x2283000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2bdbc81000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=44165, ...}) = 0
mmap(NULL, 44165, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b2bdbc82000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2112384, ...}) = 0
mmap(NULL, 3936832, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b2bdbe84000
mprotect(0x2b2bdc03b000, 2097152, PROT_NONE) = 0
mmap(0x2b2bdc23b000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b7000) = 0x2b2bdc23b000
mmap(0x2b2bdc241000, 16960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b2bdc241000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2bdbc8d000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2bdbc8e000
arch_prctl(ARCH_SET_FS, 0x2b2bdbc8db80) = 0
mprotect(0x2b2bdc23b000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ)     = 0
mprotect(0x2b2bdbe81000, 4096, PROT_READ) = 0
munmap(0x2b2bdbc82000, 44165)           = 0
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2bdbc90000
brk(0)                                  = 0x2283000
brk(0x22b3000)                          = 0x22b3000
brk(0)                                  = 0x22b3000
vfork(Process 9787 attached
 <unfinished ...>
[pid  9787] execve("/bin/true", ["                                "..., (...snipped...), ...], [/* 0 vars */] <unfinished ...>
[pid  9786] <... vfork resumed> )       = 9787
[pid  9786] exit_group(0)               = ?
[pid  9786] +++ exited with 0 +++
<... execve resumed> )                  = 0
brk(0)                                  = 0x13e2000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2e71a6f000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=44165, ...}) = 0
mmap(NULL, 44165, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b2e71a70000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2112384, ...}) = 0
mmap(NULL, 3936832, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b2e71c6e000
mprotect(0x2b2e71e25000, 2097152, PROT_NONE) = 0
mmap(0x2b2e72025000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b7000) = 0x2b2e72025000
mmap(0x2b2e7202b000, 16960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b2e7202b000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2e71a7b000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2e71a7c000
arch_prctl(ARCH_SET_FS, 0x2b2e71a7bb80) = 0
mprotect(0x2b2e72025000, 16384, PROT_READ) = 0
mprotect(0x605000, 4096, PROT_READ)     = 0
mprotect(0x2b2e71c6b000, 4096, PROT_READ) = 0
munmap(0x2b2e71a70000, 44165)           = 0
exit_group(0)                           = ?
+++ exited with 0 +++

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-06-02 10:45       ` Tetsuo Handa
@ 2016-06-02 11:20         ` Michal Hocko
  2016-06-02 11:31           ` Tetsuo Handa
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2016-06-02 11:20 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

On Thu 02-06-16 19:45:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 01-06-16 23:12:20, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > vforked tasks are not really sitting on any memory. They are sharing
> > > > the mm with parent until they exec into a new code. Until then it is
> > > > just pinning the address space. OOM killer will kill the vforked task
> > > > along with its parent but we still can end up selecting vforked task
> > > > when the parent wouldn't be selected. E.g. init doing vfork to launch
> > > > a task or vforked being a child of oom unkillable task with an updated
> > > > oom_score_adj to be killable.
> > > > 
> > > > Make sure to not select vforked task as an oom victim by checking
> > > > vfork_done in oom_badness.
> > > 
> > > While vfork()ed task cannot modify userspace memory, can't such task
> > > allocate significant amount of kernel memory inside execve() operation
> > > (as demonstrated by CVE-2010-4243 64bit_dos.c )?
> > > 
> > > It is possible that killing vfork()ed task releases a lot of memory,
> > > isn't it?
> > 
> > I am not familiar with the above CVE but doesn't that allocated memory
> > come after flush_old_exec (and so mm_release)?
> 
> That memory is allocated as of copy_strings() in do_execveat_common().
> 
> An example shown below (based on https://grsecurity.net/~spender/exploits/64bit_dos.c )
> can consume nearly 50% of 2GB RAM while execve() from vfork(). That is, selecting
> vfork()ed task as an OOM victim might release nearly 50% of 2GB RAM.
> 
> ----------
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> 
> #define NUM_ARGS 8000 /* Nearly 50% of 2GB RAM */
> 
> int main(void)
> {
>         /* Be sure to do "ulimit -s unlimited" before run. */
>         char **args;
>         char *str;
>         int i;
>         str = malloc(128 * 1024);
>         memset(str, ' ', 128 * 1024 - 1);
>         str[128 * 1024 - 1] = '\0';
>         args = malloc(NUM_ARGS * sizeof(char *));
>         for (i = 0; i < (NUM_ARGS - 1); i++)
>                 args[i] = str;
>         args[i] = NULL;
>         if (vfork() == 0) {
>                 execve("/bin/true", args, NULL);
>                 _exit(1);
>         }
>         return 0;
> }

OK, but the memory is allocated on behalf of the parent already, right?
And the patch doesn't prevent parent from being selected and the vfroked
child being killed along the way as sharing the mm with it. So what
exactly this patch changes for this test case? What am I missing?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-06-02 11:20         ` Michal Hocko
@ 2016-06-02 11:31           ` Tetsuo Handa
  2016-06-02 12:55             ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Tetsuo Handa @ 2016-06-02 11:31 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

Michal Hocko wrote:
> OK, but the memory is allocated on behalf of the parent already, right?

What does "the memory is allocated on behalf of the parent already" mean?
The memory used for argv[]/envp[] may not yet be visible from mm_struct when
the OOM killer is invoked.

> And the patch doesn't prevent parent from being selected and the vfroked
> child being killed along the way as sharing the mm with it. So what
> exactly this patch changes for this test case? What am I missing?

If the parent is OOM_SCORE_ADJ_MIN and vfork()ed child doing execve()
with large argv[]/envp[] is not OOM_SCORE_ADJ_MIN, we should not hesitate
to OOM-kill vfork()ed child even if the parent is not OOM-killable.

	vfork()
	set_oom_adj()
	exec()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-06-02 11:31           ` Tetsuo Handa
@ 2016-06-02 12:55             ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-02 12:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm

On Thu 02-06-16 20:31:57, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, but the memory is allocated on behalf of the parent already, right?
> 
> What does "the memory is allocated on behalf of the parent already" mean?

It means that vforked task cannot allocate a new memory directly. Sure
it can get a copy of what parent already has allocated during execve but
that is under control of the parent. If the parent is OOM_SCORE_ADJ_MIN
then it should better be careful how it spawns new tasks.

> The memory used for argv[]/envp[] may not yet be visible from mm_struct when
> the OOM killer is invoked.
>
> > And the patch doesn't prevent parent from being selected and the vfroked
> > child being killed along the way as sharing the mm with it. So what
> > exactly this patch changes for this test case? What am I missing?
> 
> If the parent is OOM_SCORE_ADJ_MIN and vfork()ed child doing execve()
> with large argv[]/envp[] is not OOM_SCORE_ADJ_MIN, we should not hesitate
> to OOM-kill vfork()ed child even if the parent is not OOM-killable.
> 
> 	vfork()
> 	set_oom_adj()
> 	exec()

Well the whole point of this patch is to not select such a task because
it makes only very limitted sense. It cannot really free much memory -
well except when parent is doing something realy stupid which I am not
really sure we should care about.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 7/6] mm, oom: task_will_free_mem should skip oom_reaped tasks
  2016-05-30 13:05 ` Michal Hocko
@ 2016-06-02 14:03   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-02 14:03 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

0-day robot has encountered the following:
[   82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
[   82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
[   82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
[   82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
[   82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

oom_reaper is trying to reap the same task again and again. This
is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process. Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.

Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dacfb6ab7b04..d6e121decb1a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -766,6 +766,15 @@ bool task_will_free_mem(struct task_struct *task)
 		return true;
 	}
 
+	/*
+	 * This task has already been drained by the oom reaper so there are
+	 * only small chances it will free some more
+	 */
+	if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
+		task_unlock(p);
+		return false;
+	}
+
 	/* pin the mm to not get freed and reused */
 	atomic_inc(&mm->mm_count);
 	task_unlock(p);
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 7/6] mm, oom: task_will_free_mem should skip oom_reaped tasks
@ 2016-06-02 14:03   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-02 14:03 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

0-day robot has encountered the following:
[   82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
[   82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
[   82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
[   82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
[   82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

oom_reaper is trying to reap the same task again and again. This
is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process. Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.

Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dacfb6ab7b04..d6e121decb1a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -766,6 +766,15 @@ bool task_will_free_mem(struct task_struct *task)
 		return true;
 	}
 
+	/*
+	 * This task has already been drained by the oom reaper so there are
+	 * only small chances it will free some more
+	 */
+	if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
+		task_unlock(p);
+		return false;
+	}
+
 	/* pin the mm to not get freed and reused */
 	atomic_inc(&mm->mm_count);
 	task_unlock(p);
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/6] mm, oom: task_will_free_mem should skip oom_reaped tasks
  2016-06-02 14:03   ` Michal Hocko
  (?)
@ 2016-06-02 15:24   ` Tetsuo Handa
  2016-06-02 15:50     ` Michal Hocko
  -1 siblings, 1 reply; 68+ messages in thread
From: Tetsuo Handa @ 2016-06-02 15:24 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, mhocko

Michal Hocko wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dacfb6ab7b04..d6e121decb1a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -766,6 +766,15 @@ bool task_will_free_mem(struct task_struct *task)
>  		return true;
>  	}
>  
> +	/*
> +	 * This task has already been drained by the oom reaper so there are
> +	 * only small chances it will free some more
> +	 */
> +	if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
> +		task_unlock(p);
> +		return false;
> +	}
> +

I think this check should be done before

	if (atomic_read(&mm->mm_users) <= 1) {
		task_unlock(p);
		return true;
	}

because it is possible that task_will_free_mem(task) is the only thread
using task->mm (i.e. atomic_read(&mm->mm_users) == 1).

>  	/* pin the mm to not get freed and reused */
>  	atomic_inc(&mm->mm_count);
>  	task_unlock(p);
> -- 
> 2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/6] mm, oom: task_will_free_mem should skip oom_reaped tasks
  2016-06-02 15:24   ` Tetsuo Handa
@ 2016-06-02 15:50     ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2016-06-02 15:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg

On Fri 03-06-16 00:24:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index dacfb6ab7b04..d6e121decb1a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -766,6 +766,15 @@ bool task_will_free_mem(struct task_struct *task)
> >  		return true;
> >  	}
> >  
> > +	/*
> > +	 * This task has already been drained by the oom reaper so there are
> > +	 * only small chances it will free some more
> > +	 */
> > +	if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
> > +		task_unlock(p);
> > +		return false;
> > +	}
> > +
> 
> I think this check should be done before
> 
> 	if (atomic_read(&mm->mm_users) <= 1) {
> 		task_unlock(p);
> 		return true;
> 	}
> 
> because it is possible that task_will_free_mem(task) is the only thread
> using task->mm (i.e. atomic_read(&mm->mm_users) == 1).

definitely true. I am growing blind for this code.

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2016-06-02 15:50 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-30 13:05 [PATCH 0/6 -v2] Handle oom bypass more gracefully Michal Hocko
2016-05-30 13:05 ` Michal Hocko
2016-05-30 13:05 ` [PATCH 1/6] proc, oom: drop bogus task_lock and mm check Michal Hocko
2016-05-30 13:05   ` Michal Hocko
2016-05-30 13:49   ` Vladimir Davydov
2016-05-30 13:49     ` Vladimir Davydov
2016-05-30 17:43   ` Oleg Nesterov
2016-05-30 17:43     ` Oleg Nesterov
2016-05-31  7:32     ` Michal Hocko
2016-05-31  7:32       ` Michal Hocko
2016-05-31 22:53       ` Oleg Nesterov
2016-05-31 22:53         ` Oleg Nesterov
2016-06-01  6:53         ` Michal Hocko
2016-06-01  6:53           ` Michal Hocko
2016-06-01 10:41           ` Tetsuo Handa
2016-06-01 10:41             ` Tetsuo Handa
2016-06-01 10:48             ` Michal Hocko
2016-06-01 10:48               ` Michal Hocko
2016-05-30 13:05 ` [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
2016-05-30 13:05   ` Michal Hocko
2016-05-30 13:05 ` [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
2016-05-30 13:05   ` Michal Hocko
2016-05-31  7:41   ` Michal Hocko
2016-05-31  7:41     ` Michal Hocko
2016-05-30 13:05 ` [PATCH 4/6] mm, oom: skip vforked tasks from being selected Michal Hocko
2016-05-30 13:05   ` Michal Hocko
2016-05-30 19:28   ` Oleg Nesterov
2016-05-30 19:28     ` Oleg Nesterov
2016-05-31  7:42     ` Michal Hocko
2016-05-31  7:42       ` Michal Hocko
2016-05-31 21:43       ` Oleg Nesterov
2016-05-31 21:43         ` Oleg Nesterov
2016-06-01  7:09         ` Michal Hocko
2016-06-01  7:09           ` Michal Hocko
2016-06-01 14:12   ` Tetsuo Handa
2016-06-01 14:25     ` Michal Hocko
2016-06-02 10:45       ` Tetsuo Handa
2016-06-02 11:20         ` Michal Hocko
2016-06-02 11:31           ` Tetsuo Handa
2016-06-02 12:55             ` Michal Hocko
2016-05-30 13:05 ` [PATCH 5/6] mm, oom: kill all tasks sharing the mm Michal Hocko
2016-05-30 13:05   ` Michal Hocko
2016-05-30 18:18   ` Oleg Nesterov
2016-05-30 18:18     ` Oleg Nesterov
2016-05-31  7:43     ` Michal Hocko
2016-05-31  7:43       ` Michal Hocko
2016-05-31 21:48       ` Oleg Nesterov
2016-05-31 21:48         ` Oleg Nesterov
2016-05-30 13:05 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
2016-05-30 13:05   ` Michal Hocko
2016-05-30 17:35   ` Oleg Nesterov
2016-05-30 17:35     ` Oleg Nesterov
2016-05-31  7:46     ` Michal Hocko
2016-05-31  7:46       ` Michal Hocko
2016-05-31 22:29       ` Oleg Nesterov
2016-05-31 22:29         ` Oleg Nesterov
2016-06-01  7:03         ` Michal Hocko
2016-06-01  7:03           ` Michal Hocko
2016-05-31 15:03   ` Tetsuo Handa
2016-05-31 15:10     ` Michal Hocko
2016-05-31 15:29       ` Tetsuo Handa
2016-06-01  7:25         ` Michal Hocko
2016-06-01 12:04           ` Tetsuo Handa
2016-06-01 12:43             ` Michal Hocko
2016-06-02 14:03 ` [PATCH 7/6] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
2016-06-02 14:03   ` Michal Hocko
2016-06-02 15:24   ` Tetsuo Handa
2016-06-02 15:50     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.