All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/10 -v5] Handle oom bypass more gracefully
@ 2016-06-20 12:43 ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Hi,
this is the v5 version of the patchse. Previous version was posted
http://lkml.kernel.org/r/1465473137-22531-1-git-send-email-mhocko@kernel.org
There was one issue fixed in patch 7. Other than that we have discussed
nommu situation after patch 7 which might reintroduce a theoretical
race condition when the oom victim passes exit_mm->exit_mm_victim after
oom_kill_process checks task_will_free_mem and before mark_oom_victim is
called which was reviously fixed by 83363b917a2982dd ("oom: make sure
that TIF_MEMDIE is set under task_lock").  The issue was never seen in
the real life AFAIK and it would only affect nommu. On the other hand
the patch fixes a class of potential lockups when we mark oom victim
on a thread while other threads will stay alive. So I think it is a
reasonable trade off.

Tetsuo still thinks that a timeout based solution is better way to
address all the issues:
"
To me, timeout based one is sufficient for handling any traps that hit
nommu kernels after the OOM killer is invoked. 

Anyway, I don't like this series because this series ignores theoretical cases.
I can't make progress as long as you repeat "does it really matter/occur".
Please go ahead without Reviewed-by: or Acked-by: from me.
"

I have added Acked-by from Oleg.

What remains is to make mm reapable even when it is shared with
kthreads. I have already started working on that and hopefully will
have something ready soon but that should go independently on this
series. After that we will have MMU basically covered and lockup free.
I consider this a noticeable improvement especially because changes to
achieve that are much less intrusive than I expected.

Nommu still suffers from potential lockups theoretically but a complete
lack of reports suggests that this is not the case in the real life. If
we ever see any reports we should implement the reaper for nommu as
well. Shouldn't be too hard...

The original description:
-------------------------
The following 10 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and therefore much more reliable finally. Even
though mm shared outside of thread group is rare (either vforked tasks
for a short period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them. Not
only it makes the current oom killer logic quite hard to follow and
reason about it can lead to weird corner cases. E.g. it is possible to
select an oom victim which shares the mm with unkillable process or
bypass the oom killer even when other processes sharing the mm are still
alive and other weird cases.

Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
This can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 drops sighand lock because it is not needed anymore as pointed
by Oleg.

Patch 3 is a clean up of oom_score_adj handling and a preparatory
work for later patches.

Patch 4 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread
groups. This can be considered a user visible behavior change because
one thread group updating oom_score_adj will affect others which share
the same mm via clone(CLONE_VM). I argue that this should be acceptable
because we already have the same behavior for threads in the same thread
group and sharing the mm without signal struct is just a different model
of threading. This is probably the most controversial part of the series,
I would like to find some consensus here. There were some suggestions
to hook some counter/oom_score_adj into the mm_struct but I feel that
this is not necessary right now and we can rely on proc handler +
oom_kill_process to DTRT. I can be convinced otherwise but I strongly
think that whatever we do the userspace has to have a way to see the
current oom priority as consistently as possible.

Patch 5 makes sure that no vforked task is selected if it is sharing
the mm with oom unkillable task.

Patch 6 ensures that all user tasks sharing the mm are killed which in
turn makes sure that all oom victims are oom reapable.

Patch 7 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

Patch 8 is new in this version and it addresses an issue pointed out
by 0-day OOM report where an oom victim was reaped several times.

Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
task and hides it from the oom killer to move on when no progress can be
made. This will give an upper bound to how long an oom_reapable task can
block the oom killer from selecting another victim if the oom_reaper is
not able to reap the victim.

Patch 10 tries to plug the (hopefully) last hole when we can still lock
up when the oom victim is shared with oom unkillable tasks (kthreads
and global init). We just try to be best effort in that case and rather
fallback to kill something else than risk a lockup.

The patchset is based on the current mmotm tree (mmotm-2016-06-15-16-18).

I have pushed the patchset to my git tree for an easier review
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to branch
attempts/process-share-mm-oom-sanitization

Michal Hocko (10):
      proc, oom: drop bogus task_lock and mm check
      proc, oom: drop bogus sighand lock
      proc, oom_adj: extract oom_score_adj setting into a helper
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
      mm, oom: skip vforked tasks from being selected
      mm, oom: kill all tasks sharing the mm
      mm, oom: fortify task_will_free_mem
      mm, oom: task_will_free_mem should skip oom_reaped tasks
      mm, oom_reaper: do not attempt to reap a task more than twice
      mm, oom: hide mm which is shared with kthread or global init

 fs/proc/base.c        | 185 +++++++++++++++++++++++-----------------------
 include/linux/mm.h    |   2 +
 include/linux/oom.h   |  26 +------
 include/linux/sched.h |  27 +++++++
 mm/memcontrol.c       |   4 +-
 mm/oom_kill.c         | 198 ++++++++++++++++++++++++++++++++++----------------
 6 files changed, 266 insertions(+), 176 deletions(-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 0/10 -v5] Handle oom bypass more gracefully
@ 2016-06-20 12:43 ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Hi,
this is the v5 version of the patchse. Previous version was posted
http://lkml.kernel.org/r/1465473137-22531-1-git-send-email-mhocko@kernel.org
There was one issue fixed in patch 7. Other than that we have discussed
nommu situation after patch 7 which might reintroduce a theoretical
race condition when the oom victim passes exit_mm->exit_mm_victim after
oom_kill_process checks task_will_free_mem and before mark_oom_victim is
called which was reviously fixed by 83363b917a2982dd ("oom: make sure
that TIF_MEMDIE is set under task_lock").  The issue was never seen in
the real life AFAIK and it would only affect nommu. On the other hand
the patch fixes a class of potential lockups when we mark oom victim
on a thread while other threads will stay alive. So I think it is a
reasonable trade off.

Tetsuo still thinks that a timeout based solution is better way to
address all the issues:
"
To me, timeout based one is sufficient for handling any traps that hit
nommu kernels after the OOM killer is invoked. 

Anyway, I don't like this series because this series ignores theoretical cases.
I can't make progress as long as you repeat "does it really matter/occur".
Please go ahead without Reviewed-by: or Acked-by: from me.
"

I have added Acked-by from Oleg.

What remains is to make mm reapable even when it is shared with
kthreads. I have already started working on that and hopefully will
have something ready soon but that should go independently on this
series. After that we will have MMU basically covered and lockup free.
I consider this a noticeable improvement especially because changes to
achieve that are much less intrusive than I expected.

Nommu still suffers from potential lockups theoretically but a complete
lack of reports suggests that this is not the case in the real life. If
we ever see any reports we should implement the reaper for nommu as
well. Shouldn't be too hard...

The original description:
-------------------------
The following 10 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and therefore much more reliable finally. Even
though mm shared outside of thread group is rare (either vforked tasks
for a short period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them. Not
only it makes the current oom killer logic quite hard to follow and
reason about it can lead to weird corner cases. E.g. it is possible to
select an oom victim which shares the mm with unkillable process or
bypass the oom killer even when other processes sharing the mm are still
alive and other weird cases.

Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
This can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 drops sighand lock because it is not needed anymore as pointed
by Oleg.

Patch 3 is a clean up of oom_score_adj handling and a preparatory
work for later patches.

Patch 4 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread
groups. This can be considered a user visible behavior change because
one thread group updating oom_score_adj will affect others which share
the same mm via clone(CLONE_VM). I argue that this should be acceptable
because we already have the same behavior for threads in the same thread
group and sharing the mm without signal struct is just a different model
of threading. This is probably the most controversial part of the series,
I would like to find some consensus here. There were some suggestions
to hook some counter/oom_score_adj into the mm_struct but I feel that
this is not necessary right now and we can rely on proc handler +
oom_kill_process to DTRT. I can be convinced otherwise but I strongly
think that whatever we do the userspace has to have a way to see the
current oom priority as consistently as possible.

Patch 5 makes sure that no vforked task is selected if it is sharing
the mm with oom unkillable task.

Patch 6 ensures that all user tasks sharing the mm are killed which in
turn makes sure that all oom victims are oom reapable.

Patch 7 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

Patch 8 is new in this version and it addresses an issue pointed out
by 0-day OOM report where an oom victim was reaped several times.

Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
task and hides it from the oom killer to move on when no progress can be
made. This will give an upper bound to how long an oom_reapable task can
block the oom killer from selecting another victim if the oom_reaper is
not able to reap the victim.

Patch 10 tries to plug the (hopefully) last hole when we can still lock
up when the oom victim is shared with oom unkillable tasks (kthreads
and global init). We just try to be best effort in that case and rather
fallback to kill something else than risk a lockup.

The patchset is based on the current mmotm tree (mmotm-2016-06-15-16-18).

I have pushed the patchset to my git tree for an easier review
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to branch
attempts/process-share-mm-oom-sanitization

Michal Hocko (10):
      proc, oom: drop bogus task_lock and mm check
      proc, oom: drop bogus sighand lock
      proc, oom_adj: extract oom_score_adj setting into a helper
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
      mm, oom: skip vforked tasks from being selected
      mm, oom: kill all tasks sharing the mm
      mm, oom: fortify task_will_free_mem
      mm, oom: task_will_free_mem should skip oom_reaped tasks
      mm, oom_reaper: do not attempt to reap a task more than twice
      mm, oom: hide mm which is shared with kthread or global init

 fs/proc/base.c        | 185 +++++++++++++++++++++++-----------------------
 include/linux/mm.h    |   2 +
 include/linux/oom.h   |  26 +------
 include/linux/sched.h |  27 +++++++
 mm/memcontrol.c       |   4 +-
 mm/oom_kill.c         | 198 ++++++++++++++++++++++++++++++++++----------------
 6 files changed, 266 insertions(+), 176 deletions(-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/10] proc, oom: drop bogus task_lock and mm check
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

both oom_adj_write and oom_score_adj_write are using task_lock,
check for task->mm and fail if it is NULL. This is not needed because
the oom_score_adj is per signal struct so we do not need mm at all.
The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
disable count") but we do not do per-mm oom disable since c9f01245b6a7
("oom: remove oom_disable_count").

The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread
group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
would succeed. This is unexpected at best.

Remove the lock along with the check to fix the unexpected behavior
and also because there is not real need for the lock in the first place.

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..a6014e45c516 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1083,15 +1083,9 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	/*
@@ -1121,8 +1115,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	trace_oom_score_adj_update(task);
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1186,15 +1179,9 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
@@ -1210,8 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/10] proc, oom: drop bogus task_lock and mm check
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

both oom_adj_write and oom_score_adj_write are using task_lock,
check for task->mm and fail if it is NULL. This is not needed because
the oom_score_adj is per signal struct so we do not need mm at all.
The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
disable count") but we do not do per-mm oom disable since c9f01245b6a7
("oom: remove oom_disable_count").

The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread
group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
would succeed. This is unexpected at best.

Remove the lock along with the check to fix the unexpected behavior
and also because there is not real need for the lock in the first place.

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..a6014e45c516 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1083,15 +1083,9 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	/*
@@ -1121,8 +1115,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	trace_oom_score_adj_update(task);
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1186,15 +1179,9 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
@@ -1210,8 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 02/10] proc, oom: drop bogus sighand lock
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Oleg has pointed out that can simplify both oom_adj_{read,write}
and oom_score_adj_{read,write} even further and drop the sighand
lock. The main purpose of the lock was to protect p->signal from
going away but this will not happen since ea6d290ca34c ("signals:
make task_struct->signal immutable/refcountable").

The other role of the lock was to synchronize different writers,
especially those with CAP_SYS_RESOURCE. Introduce a mutex for this
purpose. Later patches will need this lock anyway.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 51 +++++++++++++++++----------------------------------
 1 file changed, 17 insertions(+), 34 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..968d5ea06e62 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1024,23 +1024,21 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	char buffer[PROC_NUMBUF];
 	int oom_adj = OOM_ADJUST_MIN;
 	size_t len;
-	unsigned long flags;
 
 	if (!task)
 		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
-			oom_adj = OOM_ADJUST_MAX;
-		else
-			oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
-				  OOM_SCORE_ADJ_MAX;
-		unlock_task_sighand(task, &flags);
-	}
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
+		oom_adj = OOM_ADJUST_MAX;
+	else
+		oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
+			  OOM_SCORE_ADJ_MAX;
 	put_task_struct(task);
 	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static DEFINE_MUTEX(oom_adj_mutex);
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1057,7 +1055,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1083,11 +1080,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1097,10 +1089,11 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
+	mutex_lock(&oom_adj_mutex);
 	if (oom_adj < task->signal->oom_score_adj &&
 	    !capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_unlock;
 	}
 
 	/*
@@ -1113,9 +1106,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 
 	task->signal->oom_score_adj = oom_adj;
 	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1133,15 +1125,11 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 	struct task_struct *task = get_proc_task(file_inode(file));
 	char buffer[PROC_NUMBUF];
 	short oom_score_adj = OOM_SCORE_ADJ_MIN;
-	unsigned long flags;
 	size_t len;
 
 	if (!task)
 		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		oom_score_adj = task->signal->oom_score_adj;
-		unlock_task_sighand(task, &flags);
-	}
+	oom_score_adj = task->signal->oom_score_adj;
 	put_task_struct(task);
 	len = snprintf(buffer, sizeof(buffer), "%hd\n", oom_score_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
@@ -1152,7 +1140,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 {
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1179,25 +1166,21 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
+	mutex_lock(&oom_adj_mutex);
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
 			!capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_unlock;
 	}
 
 	task->signal->oom_score_adj = (short)oom_score_adj;
 	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_score_adj;
+
 	trace_oom_score_adj_update(task);
 
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 02/10] proc, oom: drop bogus sighand lock
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Oleg has pointed out that can simplify both oom_adj_{read,write}
and oom_score_adj_{read,write} even further and drop the sighand
lock. The main purpose of the lock was to protect p->signal from
going away but this will not happen since ea6d290ca34c ("signals:
make task_struct->signal immutable/refcountable").

The other role of the lock was to synchronize different writers,
especially those with CAP_SYS_RESOURCE. Introduce a mutex for this
purpose. Later patches will need this lock anyway.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 51 +++++++++++++++++----------------------------------
 1 file changed, 17 insertions(+), 34 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..968d5ea06e62 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1024,23 +1024,21 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	char buffer[PROC_NUMBUF];
 	int oom_adj = OOM_ADJUST_MIN;
 	size_t len;
-	unsigned long flags;
 
 	if (!task)
 		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
-			oom_adj = OOM_ADJUST_MAX;
-		else
-			oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
-				  OOM_SCORE_ADJ_MAX;
-		unlock_task_sighand(task, &flags);
-	}
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
+		oom_adj = OOM_ADJUST_MAX;
+	else
+		oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
+			  OOM_SCORE_ADJ_MAX;
 	put_task_struct(task);
 	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static DEFINE_MUTEX(oom_adj_mutex);
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1057,7 +1055,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1083,11 +1080,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1097,10 +1089,11 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
+	mutex_lock(&oom_adj_mutex);
 	if (oom_adj < task->signal->oom_score_adj &&
 	    !capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_unlock;
 	}
 
 	/*
@@ -1113,9 +1106,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 
 	task->signal->oom_score_adj = oom_adj;
 	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1133,15 +1125,11 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 	struct task_struct *task = get_proc_task(file_inode(file));
 	char buffer[PROC_NUMBUF];
 	short oom_score_adj = OOM_SCORE_ADJ_MIN;
-	unsigned long flags;
 	size_t len;
 
 	if (!task)
 		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		oom_score_adj = task->signal->oom_score_adj;
-		unlock_task_sighand(task, &flags);
-	}
+	oom_score_adj = task->signal->oom_score_adj;
 	put_task_struct(task);
 	len = snprintf(buffer, sizeof(buffer), "%hd\n", oom_score_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
@@ -1152,7 +1140,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 {
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1179,25 +1166,21 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
+	mutex_lock(&oom_adj_mutex);
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
 			!capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_unlock;
 	}
 
 	task->signal->oom_score_adj = (short)oom_score_adj;
 	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_score_adj;
+
 	trace_oom_score_adj_update(task);
 
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 94 +++++++++++++++++++++++++++-------------------------------
 1 file changed, 43 insertions(+), 51 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 968d5ea06e62..a6a8fbdd5a1b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1037,7 +1037,47 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
-static DEFINE_MUTEX(oom_adj_mutex);
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+	static DEFINE_MUTEX(oom_adj_mutex);
+	struct task_struct *task;
+	int err = 0;
+
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		return -ESRCH;
+
+	mutex_lock(&oom_adj_mutex);
+	if (legacy) {
+		if (oom_adj < task->signal->oom_score_adj &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_unlock;
+		}
+		/*
+		 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+		 * /proc/pid/oom_score_adj instead.
+		 */
+		pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+			  current->comm, task_pid_nr(current), task_pid_nr(task),
+			  task_pid_nr(task));
+	} else {
+		if ((short)oom_adj < task->signal->oom_score_adj_min &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_unlock;
+		}
+	}
+
+	task->signal->oom_score_adj = oom_adj;
+	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = (short)oom_adj;
+	trace_oom_score_adj_update(task);
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
+	put_task_struct(task);
+	return err;
+}
 
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
@@ -1052,7 +1092,6 @@ static DEFINE_MUTEX(oom_adj_mutex);
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
 	int err;
@@ -1074,12 +1113,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1089,26 +1122,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-	mutex_lock(&oom_adj_mutex);
-	if (oom_adj < task->signal->oom_score_adj &&
-	    !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_unlock;
-	}
-
-	/*
-	 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-	 * /proc/pid/oom_score_adj instead.
-	 */
-	pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
-		  current->comm, task_pid_nr(current), task_pid_nr(task),
-		  task_pid_nr(task));
-
-	task->signal->oom_score_adj = oom_adj;
-	trace_oom_score_adj_update(task);
-err_unlock:
-	mutex_unlock(&oom_adj_mutex);
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_adj, true);
 out:
 	return err < 0 ? err : count;
 }
@@ -1138,7 +1152,6 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_score_adj;
 	int err;
@@ -1160,28 +1173,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	mutex_lock(&oom_adj_mutex);
-	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_unlock;
-	}
-
-	task->signal->oom_score_adj = (short)oom_score_adj;
-	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
-		task->signal->oom_score_adj_min = (short)oom_score_adj;
-
-	trace_oom_score_adj_update(task);
-
-err_unlock:
-	mutex_unlock(&oom_adj_mutex);
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_score_adj, false);
 out:
 	return err < 0 ? err : count;
 }
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 94 +++++++++++++++++++++++++++-------------------------------
 1 file changed, 43 insertions(+), 51 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 968d5ea06e62..a6a8fbdd5a1b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1037,7 +1037,47 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
-static DEFINE_MUTEX(oom_adj_mutex);
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+	static DEFINE_MUTEX(oom_adj_mutex);
+	struct task_struct *task;
+	int err = 0;
+
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		return -ESRCH;
+
+	mutex_lock(&oom_adj_mutex);
+	if (legacy) {
+		if (oom_adj < task->signal->oom_score_adj &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_unlock;
+		}
+		/*
+		 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+		 * /proc/pid/oom_score_adj instead.
+		 */
+		pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+			  current->comm, task_pid_nr(current), task_pid_nr(task),
+			  task_pid_nr(task));
+	} else {
+		if ((short)oom_adj < task->signal->oom_score_adj_min &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_unlock;
+		}
+	}
+
+	task->signal->oom_score_adj = oom_adj;
+	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = (short)oom_adj;
+	trace_oom_score_adj_update(task);
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
+	put_task_struct(task);
+	return err;
+}
 
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
@@ -1052,7 +1092,6 @@ static DEFINE_MUTEX(oom_adj_mutex);
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
 	int err;
@@ -1074,12 +1113,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1089,26 +1122,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-	mutex_lock(&oom_adj_mutex);
-	if (oom_adj < task->signal->oom_score_adj &&
-	    !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_unlock;
-	}
-
-	/*
-	 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-	 * /proc/pid/oom_score_adj instead.
-	 */
-	pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
-		  current->comm, task_pid_nr(current), task_pid_nr(task),
-		  task_pid_nr(task));
-
-	task->signal->oom_score_adj = oom_adj;
-	trace_oom_score_adj_update(task);
-err_unlock:
-	mutex_unlock(&oom_adj_mutex);
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_adj, true);
 out:
 	return err < 0 ? err : count;
 }
@@ -1138,7 +1152,6 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_score_adj;
 	int err;
@@ -1160,28 +1173,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	mutex_lock(&oom_adj_mutex);
-	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_unlock;
-	}
-
-	task->signal->oom_score_adj = (short)oom_score_adj;
-	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
-		task->signal->oom_score_adj_min = (short)oom_score_adj;
-
-	trace_oom_score_adj_update(task);
-
-err_unlock:
-	mutex_unlock(&oom_adj_mutex);
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_score_adj, false);
 out:
 	return err < 0 ? err : count;
 }
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer. In the
worst case some of those processes might hide from the oom killer altogether
via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then
pick up those eligible but won't be allowed to kill others sharing the
same mm so the mm wouldn't release the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
	vfork()
	set_oom_adj()
	exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj. The current implementation is rather pessimistic
and checks all the existing processes by default if there is more than
1 holder of the mm but we do not have any reliable way to check for
external users yet.

Changes since v2
- skip over same thread group
- skip over kernel threads and global init

Changes since v1
- note that we are changing oom_score_adj outside of the thread group
  to the log

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6a8fbdd5a1b..c986e92680e1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1040,6 +1040,7 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
 	static DEFINE_MUTEX(oom_adj_mutex);
+	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	int err = 0;
 
@@ -1069,10 +1070,55 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
+	/*
+	 * Make sure we will check other processes sharing the mm if this is
+	 * not vfrok which wants its own oom_score_adj.
+	 * pin the mm so it doesn't go away and get reused after task_unlock
+	 */
+	if (!task->vfork_done) {
+		struct task_struct *p = find_lock_task_mm(task);
+
+		if (p) {
+			if (atomic_read(&p->mm->mm_users) > 1) {
+				mm = p->mm;
+				atomic_inc(&mm->mm_count);
+			}
+			task_unlock(p);
+		}
+	}
+
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
 	trace_oom_score_adj_update(task);
+
+	if (mm) {
+		struct task_struct *p;
+
+		rcu_read_lock();
+		for_each_process(p) {
+			if (same_thread_group(task, p))
+				continue;
+
+			/* do not touch kernel threads or the global init */
+			if (p->flags & PF_KTHREAD || is_global_init(p))
+				continue;
+
+			task_lock(p);
+			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+						task_pid_nr(p), p->comm,
+						p->signal->oom_score_adj, oom_adj,
+						task_pid_nr(task), task->comm);
+				p->signal->oom_score_adj = oom_adj;
+				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+					p->signal->oom_score_adj_min = (short)oom_adj;
+			}
+			task_unlock(p);
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
 err_unlock:
 	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e500ce06387..0c468aa2ea61 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2255,6 +2255,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d4a929d79470..d8220c5603a5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -415,7 +415,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer. In the
worst case some of those processes might hide from the oom killer altogether
via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then
pick up those eligible but won't be allowed to kill others sharing the
same mm so the mm wouldn't release the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
	vfork()
	set_oom_adj()
	exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj. The current implementation is rather pessimistic
and checks all the existing processes by default if there is more than
1 holder of the mm but we do not have any reliable way to check for
external users yet.

Changes since v2
- skip over same thread group
- skip over kernel threads and global init

Changes since v1
- note that we are changing oom_score_adj outside of the thread group
  to the log

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6a8fbdd5a1b..c986e92680e1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1040,6 +1040,7 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
 	static DEFINE_MUTEX(oom_adj_mutex);
+	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	int err = 0;
 
@@ -1069,10 +1070,55 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
+	/*
+	 * Make sure we will check other processes sharing the mm if this is
+	 * not vfrok which wants its own oom_score_adj.
+	 * pin the mm so it doesn't go away and get reused after task_unlock
+	 */
+	if (!task->vfork_done) {
+		struct task_struct *p = find_lock_task_mm(task);
+
+		if (p) {
+			if (atomic_read(&p->mm->mm_users) > 1) {
+				mm = p->mm;
+				atomic_inc(&mm->mm_count);
+			}
+			task_unlock(p);
+		}
+	}
+
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
 	trace_oom_score_adj_update(task);
+
+	if (mm) {
+		struct task_struct *p;
+
+		rcu_read_lock();
+		for_each_process(p) {
+			if (same_thread_group(task, p))
+				continue;
+
+			/* do not touch kernel threads or the global init */
+			if (p->flags & PF_KTHREAD || is_global_init(p))
+				continue;
+
+			task_lock(p);
+			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+						task_pid_nr(p), p->comm,
+						p->signal->oom_score_adj, oom_adj,
+						task_pid_nr(task), task->comm);
+				p->signal->oom_score_adj = oom_adj;
+				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+					p->signal->oom_score_adj_min = (short)oom_adj;
+			}
+			task_unlock(p);
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
 err_unlock:
 	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e500ce06387..0c468aa2ea61 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2255,6 +2255,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d4a929d79470..d8220c5603a5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -415,7 +415,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 05/10] mm, oom: skip vforked tasks from being selected
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vforked tasks are not really sitting on any memory. They are sharing
the mm with parent until they exec into a new code. Until then it is
just pinning the address space. OOM killer will kill the vforked task
along with its parent but we still can end up selecting vforked task
when the parent wouldn't be selected. E.g. init doing vfork to launch
a task or vforked being a child of oom unkillable task with an updated
oom_score_adj to be killable.

Add a new helper to check whether a task is in the vfork sharing memory
with its parent and use it in oom_badness to skip over these tasks.

Changes since v1
- copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with
  this patch it would be trivial to make the exploit which hides a
  memory hog from oom-killer - per Oleg
- comment in in_vfork by Oleg

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 26 ++++++++++++++++++++++++++
 mm/oom_kill.c         |  6 ++++--
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec636400669f..7442f74b6d44 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,6 +1883,32 @@ extern int arch_task_struct_size __read_mostly;
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
 
+static inline bool in_vfork(struct task_struct *tsk)
+{
+	bool ret;
+
+	/*
+	 * need RCU to access ->real_parent if CLONE_VM was used along with
+	 * CLONE_PARENT.
+	 *
+	 * We check real_parent->mm == tsk->mm because CLONE_VFORK does not
+	 * imply CLONE_VM
+	 *
+	 * CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
+	 * ->real_parent is not necessarily the task doing vfork(), so in
+	 * theory we can't rely on task_lock() if we want to dereference it.
+	 *
+	 * And in this case we can't trust the real_parent->mm == tsk->mm
+	 * check, it can be false negative. But we do not care, if init or
+	 * another oom-unkillable task does this it should blame itself.
+	 */
+	rcu_read_lock();
+	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
+	rcu_read_unlock();
+
+	return ret;
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d8220c5603a5..02da660b7c25 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped.
+	 * unkillable or have been already oom reaped or the are in
+	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			in_vfork(p)) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 05/10] mm, oom: skip vforked tasks from being selected
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vforked tasks are not really sitting on any memory. They are sharing
the mm with parent until they exec into a new code. Until then it is
just pinning the address space. OOM killer will kill the vforked task
along with its parent but we still can end up selecting vforked task
when the parent wouldn't be selected. E.g. init doing vfork to launch
a task or vforked being a child of oom unkillable task with an updated
oom_score_adj to be killable.

Add a new helper to check whether a task is in the vfork sharing memory
with its parent and use it in oom_badness to skip over these tasks.

Changes since v1
- copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with
  this patch it would be trivial to make the exploit which hides a
  memory hog from oom-killer - per Oleg
- comment in in_vfork by Oleg

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 26 ++++++++++++++++++++++++++
 mm/oom_kill.c         |  6 ++++--
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec636400669f..7442f74b6d44 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,6 +1883,32 @@ extern int arch_task_struct_size __read_mostly;
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
 
+static inline bool in_vfork(struct task_struct *tsk)
+{
+	bool ret;
+
+	/*
+	 * need RCU to access ->real_parent if CLONE_VM was used along with
+	 * CLONE_PARENT.
+	 *
+	 * We check real_parent->mm == tsk->mm because CLONE_VFORK does not
+	 * imply CLONE_VM
+	 *
+	 * CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
+	 * ->real_parent is not necessarily the task doing vfork(), so in
+	 * theory we can't rely on task_lock() if we want to dereference it.
+	 *
+	 * And in this case we can't trust the real_parent->mm == tsk->mm
+	 * check, it can be false negative. But we do not care, if init or
+	 * another oom-unkillable task does this it should blame itself.
+	 */
+	rcu_read_lock();
+	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
+	rcu_read_unlock();
+
+	return ret;
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d8220c5603a5..02da660b7c25 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped.
+	 * unkillable or have been already oom reaped or the are in
+	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			in_vfork(p)) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).

We can still encounter oom disabled vforked task which has to be killed
as well if we want to have other tasks sharing the mm reapable
because it can access the memory before doing exec. Killing such a task
should be acceptable because it is highly unlikely it has done anything
useful because it cannot modify any memory before it calls exec. An
alternative would be to keep the task alive and skip the oom reaper and
risk all the weird corner cases where the OOM killer cannot make forward
progress because the oom victim hung somewhere on the way to exit.

[rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
 the setting is inherently racy and we cannot do much about it without
 introducing locks in hot paths]
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 02da660b7c25..38f89ac2df7f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 06/10] mm, oom: kill all tasks sharing the mm
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).

We can still encounter oom disabled vforked task which has to be killed
as well if we want to have other tasks sharing the mm reapable
because it can access the memory before doing exec. Killing such a task
should be acceptable because it is highly unlikely it has done anything
useful because it cannot modify any memory before it calls exec. An
alternative would be to keep the task alive and skip the oom reaper and
risk all the weird corner cases where the OOM killer cannot make forward
progress because the oom victim hung somewhere on the way to exit.

[rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
 the setting is inherently racy and we cannot do much about it without
 introducing locks in hot paths]
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 02da660b7c25..38f89ac2df7f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_SIGHAND, or in
kernel use_mm().

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper because
task_will_free_mem implies the task is reapable now. Therefore all paths
which bypass the oom killer are now reapable and so they shouldn't lock
up the oom killer.

Changes since v4
- do not call task_will_free_mem if current->mm is NULL to catch
  after exit_mm allocations - as per Tetsuo

Changes since v2 - per Oleg
- uninline task_will_free_mem and move it to oom proper
- reorganize checks in and simplify __task_will_free_mem
- add missing process_shares_mm in task_will_free_mem
- add same_thread_group to task_will_free_mem for clarity

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  26 ++--------
 mm/memcontrol.c     |   4 +-
 mm/oom_kill.c       | 133 +++++++++++++++++++++++++++++++---------------------
 3 files changed, 85 insertions(+), 78 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 606137b3b778..5bc0457ee3a8 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -107,27 +107,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
-{
-	struct signal_struct *sig = task->signal;
-
-	/*
-	 * A coredumping process may sleep for an extended period in exit_mm(),
-	 * so the oom killer cannot assume that the process will promptly exit
-	 * and release memory.
-	 */
-	if (sig->flags & SIGNAL_GROUP_COREDUMP)
-		return false;
-
-	if (!(task->flags & PF_EXITING))
-		return false;
-
-	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
-		return false;
-
-	return true;
-}
+bool task_will_free_mem(struct task_struct *task);
 
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 587aa0d110c9..b100215811ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 38f89ac2df7f..8ee92fb76968 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -596,7 +596,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -614,46 +614,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			if (signal_group_exit(p->signal))
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -665,10 +625,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -745,6 +701,81 @@ void oom_killer_enable(void)
 	oom_killer_disabled = false;
 }
 
+static inline bool __task_will_free_mem(struct task_struct *task)
+{
+	struct signal_struct *sig = task->signal;
+
+	/*
+	 * A coredumping process may sleep for an extended period in exit_mm(),
+	 * so the oom killer cannot assume that the process will promptly exit
+	 * and release memory.
+	 */
+	if (sig->flags & SIGNAL_GROUP_COREDUMP)
+		return false;
+
+	if (sig->flags & SIGNAL_GROUP_EXIT)
+		return true;
+
+	if (thread_group_empty(task) && (task->flags & PF_EXITING))
+		return true;
+
+	return false;
+}
+
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm;
+	struct task_struct *p;
+	bool ret;
+
+	if (!__task_will_free_mem(task))
+		return false;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (same_thread_group(task, p))
+			continue;
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
@@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
@@ -944,10 +972,9 @@ bool out_of_memory(struct oom_control *oc)
 	 * But don't select if current has already released its mm and cleared
 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (current->mm && task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 07/10] mm, oom: fortify task_will_free_mem
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_SIGHAND, or in
kernel use_mm().

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper because
task_will_free_mem implies the task is reapable now. Therefore all paths
which bypass the oom killer are now reapable and so they shouldn't lock
up the oom killer.

Changes since v4
- do not call task_will_free_mem if current->mm is NULL to catch
  after exit_mm allocations - as per Tetsuo

Changes since v2 - per Oleg
- uninline task_will_free_mem and move it to oom proper
- reorganize checks in and simplify __task_will_free_mem
- add missing process_shares_mm in task_will_free_mem
- add same_thread_group to task_will_free_mem for clarity

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  26 ++--------
 mm/memcontrol.c     |   4 +-
 mm/oom_kill.c       | 133 +++++++++++++++++++++++++++++++---------------------
 3 files changed, 85 insertions(+), 78 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 606137b3b778..5bc0457ee3a8 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -107,27 +107,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
-{
-	struct signal_struct *sig = task->signal;
-
-	/*
-	 * A coredumping process may sleep for an extended period in exit_mm(),
-	 * so the oom killer cannot assume that the process will promptly exit
-	 * and release memory.
-	 */
-	if (sig->flags & SIGNAL_GROUP_COREDUMP)
-		return false;
-
-	if (!(task->flags & PF_EXITING))
-		return false;
-
-	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
-		return false;
-
-	return true;
-}
+bool task_will_free_mem(struct task_struct *task);
 
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 587aa0d110c9..b100215811ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 38f89ac2df7f..8ee92fb76968 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -596,7 +596,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -614,46 +614,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			if (signal_group_exit(p->signal))
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -665,10 +625,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -745,6 +701,81 @@ void oom_killer_enable(void)
 	oom_killer_disabled = false;
 }
 
+static inline bool __task_will_free_mem(struct task_struct *task)
+{
+	struct signal_struct *sig = task->signal;
+
+	/*
+	 * A coredumping process may sleep for an extended period in exit_mm(),
+	 * so the oom killer cannot assume that the process will promptly exit
+	 * and release memory.
+	 */
+	if (sig->flags & SIGNAL_GROUP_COREDUMP)
+		return false;
+
+	if (sig->flags & SIGNAL_GROUP_EXIT)
+		return true;
+
+	if (thread_group_empty(task) && (task->flags & PF_EXITING))
+		return true;
+
+	return false;
+}
+
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm;
+	struct task_struct *p;
+	bool ret;
+
+	if (!__task_will_free_mem(task))
+		return false;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (same_thread_group(task, p))
+			continue;
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
@@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
@@ -944,10 +972,9 @@ bool out_of_memory(struct oom_control *oc)
 	 * But don't select if current has already released its mm and cleared
 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (current->mm && task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

0-day robot has encountered the following:
[   82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
[   82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
[   82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
[   82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
[   82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

oom_reaper is trying to reap the same task again and again. This
is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process. Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.

Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ee92fb76968..36d5dd88d990 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -747,6 +747,16 @@ bool task_will_free_mem(struct task_struct *task)
 		return false;
 
 	mm = p->mm;
+
+	/*
+	 * This task has already been drained by the oom reaper so there are
+	 * only small chances it will free some more
+	 */
+	if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
+		task_unlock(p);
+		return false;
+	}
+
 	if (atomic_read(&mm->mm_users) <= 1) {
 		task_unlock(p);
 		return true;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

0-day robot has encountered the following:
[   82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
[   82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
[   82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
[   82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
[   82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

oom_reaper is trying to reap the same task again and again. This
is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process. Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.

Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ee92fb76968..36d5dd88d990 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -747,6 +747,16 @@ bool task_will_free_mem(struct task_struct *task)
 		return false;
 
 	mm = p->mm;
+
+	/*
+	 * This task has already been drained by the oom reaper so there are
+	 * only small chances it will free some more
+	 */
+	if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
+		task_unlock(p);
+		return false;
+	}
+
 	if (atomic_read(&mm->mm_users) <= 1) {
 		task_unlock(p);
 		return true;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reaper relies on the mmap_sem for read to do its job. Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot. Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.

This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails. If the flag is present after
the failure then we set MMF_OOM_REAPED to hide this mm from the oom
killer completely so it can go and chose another victim.

As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to perform all
reclaim attempts and stay predictable in the behavior.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  1 +
 mm/oom_kill.c         | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7442f74b6d44..6d81a1eb974a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 36d5dd88d990..bfddc93ccd34 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,27 @@ static void oom_reap_task(struct task_struct *tsk)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
+		struct task_struct *p;
+
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
+
+		/*
+		 * If we've already tried to reap this task in the past and
+		 * failed it probably doesn't make much sense to try yet again
+		 * so hide the mm from the oom killer so that it can move on
+		 * to another task with a different mm struct.
+		 */
+		p = find_lock_task_mm(tsk);
+		if (p) {
+			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
+				pr_info("oom_reaper: giving up pid:%d (%s)\n",
+						task_pid_nr(tsk), tsk->comm);
+				set_bit(MMF_OOM_REAPED, &p->mm->flags);
+			}
+			task_unlock(p);
+		}
+
 		debug_show_all_locks();
 	}
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reaper relies on the mmap_sem for read to do its job. Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot. Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.

This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails. If the flag is present after
the failure then we set MMF_OOM_REAPED to hide this mm from the oom
killer completely so it can go and chose another victim.

As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to perform all
reclaim attempts and stay predictable in the behavior.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  1 +
 mm/oom_kill.c         | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7442f74b6d44..6d81a1eb974a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 36d5dd88d990..bfddc93ccd34 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,27 @@ static void oom_reap_task(struct task_struct *tsk)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
+		struct task_struct *p;
+
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
+
+		/*
+		 * If we've already tried to reap this task in the past and
+		 * failed it probably doesn't make much sense to try yet again
+		 * so hide the mm from the oom killer so that it can move on
+		 * to another task with a different mm struct.
+		 */
+		p = find_lock_task_mm(tsk);
+		if (p) {
+			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
+				pr_info("oom_reaper: giving up pid:%d (%s)\n",
+						task_pid_nr(tsk), tsk->comm);
+				set_bit(MMF_OOM_REAPED, &p->mm->flags);
+			}
+			task_unlock(p);
+		}
+
 		debug_show_all_locks();
 	}
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-20 12:43 ` Michal Hocko
@ 2016-06-20 12:43   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The only case where the oom_reaper is not triggered for the oom victim
is when it shares the memory with a kernel thread (aka use_mm) or with
the global init. After "mm, oom: skip vforked tasks from being selected"
the victim cannot be a vforked task of the global init so we are left
with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite
rare as well.

In order to guarantee a forward progress for the OOM killer make
sure that this really rare cases will not get into the way and hide
the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
MMF_OOM_REAPED flag set to catch these oom victims.

After this patch we should guarantee a forward progress for the OOM
killer even when the selected victim is sharing memory with a kernel
thread or global init.

Changes since v1
- do not exit_oom_victim because oom_scan_process_thread will handle
  those which couldn't terminate in time. exit_oom_victim is not safe
  wrt. oom_disable synchronization.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bfddc93ccd34..4c21f744daa6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 	/*
 	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * Don't allow any other task to have access to the reserves unless
+	 * the task has MMF_OOM_REAPED because chances that it would release
+	 * any memory is quite low.
 	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
-		return OOM_SCAN_ABORT;
+	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
+		struct task_struct *p = find_lock_task_mm(task);
+		enum oom_scan_t ret = OOM_SCAN_ABORT;
+
+		if (p) {
+			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
+				ret = OOM_SCAN_CONTINUE;
+			task_unlock(p);
+		}
+
+		return ret;
+	}
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -913,9 +925,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
+			 * memory might be still used. Hide the mm from the oom
+			 * killer to guarantee OOM forward progress.
 			 */
 			can_oom_reap = false;
+			set_bit(MMF_OOM_REAPED, &mm->flags);
+			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
+					task_pid_nr(victim), victim->comm,
+					task_pid_nr(p), p->comm);
 			continue;
 		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
-- 
2.8.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
@ 2016-06-20 12:43   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The only case where the oom_reaper is not triggered for the oom victim
is when it shares the memory with a kernel thread (aka use_mm) or with
the global init. After "mm, oom: skip vforked tasks from being selected"
the victim cannot be a vforked task of the global init so we are left
with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite
rare as well.

In order to guarantee a forward progress for the OOM killer make
sure that this really rare cases will not get into the way and hide
the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
MMF_OOM_REAPED flag set to catch these oom victims.

After this patch we should guarantee a forward progress for the OOM
killer even when the selected victim is sharing memory with a kernel
thread or global init.

Changes since v1
- do not exit_oom_victim because oom_scan_process_thread will handle
  those which couldn't terminate in time. exit_oom_victim is not safe
  wrt. oom_disable synchronization.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bfddc93ccd34..4c21f744daa6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 	/*
 	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * Don't allow any other task to have access to the reserves unless
+	 * the task has MMF_OOM_REAPED because chances that it would release
+	 * any memory is quite low.
 	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
-		return OOM_SCAN_ABORT;
+	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
+		struct task_struct *p = find_lock_task_mm(task);
+		enum oom_scan_t ret = OOM_SCAN_ABORT;
+
+		if (p) {
+			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
+				ret = OOM_SCAN_CONTINUE;
+			task_unlock(p);
+		}
+
+		return ret;
+	}
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -913,9 +925,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
+			 * memory might be still used. Hide the mm from the oom
+			 * killer to guarantee OOM forward progress.
 			 */
 			can_oom_reap = false;
+			set_bit(MMF_OOM_REAPED, &mm->flags);
+			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
+					task_pid_nr(victim), victim->comm,
+					task_pid_nr(p), p->comm);
 			continue;
 		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-20 12:43   ` Michal Hocko
@ 2016-07-19 12:05     ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-07-19 12:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Andrew,

On Mon 20-06-16 14:43:48, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> The only case where the oom_reaper is not triggered for the oom victim
> is when it shares the memory with a kernel thread (aka use_mm) or with
> the global init. After "mm, oom: skip vforked tasks from being selected"
> the victim cannot be a vforked task of the global init so we are left
> with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite
> rare as well.
> 
> In order to guarantee a forward progress for the OOM killer make
> sure that this really rare cases will not get into the way and hide
> the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> MMF_OOM_REAPED flag set to catch these oom victims.
> 
> After this patch we should guarantee a forward progress for the OOM
> killer even when the selected victim is sharing memory with a kernel
> thread or global init.

Could you replace the last two paragraphs with the following. Tetsuo
didn't like the guarantee mentioned there because that is a too strong
statement as find_lock_task_mm might not find any mm and so we still
could end up looping on the oom victim if it gets stuck somewhere in
__mmput. This particular patch didn't aim at closing that case. Plugging
that hole is planned later after the next upcoming merge window closes.

"
In order to help a forward progress for the OOM killer, make sure
that this really rare cases will not get into the way and hide
the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
MMF_OOM_REAPED flag set to catch these oom victims.
		        
After this patch we should guarantee a forward progress for the OOM
killer even when the selected victim is sharing memory with a kernel
thread or global init as long as the victims mm is still alive.
"

> 
> Changes since v1
> - do not exit_oom_victim because oom_scan_process_thread will handle
>   those which couldn't terminate in time. exit_oom_victim is not safe
>   wrt. oom_disable synchronization.
> 
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index bfddc93ccd34..4c21f744daa6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  
>  	/*
>  	 * This task already has access to memory reserves and is being killed.
> -	 * Don't allow any other task to have access to the reserves.
> +	 * Don't allow any other task to have access to the reserves unless
> +	 * the task has MMF_OOM_REAPED because chances that it would release
> +	 * any memory is quite low.
>  	 */
> -	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
> -		return OOM_SCAN_ABORT;
> +	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> +		struct task_struct *p = find_lock_task_mm(task);
> +		enum oom_scan_t ret = OOM_SCAN_ABORT;
> +
> +		if (p) {
> +			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> +				ret = OOM_SCAN_CONTINUE;
> +			task_unlock(p);
> +		}
> +
> +		return ret;
> +	}
>  
>  	/*
>  	 * If task is allocating a lot of memory and has been marked to be
> @@ -913,9 +925,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			/*
>  			 * We cannot use oom_reaper for the mm shared by this
>  			 * process because it wouldn't get killed and so the
> -			 * memory might be still used.
> +			 * memory might be still used. Hide the mm from the oom
> +			 * killer to guarantee OOM forward progress.
>  			 */
>  			can_oom_reap = false;
> +			set_bit(MMF_OOM_REAPED, &mm->flags);
> +			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
> +					task_pid_nr(victim), victim->comm,
> +					task_pid_nr(p), p->comm);
>  			continue;
>  		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> -- 
> 2.8.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
@ 2016-07-19 12:05     ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-07-19 12:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Andrew,

On Mon 20-06-16 14:43:48, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> The only case where the oom_reaper is not triggered for the oom victim
> is when it shares the memory with a kernel thread (aka use_mm) or with
> the global init. After "mm, oom: skip vforked tasks from being selected"
> the victim cannot be a vforked task of the global init so we are left
> with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite
> rare as well.
> 
> In order to guarantee a forward progress for the OOM killer make
> sure that this really rare cases will not get into the way and hide
> the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> MMF_OOM_REAPED flag set to catch these oom victims.
> 
> After this patch we should guarantee a forward progress for the OOM
> killer even when the selected victim is sharing memory with a kernel
> thread or global init.

Could you replace the last two paragraphs with the following. Tetsuo
didn't like the guarantee mentioned there because that is a too strong
statement as find_lock_task_mm might not find any mm and so we still
could end up looping on the oom victim if it gets stuck somewhere in
__mmput. This particular patch didn't aim at closing that case. Plugging
that hole is planned later after the next upcoming merge window closes.

"
In order to help a forward progress for the OOM killer, make sure
that this really rare cases will not get into the way and hide
the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
MMF_OOM_REAPED flag set to catch these oom victims.
		        
After this patch we should guarantee a forward progress for the OOM
killer even when the selected victim is sharing memory with a kernel
thread or global init as long as the victims mm is still alive.
"

> 
> Changes since v1
> - do not exit_oom_victim because oom_scan_process_thread will handle
>   those which couldn't terminate in time. exit_oom_victim is not safe
>   wrt. oom_disable synchronization.
> 
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index bfddc93ccd34..4c21f744daa6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  
>  	/*
>  	 * This task already has access to memory reserves and is being killed.
> -	 * Don't allow any other task to have access to the reserves.
> +	 * Don't allow any other task to have access to the reserves unless
> +	 * the task has MMF_OOM_REAPED because chances that it would release
> +	 * any memory is quite low.
>  	 */
> -	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
> -		return OOM_SCAN_ABORT;
> +	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> +		struct task_struct *p = find_lock_task_mm(task);
> +		enum oom_scan_t ret = OOM_SCAN_ABORT;
> +
> +		if (p) {
> +			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> +				ret = OOM_SCAN_CONTINUE;
> +			task_unlock(p);
> +		}
> +
> +		return ret;
> +	}
>  
>  	/*
>  	 * If task is allocating a lot of memory and has been marked to be
> @@ -913,9 +925,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			/*
>  			 * We cannot use oom_reaper for the mm shared by this
>  			 * process because it wouldn't get killed and so the
> -			 * memory might be still used.
> +			 * memory might be still used. Hide the mm from the oom
> +			 * killer to guarantee OOM forward progress.
>  			 */
>  			can_oom_reap = false;
> +			set_bit(MMF_OOM_REAPED, &mm->flags);
> +			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
> +					task_pid_nr(victim), victim->comm,
> +					task_pid_nr(p), p->comm);
>  			continue;
>  		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> -- 
> 2.8.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-07-19 12:05     ` Michal Hocko
@ 2016-07-19 23:27       ` Andrew Morton
  -1 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2016-07-19 23:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Vladimir Davydov, LKML

On Tue, 19 Jul 2016 14:05:39 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > After this patch we should guarantee a forward progress for the OOM
> > killer even when the selected victim is sharing memory with a kernel
> > thread or global init.
> 
> Could you replace the last two paragraphs with the following. Tetsuo
> didn't like the guarantee mentioned there because that is a too strong
> statement as find_lock_task_mm might not find any mm and so we still
> could end up looping on the oom victim if it gets stuck somewhere in
> __mmput. This particular patch didn't aim at closing that case. Plugging
> that hole is planned later after the next upcoming merge window closes.
> 
> "
> In order to help a forward progress for the OOM killer, make sure
> that this really rare cases will not get into the way and hide
> the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> MMF_OOM_REAPED flag set to catch these oom victims.
> 		        
> After this patch we should guarantee a forward progress for the OOM
> killer even when the selected victim is sharing memory with a kernel
> thread or global init as long as the victims mm is still alive.
> "

I tweaked it a bit:

: In order to help forward progress for the OOM killer, make sure that
: this really rare case will not get in the way - we do this by hiding
: the mm from the oom killer by setting MMF_OOM_REAPED flag for it. 
: oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
: MMF_OOM_REAPED flag set to catch these oom victims.
: 
: After this patch we should guarantee forward progress for the OOM
: killer even when the selected victim is sharing memory with a kernel
: thread or global init as long as the victims mm is still alive.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
@ 2016-07-19 23:27       ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2016-07-19 23:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Vladimir Davydov, LKML

On Tue, 19 Jul 2016 14:05:39 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > After this patch we should guarantee a forward progress for the OOM
> > killer even when the selected victim is sharing memory with a kernel
> > thread or global init.
> 
> Could you replace the last two paragraphs with the following. Tetsuo
> didn't like the guarantee mentioned there because that is a too strong
> statement as find_lock_task_mm might not find any mm and so we still
> could end up looping on the oom victim if it gets stuck somewhere in
> __mmput. This particular patch didn't aim at closing that case. Plugging
> that hole is planned later after the next upcoming merge window closes.
> 
> "
> In order to help a forward progress for the OOM killer, make sure
> that this really rare cases will not get into the way and hide
> the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> MMF_OOM_REAPED flag set to catch these oom victims.
> 		        
> After this patch we should guarantee a forward progress for the OOM
> killer even when the selected victim is sharing memory with a kernel
> thread or global init as long as the victims mm is still alive.
> "

I tweaked it a bit:

: In order to help forward progress for the OOM killer, make sure that
: this really rare case will not get in the way - we do this by hiding
: the mm from the oom killer by setting MMF_OOM_REAPED flag for it. 
: oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
: MMF_OOM_REAPED flag set to catch these oom victims.
: 
: After this patch we should guarantee forward progress for the OOM
: killer even when the selected victim is sharing memory with a kernel
: thread or global init as long as the victims mm is still alive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-07-19 23:27       ` Andrew Morton
@ 2016-07-20  6:29         ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-07-20  6:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Vladimir Davydov, LKML

On Tue 19-07-16 16:27:59, Andrew Morton wrote:
> On Tue, 19 Jul 2016 14:05:39 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > After this patch we should guarantee a forward progress for the OOM
> > > killer even when the selected victim is sharing memory with a kernel
> > > thread or global init.
> > 
> > Could you replace the last two paragraphs with the following. Tetsuo
> > didn't like the guarantee mentioned there because that is a too strong
> > statement as find_lock_task_mm might not find any mm and so we still
> > could end up looping on the oom victim if it gets stuck somewhere in
> > __mmput. This particular patch didn't aim at closing that case. Plugging
> > that hole is planned later after the next upcoming merge window closes.
> > 
> > "
> > In order to help a forward progress for the OOM killer, make sure
> > that this really rare cases will not get into the way and hide
> > the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> > oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> > MMF_OOM_REAPED flag set to catch these oom victims.
> > 		        
> > After this patch we should guarantee a forward progress for the OOM
> > killer even when the selected victim is sharing memory with a kernel
> > thread or global init as long as the victims mm is still alive.
> > "
> 
> I tweaked it a bit:
> 
> : In order to help forward progress for the OOM killer, make sure that
> : this really rare case will not get in the way - we do this by hiding
> : the mm from the oom killer by setting MMF_OOM_REAPED flag for it. 
> : oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> : MMF_OOM_REAPED flag set to catch these oom victims.
> : 
> : After this patch we should guarantee forward progress for the OOM
> : killer even when the selected victim is sharing memory with a kernel
> : thread or global init as long as the victims mm is still alive.

Sounds good to me. Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
@ 2016-07-20  6:29         ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-07-20  6:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Vladimir Davydov, LKML

On Tue 19-07-16 16:27:59, Andrew Morton wrote:
> On Tue, 19 Jul 2016 14:05:39 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > After this patch we should guarantee a forward progress for the OOM
> > > killer even when the selected victim is sharing memory with a kernel
> > > thread or global init.
> > 
> > Could you replace the last two paragraphs with the following. Tetsuo
> > didn't like the guarantee mentioned there because that is a too strong
> > statement as find_lock_task_mm might not find any mm and so we still
> > could end up looping on the oom victim if it gets stuck somewhere in
> > __mmput. This particular patch didn't aim at closing that case. Plugging
> > that hole is planned later after the next upcoming merge window closes.
> > 
> > "
> > In order to help a forward progress for the OOM killer, make sure
> > that this really rare cases will not get into the way and hide
> > the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> > oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> > MMF_OOM_REAPED flag set to catch these oom victims.
> > 		        
> > After this patch we should guarantee a forward progress for the OOM
> > killer even when the selected victim is sharing memory with a kernel
> > thread or global init as long as the victims mm is still alive.
> > "
> 
> I tweaked it a bit:
> 
> : In order to help forward progress for the OOM killer, make sure that
> : this really rare case will not get in the way - we do this by hiding
> : the mm from the oom killer by setting MMF_OOM_REAPED flag for it. 
> : oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> : MMF_OOM_REAPED flag set to catch these oom victims.
> : 
> : After this patch we should guarantee forward progress for the OOM
> : killer even when the selected victim is sharing memory with a kernel
> : thread or global init as long as the victims mm is still alive.

Sounds good to me. Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-07-20  6:29 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-06-20 12:43 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-06-20 12:43 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-06-20 12:43 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-06-20 12:43 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-06-20 12:43 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-06-20 12:43 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-06-20 12:43 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-06-20 12:43 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-06-20 12:43 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
2016-06-20 12:43   ` Michal Hocko
2016-07-19 12:05   ` Michal Hocko
2016-07-19 12:05     ` Michal Hocko
2016-07-19 23:27     ` Andrew Morton
2016-07-19 23:27       ` Andrew Morton
2016-07-20  6:29       ` Michal Hocko
2016-07-20  6:29         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.