* [PATCH 01/10] proc, oom: drop bogus task_lock and mm check
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko
` (8 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
both oom_adj_write and oom_score_adj_write are using task_lock,
check for task->mm and fail if it is NULL. This is not needed because
the oom_score_adj is per signal struct so we do not need mm at all.
The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
disable count") but we do not do per-mm oom disable since c9f01245b6a7
("oom: remove oom_disable_count").
The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread
group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
would succeed. This is unexpected at best.
Remove the lock along with the check to fix the unexpected behavior
and also because there is not real need for the lock in the first place.
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
fs/proc/base.c | 22 ++++------------------
1 file changed, 4 insertions(+), 18 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..a6014e45c516 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1083,15 +1083,9 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
goto out;
}
- task_lock(task);
- if (!task->mm) {
- err = -EINVAL;
- goto err_task_lock;
- }
-
if (!lock_task_sighand(task, &flags)) {
err = -ESRCH;
- goto err_task_lock;
+ goto err_put_task;
}
/*
@@ -1121,8 +1115,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
trace_oom_score_adj_update(task);
err_sighand:
unlock_task_sighand(task, &flags);
-err_task_lock:
- task_unlock(task);
+err_put_task:
put_task_struct(task);
out:
return err < 0 ? err : count;
@@ -1186,15 +1179,9 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
goto out;
}
- task_lock(task);
- if (!task->mm) {
- err = -EINVAL;
- goto err_task_lock;
- }
-
if (!lock_task_sighand(task, &flags)) {
err = -ESRCH;
- goto err_task_lock;
+ goto err_put_task;
}
if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
@@ -1210,8 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
err_sighand:
unlock_task_sighand(task, &flags);
-err_task_lock:
- task_unlock(task);
+err_put_task:
put_task_struct(task);
out:
return err < 0 ? err : count;
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 02/10] proc, oom: drop bogus sighand lock
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
2016-06-20 12:43 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
` (7 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
Oleg has pointed out that can simplify both oom_adj_{read,write}
and oom_score_adj_{read,write} even further and drop the sighand
lock. The main purpose of the lock was to protect p->signal from
going away but this will not happen since ea6d290ca34c ("signals:
make task_struct->signal immutable/refcountable").
The other role of the lock was to synchronize different writers,
especially those with CAP_SYS_RESOURCE. Introduce a mutex for this
purpose. Later patches will need this lock anyway.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
fs/proc/base.c | 51 +++++++++++++++++----------------------------------
1 file changed, 17 insertions(+), 34 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..968d5ea06e62 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1024,23 +1024,21 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
char buffer[PROC_NUMBUF];
int oom_adj = OOM_ADJUST_MIN;
size_t len;
- unsigned long flags;
if (!task)
return -ESRCH;
- if (lock_task_sighand(task, &flags)) {
- if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
- oom_adj = OOM_ADJUST_MAX;
- else
- oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
- OOM_SCORE_ADJ_MAX;
- unlock_task_sighand(task, &flags);
- }
+ if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
+ oom_adj = OOM_ADJUST_MAX;
+ else
+ oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
+ OOM_SCORE_ADJ_MAX;
put_task_struct(task);
len = snprintf(buffer, sizeof(buffer), "%d\n", oom_adj);
return simple_read_from_buffer(buf, count, ppos, buffer, len);
}
+static DEFINE_MUTEX(oom_adj_mutex);
+
/*
* /proc/pid/oom_adj exists solely for backwards compatibility with previous
* kernels. The effective policy is defined by oom_score_adj, which has a
@@ -1057,7 +1055,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
struct task_struct *task;
char buffer[PROC_NUMBUF];
int oom_adj;
- unsigned long flags;
int err;
memset(buffer, 0, sizeof(buffer));
@@ -1083,11 +1080,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
goto out;
}
- if (!lock_task_sighand(task, &flags)) {
- err = -ESRCH;
- goto err_put_task;
- }
-
/*
* Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
* value is always attainable.
@@ -1097,10 +1089,11 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
else
oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
+ mutex_lock(&oom_adj_mutex);
if (oom_adj < task->signal->oom_score_adj &&
!capable(CAP_SYS_RESOURCE)) {
err = -EACCES;
- goto err_sighand;
+ goto err_unlock;
}
/*
@@ -1113,9 +1106,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
task->signal->oom_score_adj = oom_adj;
trace_oom_score_adj_update(task);
-err_sighand:
- unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+ mutex_unlock(&oom_adj_mutex);
put_task_struct(task);
out:
return err < 0 ? err : count;
@@ -1133,15 +1125,11 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
struct task_struct *task = get_proc_task(file_inode(file));
char buffer[PROC_NUMBUF];
short oom_score_adj = OOM_SCORE_ADJ_MIN;
- unsigned long flags;
size_t len;
if (!task)
return -ESRCH;
- if (lock_task_sighand(task, &flags)) {
- oom_score_adj = task->signal->oom_score_adj;
- unlock_task_sighand(task, &flags);
- }
+ oom_score_adj = task->signal->oom_score_adj;
put_task_struct(task);
len = snprintf(buffer, sizeof(buffer), "%hd\n", oom_score_adj);
return simple_read_from_buffer(buf, count, ppos, buffer, len);
@@ -1152,7 +1140,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
{
struct task_struct *task;
char buffer[PROC_NUMBUF];
- unsigned long flags;
int oom_score_adj;
int err;
@@ -1179,25 +1166,21 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
goto out;
}
- if (!lock_task_sighand(task, &flags)) {
- err = -ESRCH;
- goto err_put_task;
- }
-
+ mutex_lock(&oom_adj_mutex);
if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
!capable(CAP_SYS_RESOURCE)) {
err = -EACCES;
- goto err_sighand;
+ goto err_unlock;
}
task->signal->oom_score_adj = (short)oom_score_adj;
if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
task->signal->oom_score_adj_min = (short)oom_score_adj;
+
trace_oom_score_adj_update(task);
-err_sighand:
- unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+ mutex_unlock(&oom_adj_mutex);
put_task_struct(task);
out:
return err < 0 ? err : count;
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
2016-06-20 12:43 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
2016-06-20 12:43 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
` (6 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.
This patch shouldn't introduce any functional changes.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
fs/proc/base.c | 94 +++++++++++++++++++++++++++-------------------------------
1 file changed, 43 insertions(+), 51 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 968d5ea06e62..a6a8fbdd5a1b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1037,7 +1037,47 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
return simple_read_from_buffer(buf, count, ppos, buffer, len);
}
-static DEFINE_MUTEX(oom_adj_mutex);
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+ static DEFINE_MUTEX(oom_adj_mutex);
+ struct task_struct *task;
+ int err = 0;
+
+ task = get_proc_task(file_inode(file));
+ if (!task)
+ return -ESRCH;
+
+ mutex_lock(&oom_adj_mutex);
+ if (legacy) {
+ if (oom_adj < task->signal->oom_score_adj &&
+ !capable(CAP_SYS_RESOURCE)) {
+ err = -EACCES;
+ goto err_unlock;
+ }
+ /*
+ * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+ * /proc/pid/oom_score_adj instead.
+ */
+ pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+ current->comm, task_pid_nr(current), task_pid_nr(task),
+ task_pid_nr(task));
+ } else {
+ if ((short)oom_adj < task->signal->oom_score_adj_min &&
+ !capable(CAP_SYS_RESOURCE)) {
+ err = -EACCES;
+ goto err_unlock;
+ }
+ }
+
+ task->signal->oom_score_adj = oom_adj;
+ if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+ task->signal->oom_score_adj_min = (short)oom_adj;
+ trace_oom_score_adj_update(task);
+err_unlock:
+ mutex_unlock(&oom_adj_mutex);
+ put_task_struct(task);
+ return err;
+}
/*
* /proc/pid/oom_adj exists solely for backwards compatibility with previous
@@ -1052,7 +1092,6 @@ static DEFINE_MUTEX(oom_adj_mutex);
static ssize_t oom_adj_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
- struct task_struct *task;
char buffer[PROC_NUMBUF];
int oom_adj;
int err;
@@ -1074,12 +1113,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
goto out;
}
- task = get_proc_task(file_inode(file));
- if (!task) {
- err = -ESRCH;
- goto out;
- }
-
/*
* Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
* value is always attainable.
@@ -1089,26 +1122,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
else
oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
- mutex_lock(&oom_adj_mutex);
- if (oom_adj < task->signal->oom_score_adj &&
- !capable(CAP_SYS_RESOURCE)) {
- err = -EACCES;
- goto err_unlock;
- }
-
- /*
- * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
- * /proc/pid/oom_score_adj instead.
- */
- pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
- current->comm, task_pid_nr(current), task_pid_nr(task),
- task_pid_nr(task));
-
- task->signal->oom_score_adj = oom_adj;
- trace_oom_score_adj_update(task);
-err_unlock:
- mutex_unlock(&oom_adj_mutex);
- put_task_struct(task);
+ err = __set_oom_adj(file, oom_adj, true);
out:
return err < 0 ? err : count;
}
@@ -1138,7 +1152,6 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
- struct task_struct *task;
char buffer[PROC_NUMBUF];
int oom_score_adj;
int err;
@@ -1160,28 +1173,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
goto out;
}
- task = get_proc_task(file_inode(file));
- if (!task) {
- err = -ESRCH;
- goto out;
- }
-
- mutex_lock(&oom_adj_mutex);
- if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
- !capable(CAP_SYS_RESOURCE)) {
- err = -EACCES;
- goto err_unlock;
- }
-
- task->signal->oom_score_adj = (short)oom_score_adj;
- if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
- task->signal->oom_score_adj_min = (short)oom_score_adj;
-
- trace_oom_score_adj_update(task);
-
-err_unlock:
- mutex_unlock(&oom_adj_mutex);
- put_task_struct(task);
+ err = __set_oom_adj(file, oom_score_adj, false);
out:
return err < 0 ? err : count;
}
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
` (2 preceding siblings ...)
2016-06-20 12:43 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
` (5 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer. In the
worst case some of those processes might hide from the oom killer altogether
via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then
pick up those eligible but won't be allowed to kill others sharing the
same mm so the mm wouldn't release the mm and so the memory.
It would be ideal to have the oom_score_adj per mm_struct because that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
vfork()
set_oom_adj()
exec()
We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj. The current implementation is rather pessimistic
and checks all the existing processes by default if there is more than
1 holder of the mm but we do not have any reliable way to check for
external users yet.
Changes since v2
- skip over same thread group
- skip over kernel threads and global init
Changes since v1
- note that we are changing oom_score_adj outside of the thread group
to the log
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
fs/proc/base.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 2 ++
mm/oom_kill.c | 2 +-
3 files changed, 49 insertions(+), 1 deletion(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6a8fbdd5a1b..c986e92680e1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1040,6 +1040,7 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
{
static DEFINE_MUTEX(oom_adj_mutex);
+ struct mm_struct *mm = NULL;
struct task_struct *task;
int err = 0;
@@ -1069,10 +1070,55 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
}
}
+ /*
+ * Make sure we will check other processes sharing the mm if this is
+ * not vfrok which wants its own oom_score_adj.
+ * pin the mm so it doesn't go away and get reused after task_unlock
+ */
+ if (!task->vfork_done) {
+ struct task_struct *p = find_lock_task_mm(task);
+
+ if (p) {
+ if (atomic_read(&p->mm->mm_users) > 1) {
+ mm = p->mm;
+ atomic_inc(&mm->mm_count);
+ }
+ task_unlock(p);
+ }
+ }
+
task->signal->oom_score_adj = oom_adj;
if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
task->signal->oom_score_adj_min = (short)oom_adj;
trace_oom_score_adj_update(task);
+
+ if (mm) {
+ struct task_struct *p;
+
+ rcu_read_lock();
+ for_each_process(p) {
+ if (same_thread_group(task, p))
+ continue;
+
+ /* do not touch kernel threads or the global init */
+ if (p->flags & PF_KTHREAD || is_global_init(p))
+ continue;
+
+ task_lock(p);
+ if (!p->vfork_done && process_shares_mm(p, mm)) {
+ pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+ task_pid_nr(p), p->comm,
+ p->signal->oom_score_adj, oom_adj,
+ task_pid_nr(task), task->comm);
+ p->signal->oom_score_adj = oom_adj;
+ if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+ p->signal->oom_score_adj_min = (short)oom_adj;
+ }
+ task_unlock(p);
+ }
+ rcu_read_unlock();
+ mmdrop(mm);
+ }
err_unlock:
mutex_unlock(&oom_adj_mutex);
put_task_struct(task);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e500ce06387..0c468aa2ea61 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2255,6 +2255,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
}
#endif /* __HAVE_ARCH_GATE_AREA */
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
#ifdef CONFIG_SYSCTL
extern int sysctl_drop_caches;
int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d4a929d79470..d8220c5603a5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -415,7 +415,7 @@ bool oom_killer_disabled __read_mostly;
* task's threads: if one of those is using this mm then this task was also
* using it.
*/
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
{
struct task_struct *t;
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 05/10] mm, oom: skip vforked tasks from being selected
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
` (3 preceding siblings ...)
2016-06-20 12:43 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
` (4 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
vforked tasks are not really sitting on any memory. They are sharing
the mm with parent until they exec into a new code. Until then it is
just pinning the address space. OOM killer will kill the vforked task
along with its parent but we still can end up selecting vforked task
when the parent wouldn't be selected. E.g. init doing vfork to launch
a task or vforked being a child of oom unkillable task with an updated
oom_score_adj to be killable.
Add a new helper to check whether a task is in the vfork sharing memory
with its parent and use it in oom_badness to skip over these tasks.
Changes since v1
- copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with
this patch it would be trivial to make the exploit which hides a
memory hog from oom-killer - per Oleg
- comment in in_vfork by Oleg
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
include/linux/sched.h | 26 ++++++++++++++++++++++++++
mm/oom_kill.c | 6 ++++--
2 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec636400669f..7442f74b6d44 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,6 +1883,32 @@ extern int arch_task_struct_size __read_mostly;
#define TNF_FAULT_LOCAL 0x08
#define TNF_MIGRATE_FAIL 0x10
+static inline bool in_vfork(struct task_struct *tsk)
+{
+ bool ret;
+
+ /*
+ * need RCU to access ->real_parent if CLONE_VM was used along with
+ * CLONE_PARENT.
+ *
+ * We check real_parent->mm == tsk->mm because CLONE_VFORK does not
+ * imply CLONE_VM
+ *
+ * CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
+ * ->real_parent is not necessarily the task doing vfork(), so in
+ * theory we can't rely on task_lock() if we want to dereference it.
+ *
+ * And in this case we can't trust the real_parent->mm == tsk->mm
+ * check, it can be false negative. But we do not care, if init or
+ * another oom-unkillable task does this it should blame itself.
+ */
+ rcu_read_lock();
+ ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
+ rcu_read_unlock();
+
+ return ret;
+}
+
#ifdef CONFIG_NUMA_BALANCING
extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d8220c5603a5..02da660b7c25 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
/*
* Do not even consider tasks which are explicitly marked oom
- * unkillable or have been already oom reaped.
+ * unkillable or have been already oom reaped or the are in
+ * the middle of vfork
*/
adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
- test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+ test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+ in_vfork(p)) {
task_unlock(p);
return 0;
}
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 06/10] mm, oom: kill all tasks sharing the mm
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
` (4 preceding siblings ...)
2016-06-20 12:43 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
` (3 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).
We can still encounter oom disabled vforked task which has to be killed
as well if we want to have other tasks sharing the mm reapable
because it can access the memory before doing exec. Killing such a task
should be acceptable because it is highly unlikely it has done anything
useful because it cannot modify any memory before it calls exec. An
alternative would be to keep the task alive and skip the oom reaper and
risk all the weird corner cases where the OOM killer cannot make forward
progress because the oom victim hung somewhere on the way to exit.
[rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
the setting is inherently racy and we cannot do much about it without
introducing locks in hot paths]
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
mm/oom_kill.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 02da660b7c25..38f89ac2df7f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
continue;
if (same_thread_group(p, victim))
continue;
- if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
- p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+ if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
/*
* We cannot use oom_reaper for the mm shared by this
* process because it wouldn't get killed and so the
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 07/10] mm, oom: fortify task_will_free_mem
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
` (5 preceding siblings ...)
2016-06-20 12:43 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
` (2 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down and probably drop the mm.
This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_SIGHAND, or in
kernel use_mm().
Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper because
task_will_free_mem implies the task is reapable now. Therefore all paths
which bypass the oom killer are now reapable and so they shouldn't lock
up the oom killer.
Changes since v4
- do not call task_will_free_mem if current->mm is NULL to catch
after exit_mm allocations - as per Tetsuo
Changes since v2 - per Oleg
- uninline task_will_free_mem and move it to oom proper
- reorganize checks in and simplify __task_will_free_mem
- add missing process_shares_mm in task_will_free_mem
- add same_thread_group to task_will_free_mem for clarity
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
include/linux/oom.h | 26 ++--------
mm/memcontrol.c | 4 +-
mm/oom_kill.c | 133 +++++++++++++++++++++++++++++++---------------------
3 files changed, 85 insertions(+), 78 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 606137b3b778..5bc0457ee3a8 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
extern void mark_oom_victim(struct task_struct *tsk);
#ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
#else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
{
}
#endif
@@ -107,27 +107,7 @@ extern void oom_killer_enable(void);
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
-static inline bool task_will_free_mem(struct task_struct *task)
-{
- struct signal_struct *sig = task->signal;
-
- /*
- * A coredumping process may sleep for an extended period in exit_mm(),
- * so the oom killer cannot assume that the process will promptly exit
- * and release memory.
- */
- if (sig->flags & SIGNAL_GROUP_COREDUMP)
- return false;
-
- if (!(task->flags & PF_EXITING))
- return false;
-
- /* Make sure that the whole thread group is going down */
- if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
- return false;
-
- return true;
-}
+bool task_will_free_mem(struct task_struct *task);
/* sysctls */
extern int sysctl_oom_dump_tasks;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 587aa0d110c9..b100215811ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* select it. The goal is to allow it to allocate so that it may
* quickly exit and free its memory.
*/
- if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+ if (task_will_free_mem(current)) {
mark_oom_victim(current);
- try_oom_reaper(current);
+ wake_oom_reaper(current);
goto unlock;
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 38f89ac2df7f..8ee92fb76968 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -596,7 +596,7 @@ static int oom_reaper(void *unused)
return 0;
}
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
{
if (!oom_reaper_th)
return;
@@ -614,46 +614,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
wake_up(&oom_reaper_wait);
}
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
- struct mm_struct *mm = tsk->mm;
- struct task_struct *p;
-
- if (!mm)
- return;
-
- /*
- * There might be other threads/processes which are either not
- * dying or even not killable.
- */
- if (atomic_read(&mm->mm_users) > 1) {
- rcu_read_lock();
- for_each_process(p) {
- if (!process_shares_mm(p, mm))
- continue;
- if (fatal_signal_pending(p))
- continue;
-
- /*
- * If the task is exiting make sure the whole thread group
- * is exiting and cannot acces mm anymore.
- */
- if (signal_group_exit(p->signal))
- continue;
-
- /* Give up */
- rcu_read_unlock();
- return;
- }
- rcu_read_unlock();
- }
-
- wake_oom_reaper(tsk);
-}
-
static int __init oom_init(void)
{
oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -665,10 +625,6 @@ static int __init oom_init(void)
return 0;
}
subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
#endif
/**
@@ -745,6 +701,81 @@ void oom_killer_enable(void)
oom_killer_disabled = false;
}
+static inline bool __task_will_free_mem(struct task_struct *task)
+{
+ struct signal_struct *sig = task->signal;
+
+ /*
+ * A coredumping process may sleep for an extended period in exit_mm(),
+ * so the oom killer cannot assume that the process will promptly exit
+ * and release memory.
+ */
+ if (sig->flags & SIGNAL_GROUP_COREDUMP)
+ return false;
+
+ if (sig->flags & SIGNAL_GROUP_EXIT)
+ return true;
+
+ if (thread_group_empty(task) && (task->flags & PF_EXITING))
+ return true;
+
+ return false;
+}
+
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+bool task_will_free_mem(struct task_struct *task)
+{
+ struct mm_struct *mm;
+ struct task_struct *p;
+ bool ret;
+
+ if (!__task_will_free_mem(task))
+ return false;
+
+ /*
+ * If the process has passed exit_mm we have to skip it because
+ * we have lost a link to other tasks sharing this mm, we do not
+ * have anything to reap and the task might then get stuck waiting
+ * for parent as zombie and we do not want it to hold TIF_MEMDIE
+ */
+ p = find_lock_task_mm(task);
+ if (!p)
+ return false;
+
+ mm = p->mm;
+ if (atomic_read(&mm->mm_users) <= 1) {
+ task_unlock(p);
+ return true;
+ }
+
+ /* pin the mm to not get freed and reused */
+ atomic_inc(&mm->mm_count);
+ task_unlock(p);
+
+ /*
+ * This is really pessimistic but we do not have any reliable way
+ * to check that external processes share with our mm
+ */
+ rcu_read_lock();
+ for_each_process(p) {
+ if (!process_shares_mm(p, mm))
+ continue;
+ if (same_thread_group(task, p))
+ continue;
+ ret = __task_will_free_mem(p);
+ if (!ret)
+ break;
+ }
+ rcu_read_unlock();
+ mmdrop(mm);
+
+ return ret;
+}
+
/*
* Must be called while holding a reference to p, which will be released upon
* returning.
@@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
* If the task is already exiting, don't alarm the sysadmin or kill
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
- task_lock(p);
- if (p->mm && task_will_free_mem(p)) {
+ if (task_will_free_mem(p)) {
mark_oom_victim(p);
- try_oom_reaper(p);
- task_unlock(p);
+ wake_oom_reaper(p);
put_task_struct(p);
return;
}
- task_unlock(p);
if (__ratelimit(&oom_rs))
dump_header(oc, p);
@@ -944,10 +972,9 @@ bool out_of_memory(struct oom_control *oc)
* But don't select if current has already released its mm and cleared
* TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
*/
- if (current->mm &&
- (fatal_signal_pending(current) || task_will_free_mem(current))) {
+ if (current->mm && task_will_free_mem(current)) {
mark_oom_victim(current);
- try_oom_reaper(current);
+ wake_oom_reaper(current);
return true;
}
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
` (6 preceding siblings ...)
2016-06-20 12:43 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
2016-06-20 12:43 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
9 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
0-day robot has encountered the following:
[ 82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
[ 82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
[ 82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
[ 82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[ 82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[ 82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
[ 82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB
oom_reaper is trying to reap the same task again and again. This
is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process. Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.
Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
mm/oom_kill.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ee92fb76968..36d5dd88d990 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -747,6 +747,16 @@ bool task_will_free_mem(struct task_struct *task)
return false;
mm = p->mm;
+
+ /*
+ * This task has already been drained by the oom reaper so there are
+ * only small chances it will free some more
+ */
+ if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
+ task_unlock(p);
+ return false;
+ }
+
if (atomic_read(&mm->mm_users) <= 1) {
task_unlock(p);
return true;
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
` (7 preceding siblings ...)
2016-06-20 12:43 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-06-20 12:43 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
9 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
oom_reaper relies on the mmap_sem for read to do its job. Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot. Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.
This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails. If the flag is present after
the failure then we set MMF_OOM_REAPED to hide this mm from the oom
killer completely so it can go and chose another victim.
As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to perform all
reclaim attempts and stay predictable in the behavior.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
include/linux/sched.h | 1 +
mm/oom_kill.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7442f74b6d44..6d81a1eb974a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
#define MMF_OOM_REAPED 21 /* mm has been already reaped */
+#define MMF_OOM_NOT_REAPABLE 22 /* mm couldn't be reaped */
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 36d5dd88d990..bfddc93ccd34 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,27 @@ static void oom_reap_task(struct task_struct *tsk)
schedule_timeout_idle(HZ/10);
if (attempts > MAX_OOM_REAP_RETRIES) {
+ struct task_struct *p;
+
pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
task_pid_nr(tsk), tsk->comm);
+
+ /*
+ * If we've already tried to reap this task in the past and
+ * failed it probably doesn't make much sense to try yet again
+ * so hide the mm from the oom killer so that it can move on
+ * to another task with a different mm struct.
+ */
+ p = find_lock_task_mm(tsk);
+ if (p) {
+ if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
+ pr_info("oom_reaper: giving up pid:%d (%s)\n",
+ task_pid_nr(tsk), tsk->comm);
+ set_bit(MMF_OOM_REAPED, &p->mm->flags);
+ }
+ task_unlock(p);
+ }
+
debug_show_all_locks();
}
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko
` (8 preceding siblings ...)
2016-06-20 12:43 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
2016-07-19 12:05 ` Michal Hocko
9 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML, Michal Hocko
From: Michal Hocko <mhocko@suse.com>
The only case where the oom_reaper is not triggered for the oom victim
is when it shares the memory with a kernel thread (aka use_mm) or with
the global init. After "mm, oom: skip vforked tasks from being selected"
the victim cannot be a vforked task of the global init so we are left
with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite
rare as well.
In order to guarantee a forward progress for the OOM killer make
sure that this really rare cases will not get into the way and hide
the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
MMF_OOM_REAPED flag set to catch these oom victims.
After this patch we should guarantee a forward progress for the OOM
killer even when the selected victim is sharing memory with a kernel
thread or global init.
Changes since v1
- do not exit_oom_victim because oom_scan_process_thread will handle
those which couldn't terminate in time. exit_oom_victim is not safe
wrt. oom_disable synchronization.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
mm/oom_kill.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bfddc93ccd34..4c21f744daa6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
/*
* This task already has access to memory reserves and is being killed.
- * Don't allow any other task to have access to the reserves.
+ * Don't allow any other task to have access to the reserves unless
+ * the task has MMF_OOM_REAPED because chances that it would release
+ * any memory is quite low.
*/
- if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
- return OOM_SCAN_ABORT;
+ if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
+ struct task_struct *p = find_lock_task_mm(task);
+ enum oom_scan_t ret = OOM_SCAN_ABORT;
+
+ if (p) {
+ if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
+ ret = OOM_SCAN_CONTINUE;
+ task_unlock(p);
+ }
+
+ return ret;
+ }
/*
* If task is allocating a lot of memory and has been marked to be
@@ -913,9 +925,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
/*
* We cannot use oom_reaper for the mm shared by this
* process because it wouldn't get killed and so the
- * memory might be still used.
+ * memory might be still used. Hide the mm from the oom
+ * killer to guarantee OOM forward progress.
*/
can_oom_reap = false;
+ set_bit(MMF_OOM_REAPED, &mm->flags);
+ pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
+ task_pid_nr(victim), victim->comm,
+ task_pid_nr(p), p->comm);
continue;
}
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
--
2.8.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
2016-06-20 12:43 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
@ 2016-07-19 12:05 ` Michal Hocko
2016-07-19 23:27 ` Andrew Morton
0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2016-07-19 12:05 UTC (permalink / raw)
To: linux-mm
Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
Andrew Morton, LKML
Andrew,
On Mon 20-06-16 14:43:48, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> The only case where the oom_reaper is not triggered for the oom victim
> is when it shares the memory with a kernel thread (aka use_mm) or with
> the global init. After "mm, oom: skip vforked tasks from being selected"
> the victim cannot be a vforked task of the global init so we are left
> with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite
> rare as well.
>
> In order to guarantee a forward progress for the OOM killer make
> sure that this really rare cases will not get into the way and hide
> the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> MMF_OOM_REAPED flag set to catch these oom victims.
>
> After this patch we should guarantee a forward progress for the OOM
> killer even when the selected victim is sharing memory with a kernel
> thread or global init.
Could you replace the last two paragraphs with the following. Tetsuo
didn't like the guarantee mentioned there because that is a too strong
statement as find_lock_task_mm might not find any mm and so we still
could end up looping on the oom victim if it gets stuck somewhere in
__mmput. This particular patch didn't aim at closing that case. Plugging
that hole is planned later after the next upcoming merge window closes.
"
In order to help a forward progress for the OOM killer, make sure
that this really rare cases will not get into the way and hide
the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
MMF_OOM_REAPED flag set to catch these oom victims.
After this patch we should guarantee a forward progress for the OOM
killer even when the selected victim is sharing memory with a kernel
thread or global init as long as the victims mm is still alive.
"
>
> Changes since v1
> - do not exit_oom_victim because oom_scan_process_thread will handle
> those which couldn't terminate in time. exit_oom_victim is not safe
> wrt. oom_disable synchronization.
>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> mm/oom_kill.c | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index bfddc93ccd34..4c21f744daa6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>
> /*
> * This task already has access to memory reserves and is being killed.
> - * Don't allow any other task to have access to the reserves.
> + * Don't allow any other task to have access to the reserves unless
> + * the task has MMF_OOM_REAPED because chances that it would release
> + * any memory is quite low.
> */
> - if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
> - return OOM_SCAN_ABORT;
> + if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> + struct task_struct *p = find_lock_task_mm(task);
> + enum oom_scan_t ret = OOM_SCAN_ABORT;
> +
> + if (p) {
> + if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> + ret = OOM_SCAN_CONTINUE;
> + task_unlock(p);
> + }
> +
> + return ret;
> + }
>
> /*
> * If task is allocating a lot of memory and has been marked to be
> @@ -913,9 +925,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> /*
> * We cannot use oom_reaper for the mm shared by this
> * process because it wouldn't get killed and so the
> - * memory might be still used.
> + * memory might be still used. Hide the mm from the oom
> + * killer to guarantee OOM forward progress.
> */
> can_oom_reap = false;
> + set_bit(MMF_OOM_REAPED, &mm->flags);
> + pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
> + task_pid_nr(victim), victim->comm,
> + task_pid_nr(p), p->comm);
> continue;
> }
> do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> --
> 2.8.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
2016-07-19 12:05 ` Michal Hocko
@ 2016-07-19 23:27 ` Andrew Morton
2016-07-20 6:29 ` Michal Hocko
0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2016-07-19 23:27 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
Vladimir Davydov, LKML
On Tue, 19 Jul 2016 14:05:39 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > After this patch we should guarantee a forward progress for the OOM
> > killer even when the selected victim is sharing memory with a kernel
> > thread or global init.
>
> Could you replace the last two paragraphs with the following. Tetsuo
> didn't like the guarantee mentioned there because that is a too strong
> statement as find_lock_task_mm might not find any mm and so we still
> could end up looping on the oom victim if it gets stuck somewhere in
> __mmput. This particular patch didn't aim at closing that case. Plugging
> that hole is planned later after the next upcoming merge window closes.
>
> "
> In order to help a forward progress for the OOM killer, make sure
> that this really rare cases will not get into the way and hide
> the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> MMF_OOM_REAPED flag set to catch these oom victims.
>
> After this patch we should guarantee a forward progress for the OOM
> killer even when the selected victim is sharing memory with a kernel
> thread or global init as long as the victims mm is still alive.
> "
I tweaked it a bit:
: In order to help forward progress for the OOM killer, make sure that
: this really rare case will not get in the way - we do this by hiding
: the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
: oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
: MMF_OOM_REAPED flag set to catch these oom victims.
:
: After this patch we should guarantee forward progress for the OOM
: killer even when the selected victim is sharing memory with a kernel
: thread or global init as long as the victims mm is still alive.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
2016-07-19 23:27 ` Andrew Morton
@ 2016-07-20 6:29 ` Michal Hocko
0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-07-20 6:29 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
Vladimir Davydov, LKML
On Tue 19-07-16 16:27:59, Andrew Morton wrote:
> On Tue, 19 Jul 2016 14:05:39 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>
> > > After this patch we should guarantee a forward progress for the OOM
> > > killer even when the selected victim is sharing memory with a kernel
> > > thread or global init.
> >
> > Could you replace the last two paragraphs with the following. Tetsuo
> > didn't like the guarantee mentioned there because that is a too strong
> > statement as find_lock_task_mm might not find any mm and so we still
> > could end up looping on the oom victim if it gets stuck somewhere in
> > __mmput. This particular patch didn't aim at closing that case. Plugging
> > that hole is planned later after the next upcoming merge window closes.
> >
> > "
> > In order to help a forward progress for the OOM killer, make sure
> > that this really rare cases will not get into the way and hide
> > the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> > oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> > MMF_OOM_REAPED flag set to catch these oom victims.
> >
> > After this patch we should guarantee a forward progress for the OOM
> > killer even when the selected victim is sharing memory with a kernel
> > thread or global init as long as the victims mm is still alive.
> > "
>
> I tweaked it a bit:
>
> : In order to help forward progress for the OOM killer, make sure that
> : this really rare case will not get in the way - we do this by hiding
> : the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> : oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> : MMF_OOM_REAPED flag set to catch these oom victims.
> :
> : After this patch we should guarantee forward progress for the OOM
> : killer even when the selected victim is sharing memory with a kernel
> : thread or global init as long as the victims mm is still alive.
Sounds good to me. Thanks!
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread