* [PATCH 0/10 -v5] Handle oom bypass more gracefully @ 2016-06-20 12:43 Michal Hocko 2016-06-20 12:43 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko ` (9 more replies) 0 siblings, 10 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML Hi, this is the v5 version of the patchse. Previous version was posted http://lkml.kernel.org/r/1465473137-22531-1-git-send-email-mhocko@kernel.org There was one issue fixed in patch 7. Other than that we have discussed nommu situation after patch 7 which might reintroduce a theoretical race condition when the oom victim passes exit_mm->exit_mm_victim after oom_kill_process checks task_will_free_mem and before mark_oom_victim is called which was reviously fixed by 83363b917a2982dd ("oom: make sure that TIF_MEMDIE is set under task_lock"). The issue was never seen in the real life AFAIK and it would only affect nommu. On the other hand the patch fixes a class of potential lockups when we mark oom victim on a thread while other threads will stay alive. So I think it is a reasonable trade off. Tetsuo still thinks that a timeout based solution is better way to address all the issues: " To me, timeout based one is sufficient for handling any traps that hit nommu kernels after the OOM killer is invoked. Anyway, I don't like this series because this series ignores theoretical cases. I can't make progress as long as you repeat "does it really matter/occur". Please go ahead without Reviewed-by: or Acked-by: from me. " I have added Acked-by from Oleg. What remains is to make mm reapable even when it is shared with kthreads. I have already started working on that and hopefully will have something ready soon but that should go independently on this series. After that we will have MMU basically covered and lockup free. I consider this a noticeable improvement especially because changes to achieve that are much less intrusive than I expected. Nommu still suffers from potential lockups theoretically but a complete lack of reports suggests that this is not the case in the real life. If we ever see any reports we should implement the reaper for nommu as well. Shouldn't be too hard... The original description: ------------------------- The following 10 patches should put some order to very rare cases of mm shared between processes and make the paths which bypass the oom killer oom reapable and therefore much more reliable finally. Even though mm shared outside of thread group is rare (either vforked tasks for a short period, use_mm by kernel threads or exotic thread model of clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them. Not only it makes the current oom killer logic quite hard to follow and reason about it can lead to weird corner cases. E.g. it is possible to select an oom victim which shares the mm with unkillable process or bypass the oom killer even when other processes sharing the mm are still alive and other weird cases. Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write. This can be considered a bug fix with a low impact as nobody has noticed for years. Patch 2 drops sighand lock because it is not needed anymore as pointed by Oleg. Patch 3 is a clean up of oom_score_adj handling and a preparatory work for later patches. Patch 4 enforces oom_adj_score to be consistent between processes sharing the mm to behave consistently with the regular thread groups. This can be considered a user visible behavior change because one thread group updating oom_score_adj will affect others which share the same mm via clone(CLONE_VM). I argue that this should be acceptable because we already have the same behavior for threads in the same thread group and sharing the mm without signal struct is just a different model of threading. This is probably the most controversial part of the series, I would like to find some consensus here. There were some suggestions to hook some counter/oom_score_adj into the mm_struct but I feel that this is not necessary right now and we can rely on proc handler + oom_kill_process to DTRT. I can be convinced otherwise but I strongly think that whatever we do the userspace has to have a way to see the current oom priority as consistently as possible. Patch 5 makes sure that no vforked task is selected if it is sharing the mm with oom unkillable task. Patch 6 ensures that all user tasks sharing the mm are killed which in turn makes sure that all oom victims are oom reapable. Patch 7 guarantees that task_will_free_mem will always imply reapable bypass of the oom killer. Patch 8 is new in this version and it addresses an issue pointed out by 0-day OOM report where an oom victim was reaped several times. Patch 9 puts an upper bound on how many times oom_reaper tries to reap a task and hides it from the oom killer to move on when no progress can be made. This will give an upper bound to how long an oom_reapable task can block the oom killer from selecting another victim if the oom_reaper is not able to reap the victim. Patch 10 tries to plug the (hopefully) last hole when we can still lock up when the oom victim is shared with oom unkillable tasks (kthreads and global init). We just try to be best effort in that case and rather fallback to kill something else than risk a lockup. The patchset is based on the current mmotm tree (mmotm-2016-06-15-16-18). I have pushed the patchset to my git tree for an easier review git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to branch attempts/process-share-mm-oom-sanitization Michal Hocko (10): proc, oom: drop bogus task_lock and mm check proc, oom: drop bogus sighand lock proc, oom_adj: extract oom_score_adj setting into a helper mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj mm, oom: skip vforked tasks from being selected mm, oom: kill all tasks sharing the mm mm, oom: fortify task_will_free_mem mm, oom: task_will_free_mem should skip oom_reaped tasks mm, oom_reaper: do not attempt to reap a task more than twice mm, oom: hide mm which is shared with kthread or global init fs/proc/base.c | 185 +++++++++++++++++++++++----------------------- include/linux/mm.h | 2 + include/linux/oom.h | 26 +------ include/linux/sched.h | 27 +++++++ mm/memcontrol.c | 4 +- mm/oom_kill.c | 198 ++++++++++++++++++++++++++++++++++---------------- 6 files changed, 266 insertions(+), 176 deletions(-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 01/10] proc, oom: drop bogus task_lock and mm check 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-06-20 12:43 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko ` (8 subsequent siblings) 9 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> both oom_adj_write and oom_score_adj_write are using task_lock, check for task->mm and fail if it is NULL. This is not needed because the oom_score_adj is per signal struct so we do not need mm at all. The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom disable count") but we do not do per-mm oom disable since c9f01245b6a7 ("oom: remove oom_disable_count"). The task->mm check is even not correct because the current thread might have exited but the thread group might be still alive - e.g. thread group leader would lead that echo $VAL > /proc/pid/oom_score_adj would always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj would succeed. This is unexpected at best. Remove the lock along with the check to fix the unexpected behavior and also because there is not real need for the lock in the first place. Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- fs/proc/base.c | 22 ++++------------------ 1 file changed, 4 insertions(+), 18 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index be73f4d0cb01..a6014e45c516 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1083,15 +1083,9 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf, goto out; } - task_lock(task); - if (!task->mm) { - err = -EINVAL; - goto err_task_lock; - } - if (!lock_task_sighand(task, &flags)) { err = -ESRCH; - goto err_task_lock; + goto err_put_task; } /* @@ -1121,8 +1115,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf, trace_oom_score_adj_update(task); err_sighand: unlock_task_sighand(task, &flags); -err_task_lock: - task_unlock(task); +err_put_task: put_task_struct(task); out: return err < 0 ? err : count; @@ -1186,15 +1179,9 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, goto out; } - task_lock(task); - if (!task->mm) { - err = -EINVAL; - goto err_task_lock; - } - if (!lock_task_sighand(task, &flags)) { err = -ESRCH; - goto err_task_lock; + goto err_put_task; } if ((short)oom_score_adj < task->signal->oom_score_adj_min && @@ -1210,8 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, err_sighand: unlock_task_sighand(task, &flags); -err_task_lock: - task_unlock(task); +err_put_task: put_task_struct(task); out: return err < 0 ? err : count; -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 02/10] proc, oom: drop bogus sighand lock 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko 2016-06-20 12:43 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-06-20 12:43 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko ` (7 subsequent siblings) 9 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> Oleg has pointed out that can simplify both oom_adj_{read,write} and oom_score_adj_{read,write} even further and drop the sighand lock. The main purpose of the lock was to protect p->signal from going away but this will not happen since ea6d290ca34c ("signals: make task_struct->signal immutable/refcountable"). The other role of the lock was to synchronize different writers, especially those with CAP_SYS_RESOURCE. Introduce a mutex for this purpose. Later patches will need this lock anyway. Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- fs/proc/base.c | 51 +++++++++++++++++---------------------------------- 1 file changed, 17 insertions(+), 34 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index a6014e45c516..968d5ea06e62 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1024,23 +1024,21 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count, char buffer[PROC_NUMBUF]; int oom_adj = OOM_ADJUST_MIN; size_t len; - unsigned long flags; if (!task) return -ESRCH; - if (lock_task_sighand(task, &flags)) { - if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX) - oom_adj = OOM_ADJUST_MAX; - else - oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) / - OOM_SCORE_ADJ_MAX; - unlock_task_sighand(task, &flags); - } + if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX) + oom_adj = OOM_ADJUST_MAX; + else + oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) / + OOM_SCORE_ADJ_MAX; put_task_struct(task); len = snprintf(buffer, sizeof(buffer), "%d\n", oom_adj); return simple_read_from_buffer(buf, count, ppos, buffer, len); } +static DEFINE_MUTEX(oom_adj_mutex); + /* * /proc/pid/oom_adj exists solely for backwards compatibility with previous * kernels. The effective policy is defined by oom_score_adj, which has a @@ -1057,7 +1055,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf, struct task_struct *task; char buffer[PROC_NUMBUF]; int oom_adj; - unsigned long flags; int err; memset(buffer, 0, sizeof(buffer)); @@ -1083,11 +1080,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf, goto out; } - if (!lock_task_sighand(task, &flags)) { - err = -ESRCH; - goto err_put_task; - } - /* * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum * value is always attainable. @@ -1097,10 +1089,11 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf, else oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE; + mutex_lock(&oom_adj_mutex); if (oom_adj < task->signal->oom_score_adj && !capable(CAP_SYS_RESOURCE)) { err = -EACCES; - goto err_sighand; + goto err_unlock; } /* @@ -1113,9 +1106,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf, task->signal->oom_score_adj = oom_adj; trace_oom_score_adj_update(task); -err_sighand: - unlock_task_sighand(task, &flags); -err_put_task: +err_unlock: + mutex_unlock(&oom_adj_mutex); put_task_struct(task); out: return err < 0 ? err : count; @@ -1133,15 +1125,11 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf, struct task_struct *task = get_proc_task(file_inode(file)); char buffer[PROC_NUMBUF]; short oom_score_adj = OOM_SCORE_ADJ_MIN; - unsigned long flags; size_t len; if (!task) return -ESRCH; - if (lock_task_sighand(task, &flags)) { - oom_score_adj = task->signal->oom_score_adj; - unlock_task_sighand(task, &flags); - } + oom_score_adj = task->signal->oom_score_adj; put_task_struct(task); len = snprintf(buffer, sizeof(buffer), "%hd\n", oom_score_adj); return simple_read_from_buffer(buf, count, ppos, buffer, len); @@ -1152,7 +1140,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, { struct task_struct *task; char buffer[PROC_NUMBUF]; - unsigned long flags; int oom_score_adj; int err; @@ -1179,25 +1166,21 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, goto out; } - if (!lock_task_sighand(task, &flags)) { - err = -ESRCH; - goto err_put_task; - } - + mutex_lock(&oom_adj_mutex); if ((short)oom_score_adj < task->signal->oom_score_adj_min && !capable(CAP_SYS_RESOURCE)) { err = -EACCES; - goto err_sighand; + goto err_unlock; } task->signal->oom_score_adj = (short)oom_score_adj; if (has_capability_noaudit(current, CAP_SYS_RESOURCE)) task->signal->oom_score_adj_min = (short)oom_score_adj; + trace_oom_score_adj_update(task); -err_sighand: - unlock_task_sighand(task, &flags); -err_put_task: +err_unlock: + mutex_unlock(&oom_adj_mutex); put_task_struct(task); out: return err < 0 ? err : count; -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko 2016-06-20 12:43 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko 2016-06-20 12:43 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-06-20 12:43 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko ` (6 subsequent siblings) 9 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> Currently we have two proc interfaces to set oom_score_adj. The legacy /proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their specific handlers. Big part of the logic is duplicated so extract the common code into __set_oom_adj helper. Legacy knob still expects some details slightly different so make sure those are handled same way - e.g. the legacy mode ignores oom_score_adj_min and it warns about the usage. This patch shouldn't introduce any functional changes. Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- fs/proc/base.c | 94 +++++++++++++++++++++++++++------------------------------- 1 file changed, 43 insertions(+), 51 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 968d5ea06e62..a6a8fbdd5a1b 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1037,7 +1037,47 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count, return simple_read_from_buffer(buf, count, ppos, buffer, len); } -static DEFINE_MUTEX(oom_adj_mutex); +static int __set_oom_adj(struct file *file, int oom_adj, bool legacy) +{ + static DEFINE_MUTEX(oom_adj_mutex); + struct task_struct *task; + int err = 0; + + task = get_proc_task(file_inode(file)); + if (!task) + return -ESRCH; + + mutex_lock(&oom_adj_mutex); + if (legacy) { + if (oom_adj < task->signal->oom_score_adj && + !capable(CAP_SYS_RESOURCE)) { + err = -EACCES; + goto err_unlock; + } + /* + * /proc/pid/oom_adj is provided for legacy purposes, ask users to use + * /proc/pid/oom_score_adj instead. + */ + pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n", + current->comm, task_pid_nr(current), task_pid_nr(task), + task_pid_nr(task)); + } else { + if ((short)oom_adj < task->signal->oom_score_adj_min && + !capable(CAP_SYS_RESOURCE)) { + err = -EACCES; + goto err_unlock; + } + } + + task->signal->oom_score_adj = oom_adj; + if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE)) + task->signal->oom_score_adj_min = (short)oom_adj; + trace_oom_score_adj_update(task); +err_unlock: + mutex_unlock(&oom_adj_mutex); + put_task_struct(task); + return err; +} /* * /proc/pid/oom_adj exists solely for backwards compatibility with previous @@ -1052,7 +1092,6 @@ static DEFINE_MUTEX(oom_adj_mutex); static ssize_t oom_adj_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { - struct task_struct *task; char buffer[PROC_NUMBUF]; int oom_adj; int err; @@ -1074,12 +1113,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf, goto out; } - task = get_proc_task(file_inode(file)); - if (!task) { - err = -ESRCH; - goto out; - } - /* * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum * value is always attainable. @@ -1089,26 +1122,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf, else oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE; - mutex_lock(&oom_adj_mutex); - if (oom_adj < task->signal->oom_score_adj && - !capable(CAP_SYS_RESOURCE)) { - err = -EACCES; - goto err_unlock; - } - - /* - * /proc/pid/oom_adj is provided for legacy purposes, ask users to use - * /proc/pid/oom_score_adj instead. - */ - pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n", - current->comm, task_pid_nr(current), task_pid_nr(task), - task_pid_nr(task)); - - task->signal->oom_score_adj = oom_adj; - trace_oom_score_adj_update(task); -err_unlock: - mutex_unlock(&oom_adj_mutex); - put_task_struct(task); + err = __set_oom_adj(file, oom_adj, true); out: return err < 0 ? err : count; } @@ -1138,7 +1152,6 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf, static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { - struct task_struct *task; char buffer[PROC_NUMBUF]; int oom_score_adj; int err; @@ -1160,28 +1173,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, goto out; } - task = get_proc_task(file_inode(file)); - if (!task) { - err = -ESRCH; - goto out; - } - - mutex_lock(&oom_adj_mutex); - if ((short)oom_score_adj < task->signal->oom_score_adj_min && - !capable(CAP_SYS_RESOURCE)) { - err = -EACCES; - goto err_unlock; - } - - task->signal->oom_score_adj = (short)oom_score_adj; - if (has_capability_noaudit(current, CAP_SYS_RESOURCE)) - task->signal->oom_score_adj_min = (short)oom_score_adj; - - trace_oom_score_adj_update(task); - -err_unlock: - mutex_unlock(&oom_adj_mutex); - put_task_struct(task); + err = __set_oom_adj(file, oom_score_adj, false); out: return err < 0 ? err : count; } -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko ` (2 preceding siblings ...) 2016-06-20 12:43 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-06-20 12:43 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko ` (5 subsequent siblings) 9 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Changes since v2 - skip over same thread group - skip over kernel threads and global init Changes since v1 - note that we are changing oom_score_adj outside of the thread group to the log Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- fs/proc/base.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/mm.h | 2 ++ mm/oom_kill.c | 2 +- 3 files changed, 49 insertions(+), 1 deletion(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index a6a8fbdd5a1b..c986e92680e1 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1040,6 +1040,7 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count, static int __set_oom_adj(struct file *file, int oom_adj, bool legacy) { static DEFINE_MUTEX(oom_adj_mutex); + struct mm_struct *mm = NULL; struct task_struct *task; int err = 0; @@ -1069,10 +1070,55 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy) } } + /* + * Make sure we will check other processes sharing the mm if this is + * not vfrok which wants its own oom_score_adj. + * pin the mm so it doesn't go away and get reused after task_unlock + */ + if (!task->vfork_done) { + struct task_struct *p = find_lock_task_mm(task); + + if (p) { + if (atomic_read(&p->mm->mm_users) > 1) { + mm = p->mm; + atomic_inc(&mm->mm_count); + } + task_unlock(p); + } + } + task->signal->oom_score_adj = oom_adj; if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE)) task->signal->oom_score_adj_min = (short)oom_adj; trace_oom_score_adj_update(task); + + if (mm) { + struct task_struct *p; + + rcu_read_lock(); + for_each_process(p) { + if (same_thread_group(task, p)) + continue; + + /* do not touch kernel threads or the global init */ + if (p->flags & PF_KTHREAD || is_global_init(p)) + continue; + + task_lock(p); + if (!p->vfork_done && process_shares_mm(p, mm)) { + pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n", + task_pid_nr(p), p->comm, + p->signal->oom_score_adj, oom_adj, + task_pid_nr(task), task->comm); + p->signal->oom_score_adj = oom_adj; + if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE)) + p->signal->oom_score_adj_min = (short)oom_adj; + } + task_unlock(p); + } + rcu_read_unlock(); + mmdrop(mm); + } err_unlock: mutex_unlock(&oom_adj_mutex); put_task_struct(task); diff --git a/include/linux/mm.h b/include/linux/mm.h index 2e500ce06387..0c468aa2ea61 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2255,6 +2255,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr) } #endif /* __HAVE_ARCH_GATE_AREA */ +extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm); + #ifdef CONFIG_SYSCTL extern int sysctl_drop_caches; int drop_caches_sysctl_handler(struct ctl_table *, int, diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d4a929d79470..d8220c5603a5 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -415,7 +415,7 @@ bool oom_killer_disabled __read_mostly; * task's threads: if one of those is using this mm then this task was also * using it. */ -static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) +bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) { struct task_struct *t; -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 05/10] mm, oom: skip vforked tasks from being selected 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko ` (3 preceding siblings ...) 2016-06-20 12:43 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-06-20 12:43 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko ` (4 subsequent siblings) 9 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Changes since v1 - copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch it would be trivial to make the exploit which hides a memory hog from oom-killer - per Oleg - comment in in_vfork by Oleg Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/sched.h | 26 ++++++++++++++++++++++++++ mm/oom_kill.c | 6 ++++-- 2 files changed, 30 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ec636400669f..7442f74b6d44 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1883,6 +1883,32 @@ extern int arch_task_struct_size __read_mostly; #define TNF_FAULT_LOCAL 0x08 #define TNF_MIGRATE_FAIL 0x10 +static inline bool in_vfork(struct task_struct *tsk) +{ + bool ret; + + /* + * need RCU to access ->real_parent if CLONE_VM was used along with + * CLONE_PARENT. + * + * We check real_parent->mm == tsk->mm because CLONE_VFORK does not + * imply CLONE_VM + * + * CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus + * ->real_parent is not necessarily the task doing vfork(), so in + * theory we can't rely on task_lock() if we want to dereference it. + * + * And in this case we can't trust the real_parent->mm == tsk->mm + * check, it can be false negative. But we do not care, if init or + * another oom-unkillable task does this it should blame itself. + */ + rcu_read_lock(); + ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm; + rcu_read_unlock(); + + return ret; +} + #ifdef CONFIG_NUMA_BALANCING extern void task_numa_fault(int last_node, int node, int pages, int flags); extern pid_t task_numa_group_id(struct task_struct *p); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d8220c5603a5..02da660b7c25 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, /* * Do not even consider tasks which are explicitly marked oom - * unkillable or have been already oom reaped. + * unkillable or have been already oom reaped or the are in + * the middle of vfork */ adj = (long)p->signal->oom_score_adj; if (adj == OOM_SCORE_ADJ_MIN || - test_bit(MMF_OOM_REAPED, &p->mm->flags)) { + test_bit(MMF_OOM_REAPED, &p->mm->flags) || + in_vfork(p)) { task_unlock(p); return 0; } -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 06/10] mm, oom: kill all tasks sharing the mm 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko ` (4 preceding siblings ...) 2016-06-20 12:43 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-06-20 12:43 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko ` (3 subsequent siblings) 9 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> Currently oom_kill_process skips both the oom reaper and SIG_KILL if a process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj" all such processes are sharing the same value so we shouldn't see such a task at all (oom_badness would rule them out). We can still encounter oom disabled vforked task which has to be killed as well if we want to have other tasks sharing the mm reapable because it can access the memory before doing exec. Killing such a task should be acceptable because it is highly unlikely it has done anything useful because it cannot modify any memory before it calls exec. An alternative would be to keep the task alive and skip the oom reaper and risk all the weird corner cases where the OOM killer cannot make forward progress because the oom victim hung somewhere on the way to exit. [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task the setting is inherently racy and we cannot do much about it without introducing locks in hot paths] Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 02da660b7c25..38f89ac2df7f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, continue; if (same_thread_group(p, victim)) continue; - if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) || - p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { + if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) { /* * We cannot use oom_reaper for the mm shared by this * process because it wouldn't get killed and so the -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko ` (5 preceding siblings ...) 2016-06-20 12:43 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-06-20 12:43 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko ` (2 subsequent siblings) 9 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 ("oom: consider multi-threaded tasks in task_will_free_mem") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Changes since v4 - do not call task_will_free_mem if current->mm is NULL to catch after exit_mm allocations - as per Tetsuo Changes since v2 - per Oleg - uninline task_will_free_mem and move it to oom proper - reorganize checks in and simplify __task_will_free_mem - add missing process_shares_mm in task_will_free_mem - add same_thread_group to task_will_free_mem for clarity Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/oom.h | 26 ++-------- mm/memcontrol.c | 4 +- mm/oom_kill.c | 133 +++++++++++++++++++++++++++++++--------------------- 3 files changed, 85 insertions(+), 78 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 606137b3b778..5bc0457ee3a8 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p) extern void mark_oom_victim(struct task_struct *tsk); #ifdef CONFIG_MMU -extern void try_oom_reaper(struct task_struct *tsk); +extern void wake_oom_reaper(struct task_struct *tsk); #else -static inline void try_oom_reaper(struct task_struct *tsk) +static inline void wake_oom_reaper(struct task_struct *tsk) { } #endif @@ -107,27 +107,7 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); -static inline bool task_will_free_mem(struct task_struct *task) -{ - struct signal_struct *sig = task->signal; - - /* - * A coredumping process may sleep for an extended period in exit_mm(), - * so the oom killer cannot assume that the process will promptly exit - * and release memory. - */ - if (sig->flags & SIGNAL_GROUP_COREDUMP) - return false; - - if (!(task->flags & PF_EXITING)) - return false; - - /* Make sure that the whole thread group is going down */ - if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT)) - return false; - - return true; -} +bool task_will_free_mem(struct task_struct *task); /* sysctls */ extern int sysctl_oom_dump_tasks; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 587aa0d110c9..b100215811ef 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { + if (task_will_free_mem(current)) { mark_oom_victim(current); - try_oom_reaper(current); + wake_oom_reaper(current); goto unlock; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 38f89ac2df7f..8ee92fb76968 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -596,7 +596,7 @@ static int oom_reaper(void *unused) return 0; } -static void wake_oom_reaper(struct task_struct *tsk) +void wake_oom_reaper(struct task_struct *tsk) { if (!oom_reaper_th) return; @@ -614,46 +614,6 @@ static void wake_oom_reaper(struct task_struct *tsk) wake_up(&oom_reaper_wait); } -/* Check if we can reap the given task. This has to be called with stable - * tsk->mm - */ -void try_oom_reaper(struct task_struct *tsk) -{ - struct mm_struct *mm = tsk->mm; - struct task_struct *p; - - if (!mm) - return; - - /* - * There might be other threads/processes which are either not - * dying or even not killable. - */ - if (atomic_read(&mm->mm_users) > 1) { - rcu_read_lock(); - for_each_process(p) { - if (!process_shares_mm(p, mm)) - continue; - if (fatal_signal_pending(p)) - continue; - - /* - * If the task is exiting make sure the whole thread group - * is exiting and cannot acces mm anymore. - */ - if (signal_group_exit(p->signal)) - continue; - - /* Give up */ - rcu_read_unlock(); - return; - } - rcu_read_unlock(); - } - - wake_oom_reaper(tsk); -} - static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -665,10 +625,6 @@ static int __init oom_init(void) return 0; } subsys_initcall(oom_init) -#else -static void wake_oom_reaper(struct task_struct *tsk) -{ -} #endif /** @@ -745,6 +701,81 @@ void oom_killer_enable(void) oom_killer_disabled = false; } +static inline bool __task_will_free_mem(struct task_struct *task) +{ + struct signal_struct *sig = task->signal; + + /* + * A coredumping process may sleep for an extended period in exit_mm(), + * so the oom killer cannot assume that the process will promptly exit + * and release memory. + */ + if (sig->flags & SIGNAL_GROUP_COREDUMP) + return false; + + if (sig->flags & SIGNAL_GROUP_EXIT) + return true; + + if (thread_group_empty(task) && (task->flags & PF_EXITING)) + return true; + + return false; +} + +/* + * Checks whether the given task is dying or exiting and likely to + * release its address space. This means that all threads and processes + * sharing the same mm have to be killed or exiting. + */ +bool task_will_free_mem(struct task_struct *task) +{ + struct mm_struct *mm; + struct task_struct *p; + bool ret; + + if (!__task_will_free_mem(task)) + return false; + + /* + * If the process has passed exit_mm we have to skip it because + * we have lost a link to other tasks sharing this mm, we do not + * have anything to reap and the task might then get stuck waiting + * for parent as zombie and we do not want it to hold TIF_MEMDIE + */ + p = find_lock_task_mm(task); + if (!p) + return false; + + mm = p->mm; + if (atomic_read(&mm->mm_users) <= 1) { + task_unlock(p); + return true; + } + + /* pin the mm to not get freed and reused */ + atomic_inc(&mm->mm_count); + task_unlock(p); + + /* + * This is really pessimistic but we do not have any reliable way + * to check that external processes share with our mm + */ + rcu_read_lock(); + for_each_process(p) { + if (!process_shares_mm(p, mm)) + continue; + if (same_thread_group(task, p)) + continue; + ret = __task_will_free_mem(p); + if (!ret) + break; + } + rcu_read_unlock(); + mmdrop(mm); + + return ret; +} + /* * Must be called while holding a reference to p, which will be released upon * returning. @@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - task_lock(p); - if (p->mm && task_will_free_mem(p)) { + if (task_will_free_mem(p)) { mark_oom_victim(p); - try_oom_reaper(p); - task_unlock(p); + wake_oom_reaper(p); put_task_struct(p); return; } - task_unlock(p); if (__ratelimit(&oom_rs)) dump_header(oc, p); @@ -944,10 +972,9 @@ bool out_of_memory(struct oom_control *oc) * But don't select if current has already released its mm and cleared * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (current->mm && - (fatal_signal_pending(current) || task_will_free_mem(current))) { + if (current->mm && task_will_free_mem(current)) { mark_oom_victim(current); - try_oom_reaper(current); + wake_oom_reaper(current); return true; } -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko ` (6 preceding siblings ...) 2016-06-20 12:43 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-06-20 12:43 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko 2016-06-20 12:43 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko 9 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> 0-day robot has encountered the following: [ 82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child [ 82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB [ 82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB [ 82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB [ 82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB [ 82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB [ 82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB oom_reaper is trying to reap the same task again and again. This is possible only when the oom killer is bypassed because of task_will_free_mem because we skip over tasks with MMF_OOM_REAPED already set during select_bad_process. Teach task_will_free_mem to skip over MMF_OOM_REAPED tasks as well because they will be unlikely to free anything more. Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8ee92fb76968..36d5dd88d990 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -747,6 +747,16 @@ bool task_will_free_mem(struct task_struct *task) return false; mm = p->mm; + + /* + * This task has already been drained by the oom reaper so there are + * only small chances it will free some more + */ + if (test_bit(MMF_OOM_REAPED, &mm->flags)) { + task_unlock(p); + return false; + } + if (atomic_read(&mm->mm_users) <= 1) { task_unlock(p); return true; -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko ` (7 preceding siblings ...) 2016-06-20 12:43 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-06-20 12:43 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko 9 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/sched.h | 1 + mm/oom_kill.c | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7442f74b6d44..6d81a1eb974a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_REAPED 21 /* mm has been already reaped */ +#define MMF_OOM_NOT_REAPABLE 22 /* mm couldn't be reaped */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 36d5dd88d990..bfddc93ccd34 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -556,8 +556,27 @@ static void oom_reap_task(struct task_struct *tsk) schedule_timeout_idle(HZ/10); if (attempts > MAX_OOM_REAP_RETRIES) { + struct task_struct *p; + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); + + /* + * If we've already tried to reap this task in the past and + * failed it probably doesn't make much sense to try yet again + * so hide the mm from the oom killer so that it can move on + * to another task with a different mm struct. + */ + p = find_lock_task_mm(tsk); + if (p) { + if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) { + pr_info("oom_reaper: giving up pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); + set_bit(MMF_OOM_REAPED, &p->mm->flags); + } + task_unlock(p); + } + debug_show_all_locks(); } -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko ` (8 preceding siblings ...) 2016-06-20 12:43 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko @ 2016-06-20 12:43 ` Michal Hocko 2016-07-19 12:05 ` Michal Hocko 9 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> The only case where the oom_reaper is not triggered for the oom victim is when it shares the memory with a kernel thread (aka use_mm) or with the global init. After "mm, oom: skip vforked tasks from being selected" the victim cannot be a vforked task of the global init so we are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite rare as well. In order to guarantee a forward progress for the OOM killer make sure that this really rare cases will not get into the way and hide the mm from the oom killer by setting MMF_OOM_REAPED flag for it. oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims. After this patch we should guarantee a forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init. Changes since v1 - do not exit_oom_victim because oom_scan_process_thread will handle those which couldn't terminate in time. exit_oom_victim is not safe wrt. oom_disable synchronization. Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index bfddc93ccd34..4c21f744daa6 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, /* * This task already has access to memory reserves and is being killed. - * Don't allow any other task to have access to the reserves. + * Don't allow any other task to have access to the reserves unless + * the task has MMF_OOM_REAPED because chances that it would release + * any memory is quite low. */ - if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) - return OOM_SCAN_ABORT; + if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { + struct task_struct *p = find_lock_task_mm(task); + enum oom_scan_t ret = OOM_SCAN_ABORT; + + if (p) { + if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) + ret = OOM_SCAN_CONTINUE; + task_unlock(p); + } + + return ret; + } /* * If task is allocating a lot of memory and has been marked to be @@ -913,9 +925,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, /* * We cannot use oom_reaper for the mm shared by this * process because it wouldn't get killed and so the - * memory might be still used. + * memory might be still used. Hide the mm from the oom + * killer to guarantee OOM forward progress. */ can_oom_reap = false; + set_bit(MMF_OOM_REAPED, &mm->flags); + pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n", + task_pid_nr(victim), victim->comm, + task_pid_nr(p), p->comm); continue; } do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init 2016-06-20 12:43 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko @ 2016-07-19 12:05 ` Michal Hocko 2016-07-19 23:27 ` Andrew Morton 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-07-19 12:05 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML Andrew, On Mon 20-06-16 14:43:48, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > The only case where the oom_reaper is not triggered for the oom victim > is when it shares the memory with a kernel thread (aka use_mm) or with > the global init. After "mm, oom: skip vforked tasks from being selected" > the victim cannot be a vforked task of the global init so we are left > with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite > rare as well. > > In order to guarantee a forward progress for the OOM killer make > sure that this really rare cases will not get into the way and hide > the mm from the oom killer by setting MMF_OOM_REAPED flag for it. > oom_scan_process_thread will ignore any TIF_MEMDIE task if it has > MMF_OOM_REAPED flag set to catch these oom victims. > > After this patch we should guarantee a forward progress for the OOM > killer even when the selected victim is sharing memory with a kernel > thread or global init. Could you replace the last two paragraphs with the following. Tetsuo didn't like the guarantee mentioned there because that is a too strong statement as find_lock_task_mm might not find any mm and so we still could end up looping on the oom victim if it gets stuck somewhere in __mmput. This particular patch didn't aim at closing that case. Plugging that hole is planned later after the next upcoming merge window closes. " In order to help a forward progress for the OOM killer, make sure that this really rare cases will not get into the way and hide the mm from the oom killer by setting MMF_OOM_REAPED flag for it. oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims. After this patch we should guarantee a forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init as long as the victims mm is still alive. " > > Changes since v1 > - do not exit_oom_victim because oom_scan_process_thread will handle > those which couldn't terminate in time. exit_oom_victim is not safe > wrt. oom_disable synchronization. > > Acked-by: Oleg Nesterov <oleg@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/oom_kill.c | 25 +++++++++++++++++++++---- > 1 file changed, 21 insertions(+), 4 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index bfddc93ccd34..4c21f744daa6 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > > /* > * This task already has access to memory reserves and is being killed. > - * Don't allow any other task to have access to the reserves. > + * Don't allow any other task to have access to the reserves unless > + * the task has MMF_OOM_REAPED because chances that it would release > + * any memory is quite low. > */ > - if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) > - return OOM_SCAN_ABORT; > + if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { > + struct task_struct *p = find_lock_task_mm(task); > + enum oom_scan_t ret = OOM_SCAN_ABORT; > + > + if (p) { > + if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) > + ret = OOM_SCAN_CONTINUE; > + task_unlock(p); > + } > + > + return ret; > + } > > /* > * If task is allocating a lot of memory and has been marked to be > @@ -913,9 +925,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > /* > * We cannot use oom_reaper for the mm shared by this > * process because it wouldn't get killed and so the > - * memory might be still used. > + * memory might be still used. Hide the mm from the oom > + * killer to guarantee OOM forward progress. > */ > can_oom_reap = false; > + set_bit(MMF_OOM_REAPED, &mm->flags); > + pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n", > + task_pid_nr(victim), victim->comm, > + task_pid_nr(p), p->comm); > continue; > } > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); > -- > 2.8.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init 2016-07-19 12:05 ` Michal Hocko @ 2016-07-19 23:27 ` Andrew Morton 2016-07-20 6:29 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Andrew Morton @ 2016-07-19 23:27 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, LKML On Tue, 19 Jul 2016 14:05:39 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > After this patch we should guarantee a forward progress for the OOM > > killer even when the selected victim is sharing memory with a kernel > > thread or global init. > > Could you replace the last two paragraphs with the following. Tetsuo > didn't like the guarantee mentioned there because that is a too strong > statement as find_lock_task_mm might not find any mm and so we still > could end up looping on the oom victim if it gets stuck somewhere in > __mmput. This particular patch didn't aim at closing that case. Plugging > that hole is planned later after the next upcoming merge window closes. > > " > In order to help a forward progress for the OOM killer, make sure > that this really rare cases will not get into the way and hide > the mm from the oom killer by setting MMF_OOM_REAPED flag for it. > oom_scan_process_thread will ignore any TIF_MEMDIE task if it has > MMF_OOM_REAPED flag set to catch these oom victims. > > After this patch we should guarantee a forward progress for the OOM > killer even when the selected victim is sharing memory with a kernel > thread or global init as long as the victims mm is still alive. > " I tweaked it a bit: : In order to help forward progress for the OOM killer, make sure that : this really rare case will not get in the way - we do this by hiding : the mm from the oom killer by setting MMF_OOM_REAPED flag for it. : oom_scan_process_thread will ignore any TIF_MEMDIE task if it has : MMF_OOM_REAPED flag set to catch these oom victims. : : After this patch we should guarantee forward progress for the OOM : killer even when the selected victim is sharing memory with a kernel : thread or global init as long as the victims mm is still alive. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init 2016-07-19 23:27 ` Andrew Morton @ 2016-07-20 6:29 ` Michal Hocko 0 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-07-20 6:29 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, LKML On Tue 19-07-16 16:27:59, Andrew Morton wrote: > On Tue, 19 Jul 2016 14:05:39 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > > > After this patch we should guarantee a forward progress for the OOM > > > killer even when the selected victim is sharing memory with a kernel > > > thread or global init. > > > > Could you replace the last two paragraphs with the following. Tetsuo > > didn't like the guarantee mentioned there because that is a too strong > > statement as find_lock_task_mm might not find any mm and so we still > > could end up looping on the oom victim if it gets stuck somewhere in > > __mmput. This particular patch didn't aim at closing that case. Plugging > > that hole is planned later after the next upcoming merge window closes. > > > > " > > In order to help a forward progress for the OOM killer, make sure > > that this really rare cases will not get into the way and hide > > the mm from the oom killer by setting MMF_OOM_REAPED flag for it. > > oom_scan_process_thread will ignore any TIF_MEMDIE task if it has > > MMF_OOM_REAPED flag set to catch these oom victims. > > > > After this patch we should guarantee a forward progress for the OOM > > killer even when the selected victim is sharing memory with a kernel > > thread or global init as long as the victims mm is still alive. > > " > > I tweaked it a bit: > > : In order to help forward progress for the OOM killer, make sure that > : this really rare case will not get in the way - we do this by hiding > : the mm from the oom killer by setting MMF_OOM_REAPED flag for it. > : oom_scan_process_thread will ignore any TIF_MEMDIE task if it has > : MMF_OOM_REAPED flag set to catch these oom victims. > : > : After this patch we should guarantee forward progress for the OOM > : killer even when the selected victim is sharing memory with a kernel > : thread or global init as long as the victims mm is still alive. Sounds good to me. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 0/10 -v4] Handle oom bypass more gracefully @ 2016-06-09 11:52 Michal Hocko 2016-06-09 11:52 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML Hi, this is the v4 version of the patchse. Previous version was posted http://lkml.kernel.org/r/1464945404-30157-1-git-send-email-mhocko@kernel.org It's been mostly my screwups during the rebase that were fixed and patch 10 reworked. The following 10 patches should put some order to very rare cases of mm shared between processes and make the paths which bypass the oom killer oom reapable and therefore much more reliable finally. Even though mm shared outside of thread group is rare (either vforked tasks for a short period, use_mm by kernel threads or exotic thread model of clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them. Not only it makes the current oom killer logic quite hard to follow and reason about it can lead to weird corner cases. E.g. it is possible to select an oom victim which shares the mm with unkillable process or bypass the oom killer even when other processes sharing the mm are still alive and other weird cases. Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write. This can be considered a bug fix with a low impact as nobody has noticed for years. Patch 2 drops sighand lock because it is not needed anymore as pointed by Oleg. Patch 3 is a clean up of oom_score_adj handling and a preparatory work for later patches. Patch 4 enforces oom_adj_score to be consistent between processes sharing the mm to behave consistently with the regular thread groups. This can be considered a user visible behavior change because one thread group updating oom_score_adj will affect others which share the same mm via clone(CLONE_VM). I argue that this should be acceptable because we already have the same behavior for threads in the same thread group and sharing the mm without signal struct is just a different model of threading. This is probably the most controversial part of the series, I would like to find some consensus here. There were some suggestions to hook some counter/oom_score_adj into the mm_struct but I feel that this is not necessary right now and we can rely on proc handler + oom_kill_process to DTRT. I can be convinced otherwise but I strongly think that whatever we do the userspace has to have a way to see the current oom priority as consistently as possible. Patch 5 makes sure that no vforked task is selected if it is sharing the mm with oom unkillable task. Patch 6 ensures that all user tasks sharing the mm are killed which in turn makes sure that all oom victims are oom reapable. Patch 7 guarantees that task_will_free_mem will always imply reapable bypass of the oom killer. Patch 8 is new in this version and it addresses an issue pointed out by 0-day OOM report where an oom victim was reaped several times. Patch 9 puts an upper bound on how many times oom_reaper tries to reap a task and hides it from the oom killer to move on when no progress can be made. This will give an upper bound to how long an oom_reapable task can block the oom killer from selecting another victim if the oom_reaper is not able to reap the victim. Patch 10 tries to plug the (hopefully) last hole when we can still lock up when the oom victim is shared with oom unkillable tasks (kthreads and global init). We just try to be best effort in that case and rather fallback to kill something else than risk a lockup. I believe that timeout based heurisitcs are no longer really needed after this series. We still have some room for improvements, though. I would like to explore ways how to remove kthreads (use_mm) special case. It shouldn't be that hard, we just have to teach the page fault handler to recognize oom victim mm and enforce EFAULT for kthreads which have borrowed that mm. The patchset is based on the current mmotm tree (mmotm-2016-06-08-15-25). I would really appreciate a deep review as this area is full of land mines but I hope I've made the code much cleaner with less kludges. I have pushed the patchset to my git tree git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to branch attempts/process-share-mm-oom-sanitization Michal Hocko (10): proc, oom: drop bogus task_lock and mm check proc, oom: drop bogus sighand lock proc, oom_adj: extract oom_score_adj setting into a helper mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj mm, oom: skip vforked tasks from being selected mm, oom: kill all tasks sharing the mm mm, oom: fortify task_will_free_mem mm, oom: task_will_free_mem should skip oom_reaped tasks mm, oom_reaper: do not attempt to reap a task more than twice mm, oom: hide mm which is shared with kthread or global init fs/proc/base.c | 185 ++++++++++++++++++++++++---------------------- include/linux/mm.h | 2 + include/linux/oom.h | 26 +------ include/linux/sched.h | 27 +++++++ mm/memcontrol.c | 4 +- mm/oom_kill.c | 201 ++++++++++++++++++++++++++++++++++---------------- 6 files changed, 266 insertions(+), 179 deletions(-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko @ 2016-06-09 11:52 ` Michal Hocko 2016-06-09 13:18 ` Tetsuo Handa 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 ("oom: consider multi-threaded tasks in task_will_free_mem") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Changes since v2 - per Oleg - uninline task_will_free_mem and move it to oom proper - reorganize checks in and simplify __task_will_free_mem - add missing process_shares_mm in task_will_free_mem - add same_thread_group to task_will_free_mem for clarity Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/oom.h | 26 ++-------- mm/memcontrol.c | 4 +- mm/oom_kill.c | 136 ++++++++++++++++++++++++++++++---------------------- 3 files changed, 85 insertions(+), 81 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 606137b3b778..5bc0457ee3a8 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p) extern void mark_oom_victim(struct task_struct *tsk); #ifdef CONFIG_MMU -extern void try_oom_reaper(struct task_struct *tsk); +extern void wake_oom_reaper(struct task_struct *tsk); #else -static inline void try_oom_reaper(struct task_struct *tsk) +static inline void wake_oom_reaper(struct task_struct *tsk) { } #endif @@ -107,27 +107,7 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); -static inline bool task_will_free_mem(struct task_struct *task) -{ - struct signal_struct *sig = task->signal; - - /* - * A coredumping process may sleep for an extended period in exit_mm(), - * so the oom killer cannot assume that the process will promptly exit - * and release memory. - */ - if (sig->flags & SIGNAL_GROUP_COREDUMP) - return false; - - if (!(task->flags & PF_EXITING)) - return false; - - /* Make sure that the whole thread group is going down */ - if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT)) - return false; - - return true; -} +bool task_will_free_mem(struct task_struct *task); /* sysctls */ extern int sysctl_oom_dump_tasks; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 83a6a2b92301..de67daf0394e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { + if (task_will_free_mem(current)) { mark_oom_victim(current); - try_oom_reaper(current); + wake_oom_reaper(current); goto unlock; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 38f89ac2df7f..21bf88a546d3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -596,7 +596,7 @@ static int oom_reaper(void *unused) return 0; } -static void wake_oom_reaper(struct task_struct *tsk) +void wake_oom_reaper(struct task_struct *tsk) { if (!oom_reaper_th) return; @@ -614,46 +614,6 @@ static void wake_oom_reaper(struct task_struct *tsk) wake_up(&oom_reaper_wait); } -/* Check if we can reap the given task. This has to be called with stable - * tsk->mm - */ -void try_oom_reaper(struct task_struct *tsk) -{ - struct mm_struct *mm = tsk->mm; - struct task_struct *p; - - if (!mm) - return; - - /* - * There might be other threads/processes which are either not - * dying or even not killable. - */ - if (atomic_read(&mm->mm_users) > 1) { - rcu_read_lock(); - for_each_process(p) { - if (!process_shares_mm(p, mm)) - continue; - if (fatal_signal_pending(p)) - continue; - - /* - * If the task is exiting make sure the whole thread group - * is exiting and cannot acces mm anymore. - */ - if (signal_group_exit(p->signal)) - continue; - - /* Give up */ - rcu_read_unlock(); - return; - } - rcu_read_unlock(); - } - - wake_oom_reaper(tsk); -} - static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -665,10 +625,6 @@ static int __init oom_init(void) return 0; } subsys_initcall(oom_init) -#else -static void wake_oom_reaper(struct task_struct *tsk) -{ -} #endif /** @@ -745,6 +701,81 @@ void oom_killer_enable(void) oom_killer_disabled = false; } +static inline bool __task_will_free_mem(struct task_struct *task) +{ + struct signal_struct *sig = task->signal; + + /* + * A coredumping process may sleep for an extended period in exit_mm(), + * so the oom killer cannot assume that the process will promptly exit + * and release memory. + */ + if (sig->flags & SIGNAL_GROUP_COREDUMP) + return false; + + if (sig->flags & SIGNAL_GROUP_EXIT) + return true; + + if (thread_group_empty(task) && (task->flags & PF_EXITING)) + return true; + + return false; +} + +/* + * Checks whether the given task is dying or exiting and likely to + * release its address space. This means that all threads and processes + * sharing the same mm have to be killed or exiting. + */ +bool task_will_free_mem(struct task_struct *task) +{ + struct mm_struct *mm; + struct task_struct *p; + bool ret; + + if (!__task_will_free_mem(task)) + return false; + + /* + * If the process has passed exit_mm we have to skip it because + * we have lost a link to other tasks sharing this mm, we do not + * have anything to reap and the task might then get stuck waiting + * for parent as zombie and we do not want it to hold TIF_MEMDIE + */ + p = find_lock_task_mm(task); + if (!p) + return false; + + mm = p->mm; + if (atomic_read(&mm->mm_users) <= 1) { + task_unlock(p); + return true; + } + + /* pin the mm to not get freed and reused */ + atomic_inc(&mm->mm_count); + task_unlock(p); + + /* + * This is really pessimistic but we do not have any reliable way + * to check that external processes share with our mm + */ + rcu_read_lock(); + for_each_process(p) { + if (!process_shares_mm(p, mm)) + continue; + if (same_thread_group(task, p)) + continue; + ret = __task_will_free_mem(p); + if (!ret) + break; + } + rcu_read_unlock(); + mmdrop(mm); + + return ret; +} + /* * Must be called while holding a reference to p, which will be released upon * returning. @@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - task_lock(p); - if (p->mm && task_will_free_mem(p)) { + if (task_will_free_mem(p)) { mark_oom_victim(p); - try_oom_reaper(p); - task_unlock(p); + wake_oom_reaper(p); put_task_struct(p); return; } - task_unlock(p); if (__ratelimit(&oom_rs)) dump_header(oc, p); @@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc) * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. - * - * But don't select if current has already released its mm and cleared - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (current->mm && - (fatal_signal_pending(current) || task_will_free_mem(current))) { + if (task_will_free_mem(current)) { mark_oom_victim(current); - try_oom_reaper(current); + wake_oom_reaper(current); return true; } -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-09 11:52 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko @ 2016-06-09 13:18 ` Tetsuo Handa 2016-06-09 14:20 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Tetsuo Handa @ 2016-06-09 13:18 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel, mhocko Michal Hocko wrote: > @@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > * If the task is already exiting, don't alarm the sysadmin or kill > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > - task_lock(p); > - if (p->mm && task_will_free_mem(p)) { > + if (task_will_free_mem(p)) { I think it is possible that p->mm becomes NULL here. Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case. > mark_oom_victim(p); > - try_oom_reaper(p); > - task_unlock(p); > + wake_oom_reaper(p); > put_task_struct(p); > return; > } > - task_unlock(p); > > if (__ratelimit(&oom_rs)) > dump_header(oc, p); > @@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc) > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > * quickly exit and free its memory. > - * > - * But don't select if current has already released its mm and cleared > - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. > */ > - if (current->mm && > - (fatal_signal_pending(current) || task_will_free_mem(current))) { > + if (task_will_free_mem(current)) { Setting TIF_MEMDIE on current when current->mm == NULL and find_lock_task_mm(current) != NULL is wrong. > mark_oom_victim(current); > - try_oom_reaper(current); > + wake_oom_reaper(current); > return true; > } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-09 13:18 ` Tetsuo Handa @ 2016-06-09 14:20 ` Michal Hocko 2016-06-11 8:10 ` Tetsuo Handa 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-09 14:20 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel On Thu 09-06-16 22:18:28, Tetsuo Handa wrote: > Michal Hocko wrote: > > @@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > * If the task is already exiting, don't alarm the sysadmin or kill > > * its children or threads, just set TIF_MEMDIE so it can die quickly > > */ > > - task_lock(p); > > - if (p->mm && task_will_free_mem(p)) { > > + if (task_will_free_mem(p)) { > > I think it is possible that p->mm becomes NULL here. Yes p->mm can become NULL at any time after we drop the task_lock. > Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is > wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set > TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case. Yes this would be racy for !CONFIG_MMU but does it actually matter? AFAIU !CONFIG_MMU model basically everything is allocated during the initialization so it is highly unlikely that we would do some allocations past the exit_mm() which releases the user memory which should be sufficient to move on and get out of OOM. Also the programming model is really careful about the memory usage (fork is basically a no go), large memory mappings are really hard due to memory fragmentation etc... So do we really have to nit pick about !CONFIG_MMU to move on here? I definitely do not want to break this configuration but I believe that this particular change has only tiny chance to make any difference. I might be missing something of course... [...] > > @@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc) > > * If current has a pending SIGKILL or is exiting, then automatically > > * select it. The goal is to allow it to allocate so that it may > > * quickly exit and free its memory. > > - * > > - * But don't select if current has already released its mm and cleared > > - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. > > */ > > - if (current->mm && > > - (fatal_signal_pending(current) || task_will_free_mem(current))) { > > + if (task_will_free_mem(current)) { > > Setting TIF_MEMDIE on current when current->mm == NULL and > find_lock_task_mm(current) != NULL is wrong. Why? Or is this still about the !CONFIG_MMU? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-09 14:20 ` Michal Hocko @ 2016-06-11 8:10 ` Tetsuo Handa 2016-06-13 11:27 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Tetsuo Handa @ 2016-06-11 8:10 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel Michal Hocko wrote: > > Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is > > wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set > > TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case. > > Yes this would be racy for !CONFIG_MMU but does it actually matter? I don't know because I've never used CONFIG_MMU=n kernels. But I think it actually matters. You fixed this race by commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE is set under task_lock"). > > > @@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc) > > > * If current has a pending SIGKILL or is exiting, then automatically > > > * select it. The goal is to allow it to allocate so that it may > > > * quickly exit and free its memory. > > > - * > > > - * But don't select if current has already released its mm and cleared > > > - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. > > > */ > > > - if (current->mm && > > > - (fatal_signal_pending(current) || task_will_free_mem(current))) { > > > + if (task_will_free_mem(current)) { > > > > Setting TIF_MEMDIE on current when current->mm == NULL and > > find_lock_task_mm(current) != NULL is wrong. > > Why? Or is this still about the !CONFIG_MMU? Yes, mainly about CONFIG_MMU=n kernels. This change reintroduces possibility of failing to clear TIF_MEMDIE which was fixed by commit d7a94e7e11badf84 ("oom: don't count on mm-less current process"). I was also thinking about possibility of the OOM reaper failing to clear TIF_MEMDIE due to race with PM/freezing. That is, TIF_MEMDIE is set on a thread which already released mm, then try_to_freeze_tasks(true) from freeze_processes() freezes such thread, then oom_reap_task() fails to clear TIF_MEMDIE from such thread due to oom_killer_disabled == true, then subsequent memory allocation fails and hibernation aborts, then all threads are thawed, but TIF_MEMDIE thread won't clear TIF_MEMDIE, and finally the system locks up because nobody will clear TIF_MEMDIE. But as I posted in other thread, we can use an approach which does not introduce possibility of failing to clear TIF_MEMDIE due to race with PM/freezing. So, this is purely about CONFIG_MMU=n kernels. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-11 8:10 ` Tetsuo Handa @ 2016-06-13 11:27 ` Michal Hocko 2016-06-16 12:54 ` Tetsuo Handa 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-13 11:27 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel On Sat 11-06-16 17:10:03, Tetsuo Handa wrote: > Michal Hocko wrote: > > > Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is > > > wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set > > > TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case. > > > > Yes this would be racy for !CONFIG_MMU but does it actually matter? > > I don't know because I've never used CONFIG_MMU=n kernels. But I think it > actually matters. You fixed this race by commit 83363b917a2982dd ("oom: > make sure that TIF_MEMDIE is set under task_lock"). Yes and that commit was trying to address a highly theoretical issue reported by you. Let me quote: :oom_kill_process is currently prone to a race condition when the OOM :victim is already exiting and TIF_MEMDIE is set after the task releases :its address space. This might theoretically lead to OOM livelock if the :OOM victim blocks on an allocation later during exiting because it :wouldn't kill any other process and the exiting one won't be able to exit. :The situation is highly unlikely because the OOM victim is expected to :release some memory which should help to sort out OOM situation. Even if such a race is possible it wouldn't be with the oom reaper. Regarding CONFIG_MMU=n I am even less sure it is possible and I would rather focus on CONFIG_MMU=y where we know that problems exist rather than speculating about something as special as nommu which even might not care at all. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-13 11:27 ` Michal Hocko @ 2016-06-16 12:54 ` Tetsuo Handa 2016-06-16 14:29 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Tetsuo Handa @ 2016-06-16 12:54 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel Michal Hocko wrote: > On Sat 11-06-16 17:10:03, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > > Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is > > > > wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set > > > > TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case. > > > > > > Yes this would be racy for !CONFIG_MMU but does it actually matter? > > > > I don't know because I've never used CONFIG_MMU=n kernels. But I think it > > actually matters. You fixed this race by commit 83363b917a2982dd ("oom: > > make sure that TIF_MEMDIE is set under task_lock"). > > Yes and that commit was trying to address a highly theoretical issue > reported by you. Let me quote: > :oom_kill_process is currently prone to a race condition when the OOM > :victim is already exiting and TIF_MEMDIE is set after the task releases > :its address space. This might theoretically lead to OOM livelock if the > :OOM victim blocks on an allocation later during exiting because it > :wouldn't kill any other process and the exiting one won't be able to exit. > :The situation is highly unlikely because the OOM victim is expected to > :release some memory which should help to sort out OOM situation. > > Even if such a race is possible it wouldn't be with the oom > reaper. Regarding CONFIG_MMU=n I am even less sure it is possible and > I would rather focus on CONFIG_MMU=y where we know that problems exist > rather than speculating about something as special as nommu which even > might not care at all. I still don't like it. current->mm == NULL in - if (current->mm && - (fatal_signal_pending(current) || task_will_free_mem(current))) { + if (task_will_free_mem(current)) { is not highly unlikely. You obviously break commit d7a94e7e11badf84 ("oom: don't count on mm-less current process") on CONFIG_MMU=n kernels. Also, since commit f44666b04605d1c7 ("mm,oom: speed up select_bad_process() loop") changed to iterate using thread group leaders, it is no longer highly unlikely that p is a thread group leader which already released mm. What you call "a highly theoretical issue" (which is true as of commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE is set under task_lock") was proposed) may not be true any more. Regarding CONFIG_MMU=n kernels, making sure that TIF_MEMDIE is set on a thread with non-NULL mm does matter. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-16 12:54 ` Tetsuo Handa @ 2016-06-16 14:29 ` Michal Hocko 2016-06-16 15:40 ` Tetsuo Handa 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-16 14:29 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel On Thu 16-06-16 21:54:27, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Sat 11-06-16 17:10:03, Tetsuo Handa wrote: [...] > I still don't like it. current->mm == NULL in > > - if (current->mm && > - (fatal_signal_pending(current) || task_will_free_mem(current))) { > + if (task_will_free_mem(current)) { > > is not highly unlikely. You obviously break commit d7a94e7e11badf84 > ("oom: don't count on mm-less current process") on CONFIG_MMU=n kernels. I still fail to see why you care about that case so much. The heuristic was broken for other reasons before this patch. The patch fixes a class of issues for both mmu and nommu. I can restore the current->mm check for now but the more I am thinking about it the less I am sure the commit you are referring to is evem correct/necessary. It claims that the OOM killer would be stuck because the child would be sitting in the final schedule() until the parent reaps it. That is not true, though, because victim would be unhashed down in release_task() path so it is not visible by the oom killer when it is waiting for the parent. I have completely missed that part when reviewing the patch. Or am I missing something... Anyway, would you be OK with the patch if I added the current->mm check and resolve its necessity in a separate patch? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-16 14:29 ` Michal Hocko @ 2016-06-16 15:40 ` Tetsuo Handa 2016-06-16 15:53 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Tetsuo Handa @ 2016-06-16 15:40 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel Michal Hocko wrote: > On Thu 16-06-16 21:54:27, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Sat 11-06-16 17:10:03, Tetsuo Handa wrote: > [...] > > I still don't like it. current->mm == NULL in > > > > - if (current->mm && > > - (fatal_signal_pending(current) || task_will_free_mem(current))) { > > + if (task_will_free_mem(current)) { > > > > is not highly unlikely. You obviously break commit d7a94e7e11badf84 > > ("oom: don't count on mm-less current process") on CONFIG_MMU=n kernels. > > I still fail to see why you care about that case so much. The heuristic > was broken for other reasons before this patch. The patch fixes a class > of issues for both mmu and nommu. I can restore the current->mm check > for now but the more I am thinking about it the less I am sure the > commit you are referring to is evem correct/necessary. > > It claims that the OOM killer would be stuck because the child would be > sitting in the final schedule() until the parent reaps it. That is not > true, though, because victim would be unhashed down in release_task() > path so it is not visible by the oom killer when it is waiting for the > parent. I have completely missed that part when reviewing the patch. Or > am I missing something... That explanation started from 201411292304.CGF68419.MOLHVQtSFFOOJF@I-love.SAKURA.ne.jp (Sat, 29 Nov 2014 23:04:33 +0900) in your mailbox. I confirmed that a TIF_MEMDIE zombie inside the final schedule() in do_exit() is waiting for parent to reap. release_task() will be called when parent noticed that there is a zombie, but this OOM livelock situation prevented parent looping inside page allocator waiting for that TIF_MEMDIE zombie from noticing that there is a zombie. > > Anyway, would you be OK with the patch if I added the current->mm check > and resolve its necessity in a separate patch? Please correct task_will_free_mem() in oom_kill_process() as well. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-16 15:40 ` Tetsuo Handa @ 2016-06-16 15:53 ` Michal Hocko 2016-06-17 11:38 ` Tetsuo Handa 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-16 15:53 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel On Fri 17-06-16 00:40:41, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 16-06-16 21:54:27, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Sat 11-06-16 17:10:03, Tetsuo Handa wrote: > > [...] > > > I still don't like it. current->mm == NULL in > > > > > > - if (current->mm && > > > - (fatal_signal_pending(current) || task_will_free_mem(current))) { > > > + if (task_will_free_mem(current)) { > > > > > > is not highly unlikely. You obviously break commit d7a94e7e11badf84 > > > ("oom: don't count on mm-less current process") on CONFIG_MMU=n kernels. > > > > I still fail to see why you care about that case so much. The heuristic > > was broken for other reasons before this patch. The patch fixes a class > > of issues for both mmu and nommu. I can restore the current->mm check > > for now but the more I am thinking about it the less I am sure the > > commit you are referring to is evem correct/necessary. > > > > It claims that the OOM killer would be stuck because the child would be > > sitting in the final schedule() until the parent reaps it. That is not > > true, though, because victim would be unhashed down in release_task() > > path so it is not visible by the oom killer when it is waiting for the > > parent. I have completely missed that part when reviewing the patch. Or > > am I missing something... > > That explanation started from 201411292304.CGF68419.MOLHVQtSFFOOJF@I-love.SAKURA.ne.jp > (Sat, 29 Nov 2014 23:04:33 +0900) in your mailbox. I confirmed that a TIF_MEMDIE > zombie inside the final schedule() in do_exit() is waiting for parent to reap. > release_task() will be called when parent noticed that there is a zombie, but > this OOM livelock situation prevented parent looping inside page allocator waiting > for that TIF_MEMDIE zombie from noticing that there is a zombie. I cannot seem to find this msg-id. Anyway, let's forget it for now to not get side tracked. I have to study that code more deeply to better understand it. > > Anyway, would you be OK with the patch if I added the current->mm check > > and resolve its necessity in a separate patch? > > Please correct task_will_free_mem() in oom_kill_process() as well. We cannot hold task_lock over all task_will_free_mem I am even not sure we have to develop an elaborate way to make it raceless just for the nommu case. The current case is simple as we cannot race here. Is that sufficient for you? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-16 15:53 ` Michal Hocko @ 2016-06-17 11:38 ` Tetsuo Handa 2016-06-17 12:26 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Tetsuo Handa @ 2016-06-17 11:38 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel Michal Hocko wrote: > > > Anyway, would you be OK with the patch if I added the current->mm check > > > and resolve its necessity in a separate patch? > > > > Please correct task_will_free_mem() in oom_kill_process() as well. > > We cannot hold task_lock over all task_will_free_mem I am even not sure > we have to develop an elaborate way to make it raceless just for the nommu > case. The current case is simple as we cannot race here. Is that > sufficient for you? We can use find_lock_task_mm() inside mark_oom_victim(). That is, call wake_oom_reaper() from mark_oom_victim() like void mark_oom_victim(struct task_struct *tsk, bool can_use_oom_reaper) { WARN_ON(oom_killer_disabled); /* OOM killer might race with memcg OOM */ tsk = find_lock_task_mm(tsk); if (!tsk) return; if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) { task_unlock(tsk); return; } task_unlock(tsk); atomic_inc(&tsk->signal->oom_victims); /* * Make sure that the task is woken up from uninterruptible sleep * if it is frozen because OOM killer wouldn't be able to free * any memory and livelock. freezing_slow_path will tell the freezer * that TIF_MEMDIE tasks should be ignored. */ __thaw_task(tsk); atomic_inc(&oom_victims); if (can_use_oom_reaper) wake_oom_reaper(tsk); } and move mark_oom_victim() by normal path to after task_unlock(victim). do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); - mark_oom_victim(victim); - if (can_oom_reap) - wake_oom_reaper(victim); + wake_oom_reaper(victim, can_oom_reap); If you don't like possibility of showing different pid for pr_err("Killed process %d (%s) and pr_info("oom_reaper: reaped process %d (%s) messages, you can defer the former till mark_oom_victim() locks that task. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-17 11:38 ` Tetsuo Handa @ 2016-06-17 12:26 ` Michal Hocko 2016-06-17 13:12 ` Tetsuo Handa 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-17 12:26 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel On Fri 17-06-16 20:38:01, Tetsuo Handa wrote: > Michal Hocko wrote: > > > > Anyway, would you be OK with the patch if I added the current->mm check > > > > and resolve its necessity in a separate patch? > > > > > > Please correct task_will_free_mem() in oom_kill_process() as well. > > > > We cannot hold task_lock over all task_will_free_mem I am even not sure > > we have to develop an elaborate way to make it raceless just for the nommu > > case. The current case is simple as we cannot race here. Is that > > sufficient for you? > > We can use find_lock_task_mm() inside mark_oom_victim(). > That is, call wake_oom_reaper() from mark_oom_victim() like > > void mark_oom_victim(struct task_struct *tsk, bool can_use_oom_reaper) > { > WARN_ON(oom_killer_disabled); > /* OOM killer might race with memcg OOM */ > tsk = find_lock_task_mm(tsk); > if (!tsk) > return; > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) { > task_unlock(tsk); > return; > } > task_unlock(tsk); > atomic_inc(&tsk->signal->oom_victims); > /* > * Make sure that the task is woken up from uninterruptible sleep > * if it is frozen because OOM killer wouldn't be able to free > * any memory and livelock. freezing_slow_path will tell the freezer > * that TIF_MEMDIE tasks should be ignored. > */ > __thaw_task(tsk); > atomic_inc(&oom_victims); > if (can_use_oom_reaper) > wake_oom_reaper(tsk); > } > > and move mark_oom_victim() by normal path to after task_unlock(victim). > > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); > - mark_oom_victim(victim); > > - if (can_oom_reap) > - wake_oom_reaper(victim); > + wake_oom_reaper(victim, can_oom_reap); I do not like this because then we would have to check the reapability from inside the oom_reaper again. But let me ask again. Does this really matter so much just because of nommu where we can fall in different traps? Can we simply focus on mmu (aka vast majority of cases) make it work reliably and see what we can do with nommu later? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-17 12:26 ` Michal Hocko @ 2016-06-17 13:12 ` Tetsuo Handa 2016-06-17 13:29 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Tetsuo Handa @ 2016-06-17 13:12 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel Michal Hocko wrote: > On Fri 17-06-16 20:38:01, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > > > Anyway, would you be OK with the patch if I added the current->mm check > > > > > and resolve its necessity in a separate patch? > > > > > > > > Please correct task_will_free_mem() in oom_kill_process() as well. > > > > > > We cannot hold task_lock over all task_will_free_mem I am even not sure > > > we have to develop an elaborate way to make it raceless just for the nommu > > > case. The current case is simple as we cannot race here. Is that > > > sufficient for you? > > > > We can use find_lock_task_mm() inside mark_oom_victim(). > > That is, call wake_oom_reaper() from mark_oom_victim() like > > > > void mark_oom_victim(struct task_struct *tsk, bool can_use_oom_reaper) > > { > > WARN_ON(oom_killer_disabled); > > /* OOM killer might race with memcg OOM */ > > tsk = find_lock_task_mm(tsk); > > if (!tsk) > > return; > > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) { > > task_unlock(tsk); > > return; > > } > > task_unlock(tsk); > > atomic_inc(&tsk->signal->oom_victims); > > /* > > * Make sure that the task is woken up from uninterruptible sleep > > * if it is frozen because OOM killer wouldn't be able to free > > * any memory and livelock. freezing_slow_path will tell the freezer > > * that TIF_MEMDIE tasks should be ignored. > > */ > > __thaw_task(tsk); > > atomic_inc(&oom_victims); > > if (can_use_oom_reaper) > > wake_oom_reaper(tsk); > > } > > > > and move mark_oom_victim() by normal path to after task_unlock(victim). > > > > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); > > - mark_oom_victim(victim); > > > > - if (can_oom_reap) > > - wake_oom_reaper(victim); > > + wake_oom_reaper(victim, can_oom_reap); > > I do not like this because then we would have to check the reapability > from inside the oom_reaper again. I didn't understand why you think so. But strictly speaking, can_oom_reap calculation in oom_kill_process() is always racy, and [PATCH 10/10] is not safe. CPU0 (memory allocating task) CPU1 (kthread) CPU2 (OOM victim) Calls use_mm(victim->mm). Starts some worker. Enters out_of_memory(). Enters oom_kill_process(). Finishes some worker. Calls rcu_read_lock(). Sets can_oom_reap = false due to process_shares_mm() && !same_thread_group() && (p->flags & PF_KTHREAD). Calls unuse_mm(victim->mm). Continues scanning other processes. Calls mmput(victim->mm). Sends SIGKILL to victim. Calls rcu_read_unlock(). Leaves oom_kill_process(). Calls do_exit(). Leaves out_of_memory(). Sets victim->mm = NULL from exit_mm(). Calls mmput() from exit_mm(). __mmput() is called because victim was the last user. Enters out_of_memory(). oom_scan_process_thread() returns OOM_SCAN_ABORT. Leaves out_of_memory(). __mmput() stalls but the oom_reaper is not called. For correctness, can_oom_reap needs to be calculated inside the oom_reaper. > > But let me ask again. Does this really matter so much just because of > nommu where we can fall in different traps? Can we simply focus on mmu > (aka vast majority of cases) make it work reliably and see what we can > do with nommu later? To me, timeout based one is sufficient for handling any traps that hit nommu kernels after the OOM killer is invoked. Anyway, I don't like this series because this series ignores theoretical cases. I can't make progress as long as you repeat "does it really matter/occur". Please go ahead without Reviewed-by: or Acked-by: from me. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-17 13:12 ` Tetsuo Handa @ 2016-06-17 13:29 ` Michal Hocko 0 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-17 13:29 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel On Fri 17-06-16 22:12:22, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 17-06-16 20:38:01, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > > > Anyway, would you be OK with the patch if I added the current->mm check > > > > > > and resolve its necessity in a separate patch? > > > > > > > > > > Please correct task_will_free_mem() in oom_kill_process() as well. > > > > > > > > We cannot hold task_lock over all task_will_free_mem I am even not sure > > > > we have to develop an elaborate way to make it raceless just for the nommu > > > > case. The current case is simple as we cannot race here. Is that > > > > sufficient for you? > > > > > > We can use find_lock_task_mm() inside mark_oom_victim(). > > > That is, call wake_oom_reaper() from mark_oom_victim() like > > > > > > void mark_oom_victim(struct task_struct *tsk, bool can_use_oom_reaper) > > > { > > > WARN_ON(oom_killer_disabled); > > > /* OOM killer might race with memcg OOM */ > > > tsk = find_lock_task_mm(tsk); > > > if (!tsk) > > > return; > > > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) { > > > task_unlock(tsk); > > > return; > > > } > > > task_unlock(tsk); > > > atomic_inc(&tsk->signal->oom_victims); > > > /* > > > * Make sure that the task is woken up from uninterruptible sleep > > > * if it is frozen because OOM killer wouldn't be able to free > > > * any memory and livelock. freezing_slow_path will tell the freezer > > > * that TIF_MEMDIE tasks should be ignored. > > > */ > > > __thaw_task(tsk); > > > atomic_inc(&oom_victims); > > > if (can_use_oom_reaper) > > > wake_oom_reaper(tsk); > > > } > > > > > > and move mark_oom_victim() by normal path to after task_unlock(victim). > > > > > > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); > > > - mark_oom_victim(victim); > > > > > > - if (can_oom_reap) > > > - wake_oom_reaper(victim); > > > + wake_oom_reaper(victim, can_oom_reap); > > > > I do not like this because then we would have to check the reapability > > from inside the oom_reaper again. > > I didn't understand why you think so. But strictly speaking, can_oom_reap calculation > in oom_kill_process() is always racy, and [PATCH 10/10] is not safe. > > CPU0 (memory allocating task) CPU1 (kthread) CPU2 (OOM victim) > > Calls use_mm(victim->mm). > Starts some worker. > Enters out_of_memory(). > Enters oom_kill_process(). > Finishes some worker. > Calls rcu_read_lock(). > Sets can_oom_reap = false due to process_shares_mm() && !same_thread_group() && (p->flags & PF_KTHREAD). > Calls unuse_mm(victim->mm). > Continues scanning other processes. > Calls mmput(victim->mm). > Sends SIGKILL to victim. > Calls rcu_read_unlock(). > Leaves oom_kill_process(). > Calls do_exit(). > Leaves out_of_memory(). > Sets victim->mm = NULL from exit_mm(). > Calls mmput() from exit_mm(). > __mmput() is called because victim was the last user. > Enters out_of_memory(). > oom_scan_process_thread() returns OOM_SCAN_ABORT. > Leaves out_of_memory(). > __mmput() stalls but the oom_reaper is not called. > > For correctness, can_oom_reap needs to be calculated inside the oom_reaper. Why it would be any less racy than the above? It doesn't employ any serialization with use_mm users nor it serialize with the exit path. The timing would get different but not in the way to talk about correctness. > > But let me ask again. Does this really matter so much just because of > > nommu where we can fall in different traps? Can we simply focus on mmu > > (aka vast majority of cases) make it work reliably and see what we can > > do with nommu later? > > To me, timeout based one is sufficient for handling any traps that hit > nommu kernels after the OOM killer is invoked. > > Anyway, I don't like this series because this series ignores > theoretical cases. I am pretty sure you would end up in the land of surprises and new classes of races with any timeout based solutions as well. But we've been through that discussion already. > I can't make progress as long as you repeat "does it really matter/occur". > Please go ahead without Reviewed-by: or Acked-by: from me. Fair enough. I appreciate your review which has caught many real bugs and subtle issues! I will repost the series on Monday. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 0/10 -v3] Handle oom bypass more gracefully @ 2016-06-03 9:16 Michal Hocko 2016-06-03 9:16 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-03 9:16 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML Hi, this is the third version of the patchse. Previous version was posted http://lkml.kernel.org/r/1464613556-16708-1-git-send-email-mhocko@kernel.org I have folded in all the fixes pointed by Oleg (thanks). I hope I haven't missed anything. The following 10 patches should put some order to very rare cases of mm shared between processes and make the paths which bypass the oom killer oom reapable and so much more reliable finally. Even though mm shared outside of thread group is rare (either vforked tasks for a short period, use_mm by kernel threads or exotic thread model of clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND). Not only it makes the current oom killer logic quite hard to follow and evaluate it can lead to weird corner cases. E.g. it is possible to select an oom victim which shares the mm with unkillable process or bypass the oom killer even when other processes sharing the mm are still alive and other weird cases. Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write. This can be considered a bug fix with a low impact as nobody has noticed for years. Patch 2 drops sighand lock because it is not needed anymore as pointed by Oleg. Patch 3 is a clean up of oom_score_adj handling and a preparatory work for later patches. Patch 4 enforces oom_adj_score to be consistent between processes sharing the mm to behave consistently with the regular thread groups. This can be considered a user visible behavior change because one thread group updating oom_score_adj will affect others which share the same mm via clone(CLONE_VM). I argue that this should be acceptable because we already have the same behavior for threads in the same thread group and sharing the mm without signal struct is just a different model of threading. This is probably the most controversial part of the series, I would like to find some consensus here though. There were some suggestions to hook some counter/oom_score_adj into the mm_struct but I feel that this is not necessary right now and we can rely on proc handler + oom_kill_process to DTRT. I can be convinced otherwise but I strongly think that whatever we do the userspace has to have a way to see the current oom priority as consistently as possible. Patch 5 makes sure that no vforked task is selected if it is sharing the mm with oom unkillable task. Patch 6 ensures that all user tasks sharing the mm are killed which in turn makes sure that all oom victims are oom reapable. Patch 7 guarantees that task_will_free_mem will always imply reapable bypass of the oom killer. Patch 8 is new in this version and it addresses an issue pointed out by 0-day OOM report where an oom victim was reaped several times. Assuming there are no other bugs in those patches and no fundamental opposition to this direction I think we should go on and merged them to the mmomt tree and target the 4.8 merge window. Finally the last 2 patches are sent as an RFC because I am still not sure this direction is the correct one. Patch 9 puts an upper bound on how many times oom_reaper tries to reap a task and hides it from the oom killer to move on when no progress can be made. Patch 10 tries to plug the (hopefully) last hole when we can still lock up when the oom victim is shared with oom unkillable tasks (kthreads and global init). We just try to be best effort in that case and rather fallback to kill something else than risk a lockup. The patchset is based on the current mmotm tree (mmotm-2016-05-27-15-19). I would really appreciate a deep review as this area is full of land mines but I hope I've made the code much cleaner with less kludges. I have pushed the patchset to my git tree git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to branch attempts/process-share-mm-oom-sanitization Michal Hocko (10): proc, oom: drop bogus task_lock and mm check proc, oom: drop bogus sighand lock proc, oom_adj: extract oom_score_adj setting into a helper mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj mm, oom: skip vforked tasks from being selected mm, oom: kill all tasks sharing the mm mm, oom: fortify task_will_free_mem mm, oom: task_will_free_mem should skip oom_reaped tasks mm, oom_reaper: do not attempt to reap a task more than twice mm, oom: hide mm which is shared with kthread or global init fs/proc/base.c | 185 ++++++++++++++++++++++--------------------- include/linux/mm.h | 2 + include/linux/oom.h | 26 +----- include/linux/sched.h | 27 +++++++ mm/memcontrol.c | 4 +- mm/oom_kill.c | 214 ++++++++++++++++++++++++++++++++++---------------- 6 files changed, 278 insertions(+), 180 deletions(-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 07/10] mm, oom: fortify task_will_free_mem 2016-06-03 9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko @ 2016-06-03 9:16 ` Michal Hocko 0 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-03 9:16 UTC (permalink / raw) To: linux-mm Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 ("oom: consider multi-threaded tasks in task_will_free_mem") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_THREAD resp CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Changes since v2 - per Oleg - uninline task_will_free_mem and move it to oom proper - reorganize checks in and simplify __task_will_free_mem - add missing process_shares_mm in task_will_free_mem - add same_thread_group to task_will_free_mem for clarity Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/oom.h | 26 ++-------- mm/memcontrol.c | 4 +- mm/oom_kill.c | 136 ++++++++++++++++++++++++++++++---------------------- 3 files changed, 85 insertions(+), 81 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index cbc24a5fe28d..87d911c604b9 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p) extern void mark_oom_victim(struct task_struct *tsk); #ifdef CONFIG_MMU -extern void try_oom_reaper(struct task_struct *tsk); +extern void wake_oom_reaper(struct task_struct *tsk); #else -static inline void try_oom_reaper(struct task_struct *tsk) +static inline void wake_oom_reaper(struct task_struct *tsk) { } #endif @@ -107,27 +107,7 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); -static inline bool task_will_free_mem(struct task_struct *task) -{ - struct signal_struct *sig = task->signal; - - /* - * A coredumping process may sleep for an extended period in exit_mm(), - * so the oom killer cannot assume that the process will promptly exit - * and release memory. - */ - if (sig->flags & SIGNAL_GROUP_COREDUMP) - return false; - - if (!(task->flags & PF_EXITING)) - return false; - - /* Make sure that the whole thread group is going down */ - if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT)) - return false; - - return true; -} +bool task_will_free_mem(struct task_struct *task); /* sysctls */ extern int sysctl_oom_dump_tasks; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index eeb3b14de01a..0ae1abe6cd39 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { + if (task_will_free_mem(current)) { mark_oom_victim(current); - try_oom_reaper(current); + wake_oom_reaper(current); goto unlock; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 22affacaf38b..64dbffa708fd 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -591,7 +591,7 @@ static int oom_reaper(void *unused) return 0; } -static void wake_oom_reaper(struct task_struct *tsk) +void wake_oom_reaper(struct task_struct *tsk) { if (!oom_reaper_th) return; @@ -609,46 +609,6 @@ static void wake_oom_reaper(struct task_struct *tsk) wake_up(&oom_reaper_wait); } -/* Check if we can reap the given task. This has to be called with stable - * tsk->mm - */ -void try_oom_reaper(struct task_struct *tsk) -{ - struct mm_struct *mm = tsk->mm; - struct task_struct *p; - - if (!mm) - return; - - /* - * There might be other threads/processes which are either not - * dying or even not killable. - */ - if (atomic_read(&mm->mm_users) > 1) { - rcu_read_lock(); - for_each_process(p) { - if (!process_shares_mm(p, mm)) - continue; - if (fatal_signal_pending(p)) - continue; - - /* - * If the task is exiting make sure the whole thread group - * is exiting and cannot acces mm anymore. - */ - if (signal_group_exit(p->signal)) - continue; - - /* Give up */ - rcu_read_unlock(); - return; - } - rcu_read_unlock(); - } - - wake_oom_reaper(tsk); -} - static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -660,10 +620,6 @@ static int __init oom_init(void) return 0; } subsys_initcall(oom_init) -#else -static void wake_oom_reaper(struct task_struct *tsk) -{ -} #endif /** @@ -740,6 +696,81 @@ void oom_killer_enable(void) oom_killer_disabled = false; } +static inline bool __task_will_free_mem(struct task_struct *task) +{ + struct signal_struct *sig = task->signal; + + /* + * A coredumping process may sleep for an extended period in exit_mm(), + * so the oom killer cannot assume that the process will promptly exit + * and release memory. + */ + if (sig->flags & SIGNAL_GROUP_COREDUMP) + return false; + + if (sig->flags & SIGNAL_GROUP_EXIT) + return true; + + if (thread_group_empty(task) && PF_EXITING) + return true; + + return false; +} + +/* + * Checks whether the given task is dying or exiting and likely to + * release its address space. This means that all threads and processes + * sharing the same mm have to be killed or exiting. + */ +bool task_will_free_mem(struct task_struct *task) +{ + struct mm_struct *mm; + struct task_struct *p; + bool ret; + + if (!__task_will_free_mem(p)) + return false; + + /* + * If the process has passed exit_mm we have to skip it because + * we have lost a link to other tasks sharing this mm, we do not + * have anything to reap and the task might then get stuck waiting + * for parent as zombie and we do not want it to hold TIF_MEMDIE + */ + p = find_lock_task_mm(task); + if (!p) + return false; + + mm = p->mm; + if (atomic_read(&mm->mm_users) <= 1) { + task_unlock(p); + return true; + } + + /* pin the mm to not get freed and reused */ + atomic_inc(&mm->mm_count); + task_unlock(p); + + /* + * This is really pessimistic but we do not have any reliable way + * to check that external processes share with our mm + */ + rcu_read_lock(); + for_each_process(p) { + if (!process_shares_mm(p, mm)) + continue; + if (same_thread_group(task, p)) + continue; + ret = __task_will_free_mem(p); + if (!ret) + break; + } + rcu_read_unlock(); + mmdrop(mm); + + return ret; +} + /* * Must be called while holding a reference to p, which will be released upon * returning. @@ -761,15 +792,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - task_lock(p); - if (p->mm && task_will_free_mem(p)) { + if (task_will_free_mem(p)) { mark_oom_victim(p); - try_oom_reaper(p); - task_unlock(p); + wake_oom_reaper(p); put_task_struct(p); return; } - task_unlock(p); if (__ratelimit(&oom_rs)) dump_header(oc, p); @@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc) * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. - * - * But don't select if current has already released its mm and cleared - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (current->mm && - (fatal_signal_pending(current) || task_will_free_mem(current))) { + if (task_will_free_mem(current)) { mark_oom_victim(current); - try_oom_reaper(current); + wake_oom_reaper(current); return true; } -- 2.8.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
end of thread, other threads:[~2016-07-20 6:29 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-20 12:43 [PATCH 0/10 -v5] Handle oom bypass more gracefully Michal Hocko 2016-06-20 12:43 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko 2016-06-20 12:43 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko 2016-06-20 12:43 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko 2016-06-20 12:43 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko 2016-06-20 12:43 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko 2016-06-20 12:43 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko 2016-06-20 12:43 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko 2016-06-20 12:43 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko 2016-06-20 12:43 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko 2016-06-20 12:43 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko 2016-07-19 12:05 ` Michal Hocko 2016-07-19 23:27 ` Andrew Morton 2016-07-20 6:29 ` Michal Hocko -- strict thread matches above, loose matches on Subject: below -- 2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko 2016-06-09 11:52 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko 2016-06-09 13:18 ` Tetsuo Handa 2016-06-09 14:20 ` Michal Hocko 2016-06-11 8:10 ` Tetsuo Handa 2016-06-13 11:27 ` Michal Hocko 2016-06-16 12:54 ` Tetsuo Handa 2016-06-16 14:29 ` Michal Hocko 2016-06-16 15:40 ` Tetsuo Handa 2016-06-16 15:53 ` Michal Hocko 2016-06-17 11:38 ` Tetsuo Handa 2016-06-17 12:26 ` Michal Hocko 2016-06-17 13:12 ` Tetsuo Handa 2016-06-17 13:29 ` Michal Hocko 2016-06-03 9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko 2016-06-03 9:16 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).