I got a question about commit e26796066fdf929c ("oom: make oom_reaper freezable"). static int enter_state(suspend_state_t state) { pr_debug("PM: Preparing system for sleep (%s)\n", pm_states[state]); error = suspend_prepare(state) { error = suspend_freeze_processes() { error = freeze_processes(); if (error) return error; error = freeze_kernel_threads(); } } } int freeze_processes(void) { if (!pm_freezing) atomic_inc(&system_freezing_cnt); // system_freezing_cnt becomes non-0 due to pm_freezing == false. pr_info("Freezing user space processes ... "); pm_freezing = true; error = try_to_freeze_tasks(true) { for_each_process_thread(g, p) { if (p == current || !freeze_task(p)) // freeze_task(oom_reaper_th) is called. continue; } } if (!error && !oom_killer_disable()) error = -EBUSY; } bool freeze_task(struct task_struct *p) { if (!freezing(p) || frozen(p)) { // freezing(oom_reaper_th) is called. return false; } if (!(p->flags & PF_KTHREAD)) fake_signal_wake_up(p); else wake_up_state(p, TASK_INTERRUPTIBLE); return true; } static inline bool freezing(struct task_struct *p) { if (likely(!atomic_read(&system_freezing_cnt))) // system_freezing_cnt is non-0 due to freeze_processes(). return false; return freezing_slow_path(p); // freezing_slow_path(oom_reaper_th) is called. } bool freezing_slow_path(struct task_struct *p) { if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK)) // false due to set_freezable() by oom_reaper(). return false; if (test_thread_flag(TIF_MEMDIE)) // false due to unkillable thread. return false; if (pm_nosig_freezing || cgroup_freezing(p)) // pm_nosig_freezing == false for now. I assume cgroup_freezing(oom_reaper_th) is also false. return true; if (pm_freezing && !(p->flags & PF_KTHREAD)) // pm_freezing == true due to freeze_processes(). But oom_reaper_th has PF_KTHREAD. return true; return false; // After all, freezing(oom_reaper_th) returns false for now. } Therefore, freeze_task(oom_reaper_th) from freeze_processes() is a no-op. int freeze_kernel_threads(void) { pr_info("Freezing remaining freezable tasks ... "); pm_nosig_freezing = true; error = try_to_freeze_tasks(false) { for_each_process_thread(g, p) { if (p == current || !freeze_task(p)) // freeze_task(oom_reaper_th) is called again. continue; } } } bool freeze_task(struct task_struct *p) { if (!freezing(p) || frozen(p)) { // freezing(oom_reaper_th) is called again. return false; } if (!(p->flags & PF_KTHREAD)) fake_signal_wake_up(p); else wake_up_state(p, TASK_INTERRUPTIBLE); return true; } This time, freezing(oom_reaper_th) == true due to pm_nosig_freezing == true and frozen(oom_reaper_th) == false. Thus, wake_up_state() wakes up the oom_reaper_th sleeping at wait_event_freezable(), and the oom_reaper_th calls __refrigerator() via try_to_freeze(). Therefore, the OOM reaper kernel thread might clear TIF_MEMDIE after oom_killer_disable() set oom_killer_disabled = true, doesn't it? This in turn allows sequence shown below, doesn't it? (1) freeze_processes() starts freezing user space threads. (2) Somebody (maybe a kenrel thread) calls out_of_memory(). (3) The OOM killer calls mark_oom_victim() on a user space thread P1 which is already in __refrigerator(). (4) oom_killer_disable() sets oom_killer_disabled = true. (5) P1 leaves __refrigerator() and enters do_exit(). (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call exit_oom_victim(P1). (7) oom_killer_disable() returns while P1 is not yet frozen again (i.e. not yet marked as PF_FROZEN). (8) P1 perform IO/interfere with the freezer. try_to_freeze_tasks(false) from freeze_kernel_threads() will freeze P1 again, but it seems to me that freeze_kernel_threads() is not always called when freeze_processes() suceeded. Therefore, we need to do like - exit_oom_victim(tsk); + mutex_lock(&oom_lock); + if (!oom_killer_disabled) + exit_oom_victim(tsk); + mutex_unlock(&oom_lock); in oom_reap_task(), don't we? ---------------------------------------- >From 331842e86a6778cfce6de13ae6c029bc5123a714 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Fri, 10 Jun 2016 19:46:26 +0900 Subject: [PATCH] mm, oom_reaper: close race with oom_killer_disable() Commit e26796066fdf929c ("oom: make oom_reaper freezable") meant to close race with oom_killer_disable() by adding set_freezable() to oom_reaper(). But freezable kernel threads are not frozen until freeze_kernel_threads() is called, which occurs after oom_killer_disable() from freeze_processes() is called. Therefore, oom_reaper() needs to check right before calling exit_oom_victim() whether oom_killer_disable() is already in progress. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> --- mm/oom_kill.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index acbc432..29f42ad 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -559,9 +559,15 @@ static void oom_reap_task(struct task_struct *tsk) * reasonably reclaimable memory anymore or it is not a good candidate * for the oom victim right now because it cannot release its memory * itself nor by the oom reaper. + * + * But be sure to check race with oom_killer_disable() because + * oom_reaper() is frozen after oom_killer_disable() returned. */ tsk->oom_reaper_list = NULL; - exit_oom_victim(tsk); + mutex_lock(&oom_lock); + if (!oom_killer_disabled) + exit_oom_victim(tsk); + mutex_unlock(&oom_lock); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); -- 1.8.3.1 ---------------------------------------- But we might be able to do like below patch rather than above patch. If below approach is OK, "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init" will be able to call exit_oom_victim() when can_oom_reap became false. ---------------------------------------- >From e2b7a68dcb2ed6ace7038be0f6865efcc27b51bd Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Fri, 10 Jun 2016 22:59:11 +0900 Subject: [PATCH] mm, oom: stop thawing victim threads Commit e26796066fdf929c ("oom: make oom_reaper freezable") meant to close race with oom_killer_disable() by adding set_freezable() to oom_reaper(). But freezable kernel threads are not frozen until freeze_kernel_threads() is called, which occurs after oom_killer_disable() from freeze_processes() is called. Therefore, oom_reaper() needs to check right before calling exit_oom_victim() whether oom_killer_disable() is already in progress. But doing such check introduces a branch where the OOM reaper fails to clear TIF_MEMDIE. There is a different approach which does not introduce such branch. Currently we thaw only one thread even if that thread's mm is shared by multiple threads. While that thread might release some memory by e.g. closing file descriptors after arriving at do_exit(), it won't help much if such resource is shared by multiple threads. Unless we thaw all threads using that memory, we can't guarantee that that memory is released before oom_killer_disable() returns. If the victim was selected by memcg OOM, memory allocation requests after oom_killer_disabled became true will likely succeed because OOM situation is local. But if the victim was selected by global OOM, memory allocation requests after oom_killer_disabled became true will unlikely succeed; we will see allocation failures caused by oom_killer_disabled == true. Then, rather than risking memory allocation failures caused by not waiting until the victim's memory is released (because the thawed victim might be able to call exit_oom_victim(tsk) before the OOM reaper starts reaping the victim's memory, for mmput(mm) which is called from between tsk->mm = NULL and exit_oom_victim(tsk) in exit_mm() will not wait because there are other still-frozen threads sharing that memory), waiting until the victim's memory is reaped by the OOM reaper (by not thawing the victim thread) will help reducing such risk. This heuristic depends on an assumption that amount of memory released by reaping mm is larger than that of by terminating one TIF_MEMDIE thread. Since we assume that mm_struct is the primary source of memory usage, the OOM reaper considers that there is no reasonably reclaimable memory after the OOM reaper reaped the victim's memory. So, it should worth trying. This patch stops thawing the TIF_MEMDIE thread because the OOM reaper will reap memory on behalf of that thread, and defers out_of_memory() returning false after oom_killer_disabled became true till the OOM reaper clears all pending TIF_MEMDIE. Since the OOM killer does not thaw victim threads, we won't break oom_killer_disable() semantics even if somebody else clears TIF_MEMDIE. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> --- mm/oom_kill.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index acbc432..11dfb16 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -678,6 +678,7 @@ void mark_oom_victim(struct task_struct *tsk) if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) return; atomic_inc(&tsk->signal->oom_victims); +#ifndef CONFIG_MMU /* * Make sure that the task is woken up from uninterruptible sleep * if it is frozen because OOM killer wouldn't be able to free @@ -685,6 +686,7 @@ void mark_oom_victim(struct task_struct *tsk) * that TIF_MEMDIE tasks should be ignored. */ __thaw_task(tsk); +#endif atomic_inc(&oom_victims); } @@ -923,8 +925,24 @@ bool out_of_memory(struct oom_control *oc) unsigned int uninitialized_var(points); enum oom_constraint constraint = CONSTRAINT_NONE; +#ifdef CONFIG_MMU + /* + * If oom_killer_disable() is called, wait for the OOM reaper to reap + * all victim's memory. Since oom_killer_disable() is called between + * after all userspace threads except the one calling + * oom_killer_disable() are frozen and before freezable kernel threads + * are frozen, the OOM reaper is known to be not yet frozen. Also, + * since anyone who calls out_of_memory() while oom_killer_disable() is + * waiting for all victims to terminate is known to be a kernel thread, + * it is better to retry allocation as long as the OOM reaper can find + * reapable memory. + */ + if (oom_killer_disabled) + return atomic_read(&oom_victims); +#else if (oom_killer_disabled) return false; +#endif blocking_notifier_call_chain(&oom_notify_list, 0, &freed); if (freed > 0) -- 1.8.3.1 ----------------------------------------
On Fri 10-06-16 23:23:36, Tetsuo Handa wrote: [...] > (1) freeze_processes() starts freezing user space threads. > (2) Somebody (maybe a kenrel thread) calls out_of_memory(). > (3) The OOM killer calls mark_oom_victim() on a user space thread > P1 which is already in __refrigerator(). > (4) oom_killer_disable() sets oom_killer_disabled = true. > (5) P1 leaves __refrigerator() and enters do_exit(). > (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call > exit_oom_victim(P1). > (7) oom_killer_disable() returns while P1 is not yet frozen > again (i.e. not yet marked as PF_FROZEN). > (8) P1 perform IO/interfere with the freezer. You are right. I missed that kernel threads are still alive when writing e26796066fdf929c ("oom: make oom_reaper freezable"). I am trying to remember why we are disabling oom killer before kernel threads are frozen but not really sure about that right away. I guess it has something to do with freeze_kernel_threads being called from different contexts as well so freeze_processes was just more convinient and was OK for correctness at the time. > try_to_freeze_tasks(false) from freeze_kernel_threads() will freeze > P1 again, but it seems to me that freeze_kernel_threads() is not > always called when freeze_processes() suceeded. > > Therefore, we need to do like > > - exit_oom_victim(tsk); > + mutex_lock(&oom_lock); > + if (!oom_killer_disabled) > + exit_oom_victim(tsk); > + mutex_unlock(&oom_lock); > > in oom_reap_task(), don't we? I do not like this very much. I would rather make sure that all freezable kernel threads are frozen when disabling the oom killer. [...] > But we might be able to do like below patch rather than above patch. > If below approach is OK, "[PATCH 10/10] mm, oom: hide mm which is shared > with kthread or global init" will be able to call exit_oom_victim() when > can_oom_reap became false. I believe this is not really needed. I will follow up on the 10/10 later. -- Michal Hocko SUSE Labs
On Mon 13-06-16 13:19:43, Michal Hocko wrote: [...] > I am trying to remember why we are disabling oom killer before kernel > threads are frozen but not really sure about that right away. OK, I guess I remember now. Say that a task would depend on a freezable kernel thread to get to do_exit (stuck in wait_event etc...). We would simply get stuck in oom_killer_disable for ever. So we need to address it a different way. One way would be what you are proposing but I guess it would be more systematic to never call exit_oom_victim on a remote task. After [1] we have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE is set. It is more code than your patch so I can see a reason to go with yours if the following one seems too large or ugly. [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org What do you think about the following? --- >From f7ffecf944648f16d98fc70275041433ed68f7e4 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Mon, 20 Jun 2016 17:28:18 +0200 Subject: [PATCH] oom, suspend: fix oom_reaper vs. oom_killer_disable race Tetsuo has reported the following potential oom_killer_disable vs. oom_reaper race: (1) freeze_processes() starts freezing user space threads. (2) Somebody (maybe a kenrel thread) calls out_of_memory(). (3) The OOM killer calls mark_oom_victim() on a user space thread P1 which is already in __refrigerator(). (4) oom_killer_disable() sets oom_killer_disabled = true. (5) P1 leaves __refrigerator() and enters do_exit(). (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call exit_oom_victim(P1). (7) oom_killer_disable() returns while P1 not yet finished (8) P1 perform IO/interfere with the freezer. This patch effectivelly reverts 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper"). We have a more efficient way to achieve the same. We can rely on MMF_REAPED because "mm, oom: hide mm which is shared with kthread or global init" made sure that we skip tasks with MMF_REAPED even when they have TIF_MEMDIE. The only missing part is that we have to handle __oom_reap_task failure gracefully. Currently we mark the mm by MMF_OOM_NOT_REAPABLE to later set MMF_REAPED should we fail __oom_reap_task again. This relies on clearing TIF_MEMDIE to select that task again. We can achieve a similar thing by requeuing the task to the oom_reaper_list. Introduce oom_reaper_add_tail to add the task to the tail of the oom_reaper_list. While we are at it reuse it for wake_oom_reaper as well because the queueing should be FIFO. In the result only the oom victim will be responsible to clear its TIF_MEMDIE flag so the above race is not possible anymore. oom_reaper will be allowed to play only with MMF_* flags so the responsibility model will be more clear. Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper") Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 86 ++++++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 55 insertions(+), 31 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4c21f744daa6..dbc62a69fee8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -558,6 +558,51 @@ static bool __oom_reap_task(struct task_struct *tsk) return ret; } +static void oom_reaper_add_tail(struct task_struct *tsk) +{ + struct task_struct *q; + + spin_lock(&oom_reaper_lock); + if (!oom_reaper_list) { + oom_reaper_list = tsk; + goto unlock; + } + + for (q = oom_reaper_list; q->oom_reaper_list; q = q->oom_reaper_list) + ; + q->oom_reaper_list = tsk; + +unlock: + spin_unlock(&oom_reaper_lock); +} + +static inline bool requeue_oom_victim(struct task_struct *tsk) +{ + struct task_struct *p; + bool ret; + + /* + * If we've already tried to reap this task in the past and + * failed it probably doesn't make much sense to try yet again + * so hide the mm from the oom killer so that it can move on + * to another task with a different mm struct. + */ + p = find_lock_task_mm(tsk); + if (!p) + return false; + + if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) + set_bit(MMF_OOM_REAPED, &p->mm->flags); + else + ret = true; + + task_unlock(p); + + if (ret) + oom_reaper_add_tail(tsk); + return ret; +} + #define MAX_OOM_REAP_RETRIES 10 static void oom_reap_task(struct task_struct *tsk) { @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk) while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk)) schedule_timeout_idle(HZ/10); - if (attempts > MAX_OOM_REAP_RETRIES) { - struct task_struct *p; + tsk->oom_reaper_list = NULL; + if (attempts > MAX_OOM_REAP_RETRIES) { pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); - /* - * If we've already tried to reap this task in the past and - * failed it probably doesn't make much sense to try yet again - * so hide the mm from the oom killer so that it can move on - * to another task with a different mm struct. - */ - p = find_lock_task_mm(tsk); - if (p) { - if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) { - pr_info("oom_reaper: giving up pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); - set_bit(MMF_OOM_REAPED, &p->mm->flags); - } - task_unlock(p); + debug_show_all_locks(); + + if (requeue_oom_victim(tsk)) { + cond_resched(); + return; } - debug_show_all_locks(); + pr_info("oom_reaper: giving up pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); } - /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a - * reasonably reclaimable memory anymore or it is not a good candidate - * for the oom victim right now because it cannot release its memory - * itself nor by the oom reaper. - */ - tsk->oom_reaper_list = NULL; - exit_oom_victim(tsk); - /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); } @@ -637,11 +665,7 @@ void wake_oom_reaper(struct task_struct *tsk) return; get_task_struct(tsk); - - spin_lock(&oom_reaper_lock); - tsk->oom_reaper_list = oom_reaper_list; - oom_reaper_list = tsk; - spin_unlock(&oom_reaper_lock); + oom_reaper_add_tail(tsk); wake_up(&oom_reaper_wait); } -- 2.8.1 -- Michal Hocko SUSE Labs
Michal Hocko wrote: > On Mon 13-06-16 13:19:43, Michal Hocko wrote: > [...] > > I am trying to remember why we are disabling oom killer before kernel > > threads are frozen but not really sure about that right away. > > OK, I guess I remember now. Say that a task would depend on a freezable > kernel thread to get to do_exit (stuck in wait_event etc...). We would > simply get stuck in oom_killer_disable for ever. So we need to address > it a different way. > > One way would be what you are proposing but I guess it would be more > systematic to never call exit_oom_victim on a remote task. After [1] we > have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE > is set. It is more code than your patch so I can see a reason to go with > yours if the following one seems too large or ugly. > > [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org > > What do you think about the following? I'm OK with not clearing TIF_MEMDIE from a remote task. But this patch is racy. > @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk) > while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk)) > schedule_timeout_idle(HZ/10); > > - if (attempts > MAX_OOM_REAP_RETRIES) { > - struct task_struct *p; > + tsk->oom_reaper_list = NULL; > > + if (attempts > MAX_OOM_REAP_RETRIES) { attempts > MAX_OOM_REAP_RETRIES would mean that down_read_trylock() continuously failed. But it does not guarantee that the offending task shall not call up_write(&mm->mmap_sem) and arrives at mmput() from exit_mm() (as well as other threads which are blocked at down_read(&mm->mmap_sem) in exit_mm() by the offending task arrive at mmput() from exit_mm()) when the OOM reaper was preempted at this point. Therefore, find_lock_task_mm() in requeue_oom_victim() could return NULL and the OOM reaper could fail to set MMF_OOM_REAPED (and find_lock_task_mm() in oom_scan_process_thread() could return NULL and the OOM killer could fail to select next OOM victim as well) when __mmput() got stuck. So, from the point of view of correctness, there remains an unhandled race window as long as you depend on find_lock_task_mm() not returning NULL. You will again ask "does it really matter/occur", and I can't make progress. > pr_info("oom_reaper: unable to reap pid:%d (%s)\n", > task_pid_nr(tsk), tsk->comm); >
On Tue 21-06-16 20:03:17, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 13-06-16 13:19:43, Michal Hocko wrote: > > [...] > > > I am trying to remember why we are disabling oom killer before kernel > > > threads are frozen but not really sure about that right away. > > > > OK, I guess I remember now. Say that a task would depend on a freezable > > kernel thread to get to do_exit (stuck in wait_event etc...). We would > > simply get stuck in oom_killer_disable for ever. So we need to address > > it a different way. > > > > One way would be what you are proposing but I guess it would be more > > systematic to never call exit_oom_victim on a remote task. After [1] we > > have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE > > is set. It is more code than your patch so I can see a reason to go with > > yours if the following one seems too large or ugly. > > > > [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org > > > > What do you think about the following? > > I'm OK with not clearing TIF_MEMDIE from a remote task. But this patch is racy. > > > @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk) > > while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk)) > > schedule_timeout_idle(HZ/10); > > > > - if (attempts > MAX_OOM_REAP_RETRIES) { > > - struct task_struct *p; > > + tsk->oom_reaper_list = NULL; > > > > + if (attempts > MAX_OOM_REAP_RETRIES) { > > attempts > MAX_OOM_REAP_RETRIES would mean that down_read_trylock() > continuously failed. But it does not guarantee that the offending task > shall not call up_write(&mm->mmap_sem) and arrives at mmput() from exit_mm() > (as well as other threads which are blocked at down_read(&mm->mmap_sem) in > exit_mm() by the offending task arrive at mmput() from exit_mm()) when the > OOM reaper was preempted at this point. > > Therefore, find_lock_task_mm() in requeue_oom_victim() could return NULL and > the OOM reaper could fail to set MMF_OOM_REAPED (and find_lock_task_mm() in > oom_scan_process_thread() could return NULL and the OOM killer could fail to > select next OOM victim as well) when __mmput() got stuck. Fair enough. As this would break no-lockup requirement we cannot go that way. Let me think about it more. > So, from the point of view of correctness, there remains an unhandled race > window as long as you depend on find_lock_task_mm() not returning NULL. > You will again ask "does it really matter/occur", and I can't make progress. Sigh... -- Michal Hocko SUSE Labs
On Tue 21-06-16 13:46:43, Michal Hocko wrote: > On Tue 21-06-16 20:03:17, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Mon 13-06-16 13:19:43, Michal Hocko wrote: > > > [...] > > > > I am trying to remember why we are disabling oom killer before kernel > > > > threads are frozen but not really sure about that right away. > > > > > > OK, I guess I remember now. Say that a task would depend on a freezable > > > kernel thread to get to do_exit (stuck in wait_event etc...). We would > > > simply get stuck in oom_killer_disable for ever. So we need to address > > > it a different way. > > > > > > One way would be what you are proposing but I guess it would be more > > > systematic to never call exit_oom_victim on a remote task. After [1] we > > > have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE > > > is set. It is more code than your patch so I can see a reason to go with > > > yours if the following one seems too large or ugly. > > > > > > [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org > > > > > > What do you think about the following? > > > > I'm OK with not clearing TIF_MEMDIE from a remote task. But this patch is racy. > > > > > @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk) > > > while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk)) > > > schedule_timeout_idle(HZ/10); > > > > > > - if (attempts > MAX_OOM_REAP_RETRIES) { > > > - struct task_struct *p; > > > + tsk->oom_reaper_list = NULL; > > > > > > + if (attempts > MAX_OOM_REAP_RETRIES) { > > > > attempts > MAX_OOM_REAP_RETRIES would mean that down_read_trylock() > > continuously failed. But it does not guarantee that the offending task > > shall not call up_write(&mm->mmap_sem) and arrives at mmput() from exit_mm() > > (as well as other threads which are blocked at down_read(&mm->mmap_sem) in > > exit_mm() by the offending task arrive at mmput() from exit_mm()) when the > > OOM reaper was preempted at this point. > > > > Therefore, find_lock_task_mm() in requeue_oom_victim() could return NULL and > > the OOM reaper could fail to set MMF_OOM_REAPED (and find_lock_task_mm() in > > oom_scan_process_thread() could return NULL and the OOM killer could fail to > > select next OOM victim as well) when __mmput() got stuck. > > Fair enough. As this would break no-lockup requirement we cannot go that > way. Let me think about it more. Hmm, what about the following instead. It is rather a workaround than a full flaged fix but it seems much more easier and shouldn't introduce new issues. --- >From 86bf010d2a6086491bb356494fab0e0fca80dee9 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Mon, 20 Jun 2016 17:28:18 +0200 Subject: [PATCH] oom, suspend: fix oom_reaper vs. oom_killer_disable race Tetsuo has reported the following potential oom_killer_disable vs. oom_reaper race: (1) freeze_processes() starts freezing user space threads. (2) Somebody (maybe a kenrel thread) calls out_of_memory(). (3) The OOM killer calls mark_oom_victim() on a user space thread P1 which is already in __refrigerator(). (4) oom_killer_disable() sets oom_killer_disabled = true. (5) P1 leaves __refrigerator() and enters do_exit(). (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call exit_oom_victim(P1). (7) oom_killer_disable() returns while P1 not yet finished (8) P1 perform IO/interfere with the freezer. This situation is unfortunate. We cannot move oom_killer_disable after all the freezable kernel threads are frozen because the oom victim might depend on some of those kthreads to make a forward progress to exit so we could deadlock. It is also far from trivial to teach the oom_reaper to not call exit_oom_victim() because then we would lose a guarantee of the OOM killer and oom_killer_disable forward progress because exit_mm->mmput might block and never call exit_oom_victim. It seems the easiest way forward is to workaround this race by calling try_to_freeze_tasks again after oom_killer_disable. This will make sure that all the tasks are frozen or it bails out. Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper") Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- kernel/power/process.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/kernel/power/process.c b/kernel/power/process.c index df058bed53ce..0c2ee9761d57 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -146,6 +146,18 @@ int freeze_processes(void) if (!error && !oom_killer_disable()) error = -EBUSY; + /* + * There is a hard to fix race between oom_reaper kernel thread + * and oom_killer_disable. oom_reaper calls exit_oom_victim + * before the victim reaches exit_mm so try to freeze all the tasks + * again and catch such a left over task. + */ + if (!error) { + pr_info("Double checking all user space processes after OOM killer disable... "); + error = try_to_freeze_tasks(true); + pr_cont("\n"); + } + if (error) thaw_processes(); return error; -- 2.8.1 -- Michal Hocko SUSE Labs
Michal Hocko wrote: > On Tue 21-06-16 13:46:43, Michal Hocko wrote: > > On Tue 21-06-16 20:03:17, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Mon 13-06-16 13:19:43, Michal Hocko wrote: > > > > [...] > > > > > I am trying to remember why we are disabling oom killer before kernel > > > > > threads are frozen but not really sure about that right away. > > > > > > > > OK, I guess I remember now. Say that a task would depend on a freezable > > > > kernel thread to get to do_exit (stuck in wait_event etc...). We would > > > > simply get stuck in oom_killer_disable for ever. So we need to address > > > > it a different way. > > > > > > > > One way would be what you are proposing but I guess it would be more > > > > systematic to never call exit_oom_victim on a remote task. After [1] we > > > > have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE > > > > is set. It is more code than your patch so I can see a reason to go with > > > > yours if the following one seems too large or ugly. > > > > > > > > [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org > > > > > > > > What do you think about the following? > > > > > > I'm OK with not clearing TIF_MEMDIE from a remote task. But this patch is racy. > > > > > > > @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk) > > > > while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk)) > > > > schedule_timeout_idle(HZ/10); > > > > > > > > - if (attempts > MAX_OOM_REAP_RETRIES) { > > > > - struct task_struct *p; > > > > + tsk->oom_reaper_list = NULL; > > > > > > > > + if (attempts > MAX_OOM_REAP_RETRIES) { > > > > > > attempts > MAX_OOM_REAP_RETRIES would mean that down_read_trylock() > > > continuously failed. But it does not guarantee that the offending task > > > shall not call up_write(&mm->mmap_sem) and arrives at mmput() from exit_mm() > > > (as well as other threads which are blocked at down_read(&mm->mmap_sem) in > > > exit_mm() by the offending task arrive at mmput() from exit_mm()) when the > > > OOM reaper was preempted at this point. > > > > > > Therefore, find_lock_task_mm() in requeue_oom_victim() could return NULL and > > > the OOM reaper could fail to set MMF_OOM_REAPED (and find_lock_task_mm() in > > > oom_scan_process_thread() could return NULL and the OOM killer could fail to > > > select next OOM victim as well) when __mmput() got stuck. > > > > Fair enough. As this would break no-lockup requirement we cannot go that > > way. Let me think about it more. > > Hmm, what about the following instead. It is rather a workaround than a > full flaged fix but it seems much more easier and shouldn't introduce > new issues. Yes, I think that will work. But I think below patch (marking signal_struct to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of current linux.git will implement no-lockup requirement. No race is possible unlike "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init". include/linux/oom.h | 1 + include/linux/sched.h | 2 ++ mm/memcontrol.c | 3 ++- mm/oom_kill.c | 60 ++++++++++++++++++++++++++++++--------------------- 4 files changed, 40 insertions(+), 26 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 8346952..f072c6c 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -69,6 +69,7 @@ static inline bool oom_task_origin(const struct task_struct *p) extern void mark_oom_victim(struct task_struct *tsk); +extern bool task_is_reapable(struct task_struct *tsk); #ifdef CONFIG_MMU extern void try_oom_reaper(struct task_struct *tsk); #else diff --git a/include/linux/sched.h b/include/linux/sched.h index 6e42ada..9248f90 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -799,6 +799,7 @@ struct signal_struct { * oom */ bool oom_flag_origin; + bool oom_ignore_me; short oom_score_adj; /* OOM kill score adjustment */ short oom_score_adj_min; /* OOM kill score adjustment min value. * Only settable by CAP_SYS_RESOURCE. */ @@ -1545,6 +1546,7 @@ struct task_struct { /* unserialized, strictly 'current' */ unsigned in_execve:1; /* bit to tell LSMs we're in execve */ unsigned in_iowait:1; + unsigned oom_shortcut_done:1; #ifdef CONFIG_MEMCG unsigned memcg_may_oom:1; #ifndef CONFIG_SLOB diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75e7440..af162f6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1275,7 +1275,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { + if (task_is_reapable(current) && !current->oom_shortcut_done) { + current->oom_shortcut_done = true; mark_oom_victim(current); try_oom_reaper(current); goto unlock; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index acbc432..e20d889 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -149,7 +149,7 @@ static bool oom_unkillable_task(struct task_struct *p, if (!has_intersects_mems_allowed(p, nodemask)) return true; - return false; + return p->signal->oom_ignore_me; } /** @@ -555,15 +555,15 @@ static void oom_reap_task(struct task_struct *tsk) } /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a + * Ignore TIF_MEMDIE because the task shouldn't be sitting on a * reasonably reclaimable memory anymore or it is not a good candidate * for the oom victim right now because it cannot release its memory * itself nor by the oom reaper. */ tsk->oom_reaper_list = NULL; - exit_oom_victim(tsk); + tsk->signal->oom_ignore_me = true; - /* Drop a reference taken by wake_oom_reaper */ + /* Drop a reference taken by try_oom_reaper */ put_task_struct(tsk); } @@ -589,7 +589,7 @@ static int oom_reaper(void *unused) return 0; } -static void wake_oom_reaper(struct task_struct *tsk) +void try_oom_reaper(struct task_struct *tsk) { if (!oom_reaper_th) return; @@ -610,13 +610,13 @@ static void wake_oom_reaper(struct task_struct *tsk) /* Check if we can reap the given task. This has to be called with stable * tsk->mm */ -void try_oom_reaper(struct task_struct *tsk) +bool task_is_reapable(struct task_struct *tsk) { struct mm_struct *mm = tsk->mm; struct task_struct *p; if (!mm) - return; + return false; /* * There might be other threads/processes which are either not @@ -639,12 +639,11 @@ void try_oom_reaper(struct task_struct *tsk) /* Give up */ rcu_read_unlock(); - return; + return false; } rcu_read_unlock(); } - - wake_oom_reaper(tsk); + return true; } static int __init oom_init(void) @@ -659,8 +658,10 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else -static void wake_oom_reaper(struct task_struct *tsk) +bool task_is_reapable(struct task_struct *tsk) { + return tsk->mm && + (fatal_signal_pending(tsk) || task_will_free_mem(tsk)); } #endif @@ -753,20 +754,28 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - bool can_oom_reap = true; /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ task_lock(p); - if (p->mm && task_will_free_mem(p)) { +#ifdef CONFIG_MMU + if (task_is_reapable(p)) { mark_oom_victim(p); try_oom_reaper(p); task_unlock(p); put_task_struct(p); return; } +#else + if (p->mm && task_will_free_mem(p)) { + mark_oom_victim(p); + task_unlock(p); + put_task_struct(p); + return; + } +#endif task_unlock(p); if (__ratelimit(&oom_rs)) @@ -846,21 +855,22 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, if (same_thread_group(p, victim)) continue; if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) || - p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { - /* - * We cannot use oom_reaper for the mm shared by this - * process because it wouldn't get killed and so the - * memory might be still used. - */ - can_oom_reap = false; + p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) continue; - } + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); } rcu_read_unlock(); - if (can_oom_reap) - wake_oom_reaper(victim); +#ifdef CONFIG_MMU + p = find_lock_task_mm(victim); + if (p && task_is_reapable(p)) + try_oom_reaper(victim); + else + victim->signal->oom_ignore_me = true; + if (p) + task_unlock(p); +#endif mmdrop(mm); put_task_struct(victim); @@ -939,8 +949,8 @@ bool out_of_memory(struct oom_control *oc) * But don't select if current has already released its mm and cleared * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (current->mm && - (fatal_signal_pending(current) || task_will_free_mem(current))) { + if (!current->oom_shortcut_done && task_is_reapable(current)) { + current->oom_shortcut_done = true; mark_oom_victim(current); try_oom_reaper(current); return true; -- 1.8.3.1
On Wed 22-06-16 00:32:29, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > Hmm, what about the following instead. It is rather a workaround than a > > full flaged fix but it seems much more easier and shouldn't introduce > > new issues. > > Yes, I think that will work. But I think below patch (marking signal_struct > to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of > current linux.git will implement no-lockup requirement. No race is possible unlike > "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init". Not really. Because without the exit_oom_victim from oom_reaper you have no guarantee that the oom_killer_disable will ever return. I have mentioned that in the changelog. There is simply no guarantee the oom victim will ever reach exit_mm->exit_oom_victim. -- Michal Hocko SUSE Labs
Michal Hocko wrote:
> On Wed 22-06-16 00:32:29, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > Hmm, what about the following instead. It is rather a workaround than a
> > > full flaged fix but it seems much more easier and shouldn't introduce
> > > new issues.
> >
> > Yes, I think that will work. But I think below patch (marking signal_struct
> > to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of
> > current linux.git will implement no-lockup requirement. No race is possible unlike
> > "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init".
>
> Not really. Because without the exit_oom_victim from oom_reaper you have
> no guarantee that the oom_killer_disable will ever return. I have
> mentioned that in the changelog. There is simply no guarantee the oom
> victim will ever reach exit_mm->exit_oom_victim.
Why? Since any allocation after setting oom_killer_disabled = true will be
forced to fail, nobody will be blocked on waiting for memory allocation. Thus,
the TIF_MEMDIE tasks will eventually reach exit_mm->exit_oom_victim, won't it?
The only possibility that the TIF_MEMDIE tasks won't reach exit_mm->exit_oom_victim
is __GFP_NOFAIL allocations failing to make forward progress even after
ALLOC_NO_WATERMARKS is used. But that is a different problem which I think
we can call panic() when __GFP_NOFAIL allocations failed after setting
oom_killer_disabled = true.
On Wed 22-06-16 06:47:48, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 22-06-16 00:32:29, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > Hmm, what about the following instead. It is rather a workaround than a
> > > > full flaged fix but it seems much more easier and shouldn't introduce
> > > > new issues.
> > >
> > > Yes, I think that will work. But I think below patch (marking signal_struct
> > > to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of
> > > current linux.git will implement no-lockup requirement. No race is possible unlike
> > > "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init".
> >
> > Not really. Because without the exit_oom_victim from oom_reaper you have
> > no guarantee that the oom_killer_disable will ever return. I have
> > mentioned that in the changelog. There is simply no guarantee the oom
> > victim will ever reach exit_mm->exit_oom_victim.
>
> Why? Since any allocation after setting oom_killer_disabled = true will be
> forced to fail, nobody will be blocked on waiting for memory allocation. Thus,
> the TIF_MEMDIE tasks will eventually reach exit_mm->exit_oom_victim, won't it?
What if it gets blocked waiting for an operation which cannot make any
forward progress because it cannot proceed with an allocation (e.g.
an open coded allocation retry loop - not that uncommon when sending
a bio)? I mean if we want to guarantee a forward progress then there has
to be something to clear the flag no matter in what state the oom victim
is or give up on oom_killer_disable.
--
Michal Hocko
SUSE Labs
On Wed 22-06-16 08:40:15, Michal Hocko wrote:
> On Wed 22-06-16 06:47:48, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 22-06-16 00:32:29, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > [...]
> > > > > Hmm, what about the following instead. It is rather a workaround than a
> > > > > full flaged fix but it seems much more easier and shouldn't introduce
> > > > > new issues.
> > > >
> > > > Yes, I think that will work. But I think below patch (marking signal_struct
> > > > to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of
> > > > current linux.git will implement no-lockup requirement. No race is possible unlike
> > > > "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init".
> > >
> > > Not really. Because without the exit_oom_victim from oom_reaper you have
> > > no guarantee that the oom_killer_disable will ever return. I have
> > > mentioned that in the changelog. There is simply no guarantee the oom
> > > victim will ever reach exit_mm->exit_oom_victim.
> >
> > Why? Since any allocation after setting oom_killer_disabled = true will be
> > forced to fail, nobody will be blocked on waiting for memory allocation. Thus,
> > the TIF_MEMDIE tasks will eventually reach exit_mm->exit_oom_victim, won't it?
>
> What if it gets blocked waiting for an operation which cannot make any
> forward progress because it cannot proceed with an allocation (e.g.
> an open coded allocation retry loop - not that uncommon when sending
> a bio)? I mean if we want to guarantee a forward progress then there has
> to be something to clear the flag no matter in what state the oom victim
> is or give up on oom_killer_disable.
That being said I guess the patch to try_to_freeze_tasks after
oom_killer_disable should be simple enough to go for now and stable
trees and we can come up with something less hackish later. I do not
like the fact that oom_killer_disable doesn't act as a full "barrier"
anymore.
What do you think?
--
Michal Hocko
SUSE Labs
Michal Hocko wrote: > On Wed 22-06-16 08:40:15, Michal Hocko wrote: > > On Wed 22-06-16 06:47:48, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Wed 22-06-16 00:32:29, Tetsuo Handa wrote: > > > > > Michal Hocko wrote: > > > > [...] > > > > > > Hmm, what about the following instead. It is rather a workaround than a > > > > > > full flaged fix but it seems much more easier and shouldn't introduce > > > > > > new issues. > > > > > > > > > > Yes, I think that will work. But I think below patch (marking signal_struct > > > > > to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of > > > > > current linux.git will implement no-lockup requirement. No race is possible unlike > > > > > "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init". > > > > > > > > Not really. Because without the exit_oom_victim from oom_reaper you have > > > > no guarantee that the oom_killer_disable will ever return. I have > > > > mentioned that in the changelog. There is simply no guarantee the oom > > > > victim will ever reach exit_mm->exit_oom_victim. > > > > > > Why? Since any allocation after setting oom_killer_disabled = true will be > > > forced to fail, nobody will be blocked on waiting for memory allocation. Thus, > > > the TIF_MEMDIE tasks will eventually reach exit_mm->exit_oom_victim, won't it? > > > > What if it gets blocked waiting for an operation which cannot make any > > forward progress because it cannot proceed with an allocation (e.g. > > an open coded allocation retry loop - not that uncommon when sending > > a bio)? I mean if we want to guarantee a forward progress then there has > > to be something to clear the flag no matter in what state the oom victim > > is or give up on oom_killer_disable. That sounds as if CONFIG_MMU=n kernels do OOM livelock at __mmput() regardless of oom_killer_disabled. > > That being said I guess the patch to try_to_freeze_tasks after > oom_killer_disable should be simple enough to go for now and stable > trees and we can come up with something less hackish later. I do not > like the fact that oom_killer_disable doesn't act as a full "barrier" > anymore. > > What do you think? I'm OK with calling try_to_freeze_tasks(true) again for Linux 4.6 and 4.7 kernels. But if free memory is little such that oom_killer_disable() can not expect TIF_MEMDIE threads to clear TIF_MEMDIE by themselves (and therefore has to depend on the OOM reaper to clear TIF_MEMDIE on behalf of them after the OOM reaper reaped some memory), subsequent operations would be as well blocked waiting for an operation which cannot make any forward progress because it cannot proceed with an allocation. Then, oom_killer_disable() returns false after some timeout (i.e. "do not try to suspend when the system is almost OOM") will be a safer reaction.
On Wed 22-06-16 19:57:17, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > That being said I guess the patch to try_to_freeze_tasks after > > oom_killer_disable should be simple enough to go for now and stable > > trees and we can come up with something less hackish later. I do not > > like the fact that oom_killer_disable doesn't act as a full "barrier" > > anymore. > > > > What do you think? > > I'm OK with calling try_to_freeze_tasks(true) again for Linux 4.6 and 4.7 kernels. OK, I will resend the patch CC Rafael and stable. > But if free memory is little such that oom_killer_disable() can not expect TIF_MEMDIE > threads to clear TIF_MEMDIE by themselves (and therefore has to depend on the OOM > reaper to clear TIF_MEMDIE on behalf of them after the OOM reaper reaped some memory), > subsequent operations would be as well blocked waiting for an operation which cannot > make any forward progress because it cannot proceed with an allocation. Then, > oom_killer_disable() returns false after some timeout (i.e. "do not try to suspend > when the system is almost OOM") will be a safer reaction. Yes that is exactly what I meant by "oom_killer_disable has to give up" alternative. pm suspend already has a notion of timeout for back off and oom_killer_disable can use wait_even_timeout. But let's do that separately. -- Michal Hocko SUSE Labs
On Wed 22-06-16 14:08:43, Michal Hocko wrote:
> On Wed 22-06-16 19:57:17, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > That being said I guess the patch to try_to_freeze_tasks after
> > > oom_killer_disable should be simple enough to go for now and stable
> > > trees and we can come up with something less hackish later. I do not
> > > like the fact that oom_killer_disable doesn't act as a full "barrier"
> > > anymore.
> > >
> > > What do you think?
> >
> > I'm OK with calling try_to_freeze_tasks(true) again for Linux 4.6 and 4.7 kernels.
>
> OK, I will resend the patch CC Rafael and stable.
Ohh, I've just realized that 449d777d7ad6 ("mm, oom_reaper: clear
TIF_MEMDIE for all tasks queued for oom_reaper") went in in this merge
window. So I've asked to push it to the next 4.7 rc.
--
Michal Hocko
SUSE Labs