* [PATCH 0/3] oom reaper follow ups v1 @ 2016-04-06 14:13 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw) To: linux-mm Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Daniel Vetter, Michael S. Tsirkin, Michal Hocko, Oleg Nesterov, Paul E. McKenney, Raushaniya Maksudova Hi, the following three patches should help to reduce the corner case space for oom livelocks even further. Patch1 is something that we should have probably done quite some time ago. GFP_NOFS requests never got access to memory reserves even when a task was killed. As this has some side effect to oom notifiers I have CCed curret users of this interface to hear from them. The patch contains more detailed information. Patch2 builds on top and allows tasks which skip the regular OOM killer (e.g. those with fatal_signal_pending) to queue them for oom reaper if there is not a risk that somebody sharing the mm with them could see this from the userspace. I have cced Oleg on this patch because I am not entirely sure I am doing it properly. Finally the last patch relaxes TIF_MEMDIE clearing and makes sure that no task queued for the oom reaper will keep it once it is processed (either successfully or not). Any feedback is highly appreciated. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 0/3] oom reaper follow ups v1 @ 2016-04-06 14:13 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw) To: linux-mm Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Daniel Vetter, Michael S. Tsirkin, Michal Hocko, Oleg Nesterov, Paul E. McKenney, Raushaniya Maksudova Hi, the following three patches should help to reduce the corner case space for oom livelocks even further. Patch1 is something that we should have probably done quite some time ago. GFP_NOFS requests never got access to memory reserves even when a task was killed. As this has some side effect to oom notifiers I have CCed curret users of this interface to hear from them. The patch contains more detailed information. Patch2 builds on top and allows tasks which skip the regular OOM killer (e.g. those with fatal_signal_pending) to queue them for oom reaper if there is not a risk that somebody sharing the mm with them could see this from the userspace. I have cced Oleg on this patch because I am not entirely sure I am doing it properly. Finally the last patch relaxes TIF_MEMDIE clearing and makes sure that no task queued for the oom reaper will keep it once it is processed (either successfully or not). Any feedback is highly appreciated. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory 2016-04-06 14:13 ` Michal Hocko @ 2016-04-06 14:13 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw) To: linux-mm Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko, Daniel Vetter, Raushaniya Maksudova, Michael S . Tsirkin, Paul E . McKenney From: Michal Hocko <mhocko@suse.com> __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 9 +++++++++ mm/page_alloc.c | 24 ++++++++++-------------- 2 files changed, 19 insertions(+), 14 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 86349586eacb..32d8210b8773 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -877,6 +877,15 @@ bool out_of_memory(struct oom_control *oc) } /* + * The OOM killer does not compensate for IO-less reclaim. + * pagefault_out_of_memory lost its gfp context so we have to + * make sure exclude 0 mask - all other users should have at least + * ___GFP_DIRECT_RECLAIM to get here. + */ + if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) + return true; + + /* * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1b889dba7bd4..736ea28abfcf 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2872,22 +2872,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer does not needlessly kill tasks for lowmem */ if (ac->high_zoneidx < ZONE_NORMAL) goto out; - /* The OOM killer does not compensate for IO-less reclaim */ - if (!(gfp_mask & __GFP_FS)) { - /* - * XXX: Page reclaim didn't yield anything, - * and the OOM killer can't be invoked, but - * keep looping as per tradition. - * - * But do not keep looping if oom_killer_disable() - * was already called, for the system is trying to - * enter a quiescent state during suspend. - */ - *did_some_progress = !oom_killer_disabled; - goto out; - } if (pm_suspended_storage()) goto out; + /* + * XXX: GFP_NOFS allocations should rather fail than rely on + * other request to make a forward progress. + * We are in an unfortunate situation where out_of_memory cannot + * do much for this context but let's try it to at least get + * access to memory reserved if the current task is killed (see + * out_of_memory). Once filesystems are ready to handle allocation + * failures more gracefully we should just bail out here. + */ + /* The OOM killer may not free memory on a specific node */ if (gfp_mask & __GFP_THISNODE) goto out; -- 2.8.0.rc3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory @ 2016-04-06 14:13 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw) To: linux-mm Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko, Daniel Vetter, Raushaniya Maksudova, Michael S . Tsirkin, Paul E . McKenney From: Michal Hocko <mhocko@suse.com> __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 9 +++++++++ mm/page_alloc.c | 24 ++++++++++-------------- 2 files changed, 19 insertions(+), 14 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 86349586eacb..32d8210b8773 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -877,6 +877,15 @@ bool out_of_memory(struct oom_control *oc) } /* + * The OOM killer does not compensate for IO-less reclaim. + * pagefault_out_of_memory lost its gfp context so we have to + * make sure exclude 0 mask - all other users should have at least + * ___GFP_DIRECT_RECLAIM to get here. + */ + if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) + return true; + + /* * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1b889dba7bd4..736ea28abfcf 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2872,22 +2872,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer does not needlessly kill tasks for lowmem */ if (ac->high_zoneidx < ZONE_NORMAL) goto out; - /* The OOM killer does not compensate for IO-less reclaim */ - if (!(gfp_mask & __GFP_FS)) { - /* - * XXX: Page reclaim didn't yield anything, - * and the OOM killer can't be invoked, but - * keep looping as per tradition. - * - * But do not keep looping if oom_killer_disable() - * was already called, for the system is trying to - * enter a quiescent state during suspend. - */ - *did_some_progress = !oom_killer_disabled; - goto out; - } if (pm_suspended_storage()) goto out; + /* + * XXX: GFP_NOFS allocations should rather fail than rely on + * other request to make a forward progress. + * We are in an unfortunate situation where out_of_memory cannot + * do much for this context but let's try it to at least get + * access to memory reserved if the current task is killed (see + * out_of_memory). Once filesystems are ready to handle allocation + * failures more gracefully we should just bail out here. + */ + /* The OOM killer may not free memory on a specific node */ if (gfp_mask & __GFP_THISNODE) goto out; -- 2.8.0.rc3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path 2016-04-06 14:13 ` Michal Hocko @ 2016-04-06 14:13 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw) To: linux-mm Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko, Oleg Nesterov From: Michal Hocko <mhocko@suse.com> If either the current task is already killed or PF_EXITING or a selected task is PF_EXITING then the oom killer is suppressed and so is the oom reaper. This patch adds try_oom_reaper which checks the given task and queues it for the oom reaper if that is safe to be done meaning that the task doesn't share the mm with an alive process. This might help to release the memory pressure while the task tries to exit. Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 72 insertions(+), 18 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 32d8210b8773..74c38f5fffef 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -412,6 +412,25 @@ bool oom_killer_disabled __read_mostly; #define K(x) ((x) << (PAGE_SHIFT-10)) +/* + * task->mm can be NULL if the task is the exited group leader. So to + * determine whether the task is using a particular mm, we examine all the + * task's threads: if one of those is using this mm then this task was also + * using it. + */ +static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) +{ + struct task_struct *t; + + for_each_thread(p, t) { + struct mm_struct *t_mm = READ_ONCE(t->mm); + if (t_mm) + return t_mm == mm; + } + return false; +} + + #ifdef CONFIG_MMU /* * OOM Reaper kernel thread which tries to reap the memory used by the OOM @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk) wake_up(&oom_reaper_wait); } +/* Check if we can reap the given task. This has to be called with stable + * tsk->mm + */ +static void try_oom_reaper(struct task_struct *tsk) +{ + struct mm_struct *mm = tsk->mm; + struct task_struct *p; + + if (!mm) + return; + + /* + * There might be other threads/processes which are either not + * dying or even not killable. + */ + if (atomic_read(&mm->mm_users) > 1) { + rcu_read_lock(); + for_each_process(p) { + bool exiting; + + if (!process_shares_mm(p, mm)) + continue; + if (same_thread_group(p, tsk)) + continue; + if (fatal_signal_pending(p)) + continue; + + /* + * If the task is exiting make sure the whole thread group + * is exiting and cannot acces mm anymore. + */ + spin_lock_irq(&p->sighand->siglock); + exiting = signal_group_exit(p->signal); + spin_unlock_irq(&p->sighand->siglock); + if (exiting) + continue; + + /* Give up */ + rcu_read_unlock(); + return; + } + rcu_read_unlock(); + } + + wake_oom_reaper(tsk); +} + static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -575,6 +641,10 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else +static void try_oom_reaper(struct task_struct *tsk) +{ +} + static void wake_oom_reaper(struct task_struct *tsk) { } @@ -653,24 +723,6 @@ void oom_killer_enable(void) } /* - * task->mm can be NULL if the task is the exited group leader. So to - * determine whether the task is using a particular mm, we examine all the - * task's threads: if one of those is using this mm then this task was also - * using it. - */ -static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) -{ - struct task_struct *t; - - for_each_thread(p, t) { - struct mm_struct *t_mm = READ_ONCE(t->mm); - if (t_mm) - return t_mm == mm; - } - return false; -} - -/* * Must be called while holding a reference to p, which will be released upon * returning. */ @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, task_lock(p); if (p->mm && task_will_free_mem(p)) { mark_oom_victim(p); + try_oom_reaper(p); task_unlock(p); put_task_struct(p); return; @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) if (current->mm && (fatal_signal_pending(current) || task_will_free_mem(current))) { mark_oom_victim(current); + try_oom_reaper(current); return true; } -- 2.8.0.rc3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path @ 2016-04-06 14:13 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw) To: linux-mm Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko, Oleg Nesterov From: Michal Hocko <mhocko@suse.com> If either the current task is already killed or PF_EXITING or a selected task is PF_EXITING then the oom killer is suppressed and so is the oom reaper. This patch adds try_oom_reaper which checks the given task and queues it for the oom reaper if that is safe to be done meaning that the task doesn't share the mm with an alive process. This might help to release the memory pressure while the task tries to exit. Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 72 insertions(+), 18 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 32d8210b8773..74c38f5fffef 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -412,6 +412,25 @@ bool oom_killer_disabled __read_mostly; #define K(x) ((x) << (PAGE_SHIFT-10)) +/* + * task->mm can be NULL if the task is the exited group leader. So to + * determine whether the task is using a particular mm, we examine all the + * task's threads: if one of those is using this mm then this task was also + * using it. + */ +static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) +{ + struct task_struct *t; + + for_each_thread(p, t) { + struct mm_struct *t_mm = READ_ONCE(t->mm); + if (t_mm) + return t_mm == mm; + } + return false; +} + + #ifdef CONFIG_MMU /* * OOM Reaper kernel thread which tries to reap the memory used by the OOM @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk) wake_up(&oom_reaper_wait); } +/* Check if we can reap the given task. This has to be called with stable + * tsk->mm + */ +static void try_oom_reaper(struct task_struct *tsk) +{ + struct mm_struct *mm = tsk->mm; + struct task_struct *p; + + if (!mm) + return; + + /* + * There might be other threads/processes which are either not + * dying or even not killable. + */ + if (atomic_read(&mm->mm_users) > 1) { + rcu_read_lock(); + for_each_process(p) { + bool exiting; + + if (!process_shares_mm(p, mm)) + continue; + if (same_thread_group(p, tsk)) + continue; + if (fatal_signal_pending(p)) + continue; + + /* + * If the task is exiting make sure the whole thread group + * is exiting and cannot acces mm anymore. + */ + spin_lock_irq(&p->sighand->siglock); + exiting = signal_group_exit(p->signal); + spin_unlock_irq(&p->sighand->siglock); + if (exiting) + continue; + + /* Give up */ + rcu_read_unlock(); + return; + } + rcu_read_unlock(); + } + + wake_oom_reaper(tsk); +} + static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -575,6 +641,10 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else +static void try_oom_reaper(struct task_struct *tsk) +{ +} + static void wake_oom_reaper(struct task_struct *tsk) { } @@ -653,24 +723,6 @@ void oom_killer_enable(void) } /* - * task->mm can be NULL if the task is the exited group leader. So to - * determine whether the task is using a particular mm, we examine all the - * task's threads: if one of those is using this mm then this task was also - * using it. - */ -static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) -{ - struct task_struct *t; - - for_each_thread(p, t) { - struct mm_struct *t_mm = READ_ONCE(t->mm); - if (t_mm) - return t_mm == mm; - } - return false; -} - -/* * Must be called while holding a reference to p, which will be released upon * returning. */ @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, task_lock(p); if (p->mm && task_will_free_mem(p)) { mark_oom_victim(p); + try_oom_reaper(p); task_unlock(p); put_task_struct(p); return; @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) if (current->mm && (fatal_signal_pending(current) || task_will_free_mem(current))) { mark_oom_victim(current); + try_oom_reaper(current); return true; } -- 2.8.0.rc3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path 2016-04-06 14:13 ` Michal Hocko @ 2016-04-07 11:38 ` Tetsuo Handa -1 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-07 11:38 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko, oleg Michal Hocko wrote: > @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk) > wake_up(&oom_reaper_wait); > } > > +/* Check if we can reap the given task. This has to be called with stable > + * tsk->mm > + */ > +static void try_oom_reaper(struct task_struct *tsk) > +{ > + struct mm_struct *mm = tsk->mm; > + struct task_struct *p; > + > + if (!mm) > + return; > + > + /* > + * There might be other threads/processes which are either not > + * dying or even not killable. > + */ > + if (atomic_read(&mm->mm_users) > 1) { > + rcu_read_lock(); > + for_each_process(p) { > + bool exiting; > + > + if (!process_shares_mm(p, mm)) > + continue; > + if (same_thread_group(p, tsk)) > + continue; > + if (fatal_signal_pending(p)) > + continue; > + > + /* > + * If the task is exiting make sure the whole thread group > + * is exiting and cannot acces mm anymore. > + */ > + spin_lock_irq(&p->sighand->siglock); > + exiting = signal_group_exit(p->signal); > + spin_unlock_irq(&p->sighand->siglock); > + if (exiting) > + continue; > + > + /* Give up */ > + rcu_read_unlock(); > + return; > + } > + rcu_read_unlock(); > + } > + > + wake_oom_reaper(tsk); > +} > + I think you want to change "try_oom_reaper() without wake_oom_reaper()" as mm_is_reapable() and use it from oom_kill_process() in order to skip p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN test which needlessly makes can_oom_reap false. > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > task_lock(p); > if (p->mm && task_will_free_mem(p)) { > mark_oom_victim(p); > + try_oom_reaper(p); > task_unlock(p); > put_task_struct(p); > return; > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) > if (current->mm && > (fatal_signal_pending(current) || task_will_free_mem(current))) { > mark_oom_victim(current); > + try_oom_reaper(current); > return true; > } > Why don't you call try_oom_reaper() from the shortcuts in mem_cgroup_out_of_memory() as well? Why don't you embed try_oom_reaper() into mark_oom_victim() like I did at http://lkml.kernel.org/r/201602052014.HBG52666.HFMOQVLFOSFJtO@I-love.SAKURA.ne.jp ? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path @ 2016-04-07 11:38 ` Tetsuo Handa 0 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-07 11:38 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko, oleg Michal Hocko wrote: > @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk) > wake_up(&oom_reaper_wait); > } > > +/* Check if we can reap the given task. This has to be called with stable > + * tsk->mm > + */ > +static void try_oom_reaper(struct task_struct *tsk) > +{ > + struct mm_struct *mm = tsk->mm; > + struct task_struct *p; > + > + if (!mm) > + return; > + > + /* > + * There might be other threads/processes which are either not > + * dying or even not killable. > + */ > + if (atomic_read(&mm->mm_users) > 1) { > + rcu_read_lock(); > + for_each_process(p) { > + bool exiting; > + > + if (!process_shares_mm(p, mm)) > + continue; > + if (same_thread_group(p, tsk)) > + continue; > + if (fatal_signal_pending(p)) > + continue; > + > + /* > + * If the task is exiting make sure the whole thread group > + * is exiting and cannot acces mm anymore. > + */ > + spin_lock_irq(&p->sighand->siglock); > + exiting = signal_group_exit(p->signal); > + spin_unlock_irq(&p->sighand->siglock); > + if (exiting) > + continue; > + > + /* Give up */ > + rcu_read_unlock(); > + return; > + } > + rcu_read_unlock(); > + } > + > + wake_oom_reaper(tsk); > +} > + I think you want to change "try_oom_reaper() without wake_oom_reaper()" as mm_is_reapable() and use it from oom_kill_process() in order to skip p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN test which needlessly makes can_oom_reap false. > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > task_lock(p); > if (p->mm && task_will_free_mem(p)) { > mark_oom_victim(p); > + try_oom_reaper(p); > task_unlock(p); > put_task_struct(p); > return; > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) > if (current->mm && > (fatal_signal_pending(current) || task_will_free_mem(current))) { > mark_oom_victim(current); > + try_oom_reaper(current); > return true; > } > Why don't you call try_oom_reaper() from the shortcuts in mem_cgroup_out_of_memory() as well? Why don't you embed try_oom_reaper() into mark_oom_victim() like I did at http://lkml.kernel.org/r/201602052014.HBG52666.HFMOQVLFOSFJtO@I-love.SAKURA.ne.jp ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path 2016-04-07 11:38 ` Tetsuo Handa @ 2016-04-08 11:19 ` Tetsuo Handa -1 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-08 11:19 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko, oleg Tetsuo Handa wrote: > Michal Hocko wrote: > > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > task_lock(p); > > if (p->mm && task_will_free_mem(p)) { > > mark_oom_victim(p); > > + try_oom_reaper(p); > > task_unlock(p); > > put_task_struct(p); > > return; > > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) > > if (current->mm && > > (fatal_signal_pending(current) || task_will_free_mem(current))) { > > mark_oom_victim(current); > > + try_oom_reaper(current); > > return true; > > } > > oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to if (tsk == oom_reaper_list || tsk->oom_reaper_list) return; test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper" will select the same thread again. Though I think we should not allow the OOM killer to select the same thread again. > > Why don't you call try_oom_reaper() from the shortcuts in > mem_cgroup_out_of_memory() as well? I looked at next-20160408 but I again came to think that we should remove these shortcuts (something like a patch shown bottom). These shortcuts might be called without sending SIGKILL to threads sharing the victim's memory. It is possible that try_oom_reaper() fails to call wake_oom_reaper() due to a thread without fatal_signal_pending(). If such thread is holding mmap_sem for write without fatal_signal_pending(), the victim will get stuck at exit_mm() but wake_oom_reaper() is not called. Therefore, use of these shortcuts can put us back to square one. My perspective on the OOM reaper is to behave as a guaranteed unlocking mechanism than a preemptive memory reaping mechanism. Therefore, it is critically important that the OOM reaper kernel thread is always woken up and unlock TIF_MEMDIE some time later, even if it is known that the memory used by the caller of try_oom_reaper() is not reapable. Therefore, I moved mm_is_reapable() test to the OOM reaper kernel thread in the patch shown bottom. Also, setting TIF_MEMDIE to all threads at oom_kill_process() not only eliminates the need of "[PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory" but also suppresses needless OOM killer messages by holding off the OOM killer until "all threads which should release the victim's mm releases the victim's mm" or "the OOM reaper marks all threads which failed to release the victim's mm as being stuck". Setting TIF_MEMDIE to all threads at oom_kill_process() also increases possibility of reclaiming memory when racing with oom_killer_disable() by waiting until "all threads which should release the victim's mm releases the victim's mm" or "oom_killer_disable() gives up waiting for oom_victims to become 0". It sounds strange to me that we are currently thawing only victim's thread when we need to thaw all threads sharing the victim's mm in order to reclaim memory used by the victim. (And more crazy thing is that we loop forever without providing a guarantee of forward progress at /* Exhausted what can be done so it's blamo time */ if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { *did_some_progress = 1; if (gfp_mask & __GFP_NOFAIL) { page = get_page_from_freelist(gfp_mask, order, ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); /* * fallback to ignore cpuset restriction if our nodes * are depleted */ if (!page) page = get_page_from_freelist(gfp_mask, order, ALLOC_NO_WATERMARKS, ac); } } but that thing is outside of this discussion.) ---------- include/linux/oom.h | 4 - include/linux/sched.h | 2 kernel/exit.c | 2 mm/memcontrol.c | 13 --- mm/oom_kill.c | 200 +++++++++++++++++++------------------------------- 5 files changed, 83 insertions(+), 138 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index abaab8e..9c99956 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -67,8 +67,6 @@ static inline bool oom_task_origin(const struct task_struct *p) return p->signal->oom_flag_origin; } -extern void mark_oom_victim(struct task_struct *tsk); - extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); @@ -86,7 +84,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, extern bool out_of_memory(struct oom_control *oc); -extern void exit_oom_victim(struct task_struct *tsk); +extern void exit_oom_victim(void); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); diff --git a/include/linux/sched.h b/include/linux/sched.h index dd286b1..a93b24d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -786,6 +786,8 @@ struct signal_struct { * oom */ bool oom_flag_origin; + /* Already OOM-killed but cannot terminate. Don't count on me. */ + bool oom_skip_me; short oom_score_adj; /* OOM kill score adjustment */ short oom_score_adj_min; /* OOM kill score adjustment min value. * Only settable by CAP_SYS_RESOURCE. */ diff --git a/kernel/exit.c b/kernel/exit.c index 9e6e135..c742c37 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk) mm_update_next_owner(mm); mmput(mm); if (test_thread_flag(TIF_MEMDIE)) - exit_oom_victim(tsk); + exit_oom_victim(); } static struct task_struct *find_alive_thread(struct task_struct *p) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3e90d48..57611b8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1278,17 +1278,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int points = 0; struct task_struct *chosen = NULL; - mutex_lock(&oom_lock); - - /* - * If current has a pending SIGKILL or is exiting, then automatically - * select it. The goal is to allow it to allocate so that it may - * quickly exit and free its memory. - */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { - mark_oom_victim(current); - goto unlock; - } + if (mutex_lock_killable(&oom_lock)) + return NULL; check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg); totalpages = mem_cgroup_get_limit(memcg) ? : 1; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7098104..7bda655 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -279,6 +279,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, * Don't allow any other task to have access to the reserves. */ if (test_tsk_thread_flag(task, TIF_MEMDIE)) { + if (task->signal->oom_skip_me) + return OOM_SCAN_CONTINUE; if (!is_sysrq_oom(oc)) return OOM_SCAN_ABORT; } @@ -441,6 +443,42 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); +static bool mm_is_reapable(struct mm_struct *mm) +{ + struct task_struct *p; + + if (atomic_read(&mm->mm_users) <= 2) + return true; + /* + * There might be other threads/processes which are either not + * dying or even not killable. + */ + rcu_read_lock(); + for_each_process(p) { + bool exiting; + + if (!process_shares_mm(p, mm)) + continue; + if (fatal_signal_pending(p)) + continue; + + /* + * If the task is exiting make sure the whole thread group + * is exiting and cannot access mm anymore. + */ + spin_lock_irq(&p->sighand->siglock); + exiting = signal_group_exit(p->signal); + spin_unlock_irq(&p->sighand->siglock); + if (exiting) + continue; + + /* Give up */ + rcu_read_unlock(); + return false; + } + rcu_read_unlock(); + return true; +} static bool __oom_reap_task(struct task_struct *tsk) { @@ -470,7 +508,7 @@ static bool __oom_reap_task(struct task_struct *tsk) task_unlock(p); - if (!down_read_trylock(&mm->mmap_sem)) { + if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) { ret = false; goto out; } @@ -509,11 +547,6 @@ static bool __oom_reap_task(struct task_struct *tsk) K(get_mm_counter(mm, MM_SHMEMPAGES))); up_read(&mm->mmap_sem); - /* - * This task can be safely ignored because we cannot do much more - * to release its memory. - */ - tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; out: mmput(mm); return ret; @@ -535,12 +568,12 @@ static void oom_reap_task(struct task_struct *tsk) } /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a - * reasonably reclaimable memory anymore or it is not a good candidate - * for the oom victim right now because it cannot release its memory - * itself nor by the oom reaper. + * Tell oom_scan_process_thread() not to wait for this process, for + * the task shouldn't be sitting on a reasonably reclaimable memory + * anymore or it is not a good candidate for the oom victim right now + * because it cannot release its memory itself nor by the oom reaper. */ - exit_oom_victim(tsk); + tsk->signal->oom_skip_me = true; /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); @@ -586,53 +619,6 @@ static void wake_oom_reaper(struct task_struct *tsk) wake_up(&oom_reaper_wait); } -/* Check if we can reap the given task. This has to be called with stable - * tsk->mm - */ -static void try_oom_reaper(struct task_struct *tsk) -{ - struct mm_struct *mm = tsk->mm; - struct task_struct *p; - - if (!mm) - return; - - /* - * There might be other threads/processes which are either not - * dying or even not killable. - */ - if (atomic_read(&mm->mm_users) > 1) { - rcu_read_lock(); - for_each_process(p) { - bool exiting; - - if (!process_shares_mm(p, mm)) - continue; - if (same_thread_group(p, tsk)) - continue; - if (fatal_signal_pending(p)) - continue; - - /* - * If the task is exiting make sure the whole thread group - * is exiting and cannot acces mm anymore. - */ - spin_lock_irq(&p->sighand->siglock); - exiting = signal_group_exit(p->signal); - spin_unlock_irq(&p->sighand->siglock); - if (exiting) - continue; - - /* Give up */ - rcu_read_unlock(); - return; - } - rcu_read_unlock(); - } - - wake_oom_reaper(tsk); -} - static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -645,10 +631,6 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else -static void try_oom_reaper(struct task_struct *tsk) -{ -} - static void wake_oom_reaper(struct task_struct *tsk) { } @@ -661,7 +643,7 @@ static void wake_oom_reaper(struct task_struct *tsk) * Has to be called with oom_lock held and never after * oom has been disabled already. */ -void mark_oom_victim(struct task_struct *tsk) +static void mark_oom_victim(struct task_struct *tsk) { WARN_ON(oom_killer_disabled); /* OOM killer might race with memcg OOM */ @@ -680,10 +662,9 @@ void mark_oom_victim(struct task_struct *tsk) /** * exit_oom_victim - note the exit of an OOM victim */ -void exit_oom_victim(struct task_struct *tsk) +void exit_oom_victim(void) { - if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) - return; + clear_thread_flag(TIF_MEMDIE); if (!atomic_dec_return(&oom_victims)) wake_up_all(&oom_victims_wait); @@ -741,21 +722,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - bool can_oom_reap = true; - - /* - * If the task is already exiting, don't alarm the sysadmin or kill - * its children or threads, just set TIF_MEMDIE so it can die quickly - */ - task_lock(p); - if (p->mm && task_will_free_mem(p)) { - mark_oom_victim(p); - try_oom_reaper(p); - task_unlock(p); - put_task_struct(p); - return; - } - task_unlock(p); if (__ratelimit(&oom_rs)) dump_header(oc, p, memcg); @@ -804,13 +770,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, /* Get a reference to safely compare mm after task_unlock(victim) */ mm = victim->mm; atomic_inc(&mm->mm_count); - /* - * We should send SIGKILL before setting TIF_MEMDIE in order to prevent - * the OOM victim from depleting the memory reserves from the user - * space under its control. - */ - do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); - mark_oom_victim(victim); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), @@ -819,37 +778,47 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, task_unlock(victim); /* - * Kill all user processes sharing victim->mm in other thread groups, if - * any. They don't get access to memory reserves, though, to avoid - * depletion of all memory. This prevents mm->mmap_sem livelock when an - * oom killed thread cannot exit because it requires the semaphore and - * its contended by another thread trying to allocate memory itself. - * That thread will now get access to memory reserves since it has a - * pending fatal signal. + * Kill all user processes sharing victim->mm in other thread groups, + * if any. This reduces possibility of hitting mm->mmap_sem livelock + * when an oom killed thread cannot exit because it requires the + * semaphore and its contended by another thread trying to allocate + * memory itself. */ rcu_read_lock(); for_each_process(p) { if (!process_shares_mm(p, mm)) continue; - if (same_thread_group(p, victim)) + if (unlikely(p->flags & PF_KTHREAD)) continue; - if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) || - p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { - /* - * We cannot use oom_reaper for the mm shared by this - * process because it wouldn't get killed and so the - * memory might be still used. - */ - can_oom_reap = false; + if (is_global_init(p)) continue; - } + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) + continue; + + /* + * We should make sure that oom_badness() will treat this + * process as unkillable because wake_oom_reaper() might do + * nothing. + * Note that this will change sysctl_oom_kill_allocating_task + * behavior. + */ + p->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; + /* + * We should send SIGKILL before setting TIF_MEMDIE in order to + * prevent the OOM victim from depleting the memory reserves + * from the user space under its control. + */ do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); + for_each_thread(p, t) { + task_lock(t); + if (t->mm) + mark_oom_victim(t); + task_unlock(t); + } + wake_oom_reaper(p); } rcu_read_unlock(); - if (can_oom_reap) - wake_oom_reaper(victim); - mmdrop(mm); put_task_struct(victim); } @@ -920,21 +889,6 @@ bool out_of_memory(struct oom_control *oc) return true; /* - * If current has a pending SIGKILL or is exiting, then automatically - * select it. The goal is to allow it to allocate so that it may - * quickly exit and free its memory. - * - * But don't select if current has already released its mm and cleared - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. - */ - if (current->mm && - (fatal_signal_pending(current) || task_will_free_mem(current))) { - mark_oom_victim(current); - try_oom_reaper(current); - return true; - } - - /* * The OOM killer does not compensate for IO-less reclaim. * pagefault_out_of_memory lost its gfp context so we have to * make sure exclude 0 mask - all other users should have at least ---------- ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path @ 2016-04-08 11:19 ` Tetsuo Handa 0 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-08 11:19 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko, oleg Tetsuo Handa wrote: > Michal Hocko wrote: > > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > task_lock(p); > > if (p->mm && task_will_free_mem(p)) { > > mark_oom_victim(p); > > + try_oom_reaper(p); > > task_unlock(p); > > put_task_struct(p); > > return; > > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) > > if (current->mm && > > (fatal_signal_pending(current) || task_will_free_mem(current))) { > > mark_oom_victim(current); > > + try_oom_reaper(current); > > return true; > > } > > oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to if (tsk == oom_reaper_list || tsk->oom_reaper_list) return; test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper" will select the same thread again. Though I think we should not allow the OOM killer to select the same thread again. > > Why don't you call try_oom_reaper() from the shortcuts in > mem_cgroup_out_of_memory() as well? I looked at next-20160408 but I again came to think that we should remove these shortcuts (something like a patch shown bottom). These shortcuts might be called without sending SIGKILL to threads sharing the victim's memory. It is possible that try_oom_reaper() fails to call wake_oom_reaper() due to a thread without fatal_signal_pending(). If such thread is holding mmap_sem for write without fatal_signal_pending(), the victim will get stuck at exit_mm() but wake_oom_reaper() is not called. Therefore, use of these shortcuts can put us back to square one. My perspective on the OOM reaper is to behave as a guaranteed unlocking mechanism than a preemptive memory reaping mechanism. Therefore, it is critically important that the OOM reaper kernel thread is always woken up and unlock TIF_MEMDIE some time later, even if it is known that the memory used by the caller of try_oom_reaper() is not reapable. Therefore, I moved mm_is_reapable() test to the OOM reaper kernel thread in the patch shown bottom. Also, setting TIF_MEMDIE to all threads at oom_kill_process() not only eliminates the need of "[PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory" but also suppresses needless OOM killer messages by holding off the OOM killer until "all threads which should release the victim's mm releases the victim's mm" or "the OOM reaper marks all threads which failed to release the victim's mm as being stuck". Setting TIF_MEMDIE to all threads at oom_kill_process() also increases possibility of reclaiming memory when racing with oom_killer_disable() by waiting until "all threads which should release the victim's mm releases the victim's mm" or "oom_killer_disable() gives up waiting for oom_victims to become 0". It sounds strange to me that we are currently thawing only victim's thread when we need to thaw all threads sharing the victim's mm in order to reclaim memory used by the victim. (And more crazy thing is that we loop forever without providing a guarantee of forward progress at /* Exhausted what can be done so it's blamo time */ if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { *did_some_progress = 1; if (gfp_mask & __GFP_NOFAIL) { page = get_page_from_freelist(gfp_mask, order, ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); /* * fallback to ignore cpuset restriction if our nodes * are depleted */ if (!page) page = get_page_from_freelist(gfp_mask, order, ALLOC_NO_WATERMARKS, ac); } } but that thing is outside of this discussion.) ---------- include/linux/oom.h | 4 - include/linux/sched.h | 2 kernel/exit.c | 2 mm/memcontrol.c | 13 --- mm/oom_kill.c | 200 +++++++++++++++++++------------------------------- 5 files changed, 83 insertions(+), 138 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index abaab8e..9c99956 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -67,8 +67,6 @@ static inline bool oom_task_origin(const struct task_struct *p) return p->signal->oom_flag_origin; } -extern void mark_oom_victim(struct task_struct *tsk); - extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); @@ -86,7 +84,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, extern bool out_of_memory(struct oom_control *oc); -extern void exit_oom_victim(struct task_struct *tsk); +extern void exit_oom_victim(void); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); diff --git a/include/linux/sched.h b/include/linux/sched.h index dd286b1..a93b24d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -786,6 +786,8 @@ struct signal_struct { * oom */ bool oom_flag_origin; + /* Already OOM-killed but cannot terminate. Don't count on me. */ + bool oom_skip_me; short oom_score_adj; /* OOM kill score adjustment */ short oom_score_adj_min; /* OOM kill score adjustment min value. * Only settable by CAP_SYS_RESOURCE. */ diff --git a/kernel/exit.c b/kernel/exit.c index 9e6e135..c742c37 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk) mm_update_next_owner(mm); mmput(mm); if (test_thread_flag(TIF_MEMDIE)) - exit_oom_victim(tsk); + exit_oom_victim(); } static struct task_struct *find_alive_thread(struct task_struct *p) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3e90d48..57611b8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1278,17 +1278,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int points = 0; struct task_struct *chosen = NULL; - mutex_lock(&oom_lock); - - /* - * If current has a pending SIGKILL or is exiting, then automatically - * select it. The goal is to allow it to allocate so that it may - * quickly exit and free its memory. - */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { - mark_oom_victim(current); - goto unlock; - } + if (mutex_lock_killable(&oom_lock)) + return NULL; check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg); totalpages = mem_cgroup_get_limit(memcg) ? : 1; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7098104..7bda655 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -279,6 +279,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, * Don't allow any other task to have access to the reserves. */ if (test_tsk_thread_flag(task, TIF_MEMDIE)) { + if (task->signal->oom_skip_me) + return OOM_SCAN_CONTINUE; if (!is_sysrq_oom(oc)) return OOM_SCAN_ABORT; } @@ -441,6 +443,42 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); +static bool mm_is_reapable(struct mm_struct *mm) +{ + struct task_struct *p; + + if (atomic_read(&mm->mm_users) <= 2) + return true; + /* + * There might be other threads/processes which are either not + * dying or even not killable. + */ + rcu_read_lock(); + for_each_process(p) { + bool exiting; + + if (!process_shares_mm(p, mm)) + continue; + if (fatal_signal_pending(p)) + continue; + + /* + * If the task is exiting make sure the whole thread group + * is exiting and cannot access mm anymore. + */ + spin_lock_irq(&p->sighand->siglock); + exiting = signal_group_exit(p->signal); + spin_unlock_irq(&p->sighand->siglock); + if (exiting) + continue; + + /* Give up */ + rcu_read_unlock(); + return false; + } + rcu_read_unlock(); + return true; +} static bool __oom_reap_task(struct task_struct *tsk) { @@ -470,7 +508,7 @@ static bool __oom_reap_task(struct task_struct *tsk) task_unlock(p); - if (!down_read_trylock(&mm->mmap_sem)) { + if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) { ret = false; goto out; } @@ -509,11 +547,6 @@ static bool __oom_reap_task(struct task_struct *tsk) K(get_mm_counter(mm, MM_SHMEMPAGES))); up_read(&mm->mmap_sem); - /* - * This task can be safely ignored because we cannot do much more - * to release its memory. - */ - tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; out: mmput(mm); return ret; @@ -535,12 +568,12 @@ static void oom_reap_task(struct task_struct *tsk) } /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a - * reasonably reclaimable memory anymore or it is not a good candidate - * for the oom victim right now because it cannot release its memory - * itself nor by the oom reaper. + * Tell oom_scan_process_thread() not to wait for this process, for + * the task shouldn't be sitting on a reasonably reclaimable memory + * anymore or it is not a good candidate for the oom victim right now + * because it cannot release its memory itself nor by the oom reaper. */ - exit_oom_victim(tsk); + tsk->signal->oom_skip_me = true; /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); @@ -586,53 +619,6 @@ static void wake_oom_reaper(struct task_struct *tsk) wake_up(&oom_reaper_wait); } -/* Check if we can reap the given task. This has to be called with stable - * tsk->mm - */ -static void try_oom_reaper(struct task_struct *tsk) -{ - struct mm_struct *mm = tsk->mm; - struct task_struct *p; - - if (!mm) - return; - - /* - * There might be other threads/processes which are either not - * dying or even not killable. - */ - if (atomic_read(&mm->mm_users) > 1) { - rcu_read_lock(); - for_each_process(p) { - bool exiting; - - if (!process_shares_mm(p, mm)) - continue; - if (same_thread_group(p, tsk)) - continue; - if (fatal_signal_pending(p)) - continue; - - /* - * If the task is exiting make sure the whole thread group - * is exiting and cannot acces mm anymore. - */ - spin_lock_irq(&p->sighand->siglock); - exiting = signal_group_exit(p->signal); - spin_unlock_irq(&p->sighand->siglock); - if (exiting) - continue; - - /* Give up */ - rcu_read_unlock(); - return; - } - rcu_read_unlock(); - } - - wake_oom_reaper(tsk); -} - static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -645,10 +631,6 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else -static void try_oom_reaper(struct task_struct *tsk) -{ -} - static void wake_oom_reaper(struct task_struct *tsk) { } @@ -661,7 +643,7 @@ static void wake_oom_reaper(struct task_struct *tsk) * Has to be called with oom_lock held and never after * oom has been disabled already. */ -void mark_oom_victim(struct task_struct *tsk) +static void mark_oom_victim(struct task_struct *tsk) { WARN_ON(oom_killer_disabled); /* OOM killer might race with memcg OOM */ @@ -680,10 +662,9 @@ void mark_oom_victim(struct task_struct *tsk) /** * exit_oom_victim - note the exit of an OOM victim */ -void exit_oom_victim(struct task_struct *tsk) +void exit_oom_victim(void) { - if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) - return; + clear_thread_flag(TIF_MEMDIE); if (!atomic_dec_return(&oom_victims)) wake_up_all(&oom_victims_wait); @@ -741,21 +722,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - bool can_oom_reap = true; - - /* - * If the task is already exiting, don't alarm the sysadmin or kill - * its children or threads, just set TIF_MEMDIE so it can die quickly - */ - task_lock(p); - if (p->mm && task_will_free_mem(p)) { - mark_oom_victim(p); - try_oom_reaper(p); - task_unlock(p); - put_task_struct(p); - return; - } - task_unlock(p); if (__ratelimit(&oom_rs)) dump_header(oc, p, memcg); @@ -804,13 +770,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, /* Get a reference to safely compare mm after task_unlock(victim) */ mm = victim->mm; atomic_inc(&mm->mm_count); - /* - * We should send SIGKILL before setting TIF_MEMDIE in order to prevent - * the OOM victim from depleting the memory reserves from the user - * space under its control. - */ - do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); - mark_oom_victim(victim); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), @@ -819,37 +778,47 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, task_unlock(victim); /* - * Kill all user processes sharing victim->mm in other thread groups, if - * any. They don't get access to memory reserves, though, to avoid - * depletion of all memory. This prevents mm->mmap_sem livelock when an - * oom killed thread cannot exit because it requires the semaphore and - * its contended by another thread trying to allocate memory itself. - * That thread will now get access to memory reserves since it has a - * pending fatal signal. + * Kill all user processes sharing victim->mm in other thread groups, + * if any. This reduces possibility of hitting mm->mmap_sem livelock + * when an oom killed thread cannot exit because it requires the + * semaphore and its contended by another thread trying to allocate + * memory itself. */ rcu_read_lock(); for_each_process(p) { if (!process_shares_mm(p, mm)) continue; - if (same_thread_group(p, victim)) + if (unlikely(p->flags & PF_KTHREAD)) continue; - if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) || - p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { - /* - * We cannot use oom_reaper for the mm shared by this - * process because it wouldn't get killed and so the - * memory might be still used. - */ - can_oom_reap = false; + if (is_global_init(p)) continue; - } + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) + continue; + + /* + * We should make sure that oom_badness() will treat this + * process as unkillable because wake_oom_reaper() might do + * nothing. + * Note that this will change sysctl_oom_kill_allocating_task + * behavior. + */ + p->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; + /* + * We should send SIGKILL before setting TIF_MEMDIE in order to + * prevent the OOM victim from depleting the memory reserves + * from the user space under its control. + */ do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); + for_each_thread(p, t) { + task_lock(t); + if (t->mm) + mark_oom_victim(t); + task_unlock(t); + } + wake_oom_reaper(p); } rcu_read_unlock(); - if (can_oom_reap) - wake_oom_reaper(victim); - mmdrop(mm); put_task_struct(victim); } @@ -920,21 +889,6 @@ bool out_of_memory(struct oom_control *oc) return true; /* - * If current has a pending SIGKILL or is exiting, then automatically - * select it. The goal is to allow it to allocate so that it may - * quickly exit and free its memory. - * - * But don't select if current has already released its mm and cleared - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. - */ - if (current->mm && - (fatal_signal_pending(current) || task_will_free_mem(current))) { - mark_oom_victim(current); - try_oom_reaper(current); - return true; - } - - /* * The OOM killer does not compensate for IO-less reclaim. * pagefault_out_of_memory lost its gfp context so we have to * make sure exclude 0 mask - all other users should have at least ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path 2016-04-08 11:19 ` Tetsuo Handa @ 2016-04-08 11:50 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 11:50 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > Michal Hocko wrote: > > > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > > task_lock(p); > > > if (p->mm && task_will_free_mem(p)) { > > > mark_oom_victim(p); > > > + try_oom_reaper(p); > > > task_unlock(p); > > > put_task_struct(p); > > > return; > > > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) > > > if (current->mm && > > > (fatal_signal_pending(current) || task_will_free_mem(current))) { > > > mark_oom_victim(current); > > > + try_oom_reaper(current); > > > return true; > > > } > > > > > oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to > > if (tsk == oom_reaper_list || tsk->oom_reaper_list) > return; > > test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear > TIF_MEMDIE for all tasks queued for oom_reaper" will select the same > thread again. true, will update my patch. > Though I think we should not allow the OOM killer to select the same > thread again. > > > > > Why don't you call try_oom_reaper() from the shortcuts in > > mem_cgroup_out_of_memory() as well? > > I looked at next-20160408 but I again came to think that we should remove > these shortcuts (something like a patch shown bottom). feel free to send the patch with the full description. But I would really encourage you to check the history to learn why those have been added and describe why those concerns are not valid/important anymore. Your way of throwing a large patch based on an extreme load which is basically DoSing the machine is not the ideal one. I do respect your different opinion. It is well possible that you are right here and you can convince all the reviewers that your changes are safe. I would be more than happy to drop my smaller steps approach then. But I will be honest with you, you haven't convinced me yet and I have seen so many subtle issues in this code area that the risk is really non trivial for any larger changes. This is the primary reason I am doing small steps each focusing on a single improvement which can be argued about and is known to help a particular case without introducing a risk of different problems. I am not the maintainer so it is not up to me to select the right approach. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path @ 2016-04-08 11:50 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 11:50 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > Michal Hocko wrote: > > > @@ -694,6 +746,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > > task_lock(p); > > > if (p->mm && task_will_free_mem(p)) { > > > mark_oom_victim(p); > > > + try_oom_reaper(p); > > > task_unlock(p); > > > put_task_struct(p); > > > return; > > > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) > > > if (current->mm && > > > (fatal_signal_pending(current) || task_will_free_mem(current))) { > > > mark_oom_victim(current); > > > + try_oom_reaper(current); > > > return true; > > > } > > > > > oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to > > if (tsk == oom_reaper_list || tsk->oom_reaper_list) > return; > > test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear > TIF_MEMDIE for all tasks queued for oom_reaper" will select the same > thread again. true, will update my patch. > Though I think we should not allow the OOM killer to select the same > thread again. > > > > > Why don't you call try_oom_reaper() from the shortcuts in > > mem_cgroup_out_of_memory() as well? > > I looked at next-20160408 but I again came to think that we should remove > these shortcuts (something like a patch shown bottom). feel free to send the patch with the full description. But I would really encourage you to check the history to learn why those have been added and describe why those concerns are not valid/important anymore. Your way of throwing a large patch based on an extreme load which is basically DoSing the machine is not the ideal one. I do respect your different opinion. It is well possible that you are right here and you can convince all the reviewers that your changes are safe. I would be more than happy to drop my smaller steps approach then. But I will be honest with you, you haven't convinced me yet and I have seen so many subtle issues in this code area that the risk is really non trivial for any larger changes. This is the primary reason I am doing small steps each focusing on a single improvement which can be argued about and is known to help a particular case without introducing a risk of different problems. I am not the maintainer so it is not up to me to select the right approach. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular OOM killer path 2016-04-08 11:50 ` Michal Hocko @ 2016-04-09 4:39 ` Tetsuo Handa -1 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-09 4:39 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg Michal Hocko wrote: > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > > I looked at next-20160408 but I again came to think that we should remove > > these shortcuts (something like a patch shown bottom). > > feel free to send the patch with the full description. But I would > really encourage you to check the history to learn why those have been > added and describe why those concerns are not valid/important anymore. I believe that past discussions and decisions about current code are too optimistic because they did not take 'The "too small to fail" memory- allocation rule' problem into account. If you ignore me with "check the history to learn why those have been added and describe why those concerns are not valid/important anymore", I can do nothing. What are valid/important concerns that have higher priority than keeping 'The "too small to fail" memory-allocation rule' problem and continue telling a lie to end users? Please enumerate such concerns. > Your way of throwing a large patch based on an extreme load which is > basically DoSing the machine is not the ideal one. You are not paying attention to real world's limitations I'm facing. I have to waste my resource trying to identify and fix on behalf of customers before they determine the kernel version to use for their systems, for your way of thinking is that "We don't need to worry about it because I have never received such report" while the reality of customers is that "I'm not skillful enough to catch the problematic behavior and make a reproducer" or "I have a little skill but I'm not permitted to modify my systems for reporting the problematic behavior". If you listen to me, I don't need to do such thing. It is very very sad. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular OOM killer path @ 2016-04-09 4:39 ` Tetsuo Handa 0 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-09 4:39 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg Michal Hocko wrote: > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > > I looked at next-20160408 but I again came to think that we should remove > > these shortcuts (something like a patch shown bottom). > > feel free to send the patch with the full description. But I would > really encourage you to check the history to learn why those have been > added and describe why those concerns are not valid/important anymore. I believe that past discussions and decisions about current code are too optimistic because they did not take 'The "too small to fail" memory- allocation rule' problem into account. If you ignore me with "check the history to learn why those have been added and describe why those concerns are not valid/important anymore", I can do nothing. What are valid/important concerns that have higher priority than keeping 'The "too small to fail" memory-allocation rule' problem and continue telling a lie to end users? Please enumerate such concerns. > Your way of throwing a large patch based on an extreme load which is > basically DoSing the machine is not the ideal one. You are not paying attention to real world's limitations I'm facing. I have to waste my resource trying to identify and fix on behalf of customers before they determine the kernel version to use for their systems, for your way of thinking is that "We don't need to worry about it because I have never received such report" while the reality of customers is that "I'm not skillful enough to catch the problematic behavior and make a reproducer" or "I have a little skill but I'm not permitted to modify my systems for reporting the problematic behavior". If you listen to me, I don't need to do such thing. It is very very sad. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular OOM killer path 2016-04-09 4:39 ` Tetsuo Handa @ 2016-04-11 12:02 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-11 12:02 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg On Sat 09-04-16 13:39:30, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > > > I looked at next-20160408 but I again came to think that we should remove > > > these shortcuts (something like a patch shown bottom). > > > > feel free to send the patch with the full description. But I would > > really encourage you to check the history to learn why those have been > > added and describe why those concerns are not valid/important anymore. > > I believe that past discussions and decisions about current code are too > optimistic because they did not take 'The "too small to fail" memory- > allocation rule' problem into account. In most cases they were driven by _real_ usecases though. And that is what matters. Theoretically possible issues which happen under crazy workloads which are DoSing the machine already are not something to optimize for. Sure we should try to cope with them as gracefully as possible, no questions about that, but we should try hard not to reintroduce previous issues during _sensible_ workloads. > If you ignore me with "check the history to learn why those have been added > and describe why those concerns are not valid/important anymore", I can do > nothing. What are valid/important concerns that have higher priority than > keeping 'The "too small to fail" memory-allocation rule' problem and continue > telling a lie to end users? Please enumerate such concerns. I feel like we are looping in a circle and I do not want to waste my time repeating arguments which were already mentioned several times. I have already told you that you have to justify potentially disruptive changes properly. So far you are more focused on extreme cases while you do not seem to care all that much about those which happen most of the time. We surely do not want to regress there. If I am telling you to study the history of our heuristics it is to _help_ you understand why they have been introduced so that you can argue with the reasoning and/or come up with improvements. Unless you start doing this chances are that your patches will not see overly warm welcome. > > Your way of throwing a large patch based on an extreme load which is > > basically DoSing the machine is not the ideal one. > > You are not paying attention to real world's limitations I'm facing. So far I haven't seen any _real_world_ example from you, to be honest. All I can see is hammering the system with some DoS scenarios which triggered different corner cases in the behavior. Those are good to make us think about our limitations and think for longterm solutions. > I have to waste my resource trying to identify and fix on behalf of > customers before they determine the kernel version to use for their > systems, for your way of thinking is that "We don't need to worry about > it because I have never received such report" No I am not saying that. I am saying that I have never seen a _properly_ configured system to blow up in a way that would trigger pathological cases you are mentioning. And that is a big difference. You can misconfigure your system in so many ways and put it on knees without a way out. With all due respect I will not continue in this line of discussion because it doesn't lead anywhere. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular OOM killer path @ 2016-04-11 12:02 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-11 12:02 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg On Sat 09-04-16 13:39:30, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > > > I looked at next-20160408 but I again came to think that we should remove > > > these shortcuts (something like a patch shown bottom). > > > > feel free to send the patch with the full description. But I would > > really encourage you to check the history to learn why those have been > > added and describe why those concerns are not valid/important anymore. > > I believe that past discussions and decisions about current code are too > optimistic because they did not take 'The "too small to fail" memory- > allocation rule' problem into account. In most cases they were driven by _real_ usecases though. And that is what matters. Theoretically possible issues which happen under crazy workloads which are DoSing the machine already are not something to optimize for. Sure we should try to cope with them as gracefully as possible, no questions about that, but we should try hard not to reintroduce previous issues during _sensible_ workloads. > If you ignore me with "check the history to learn why those have been added > and describe why those concerns are not valid/important anymore", I can do > nothing. What are valid/important concerns that have higher priority than > keeping 'The "too small to fail" memory-allocation rule' problem and continue > telling a lie to end users? Please enumerate such concerns. I feel like we are looping in a circle and I do not want to waste my time repeating arguments which were already mentioned several times. I have already told you that you have to justify potentially disruptive changes properly. So far you are more focused on extreme cases while you do not seem to care all that much about those which happen most of the time. We surely do not want to regress there. If I am telling you to study the history of our heuristics it is to _help_ you understand why they have been introduced so that you can argue with the reasoning and/or come up with improvements. Unless you start doing this chances are that your patches will not see overly warm welcome. > > Your way of throwing a large patch based on an extreme load which is > > basically DoSing the machine is not the ideal one. > > You are not paying attention to real world's limitations I'm facing. So far I haven't seen any _real_world_ example from you, to be honest. All I can see is hammering the system with some DoS scenarios which triggered different corner cases in the behavior. Those are good to make us think about our limitations and think for longterm solutions. > I have to waste my resource trying to identify and fix on behalf of > customers before they determine the kernel version to use for their > systems, for your way of thinking is that "We don't need to worry about > it because I have never received such report" No I am not saying that. I am saying that I have never seen a _properly_ configured system to blow up in a way that would trigger pathological cases you are mentioning. And that is a big difference. You can misconfigure your system in so many ways and put it on knees without a way out. With all due respect I will not continue in this line of discussion because it doesn't lead anywhere. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path 2016-04-11 12:02 ` Michal Hocko @ 2016-04-11 13:26 ` Tetsuo Handa -1 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-11 13:26 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg Michal Hocko wrote: > On Sat 09-04-16 13:39:30, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > > > > I looked at next-20160408 but I again came to think that we should remove > > > > these shortcuts (something like a patch shown bottom). > > > > > > feel free to send the patch with the full description. But I would > > > really encourage you to check the history to learn why those have been > > > added and describe why those concerns are not valid/important anymore. > > > > I believe that past discussions and decisions about current code are too > > optimistic because they did not take 'The "too small to fail" memory- > > allocation rule' problem into account. > > In most cases they were driven by _real_ usecases though. And that > is what matters. Theoretically possible issues which happen under > crazy workloads which are DoSing the machine already are not something > to optimize for. Sure we should try to cope with them as gracefully > as possible, no questions about that, but we should try hard not to > reintroduce previous issues during _sensible_ workloads. I'm not requesting you to optimize for crazy workloads. None of my customers intentionally put crazy workloads, but they are getting silent hangups and I'm suspecting that something went wrong with memory management. But there is no evidence because memory management subsystem remains silent. You call my testcases DoS, but there is no evidence that my customers are not hitting the same problem my testcases found. I'm suggesting you to at least emit diagnostic messages when something went wrong. That is what kmallocwd is for. And if you do not want to emit diagnostic messages, I'm fine with timeout based approach. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path @ 2016-04-11 13:26 ` Tetsuo Handa 0 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-11 13:26 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg Michal Hocko wrote: > On Sat 09-04-16 13:39:30, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > > > > I looked at next-20160408 but I again came to think that we should remove > > > > these shortcuts (something like a patch shown bottom). > > > > > > feel free to send the patch with the full description. But I would > > > really encourage you to check the history to learn why those have been > > > added and describe why those concerns are not valid/important anymore. > > > > I believe that past discussions and decisions about current code are too > > optimistic because they did not take 'The "too small to fail" memory- > > allocation rule' problem into account. > > In most cases they were driven by _real_ usecases though. And that > is what matters. Theoretically possible issues which happen under > crazy workloads which are DoSing the machine already are not something > to optimize for. Sure we should try to cope with them as gracefully > as possible, no questions about that, but we should try hard not to > reintroduce previous issues during _sensible_ workloads. I'm not requesting you to optimize for crazy workloads. None of my customers intentionally put crazy workloads, but they are getting silent hangups and I'm suspecting that something went wrong with memory management. But there is no evidence because memory management subsystem remains silent. You call my testcases DoS, but there is no evidence that my customers are not hitting the same problem my testcases found. I'm suggesting you to at least emit diagnostic messages when something went wrong. That is what kmallocwd is for. And if you do not want to emit diagnostic messages, I'm fine with timeout based approach. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path 2016-04-11 13:26 ` Tetsuo Handa @ 2016-04-11 13:43 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-11 13:43 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg On Mon 11-04-16 22:26:09, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Sat 09-04-16 13:39:30, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > > > > > I looked at next-20160408 but I again came to think that we should remove > > > > > these shortcuts (something like a patch shown bottom). > > > > > > > > feel free to send the patch with the full description. But I would > > > > really encourage you to check the history to learn why those have been > > > > added and describe why those concerns are not valid/important anymore. > > > > > > I believe that past discussions and decisions about current code are too > > > optimistic because they did not take 'The "too small to fail" memory- > > > allocation rule' problem into account. > > > > In most cases they were driven by _real_ usecases though. And that > > is what matters. Theoretically possible issues which happen under > > crazy workloads which are DoSing the machine already are not something > > to optimize for. Sure we should try to cope with them as gracefully > > as possible, no questions about that, but we should try hard not to > > reintroduce previous issues during _sensible_ workloads. > > I'm not requesting you to optimize for crazy workloads. None of my > customers intentionally put crazy workloads, but they are getting silent > hangups and I'm suspecting that something went wrong with memory management. There are many other possible reasons for thses symptoms. Have you actually seen any _evidence_ they the hang they are seeing is due to oom deadlock, though. A single crash dump or consistent sysrq output which would point that direction. > But there is no evidence because memory management subsystem remains silent. > You call my testcases DoS, but there is no evidence that my customers > are not hitting the same problem my testcases found. This is really impossible to comment on. > I'm suggesting you to at least emit diagnostic messages when something went > wrong. That is what kmallocwd is for. And if you do not want to emit > diagnostic messages, I'm fine with timeout based approach. I am all for more diagnostic but what you were proposing was so heavy weight it doesn't really seem worth it. Anyway yet again this is getting largely off-topic... -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path @ 2016-04-11 13:43 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-11 13:43 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg On Mon 11-04-16 22:26:09, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Sat 09-04-16 13:39:30, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote: > > > > > I looked at next-20160408 but I again came to think that we should remove > > > > > these shortcuts (something like a patch shown bottom). > > > > > > > > feel free to send the patch with the full description. But I would > > > > really encourage you to check the history to learn why those have been > > > > added and describe why those concerns are not valid/important anymore. > > > > > > I believe that past discussions and decisions about current code are too > > > optimistic because they did not take 'The "too small to fail" memory- > > > allocation rule' problem into account. > > > > In most cases they were driven by _real_ usecases though. And that > > is what matters. Theoretically possible issues which happen under > > crazy workloads which are DoSing the machine already are not something > > to optimize for. Sure we should try to cope with them as gracefully > > as possible, no questions about that, but we should try hard not to > > reintroduce previous issues during _sensible_ workloads. > > I'm not requesting you to optimize for crazy workloads. None of my > customers intentionally put crazy workloads, but they are getting silent > hangups and I'm suspecting that something went wrong with memory management. There are many other possible reasons for thses symptoms. Have you actually seen any _evidence_ they the hang they are seeing is due to oom deadlock, though. A single crash dump or consistent sysrq output which would point that direction. > But there is no evidence because memory management subsystem remains silent. > You call my testcases DoS, but there is no evidence that my customers > are not hitting the same problem my testcases found. This is really impossible to comment on. > I'm suggesting you to at least emit diagnostic messages when something went > wrong. That is what kmallocwd is for. And if you do not want to emit > diagnostic messages, I'm fine with timeout based approach. I am all for more diagnostic but what you were proposing was so heavy weight it doesn't really seem worth it. Anyway yet again this is getting largely off-topic... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path 2016-04-11 13:43 ` Michal Hocko @ 2016-04-13 11:08 ` Tetsuo Handa -1 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-13 11:08 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg Michal Hocko wrote: > There are many other possible reasons for thses symptoms. Have you > actually seen any _evidence_ they the hang they are seeing is due to > oom deadlock, though. A single crash dump or consistent sysrq output > which would point that direction. Yes. I saw several OOM livelock cases occurred in the customer's servers. One case I was able to identify the cause was request_module() local DoS ( https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2012-4398 ). That server was running Java based enterprise application. The java process tried to create IPv6 socket on a system where IPv6 was configured to be disabled, and request_module() was every time called from socket() syscall for trying to load ipv6.ko module, and the OOM killer was invoked and the java process was selected for the OOM victim, and since request_module() was not killable, the system got the OOM livelock. Another case I saw is interrupts from virtio being disabled due to a bug in qemu-kvm. Since the cron daemon continuously starts cron jobs even after storage I/O started stalling (because the qemu-kvm stopped sending interrupts), all memory was consumed for cron jobs and async file write requests, and the OOM killer was invoked. And since a cron job which was selected as the OOM victim while trying to write to file was unable to terminate due to waiting for fs writeback, the system got the OOM livelock. Yet another case I saw is a hangup where a process is blocked at down_read(&mm->mmap_sem) in __access_remote_vm() while reading /proc/pid/ entries. Since I had zero knowledge about OOM livelock at that time, I was not able to tell whether it was an OOM livelock or not. There would be some more, but I can't recall them because I left the support center one year ago and I have no chance to re-examine these cases. But in general, it is rare that I can find the OOM killer messages because their servers are force rebooted without capturing kdump or SysRq. Hints I can use are limited to /var/log/messages which lacks suspicious messages, /var/log/sa/ which shows that there was little free memory and /proc/sys/kernel/hung_task_warnings already being 0 (if sosreport is also provided). > > I'm suggesting you to at least emit diagnostic messages when something went > > wrong. That is what kmallocwd is for. And if you do not want to emit > > diagnostic messages, I'm fine with timeout based approach. > > I am all for more diagnostic but what you were proposing was so heavy > weight it doesn't really seem worth it. I suspect that the reason hung_task_warnings becomes 0 is related to use of the same watermark for GFP_KERNEL/GFP_NOFS/GFP_NOIO, but I can't ask customers to replace their kernels for debugging. So, the first step is to merge kmallocwd upstream, then wait until customers start using that kernel (it may be within a few months if they are about to develop a new server, but it may be 10 years away if they already decided not to update kernels for their servers' lifetime). > Anyway yet again this is getting largely off-topic... OK. I'll stop posting to this thread. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path @ 2016-04-13 11:08 ` Tetsuo Handa 0 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-13 11:08 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg Michal Hocko wrote: > There are many other possible reasons for thses symptoms. Have you > actually seen any _evidence_ they the hang they are seeing is due to > oom deadlock, though. A single crash dump or consistent sysrq output > which would point that direction. Yes. I saw several OOM livelock cases occurred in the customer's servers. One case I was able to identify the cause was request_module() local DoS ( https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2012-4398 ). That server was running Java based enterprise application. The java process tried to create IPv6 socket on a system where IPv6 was configured to be disabled, and request_module() was every time called from socket() syscall for trying to load ipv6.ko module, and the OOM killer was invoked and the java process was selected for the OOM victim, and since request_module() was not killable, the system got the OOM livelock. Another case I saw is interrupts from virtio being disabled due to a bug in qemu-kvm. Since the cron daemon continuously starts cron jobs even after storage I/O started stalling (because the qemu-kvm stopped sending interrupts), all memory was consumed for cron jobs and async file write requests, and the OOM killer was invoked. And since a cron job which was selected as the OOM victim while trying to write to file was unable to terminate due to waiting for fs writeback, the system got the OOM livelock. Yet another case I saw is a hangup where a process is blocked at down_read(&mm->mmap_sem) in __access_remote_vm() while reading /proc/pid/ entries. Since I had zero knowledge about OOM livelock at that time, I was not able to tell whether it was an OOM livelock or not. There would be some more, but I can't recall them because I left the support center one year ago and I have no chance to re-examine these cases. But in general, it is rare that I can find the OOM killer messages because their servers are force rebooted without capturing kdump or SysRq. Hints I can use are limited to /var/log/messages which lacks suspicious messages, /var/log/sa/ which shows that there was little free memory and /proc/sys/kernel/hung_task_warnings already being 0 (if sosreport is also provided). > > I'm suggesting you to at least emit diagnostic messages when something went > > wrong. That is what kmallocwd is for. And if you do not want to emit > > diagnostic messages, I'm fine with timeout based approach. > > I am all for more diagnostic but what you were proposing was so heavy > weight it doesn't really seem worth it. I suspect that the reason hung_task_warnings becomes 0 is related to use of the same watermark for GFP_KERNEL/GFP_NOFS/GFP_NOIO, but I can't ask customers to replace their kernels for debugging. So, the first step is to merge kmallocwd upstream, then wait until customers start using that kernel (it may be within a few months if they are about to develop a new server, but it may be 10 years away if they already decided not to update kernels for their servers' lifetime). > Anyway yet again this is getting largely off-topic... OK. I'll stop posting to this thread. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path 2016-04-07 11:38 ` Tetsuo Handa @ 2016-04-08 11:34 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 11:34 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg On Thu 07-04-16 20:38:43, Tetsuo Handa wrote: > Michal Hocko wrote: > > @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk) > > wake_up(&oom_reaper_wait); > > } > > > > +/* Check if we can reap the given task. This has to be called with stable > > + * tsk->mm > > + */ > > +static void try_oom_reaper(struct task_struct *tsk) > > +{ > > + struct mm_struct *mm = tsk->mm; > > + struct task_struct *p; > > + > > + if (!mm) > > + return; > > + > > + /* > > + * There might be other threads/processes which are either not > > + * dying or even not killable. > > + */ > > + if (atomic_read(&mm->mm_users) > 1) { > > + rcu_read_lock(); > > + for_each_process(p) { > > + bool exiting; > > + > > + if (!process_shares_mm(p, mm)) > > + continue; > > + if (same_thread_group(p, tsk)) > > + continue; > > + if (fatal_signal_pending(p)) > > + continue; > > + > > + /* > > + * If the task is exiting make sure the whole thread group > > + * is exiting and cannot acces mm anymore. > > + */ > > + spin_lock_irq(&p->sighand->siglock); > > + exiting = signal_group_exit(p->signal); > > + spin_unlock_irq(&p->sighand->siglock); > > + if (exiting) > > + continue; > > + > > + /* Give up */ > > + rcu_read_unlock(); > > + return; > > + } > > + rcu_read_unlock(); > > + } > > + > > + wake_oom_reaper(tsk); > > +} > > + > > I think you want to change "try_oom_reaper() without wake_oom_reaper()" > as mm_is_reapable() and use it from oom_kill_process() in order to skip > p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN test which needlessly makes > can_oom_reap false. Not sure I understand the OOM_SCORE_ADJ_MIN part. We cannot reap the task if somebody sharing the mm is OOM_SCORE_ADJ_MIN. We have to check this in oom_kill_process because we are sending SIGKILL but we do not have to check for this explicitly in try_oom_reaper because we only care about exiting/killed tasks. [...] > > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) > > if (current->mm && > > (fatal_signal_pending(current) || task_will_free_mem(current))) { > > mark_oom_victim(current); > > + try_oom_reaper(current); > > return true; > > } > > > > Why don't you call try_oom_reaper() from the shortcuts in > mem_cgroup_out_of_memory() as well? I have focused on the global case and the correctness for now. But I agree we can safely squash mem_cgroup_out_of_memory part into the patch as well. Thanks for pointing this out. > Why don't you embed try_oom_reaper() into mark_oom_victim() like I did at > http://lkml.kernel.org/r/201602052014.HBG52666.HFMOQVLFOSFJtO@I-love.SAKURA.ne.jp ? it didn't fit in the current flow of oom_kill_process where we do: do_send_sig_info(victim) mark_oom_victim(victim) kill_sharing_tasks so in the case of shared mm we wouldn't schedule the task for the reaper most likely because we have to kill them first. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path @ 2016-04-08 11:34 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 11:34 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg On Thu 07-04-16 20:38:43, Tetsuo Handa wrote: > Michal Hocko wrote: > > @@ -563,6 +582,53 @@ static void wake_oom_reaper(struct task_struct *tsk) > > wake_up(&oom_reaper_wait); > > } > > > > +/* Check if we can reap the given task. This has to be called with stable > > + * tsk->mm > > + */ > > +static void try_oom_reaper(struct task_struct *tsk) > > +{ > > + struct mm_struct *mm = tsk->mm; > > + struct task_struct *p; > > + > > + if (!mm) > > + return; > > + > > + /* > > + * There might be other threads/processes which are either not > > + * dying or even not killable. > > + */ > > + if (atomic_read(&mm->mm_users) > 1) { > > + rcu_read_lock(); > > + for_each_process(p) { > > + bool exiting; > > + > > + if (!process_shares_mm(p, mm)) > > + continue; > > + if (same_thread_group(p, tsk)) > > + continue; > > + if (fatal_signal_pending(p)) > > + continue; > > + > > + /* > > + * If the task is exiting make sure the whole thread group > > + * is exiting and cannot acces mm anymore. > > + */ > > + spin_lock_irq(&p->sighand->siglock); > > + exiting = signal_group_exit(p->signal); > > + spin_unlock_irq(&p->sighand->siglock); > > + if (exiting) > > + continue; > > + > > + /* Give up */ > > + rcu_read_unlock(); > > + return; > > + } > > + rcu_read_unlock(); > > + } > > + > > + wake_oom_reaper(tsk); > > +} > > + > > I think you want to change "try_oom_reaper() without wake_oom_reaper()" > as mm_is_reapable() and use it from oom_kill_process() in order to skip > p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN test which needlessly makes > can_oom_reap false. Not sure I understand the OOM_SCORE_ADJ_MIN part. We cannot reap the task if somebody sharing the mm is OOM_SCORE_ADJ_MIN. We have to check this in oom_kill_process because we are sending SIGKILL but we do not have to check for this explicitly in try_oom_reaper because we only care about exiting/killed tasks. [...] > > @@ -873,6 +926,7 @@ bool out_of_memory(struct oom_control *oc) > > if (current->mm && > > (fatal_signal_pending(current) || task_will_free_mem(current))) { > > mark_oom_victim(current); > > + try_oom_reaper(current); > > return true; > > } > > > > Why don't you call try_oom_reaper() from the shortcuts in > mem_cgroup_out_of_memory() as well? I have focused on the global case and the correctness for now. But I agree we can safely squash mem_cgroup_out_of_memory part into the patch as well. Thanks for pointing this out. > Why don't you embed try_oom_reaper() into mark_oom_victim() like I did at > http://lkml.kernel.org/r/201602052014.HBG52666.HFMOQVLFOSFJtO@I-love.SAKURA.ne.jp ? it didn't fit in the current flow of oom_kill_process where we do: do_send_sig_info(victim) mark_oom_victim(victim) kill_sharing_tasks so in the case of shared mm we wouldn't schedule the task for the reaper most likely because we have to kill them first. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path 2016-04-06 14:13 ` Michal Hocko @ 2016-04-08 13:14 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 13:14 UTC (permalink / raw) To: Andrew Morton; +Cc: David Rientjes, Tetsuo Handa, linux-mm, LKML, Oleg Nesterov Andrew, could you fold this in? --- >From 9645b0518c20da67decfd2a94b1cc99ef1b4218a Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 8 Apr 2016 15:11:59 +0200 Subject: [PATCH] mm, oom_reaper: Try to reap tasks which skip regular memcg OOM killer path no reason to include memcg oom killed tasks as well. Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/oom.h | 8 ++++++++ mm/memcontrol.c | 1 + mm/oom_kill.c | 2 +- 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 628a43242a34..83b9c39bd8b7 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -72,6 +72,14 @@ static inline bool oom_task_origin(const struct task_struct *p) extern void mark_oom_victim(struct task_struct *tsk); +#ifdef CONFIG_MMU +extern void try_oom_reaper(struct task_struct *tsk); +#else +static inline void try_oom_reaper(struct task_struct *tsk) +{ +} +#endif + extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 17a847c96618..6d2f699e4d3d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1256,6 +1256,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, */ if (fatal_signal_pending(current) || task_will_free_mem(current)) { mark_oom_victim(current); + try_oom_reaper(current); goto unlock; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 74c38f5fffef..e656ee2c0be4 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -585,7 +585,7 @@ static void wake_oom_reaper(struct task_struct *tsk) /* Check if we can reap the given task. This has to be called with stable * tsk->mm */ -static void try_oom_reaper(struct task_struct *tsk) +void try_oom_reaper(struct task_struct *tsk) { struct mm_struct *mm = tsk->mm; struct task_struct *p; -- 2.8.0.rc3 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path @ 2016-04-08 13:14 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 13:14 UTC (permalink / raw) To: Andrew Morton; +Cc: David Rientjes, Tetsuo Handa, linux-mm, LKML, Oleg Nesterov Andrew, could you fold this in? --- ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-06 14:13 ` Michal Hocko @ 2016-04-06 14:13 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw) To: linux-mm; +Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> Right now the oom reaper will clear TIF_MEMDIE only for tasks which were successfully reaped. This is the safest option because we know that such an oom victim would only block forward progress of the oom killer without a good reason because it is highly unlikely it would release much more memory. Basically most of its memory has been already torn down. We can relax this assumption to catch more corner cases though. The first obvious one is when the oom victim clears its mm and gets stuck later on. oom_reaper would back of on find_lock_task_mm returning NULL. We can safely try to clear TIF_MEMDIE in this case because such a task would be ignored by the oom killer anyway. The flag would be cleared by that time already most of the time anyway. The less obvious one is when the oom reaper fails due to mmap_sem contention. Even if we clear TIF_MEMDIE for this task then it is not very likely that we would select another task too easily because we haven't reaped the last victim and so it would be still the #1 candidate. There is a rare race condition possible when the current victim terminates before the next select_bad_process but considering that oom_reap_task had retried several times before giving up then this sounds like a borderline thing. After this patch we should have a guarantee that the OOM killer will not be block for unbounded amount of time for most cases. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 74c38f5fffef..7098104b7475 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -510,14 +510,10 @@ static bool __oom_reap_task(struct task_struct *tsk) up_read(&mm->mmap_sem); /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a - * reasonably reclaimable memory anymore. OOM killer can continue - * by selecting other victim if unmapping hasn't led to any - * improvements. This also means that selecting this task doesn't - * make any sense. + * This task can be safely ignored because we cannot do much more + * to release its memory. */ tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; - exit_oom_victim(tsk); out: mmput(mm); return ret; @@ -538,6 +534,14 @@ static void oom_reap_task(struct task_struct *tsk) debug_show_all_locks(); } + /* + * Clear TIF_MEMDIE because the task shouldn't be sitting on a + * reasonably reclaimable memory anymore or it is not a good candidate + * for the oom victim right now because it cannot release its memory + * itself nor by the oom reaper. + */ + exit_oom_victim(tsk); + /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); } -- 2.8.0.rc3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper @ 2016-04-06 14:13 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-06 14:13 UTC (permalink / raw) To: linux-mm; +Cc: David Rientjes, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> Right now the oom reaper will clear TIF_MEMDIE only for tasks which were successfully reaped. This is the safest option because we know that such an oom victim would only block forward progress of the oom killer without a good reason because it is highly unlikely it would release much more memory. Basically most of its memory has been already torn down. We can relax this assumption to catch more corner cases though. The first obvious one is when the oom victim clears its mm and gets stuck later on. oom_reaper would back of on find_lock_task_mm returning NULL. We can safely try to clear TIF_MEMDIE in this case because such a task would be ignored by the oom killer anyway. The flag would be cleared by that time already most of the time anyway. The less obvious one is when the oom reaper fails due to mmap_sem contention. Even if we clear TIF_MEMDIE for this task then it is not very likely that we would select another task too easily because we haven't reaped the last victim and so it would be still the #1 candidate. There is a rare race condition possible when the current victim terminates before the next select_bad_process but considering that oom_reap_task had retried several times before giving up then this sounds like a borderline thing. After this patch we should have a guarantee that the OOM killer will not be block for unbounded amount of time for most cases. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 74c38f5fffef..7098104b7475 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -510,14 +510,10 @@ static bool __oom_reap_task(struct task_struct *tsk) up_read(&mm->mmap_sem); /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a - * reasonably reclaimable memory anymore. OOM killer can continue - * by selecting other victim if unmapping hasn't led to any - * improvements. This also means that selecting this task doesn't - * make any sense. + * This task can be safely ignored because we cannot do much more + * to release its memory. */ tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; - exit_oom_victim(tsk); out: mmput(mm); return ret; @@ -538,6 +534,14 @@ static void oom_reap_task(struct task_struct *tsk) debug_show_all_locks(); } + /* + * Clear TIF_MEMDIE because the task shouldn't be sitting on a + * reasonably reclaimable memory anymore or it is not a good candidate + * for the oom victim right now because it cannot release its memory + * itself nor by the oom reaper. + */ + exit_oom_victim(tsk); + /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); } -- 2.8.0.rc3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-06 14:13 ` Michal Hocko @ 2016-04-07 11:55 ` Tetsuo Handa -1 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-07 11:55 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko Michal Hocko wrote: > The first obvious one is when the oom victim clears its mm and gets > stuck later on. oom_reaper would back of on find_lock_task_mm returning > NULL. We can safely try to clear TIF_MEMDIE in this case because such a > task would be ignored by the oom killer anyway. The flag would be > cleared by that time already most of the time anyway. I didn't understand what this wants to tell. The OOM victim will clear TIF_MEMDIE as soon as it sets current->mm = NULL. Even if the oom victim clears its mm and gets stuck later on (e.g. at exit_task_work()), TIF_MEMDIE was already cleared by that moment by the OOM victim. > > The less obvious one is when the oom reaper fails due to mmap_sem > contention. Even if we clear TIF_MEMDIE for this task then it is not > very likely that we would select another task too easily because > we haven't reaped the last victim and so it would be still the #1 > candidate. There is a rare race condition possible when the current > victim terminates before the next select_bad_process but considering > that oom_reap_task had retried several times before giving up then > this sounds like a borderline thing. Is it helpful? Allowing the OOM killer to select the same thread again simply makes the kernel log buffer flooded with the OOM kill messages. I think we should not allow the OOM killer to select the same thread again by e.g. doing tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN regardless of whether reaping that thread's memory succeeded or not. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper @ 2016-04-07 11:55 ` Tetsuo Handa 0 siblings, 0 replies; 40+ messages in thread From: Tetsuo Handa @ 2016-04-07 11:55 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, akpm, linux-kernel, mhocko Michal Hocko wrote: > The first obvious one is when the oom victim clears its mm and gets > stuck later on. oom_reaper would back of on find_lock_task_mm returning > NULL. We can safely try to clear TIF_MEMDIE in this case because such a > task would be ignored by the oom killer anyway. The flag would be > cleared by that time already most of the time anyway. I didn't understand what this wants to tell. The OOM victim will clear TIF_MEMDIE as soon as it sets current->mm = NULL. Even if the oom victim clears its mm and gets stuck later on (e.g. at exit_task_work()), TIF_MEMDIE was already cleared by that moment by the OOM victim. > > The less obvious one is when the oom reaper fails due to mmap_sem > contention. Even if we clear TIF_MEMDIE for this task then it is not > very likely that we would select another task too easily because > we haven't reaped the last victim and so it would be still the #1 > candidate. There is a rare race condition possible when the current > victim terminates before the next select_bad_process but considering > that oom_reap_task had retried several times before giving up then > this sounds like a borderline thing. Is it helpful? Allowing the OOM killer to select the same thread again simply makes the kernel log buffer flooded with the OOM kill messages. I think we should not allow the OOM killer to select the same thread again by e.g. doing tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN regardless of whether reaping that thread's memory succeeded or not. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-07 11:55 ` Tetsuo Handa @ 2016-04-08 11:34 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 11:34 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel On Thu 07-04-16 20:55:34, Tetsuo Handa wrote: > Michal Hocko wrote: > > The first obvious one is when the oom victim clears its mm and gets > > stuck later on. oom_reaper would back of on find_lock_task_mm returning > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a > > task would be ignored by the oom killer anyway. The flag would be > > cleared by that time already most of the time anyway. > > I didn't understand what this wants to tell. The OOM victim will clear > TIF_MEMDIE as soon as it sets current->mm = NULL. No it clears the flag _after_ it returns from mmput. There is no guarantee it won't get stuck somewhere on the way there - e.g. exit_aio waits for completion and who knows what else might get stuck. > Even if the oom victim > clears its mm and gets stuck later on (e.g. at exit_task_work()), > TIF_MEMDIE was already cleared by that moment by the OOM victim. > > > > > The less obvious one is when the oom reaper fails due to mmap_sem > > contention. Even if we clear TIF_MEMDIE for this task then it is not > > very likely that we would select another task too easily because > > we haven't reaped the last victim and so it would be still the #1 > > candidate. There is a rare race condition possible when the current > > victim terminates before the next select_bad_process but considering > > that oom_reap_task had retried several times before giving up then > > this sounds like a borderline thing. > > Is it helpful? Allowing the OOM killer to select the same thread again > simply makes the kernel log buffer flooded with the OOM kill messages. I am trying to be as conservative as possible here. The likelyhood of mmap sem contention will be reduced considerably after my down_write_killable series will get merged. If this turns out to be a problem (trivial to spot as the same task will be killed again) then we can think about a fix for that (e.g. ignore the task if the has been selected more than N times). > I think we should not allow the OOM killer to select the same thread again > by e.g. doing tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN regardless of > whether reaping that thread's memory succeeded or not. I think this comes with some risk and so it should go as a separate patch with a full justification why the outcome is better. Especially after the mmap_sem contention will be reduced by other means. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper @ 2016-04-08 11:34 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 11:34 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, linux-kernel On Thu 07-04-16 20:55:34, Tetsuo Handa wrote: > Michal Hocko wrote: > > The first obvious one is when the oom victim clears its mm and gets > > stuck later on. oom_reaper would back of on find_lock_task_mm returning > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a > > task would be ignored by the oom killer anyway. The flag would be > > cleared by that time already most of the time anyway. > > I didn't understand what this wants to tell. The OOM victim will clear > TIF_MEMDIE as soon as it sets current->mm = NULL. No it clears the flag _after_ it returns from mmput. There is no guarantee it won't get stuck somewhere on the way there - e.g. exit_aio waits for completion and who knows what else might get stuck. > Even if the oom victim > clears its mm and gets stuck later on (e.g. at exit_task_work()), > TIF_MEMDIE was already cleared by that moment by the OOM victim. > > > > > The less obvious one is when the oom reaper fails due to mmap_sem > > contention. Even if we clear TIF_MEMDIE for this task then it is not > > very likely that we would select another task too easily because > > we haven't reaped the last victim and so it would be still the #1 > > candidate. There is a rare race condition possible when the current > > victim terminates before the next select_bad_process but considering > > that oom_reap_task had retried several times before giving up then > > this sounds like a borderline thing. > > Is it helpful? Allowing the OOM killer to select the same thread again > simply makes the kernel log buffer flooded with the OOM kill messages. I am trying to be as conservative as possible here. The likelyhood of mmap sem contention will be reduced considerably after my down_write_killable series will get merged. If this turns out to be a problem (trivial to spot as the same task will be killed again) then we can think about a fix for that (e.g. ignore the task if the has been selected more than N times). > I think we should not allow the OOM killer to select the same thread again > by e.g. doing tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN regardless of > whether reaping that thread's memory succeeded or not. I think this comes with some risk and so it should go as a separate patch with a full justification why the outcome is better. Especially after the mmap_sem contention will be reduced by other means. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-08 11:34 ` Michal Hocko (?) @ 2016-04-16 2:51 ` Tetsuo Handa 2016-04-17 11:54 ` Michal Hocko -1 siblings, 1 reply; 40+ messages in thread From: Tetsuo Handa @ 2016-04-16 2:51 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm Michal Hocko wrote: > On Thu 07-04-16 20:55:34, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > The first obvious one is when the oom victim clears its mm and gets > > > stuck later on. oom_reaper would back of on find_lock_task_mm returning > > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a > > > task would be ignored by the oom killer anyway. The flag would be > > > cleared by that time already most of the time anyway. > > > > I didn't understand what this wants to tell. The OOM victim will clear > > TIF_MEMDIE as soon as it sets current->mm = NULL. > > No it clears the flag _after_ it returns from mmput. There is no > guarantee it won't get stuck somewhere on the way there - e.g. exit_aio > waits for completion and who knows what else might get stuck. OK. Then, I think an OOM livelock scenario shown below is possible. (1) First OOM victim (where mm->mm_users == 1) is selected by the first round of out_of_memory() call. (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users). (3) The OOM victim calls mmput() from exit_mm() from do_exit(). mmput() returns immediately because atomic_dec_and_test(&mm->mm_users) returns false because of (2). (4) The OOM reaper reaps memory and then calls mmput(). mmput() calls exit_aio() etc. and waits for completion because atomic_dec_and_test(&mm->mm_users) is now true. (5) Second OOM victim (which is the parent of the first OOM victim) is selected by the next round of out_of_memory() call. (6) The OOM reaper is waiting for completion of the first OOM victim's memory while the second OOM victim is waiting for the OOM reaper to reap memory. Where is the guarantee that exit_aio() etc. called from mmput() by the OOM reaper does not depend on memory allocation (i.e. the OOM reaper is not blocked forever inside __oom_reap_task())? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-16 2:51 ` Tetsuo Handa @ 2016-04-17 11:54 ` Michal Hocko 2016-04-18 11:59 ` Tetsuo Handa 0 siblings, 1 reply; 40+ messages in thread From: Michal Hocko @ 2016-04-17 11:54 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm On Sat 16-04-16 11:51:11, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 07-04-16 20:55:34, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > The first obvious one is when the oom victim clears its mm and gets > > > > stuck later on. oom_reaper would back of on find_lock_task_mm returning > > > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a > > > > task would be ignored by the oom killer anyway. The flag would be > > > > cleared by that time already most of the time anyway. > > > > > > I didn't understand what this wants to tell. The OOM victim will clear > > > TIF_MEMDIE as soon as it sets current->mm = NULL. > > > > No it clears the flag _after_ it returns from mmput. There is no > > guarantee it won't get stuck somewhere on the way there - e.g. exit_aio > > waits for completion and who knows what else might get stuck. > > OK. Then, I think an OOM livelock scenario shown below is possible. > > (1) First OOM victim (where mm->mm_users == 1) is selected by the first > round of out_of_memory() call. > > (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users). > > (3) The OOM victim calls mmput() from exit_mm() from do_exit(). > mmput() returns immediately because atomic_dec_and_test(&mm->mm_users) > returns false because of (2). > > (4) The OOM reaper reaps memory and then calls mmput(). > mmput() calls exit_aio() etc. and waits for completion because > atomic_dec_and_test(&mm->mm_users) is now true. > > (5) Second OOM victim (which is the parent of the first OOM victim) > is selected by the next round of out_of_memory() call. > > (6) The OOM reaper is waiting for completion of the first OOM victim's > memory while the second OOM victim is waiting for the OOM reaper to > reap memory. > > Where is the guarantee that exit_aio() etc. called from mmput() by the > OOM reaper does not depend on memory allocation (i.e. the OOM reaper is > not blocked forever inside __oom_reap_task())? You should realize that the mmput is called _after_ we have reclaimed victim's address space. So there should be some memory freed by that time which reduce the likelyhood of a lockup due to memory allocation request if it is really needed for exit_aio. But you have a good point here. We want to strive for robustness of oom_reaper as much as possible. We have dropped the munlock patch because of the robustness so I guess we want this to be fixed as well. The reason for blocking might be different from memory pressure I guess. Here is what should work - I have only compile tested it. I will prepare the proper patch later this week with other oom reaper patches or after I come back from LSF/MM. --- diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 624b78b848b8..5113e0e7e8ef 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -509,6 +509,7 @@ struct mm_struct { #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; #endif + struct work_struct async_put_work; }; static inline void mm_init_cpumask(struct mm_struct *mm) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7bd0fa9db199..df8778e72211 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2604,6 +2604,11 @@ static inline void mmdrop(struct mm_struct * mm) /* mmput gets rid of the mappings and all user-space */ extern void mmput(struct mm_struct *); +/* same as above but performs the slow path from the async kontext. Can + * be called from the atomic context as well + */ +extern void mmput_async(struct mm_struct *); + /* Grab a reference to a task's mm, if it is not already going away */ extern struct mm_struct *get_task_mm(struct task_struct *task); /* diff --git a/kernel/fork.c b/kernel/fork.c index accb7221d547..10b0f771d795 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -696,6 +696,26 @@ void __mmdrop(struct mm_struct *mm) } EXPORT_SYMBOL_GPL(__mmdrop); +static inline void __mmput(struct mm_struct *mm) +{ + VM_BUG_ON(atomic_read(&mm->mm_users)); + + uprobe_clear_state(mm); + exit_aio(mm); + ksm_exit(mm); + khugepaged_exit(mm); /* must run before exit_mmap */ + exit_mmap(mm); + set_mm_exe_file(mm, NULL); + if (!list_empty(&mm->mmlist)) { + spin_lock(&mmlist_lock); + list_del(&mm->mmlist); + spin_unlock(&mmlist_lock); + } + if (mm->binfmt) + module_put(mm->binfmt->module); + mmdrop(mm); +} + /* * Decrement the use count and release all resources for an mm. */ @@ -703,24 +723,24 @@ void mmput(struct mm_struct *mm) { might_sleep(); + if (atomic_dec_and_test(&mm->mm_users)) + __mmput(mm); +} +EXPORT_SYMBOL_GPL(mmput); + +static void mmput_async_fn(struct work_struct *work) +{ + struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work); + __mmput(mm); +} + +void mmput_async(struct mm_struct *mm) +{ if (atomic_dec_and_test(&mm->mm_users)) { - uprobe_clear_state(mm); - exit_aio(mm); - ksm_exit(mm); - khugepaged_exit(mm); /* must run before exit_mmap */ - exit_mmap(mm); - set_mm_exe_file(mm, NULL); - if (!list_empty(&mm->mmlist)) { - spin_lock(&mmlist_lock); - list_del(&mm->mmlist); - spin_unlock(&mmlist_lock); - } - if (mm->binfmt) - module_put(mm->binfmt->module); - mmdrop(mm); + INIT_WORK(&mm->async_put_work, mmput_async_fn); + schedule_work(&mm->async_put_work); } } -EXPORT_SYMBOL_GPL(mmput); /** * set_mm_exe_file - change a reference to the mm's executable file diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 44683a2f8fa7..65f2acbaad29 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -446,7 +446,6 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); - static bool __oom_reap_task(struct task_struct *tsk) { struct mmu_gather tlb; @@ -520,7 +519,12 @@ static bool __oom_reap_task(struct task_struct *tsk) */ set_bit(MMF_OOM_REAPED, &mm->flags); out: - mmput(mm); + /* + * Drop our reference but make sure the mmput slow path is called from a + * different context because we shouldn't risk we get stuck there and + * put the oom_reaper out of the way. + */ + mmput_async(mm); return ret; } -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-17 11:54 ` Michal Hocko @ 2016-04-18 11:59 ` Tetsuo Handa 2016-04-19 14:17 ` Michal Hocko 0 siblings, 1 reply; 40+ messages in thread From: Tetsuo Handa @ 2016-04-18 11:59 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm Michal Hocko wrote: > On Sat 16-04-16 11:51:11, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 07-04-16 20:55:34, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > The first obvious one is when the oom victim clears its mm and gets > > > > > stuck later on. oom_reaper would back of on find_lock_task_mm returning > > > > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a > > > > > task would be ignored by the oom killer anyway. The flag would be > > > > > cleared by that time already most of the time anyway. > > > > > > > > I didn't understand what this wants to tell. The OOM victim will clear > > > > TIF_MEMDIE as soon as it sets current->mm = NULL. > > > > > > No it clears the flag _after_ it returns from mmput. There is no > > > guarantee it won't get stuck somewhere on the way there - e.g. exit_aio > > > waits for completion and who knows what else might get stuck. > > > > OK. Then, I think an OOM livelock scenario shown below is possible. > > > > (1) First OOM victim (where mm->mm_users == 1) is selected by the first > > round of out_of_memory() call. > > > > (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users). > > > > (3) The OOM victim calls mmput() from exit_mm() from do_exit(). > > mmput() returns immediately because atomic_dec_and_test(&mm->mm_users) > > returns false because of (2). > > > > (4) The OOM reaper reaps memory and then calls mmput(). > > mmput() calls exit_aio() etc. and waits for completion because > > atomic_dec_and_test(&mm->mm_users) is now true. > > > > (5) Second OOM victim (which is the parent of the first OOM victim) > > is selected by the next round of out_of_memory() call. > > > > (6) The OOM reaper is waiting for completion of the first OOM victim's > > memory while the second OOM victim is waiting for the OOM reaper to > > reap memory. > > > > Where is the guarantee that exit_aio() etc. called from mmput() by the > > OOM reaper does not depend on memory allocation (i.e. the OOM reaper is > > not blocked forever inside __oom_reap_task())? > > You should realize that the mmput is called _after_ we have reclaimed > victim's address space. So there should be some memory freed by that > time which reduce the likelyhood of a lockup due to memory allocation > request if it is really needed for exit_aio. Not always true. mmput() is called when down_read_trylock(&mm->mmap_sem) failed. It is possible that the OOM victim was about to call up_write(&mm->mmap_sem) when down_read_trylock(&mm->mmap_sem) failed, and it is possible that the OOM victim runs until returning from mmput() from exit_mm() from do_exit() when the OOM reaper was preempted between down_read_trylock(&mm->mmap_sem) and mmput(). Under such race, the OOM reaper will call mmput() without reclaiming the OOM victim's address space. > > But you have a good point here. We want to strive for robustness of > oom_reaper as much as possible. We have dropped the munlock patch because > of the robustness so I guess we want this to be fixed as well. The > reason for blocking might be different from memory pressure I guess. The reality of race/dependency is more complicated than we can imagine. > > Here is what should work - I have only compile tested it. I will prepare > the proper patch later this week with other oom reaper patches or after > I come back from LSF/MM. Excuse me, but is system_wq suitable for queuing operations which may take unpredictable duration to flush? system_wq is the one used by schedule[_delayed]_work[_on](). Multi-CPU multi-threaded. There are users which expect relatively short queue flush time. Don't queue works which can run for too long. Many users including SysRq-f depend on system_wq being flushed shortly. We haven't guaranteed that SysRq-f can always fire and select a different OOM victim, but you proposed always clearing TIF_MEMDIE without thinking the possibility of the OOM victim with mmap_sem held for write being stuck at unkillable wait. I wonder about your definition of "robustness". You are almost always missing the worst scenario. You are trying to manage OOM without defining default: label in a switch statement. I don't think your approach is robust. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-18 11:59 ` Tetsuo Handa @ 2016-04-19 14:17 ` Michal Hocko 2016-04-19 15:07 ` Tetsuo Handa 0 siblings, 1 reply; 40+ messages in thread From: Michal Hocko @ 2016-04-19 14:17 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm On Mon 18-04-16 20:59:51, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Sat 16-04-16 11:51:11, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Thu 07-04-16 20:55:34, Tetsuo Handa wrote: > > > > > Michal Hocko wrote: > > > > > > The first obvious one is when the oom victim clears its mm and gets > > > > > > stuck later on. oom_reaper would back of on find_lock_task_mm returning > > > > > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a > > > > > > task would be ignored by the oom killer anyway. The flag would be > > > > > > cleared by that time already most of the time anyway. > > > > > > > > > > I didn't understand what this wants to tell. The OOM victim will clear > > > > > TIF_MEMDIE as soon as it sets current->mm = NULL. > > > > > > > > No it clears the flag _after_ it returns from mmput. There is no > > > > guarantee it won't get stuck somewhere on the way there - e.g. exit_aio > > > > waits for completion and who knows what else might get stuck. > > > > > > OK. Then, I think an OOM livelock scenario shown below is possible. > > > > > > (1) First OOM victim (where mm->mm_users == 1) is selected by the first > > > round of out_of_memory() call. > > > > > > (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users). > > > > > > (3) The OOM victim calls mmput() from exit_mm() from do_exit(). > > > mmput() returns immediately because atomic_dec_and_test(&mm->mm_users) > > > returns false because of (2). > > > > > > (4) The OOM reaper reaps memory and then calls mmput(). > > > mmput() calls exit_aio() etc. and waits for completion because > > > atomic_dec_and_test(&mm->mm_users) is now true. > > > > > > (5) Second OOM victim (which is the parent of the first OOM victim) > > > is selected by the next round of out_of_memory() call. > > > > > > (6) The OOM reaper is waiting for completion of the first OOM victim's > > > memory while the second OOM victim is waiting for the OOM reaper to > > > reap memory. > > > > > > Where is the guarantee that exit_aio() etc. called from mmput() by the > > > OOM reaper does not depend on memory allocation (i.e. the OOM reaper is > > > not blocked forever inside __oom_reap_task())? > > > > You should realize that the mmput is called _after_ we have reclaimed > > victim's address space. So there should be some memory freed by that > > time which reduce the likelyhood of a lockup due to memory allocation > > request if it is really needed for exit_aio. > > Not always true. mmput() is called when down_read_trylock(&mm->mmap_sem) failed. > It is possible that the OOM victim was about to call up_write(&mm->mmap_sem) when > down_read_trylock(&mm->mmap_sem) failed, and it is possible that the OOM victim > runs until returning from mmput() from exit_mm() from do_exit() when the OOM > reaper was preempted between down_read_trylock(&mm->mmap_sem) and mmput(). > Under such race, the OOM reaper will call mmput() without reclaiming the OOM > victim's address space. You are right! For some reason I have missed that. > > But you have a good point here. We want to strive for robustness of > > oom_reaper as much as possible. We have dropped the munlock patch because > > of the robustness so I guess we want this to be fixed as well. The > > reason for blocking might be different from memory pressure I guess. > > The reality of race/dependency is more complicated than we can imagine. > > > > > Here is what should work - I have only compile tested it. I will prepare > > the proper patch later this week with other oom reaper patches or after > > I come back from LSF/MM. > > Excuse me, but is system_wq suitable for queuing operations which may take > unpredictable duration to flush? > > system_wq is the one used by schedule[_delayed]_work[_on](). > Multi-CPU multi-threaded. There are users which expect relatively > short queue flush time. Don't queue works which can run for too > long. An alternative would be using a dedicated WQ with WQ_MEM_RECLAIM which I am not really sure would be justified considering we are talking about a highly unlikely event. You do not want to consume resources permanently for an eventual and not fatal event. > Many users including SysRq-f depend on system_wq being flushed shortly. Critical work shouldn't really rely on system_wq, full stop. There is just too much going on on that WQ and it is simply impossible to guarantee anything. > We > haven't guaranteed that SysRq-f can always fire and select a different OOM > victim, but you proposed always clearing TIF_MEMDIE without thinking the > possibility of the OOM victim with mmap_sem held for write being stuck at > unkillable wait. > > I wonder about your definition of "robustness". You are almost always missing > the worst scenario. You are trying to manage OOM without defining default: > label in a switch statement. I don't think your approach is robust. I am trying to be as robust as it is viable. You have to realize we are in the catastrophic path already and there is simply no deterministic way out. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-19 14:17 ` Michal Hocko @ 2016-04-19 15:07 ` Tetsuo Handa 2016-04-19 19:32 ` Michal Hocko 0 siblings, 1 reply; 40+ messages in thread From: Tetsuo Handa @ 2016-04-19 15:07 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm Michal Hocko wrote: > On Mon 18-04-16 20:59:51, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > Here is what should work - I have only compile tested it. I will prepare > > > the proper patch later this week with other oom reaper patches or after > > > I come back from LSF/MM. > > > > Excuse me, but is system_wq suitable for queuing operations which may take > > unpredictable duration to flush? > > > > system_wq is the one used by schedule[_delayed]_work[_on](). > > Multi-CPU multi-threaded. There are users which expect relatively > > short queue flush time. Don't queue works which can run for too > > long. > > An alternative would be using a dedicated WQ with WQ_MEM_RECLAIM which I > am not really sure would be justified considering we are talking about a > highly unlikely event. You do not want to consume resources permanently > for an eventual and not fatal event. Yes, the reason SysRq-f is still not using a dedicated WQ with WQ_MEM_RECLAIM will be the same. > > > We > > haven't guaranteed that SysRq-f can always fire and select a different OOM > > victim, but you proposed always clearing TIF_MEMDIE without thinking the > > possibility of the OOM victim with mmap_sem held for write being stuck at > > unkillable wait. > > > > I wonder about your definition of "robustness". You are almost always missing > > the worst scenario. You are trying to manage OOM without defining default: > > label in a switch statement. I don't think your approach is robust. > > I am trying to be as robust as it is viable. You have to realize we are > in the catastrophic path already and there is simply no deterministic > way out. I know we are talking about the catastrophic situation. Since you insist on deterministic approach, we are struggling so much. If you tolerate http://lkml.kernel.org/r/201604152111.JBD95763.LMFOOHQOtFSFJV@I-love.SAKURA.ne.jp approach as the fastpath (deterministic but could fail) and http://lkml.kernel.org/r/201604200006.FBG45192.SOHFQJFOOLFMtV@I-love.SAKURA.ne.jp approach as the slowpath (non-deterministic but never fail), we don't need to use a dedicated WQ with WQ_MEM_RECLAIM for avoiding this mmput() trap and the SysRq-f trap. What a simple answer. ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-19 15:07 ` Tetsuo Handa @ 2016-04-19 19:32 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-19 19:32 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm On Wed 20-04-16 00:07:50, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 18-04-16 20:59:51, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > Here is what should work - I have only compile tested it. I will prepare > > > > the proper patch later this week with other oom reaper patches or after > > > > I come back from LSF/MM. > > > > > > Excuse me, but is system_wq suitable for queuing operations which may take > > > unpredictable duration to flush? > > > > > > system_wq is the one used by schedule[_delayed]_work[_on](). > > > Multi-CPU multi-threaded. There are users which expect relatively > > > short queue flush time. Don't queue works which can run for too > > > long. > > > > An alternative would be using a dedicated WQ with WQ_MEM_RECLAIM which I > > am not really sure would be justified considering we are talking about a > > highly unlikely event. You do not want to consume resources permanently > > for an eventual and not fatal event. > > Yes, the reason SysRq-f is still not using a dedicated WQ with WQ_MEM_RECLAIM > will be the same. sysrq+f can use the oom_reaper kernel thread. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper 2016-04-06 14:13 ` Michal Hocko @ 2016-04-08 13:07 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 13:07 UTC (permalink / raw) To: Andrew Morton; +Cc: David Rientjes, Tetsuo Handa, linux-mm, LKML Andrew, could you please fold this in? --- >From 4e23df3584966f6d7885245f3071c15669b8a5fc Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 8 Apr 2016 15:04:29 +0200 Subject: [PATCH] mm, oom_reaper: clear oom_reaper_list before clearing TIF_MEMDIE As per Tetsuo: : oom_reaper() will need to do "tsk->oom_reaper_list = NULL;" due to : : if (tsk == oom_reaper_list || tsk->oom_reaper_list) : return; : : test in wake_oom_reaper() if "[PATCH 3/3] mm, oom_reaper: clear : TIF_MEMDIE for all tasks queued for oom_reaper" will select the same : thread again Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7098104b7475..ca34036f3ae1 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -540,6 +540,7 @@ static void oom_reap_task(struct task_struct *tsk) * for the oom victim right now because it cannot release its memory * itself nor by the oom reaper. */ + tsk->oom_reaper_list = NULL; exit_oom_victim(tsk); /* Drop a reference taken by wake_oom_reaper */ -- 2.8.0.rc3 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper @ 2016-04-08 13:07 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2016-04-08 13:07 UTC (permalink / raw) To: Andrew Morton; +Cc: David Rientjes, Tetsuo Handa, linux-mm, LKML Andrew, could you please fold this in? --- ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2016-04-19 19:32 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-04-06 14:13 [PATCH 0/3] oom reaper follow ups v1 Michal Hocko 2016-04-06 14:13 ` Michal Hocko 2016-04-06 14:13 ` [PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko 2016-04-06 14:13 ` Michal Hocko 2016-04-06 14:13 ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path Michal Hocko 2016-04-06 14:13 ` Michal Hocko 2016-04-07 11:38 ` Tetsuo Handa 2016-04-07 11:38 ` Tetsuo Handa 2016-04-08 11:19 ` Tetsuo Handa 2016-04-08 11:19 ` Tetsuo Handa 2016-04-08 11:50 ` Michal Hocko 2016-04-08 11:50 ` Michal Hocko 2016-04-09 4:39 ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular " Tetsuo Handa 2016-04-09 4:39 ` Tetsuo Handa 2016-04-11 12:02 ` Michal Hocko 2016-04-11 12:02 ` Michal Hocko 2016-04-11 13:26 ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular " Tetsuo Handa 2016-04-11 13:26 ` Tetsuo Handa 2016-04-11 13:43 ` Michal Hocko 2016-04-11 13:43 ` Michal Hocko 2016-04-13 11:08 ` Tetsuo Handa 2016-04-13 11:08 ` Tetsuo Handa 2016-04-08 11:34 ` Michal Hocko 2016-04-08 11:34 ` Michal Hocko 2016-04-08 13:14 ` Michal Hocko 2016-04-08 13:14 ` Michal Hocko 2016-04-06 14:13 ` [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper Michal Hocko 2016-04-06 14:13 ` Michal Hocko 2016-04-07 11:55 ` Tetsuo Handa 2016-04-07 11:55 ` Tetsuo Handa 2016-04-08 11:34 ` Michal Hocko 2016-04-08 11:34 ` Michal Hocko 2016-04-16 2:51 ` Tetsuo Handa 2016-04-17 11:54 ` Michal Hocko 2016-04-18 11:59 ` Tetsuo Handa 2016-04-19 14:17 ` Michal Hocko 2016-04-19 15:07 ` Tetsuo Handa 2016-04-19 19:32 ` Michal Hocko 2016-04-08 13:07 ` Michal Hocko 2016-04-08 13:07 ` Michal Hocko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.