* [RFC 0/3] oom: few enahancements @ 2016-01-12 21:00 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw) To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML Hi, based on the recent discussions I have accumulated the following three patches. I haven't tested them yet but I would like to hear your opinion. The first patch only affects sysrq+f OOM killer. I believe it should be relatively uncontroversial. The patch 2 tweaks how we handle children tasks standing for the parent oom victim. This should help the test case Tetsuo shown [1]. The patch 3 is just a rough idea. I can see objections there but this is mainly to start discussion about ho to deal with small children which basically do not sit on any memory. Maybe we do not need anything like that at all and realy on multiple OOM invocations as a safer option. I dunno but I would like to hear your opinions. --- [1] http://lkml.kernel.org/r/201512292258.ABF87505.OFOSJLHMFVOQFt%40I-love.SAKURA.ne.jp ^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC 0/3] oom: few enahancements @ 2016-01-12 21:00 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw) To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML Hi, based on the recent discussions I have accumulated the following three patches. I haven't tested them yet but I would like to hear your opinion. The first patch only affects sysrq+f OOM killer. I believe it should be relatively uncontroversial. The patch 2 tweaks how we handle children tasks standing for the parent oom victim. This should help the test case Tetsuo shown [1]. The patch 3 is just a rough idea. I can see objections there but this is mainly to start discussion about ho to deal with small children which basically do not sit on any memory. Maybe we do not need anything like that at all and realy on multiple OOM invocations as a safer option. I dunno but I would like to hear your opinions. --- [1] http://lkml.kernel.org/r/201512292258.ABF87505.OFOSJLHMFVOQFt%40I-love.SAKURA.ne.jp -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-12 21:00 ` Michal Hocko @ 2016-01-12 21:00 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw) To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> When the OOM killer is issued by the administrator by sysrq+f it is expected that a task will be killed to release the memory pressure. Unlike the regular OOM killer the forced one doesn't abort when there is an OOM victim selected. Instead oom_scan_process_thread forces select_bad_process to check this thread. If this happens to be the largest OOM hog then it will be selected again and the forced OOM killer wouldn't make any change in case the current OOM victim is not able terminate and free up resources it is sitting on. This patch makes sure that the forced oom killer will skip over all oom victims (with TIF_MEMDIE) and tasks with fatal_signal_pending on basis that there is no guarantee those tasks are making progress and there is no way to check they will ever make any. It is more conservative to simply try another task. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index abefeeb42504..2b9dc5129a89 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc, case OOM_SCAN_OK: break; }; + + /* + * If we are doing sysrq+f then it doesn't make any sense to + * check OOM victim or killed task because it might be stuck + * and unable to terminate while the forced OOM might be the + * only option left to get the system back to work. + */ + if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) || + fatal_signal_pending(p))) + continue; + points = oom_badness(p, NULL, oc->nodemask, totalpages); if (!points || points < chosen_points) continue; -- 2.6.4 ^ permalink raw reply related [flat|nested] 44+ messages in thread
* [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-12 21:00 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw) To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> When the OOM killer is issued by the administrator by sysrq+f it is expected that a task will be killed to release the memory pressure. Unlike the regular OOM killer the forced one doesn't abort when there is an OOM victim selected. Instead oom_scan_process_thread forces select_bad_process to check this thread. If this happens to be the largest OOM hog then it will be selected again and the forced OOM killer wouldn't make any change in case the current OOM victim is not able terminate and free up resources it is sitting on. This patch makes sure that the forced oom killer will skip over all oom victims (with TIF_MEMDIE) and tasks with fatal_signal_pending on basis that there is no guarantee those tasks are making progress and there is no way to check they will ever make any. It is more conservative to simply try another task. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index abefeeb42504..2b9dc5129a89 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc, case OOM_SCAN_OK: break; }; + + /* + * If we are doing sysrq+f then it doesn't make any sense to + * check OOM victim or killed task because it might be stuck + * and unable to terminate while the forced OOM might be the + * only option left to get the system back to work. + */ + if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) || + fatal_signal_pending(p))) + continue; + points = oom_badness(p, NULL, oc->nodemask, totalpages); if (!points || points < chosen_points) continue; -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-12 21:00 ` Michal Hocko @ 2016-01-13 0:41 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-13 0:41 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML, Michal Hocko On Tue, 12 Jan 2016, Michal Hocko wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index abefeeb42504..2b9dc5129a89 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc, > case OOM_SCAN_OK: > break; > }; > + > + /* > + * If we are doing sysrq+f then it doesn't make any sense to > + * check OOM victim or killed task because it might be stuck > + * and unable to terminate while the forced OOM might be the > + * only option left to get the system back to work. > + */ > + if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) || > + fatal_signal_pending(p))) > + continue; > + > points = oom_badness(p, NULL, oc->nodemask, totalpages); > if (!points || points < chosen_points) > continue; I think you can make a case for testing TIF_MEMDIE here since there is no chance of a panic from the sysrq trigger. However, I'm not convinced that checking fatal_signal_pending() is appropriate. I think it would be better for sysrq+f to first select a process with fatal_signal_pending() set so it silently gets access to memory reserves and then a second sysrq+f to choose a different process, if necessary, because of TIF_MEMDIE. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-13 0:41 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-13 0:41 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML, Michal Hocko On Tue, 12 Jan 2016, Michal Hocko wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index abefeeb42504..2b9dc5129a89 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc, > case OOM_SCAN_OK: > break; > }; > + > + /* > + * If we are doing sysrq+f then it doesn't make any sense to > + * check OOM victim or killed task because it might be stuck > + * and unable to terminate while the forced OOM might be the > + * only option left to get the system back to work. > + */ > + if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) || > + fatal_signal_pending(p))) > + continue; > + > points = oom_badness(p, NULL, oc->nodemask, totalpages); > if (!points || points < chosen_points) > continue; I think you can make a case for testing TIF_MEMDIE here since there is no chance of a panic from the sysrq trigger. However, I'm not convinced that checking fatal_signal_pending() is appropriate. I think it would be better for sysrq+f to first select a process with fatal_signal_pending() set so it silently gets access to memory reserves and then a second sysrq+f to choose a different process, if necessary, because of TIF_MEMDIE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-13 0:41 ` David Rientjes @ 2016-01-13 9:30 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-13 9:30 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Tue 12-01-16 16:41:50, David Rientjes wrote: > On Tue, 12 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index abefeeb42504..2b9dc5129a89 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc, > > case OOM_SCAN_OK: > > break; > > }; > > + > > + /* > > + * If we are doing sysrq+f then it doesn't make any sense to > > + * check OOM victim or killed task because it might be stuck > > + * and unable to terminate while the forced OOM might be the > > + * only option left to get the system back to work. > > + */ > > + if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) || > > + fatal_signal_pending(p))) > > + continue; > > + > > points = oom_badness(p, NULL, oc->nodemask, totalpages); > > if (!points || points < chosen_points) > > continue; > > I think you can make a case for testing TIF_MEMDIE here since there is no > chance of a panic from the sysrq trigger. However, I'm not convinced that > checking fatal_signal_pending() is appropriate. My thinking was that such a process would get TIF_MEMDIE if it hits the OOM from the allocator. > I think it would be > better for sysrq+f to first select a process with fatal_signal_pending() > set so it silently gets access to memory reserves and then a second > sysrq+f to choose a different process, if necessary, because of > TIF_MEMDIE. The disadvantage of this approach is that sysrq+f might silently be ignored and the administrator doesn't have any signal about that. IMHO sysrq+f would be much better defined if it _always_ selected and killed a task. After all it is an explicit administrator action. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-13 9:30 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-13 9:30 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Tue 12-01-16 16:41:50, David Rientjes wrote: > On Tue, 12 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index abefeeb42504..2b9dc5129a89 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc, > > case OOM_SCAN_OK: > > break; > > }; > > + > > + /* > > + * If we are doing sysrq+f then it doesn't make any sense to > > + * check OOM victim or killed task because it might be stuck > > + * and unable to terminate while the forced OOM might be the > > + * only option left to get the system back to work. > > + */ > > + if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) || > > + fatal_signal_pending(p))) > > + continue; > > + > > points = oom_badness(p, NULL, oc->nodemask, totalpages); > > if (!points || points < chosen_points) > > continue; > > I think you can make a case for testing TIF_MEMDIE here since there is no > chance of a panic from the sysrq trigger. However, I'm not convinced that > checking fatal_signal_pending() is appropriate. My thinking was that such a process would get TIF_MEMDIE if it hits the OOM from the allocator. > I think it would be > better for sysrq+f to first select a process with fatal_signal_pending() > set so it silently gets access to memory reserves and then a second > sysrq+f to choose a different process, if necessary, because of > TIF_MEMDIE. The disadvantage of this approach is that sysrq+f might silently be ignored and the administrator doesn't have any signal about that. IMHO sysrq+f would be much better defined if it _always_ selected and killed a task. After all it is an explicit administrator action. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-13 9:30 ` Michal Hocko @ 2016-01-14 0:38 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-14 0:38 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Wed, 13 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index abefeeb42504..2b9dc5129a89 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc, > > > case OOM_SCAN_OK: > > > break; > > > }; > > > + > > > + /* > > > + * If we are doing sysrq+f then it doesn't make any sense to > > > + * check OOM victim or killed task because it might be stuck > > > + * and unable to terminate while the forced OOM might be the > > > + * only option left to get the system back to work. > > > + */ > > > + if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) || > > > + fatal_signal_pending(p))) > > > + continue; > > > + > > > points = oom_badness(p, NULL, oc->nodemask, totalpages); > > > if (!points || points < chosen_points) > > > continue; > > > > I think you can make a case for testing TIF_MEMDIE here since there is no > > chance of a panic from the sysrq trigger. However, I'm not convinced that > > checking fatal_signal_pending() is appropriate. > > My thinking was that such a process would get TIF_MEMDIE if it hits the > OOM from the allocator. > It certainly would get TIF_MEMDIE set if it needs to allocate memory itself and it calls the oom killer. That doesn't mean that we should kill a different process, though, when the killed process should exit and free its memory. So NACK to the fatal_signal_pending() check here. > > I think it would be > > better for sysrq+f to first select a process with fatal_signal_pending() > > set so it silently gets access to memory reserves and then a second > > sysrq+f to choose a different process, if necessary, because of > > TIF_MEMDIE. > > The disadvantage of this approach is that sysrq+f might silently be > ignored and the administrator doesn't have any signal about that. The administrator can check the kernel log for an oom kill. Killing additional processes is not going to help and has never been the semantics of the sysrq trigger, it is quite clearly defined as killing a process when out of memory, not serial killing everything on the machine. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-14 0:38 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-14 0:38 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Wed, 13 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index abefeeb42504..2b9dc5129a89 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -326,6 +326,17 @@ static struct task_struct *select_bad_process(struct oom_control *oc, > > > case OOM_SCAN_OK: > > > break; > > > }; > > > + > > > + /* > > > + * If we are doing sysrq+f then it doesn't make any sense to > > > + * check OOM victim or killed task because it might be stuck > > > + * and unable to terminate while the forced OOM might be the > > > + * only option left to get the system back to work. > > > + */ > > > + if (is_sysrq_oom(oc) && (test_tsk_thread_flag(p, TIF_MEMDIE) || > > > + fatal_signal_pending(p))) > > > + continue; > > > + > > > points = oom_badness(p, NULL, oc->nodemask, totalpages); > > > if (!points || points < chosen_points) > > > continue; > > > > I think you can make a case for testing TIF_MEMDIE here since there is no > > chance of a panic from the sysrq trigger. However, I'm not convinced that > > checking fatal_signal_pending() is appropriate. > > My thinking was that such a process would get TIF_MEMDIE if it hits the > OOM from the allocator. > It certainly would get TIF_MEMDIE set if it needs to allocate memory itself and it calls the oom killer. That doesn't mean that we should kill a different process, though, when the killed process should exit and free its memory. So NACK to the fatal_signal_pending() check here. > > I think it would be > > better for sysrq+f to first select a process with fatal_signal_pending() > > set so it silently gets access to memory reserves and then a second > > sysrq+f to choose a different process, if necessary, because of > > TIF_MEMDIE. > > The disadvantage of this approach is that sysrq+f might silently be > ignored and the administrator doesn't have any signal about that. The administrator can check the kernel log for an oom kill. Killing additional processes is not going to help and has never been the semantics of the sysrq trigger, it is quite clearly defined as killing a process when out of memory, not serial killing everything on the machine. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-14 0:38 ` David Rientjes @ 2016-01-14 11:00 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-14 11:00 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Wed 13-01-16 16:38:26, David Rientjes wrote: > On Wed, 13 Jan 2016, Michal Hocko wrote: [...] > > > I think it would be > > > better for sysrq+f to first select a process with fatal_signal_pending() > > > set so it silently gets access to memory reserves and then a second > > > sysrq+f to choose a different process, if necessary, because of > > > TIF_MEMDIE. > > > > The disadvantage of this approach is that sysrq+f might silently be > > ignored and the administrator doesn't have any signal about that. Sorry I meant to say "administrator doesn't know why it has been ignored". But it would have been better to say "administrator cannot do anything about that". > The administrator can check the kernel log for an oom kill. What should the admin do when the request got ignored, though? sysrq+i? sysrq+b? > Killing additional processes is not going to help Whether it is going to help or not is a different topic. My point is that we have a sysrq action which might get ignored without providing any explanation. But what I consider much bigger issue is that the deliberate request of the admin is ignored in the first place. Me as an admin do not expect the system knows better than me when I perform some action. > and has never been the semantics > of the sysrq trigger, it is quite clearly defined as killing a process > when out of memory, I disagree. Being OOM has never been the requirement for sysrq+f to kill a task. It should kill _a_ memory hog. Your system might be trashing to the point you are not able to log in and resolve the situation in a reasonable time yet you are still not OOM. sysrq+f is your only choice then. > not serial killing everything on the machine. Which is not proposed here. The only thing I would like to achive is to get rid of OOM killer heuristics which assume some forward progress in order to prevent from killing a task. Those make perfect sense when the system tries to resolve the OOM condition but when the administrator has clearly asked to _kill_a_ memory hog then the result should be killing a task which consumes memory (ideally the largest one). What would be a regression scenario for this change? I can clearly see deficiency of the current implementation so we should weigh cons and pros here I believe. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-14 11:00 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-14 11:00 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Wed 13-01-16 16:38:26, David Rientjes wrote: > On Wed, 13 Jan 2016, Michal Hocko wrote: [...] > > > I think it would be > > > better for sysrq+f to first select a process with fatal_signal_pending() > > > set so it silently gets access to memory reserves and then a second > > > sysrq+f to choose a different process, if necessary, because of > > > TIF_MEMDIE. > > > > The disadvantage of this approach is that sysrq+f might silently be > > ignored and the administrator doesn't have any signal about that. Sorry I meant to say "administrator doesn't know why it has been ignored". But it would have been better to say "administrator cannot do anything about that". > The administrator can check the kernel log for an oom kill. What should the admin do when the request got ignored, though? sysrq+i? sysrq+b? > Killing additional processes is not going to help Whether it is going to help or not is a different topic. My point is that we have a sysrq action which might get ignored without providing any explanation. But what I consider much bigger issue is that the deliberate request of the admin is ignored in the first place. Me as an admin do not expect the system knows better than me when I perform some action. > and has never been the semantics > of the sysrq trigger, it is quite clearly defined as killing a process > when out of memory, I disagree. Being OOM has never been the requirement for sysrq+f to kill a task. It should kill _a_ memory hog. Your system might be trashing to the point you are not able to log in and resolve the situation in a reasonable time yet you are still not OOM. sysrq+f is your only choice then. > not serial killing everything on the machine. Which is not proposed here. The only thing I would like to achive is to get rid of OOM killer heuristics which assume some forward progress in order to prevent from killing a task. Those make perfect sense when the system tries to resolve the OOM condition but when the administrator has clearly asked to _kill_a_ memory hog then the result should be killing a task which consumes memory (ideally the largest one). What would be a regression scenario for this change? I can clearly see deficiency of the current implementation so we should weigh cons and pros here I believe. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-14 11:00 ` Michal Hocko @ 2016-01-14 21:51 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-14 21:51 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Thu, 14 Jan 2016, Michal Hocko wrote: > > > > I think it would be > > > > better for sysrq+f to first select a process with fatal_signal_pending() > > > > set so it silently gets access to memory reserves and then a second > > > > sysrq+f to choose a different process, if necessary, because of > > > > TIF_MEMDIE. > > > > > > The disadvantage of this approach is that sysrq+f might silently be > > > ignored and the administrator doesn't have any signal about that. > > Sorry I meant to say "administrator doesn't know why it has been > ignored". But it would have been better to say "administrator cannot do > anything about that". > > > The administrator can check the kernel log for an oom kill. > > What should the admin do when the request got ignored, though? sysrq+i? > sysrq+b? > We're not striving for a solution to general process exiting issues or oom livelock situations by requiring admins to use a sysrq trigger. Sysrq+F could arguably be removed at this point since it solely existed to trigger the oom killer when newly rewritten reclaim thrashed and the page allocator didn't call it fast enough. Since the oom killer allows killed processes to gain access to memory reserves, we could extend that to contexts that do not allow calling the oom killer to set TIF_MEMDIE, and then simply require root to send a SIGKILL rather than do sysrq+F. I think it's time to kill sysrq+F and I'll send those two patches unless there is a usecase I'm not aware of. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-14 21:51 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-14 21:51 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Thu, 14 Jan 2016, Michal Hocko wrote: > > > > I think it would be > > > > better for sysrq+f to first select a process with fatal_signal_pending() > > > > set so it silently gets access to memory reserves and then a second > > > > sysrq+f to choose a different process, if necessary, because of > > > > TIF_MEMDIE. > > > > > > The disadvantage of this approach is that sysrq+f might silently be > > > ignored and the administrator doesn't have any signal about that. > > Sorry I meant to say "administrator doesn't know why it has been > ignored". But it would have been better to say "administrator cannot do > anything about that". > > > The administrator can check the kernel log for an oom kill. > > What should the admin do when the request got ignored, though? sysrq+i? > sysrq+b? > We're not striving for a solution to general process exiting issues or oom livelock situations by requiring admins to use a sysrq trigger. Sysrq+F could arguably be removed at this point since it solely existed to trigger the oom killer when newly rewritten reclaim thrashed and the page allocator didn't call it fast enough. Since the oom killer allows killed processes to gain access to memory reserves, we could extend that to contexts that do not allow calling the oom killer to set TIF_MEMDIE, and then simply require root to send a SIGKILL rather than do sysrq+F. I think it's time to kill sysrq+F and I'll send those two patches unless there is a usecase I'm not aware of. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-14 21:51 ` David Rientjes @ 2016-01-15 10:12 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-15 10:12 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Thu 14-01-16 13:51:16, David Rientjes wrote: > I think it's time to kill sysrq+F and I'll send those two patches > unless there is a usecase I'm not aware of. I have described one in the part you haven't quoted here. Let me repeat: : Your system might be trashing to the point you are not able to log in : and resolve the situation in a reasonable time yet you are still not : OOM. sysrq+f is your only choice then. Could you clarify why it is better to ditch a potentially usefull emergency tool rather than to make it work reliably and predictably? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-15 10:12 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-15 10:12 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Thu 14-01-16 13:51:16, David Rientjes wrote: > I think it's time to kill sysrq+F and I'll send those two patches > unless there is a usecase I'm not aware of. I have described one in the part you haven't quoted here. Let me repeat: : Your system might be trashing to the point you are not able to log in : and resolve the situation in a reasonable time yet you are still not : OOM. sysrq+f is your only choice then. Could you clarify why it is better to ditch a potentially usefull emergency tool rather than to make it work reliably and predictably? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-15 10:12 ` Michal Hocko @ 2016-01-15 15:37 ` One Thousand Gnomes -1 siblings, 0 replies; 44+ messages in thread From: One Thousand Gnomes @ 2016-01-15 15:37 UTC (permalink / raw) To: Michal Hocko; +Cc: David Rientjes, linux-mm, Tetsuo Handa, LKML On Fri, 15 Jan 2016 11:12:18 +0100 Michal Hocko <mhocko@kernel.org> wrote: > On Thu 14-01-16 13:51:16, David Rientjes wrote: > > I think it's time to kill sysrq+F and I'll send those two patches > > unless there is a usecase I'm not aware of. > > I have described one in the part you haven't quoted here. Let me repeat: > : Your system might be trashing to the point you are not able to log in > : and resolve the situation in a reasonable time yet you are still not > : OOM. sysrq+f is your only choice then. > > Could you clarify why it is better to ditch a potentially usefull > emergency tool rather than to make it work reliably and predictably? Even if it doesn't work reliably and predictably it is *still* better than removing it as it works currently. Today we have "might save you a reboot", the removal turns it into "you'll have to reboot". That's a regression. Alan ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-15 15:37 ` One Thousand Gnomes 0 siblings, 0 replies; 44+ messages in thread From: One Thousand Gnomes @ 2016-01-15 15:37 UTC (permalink / raw) To: Michal Hocko; +Cc: David Rientjes, linux-mm, Tetsuo Handa, LKML On Fri, 15 Jan 2016 11:12:18 +0100 Michal Hocko <mhocko@kernel.org> wrote: > On Thu 14-01-16 13:51:16, David Rientjes wrote: > > I think it's time to kill sysrq+F and I'll send those two patches > > unless there is a usecase I'm not aware of. > > I have described one in the part you haven't quoted here. Let me repeat: > : Your system might be trashing to the point you are not able to log in > : and resolve the situation in a reasonable time yet you are still not > : OOM. sysrq+f is your only choice then. > > Could you clarify why it is better to ditch a potentially usefull > emergency tool rather than to make it work reliably and predictably? Even if it doesn't work reliably and predictably it is *still* better than removing it as it works currently. Today we have "might save you a reboot", the removal turns it into "you'll have to reboot". That's a regression. Alan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-15 15:37 ` One Thousand Gnomes @ 2016-01-19 23:01 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-19 23:01 UTC (permalink / raw) To: One Thousand Gnomes; +Cc: Michal Hocko, linux-mm, Tetsuo Handa, LKML On Fri, 15 Jan 2016, One Thousand Gnomes wrote: > > > I think it's time to kill sysrq+F and I'll send those two patches > > > unless there is a usecase I'm not aware of. > > > > I have described one in the part you haven't quoted here. Let me repeat: > > : Your system might be trashing to the point you are not able to log in > > : and resolve the situation in a reasonable time yet you are still not > > : OOM. sysrq+f is your only choice then. > > > > Could you clarify why it is better to ditch a potentially usefull > > emergency tool rather than to make it work reliably and predictably? > > Even if it doesn't work reliably and predictably it is *still* better > than removing it as it works currently. Today we have "might save you a > reboot", the removal turns it into "you'll have to reboot". That's a > regression. > Under what circumstance are you supposing to use sysrq+f in your hypothetical? If you have access to the shell, then you can kill any process at random (and you may even be able to make better realtime decisions than the oom killer) and it will gain access to memory reserves immediately under my proposal when it tries to allocate memory. The net result is that calling the oom killer is no better than you issuing the SIGKILL yourself. This doesn't work if your are supposing to use sysrq+f without the ability to get access to the shell. That's the point, I believe, that Michal has raised in this thread. I'd like to address that issue directly rather than requiring human intervention to fix. If you have deployed a very large number of machines to your datacenters, you don't possibly have the resources to do this. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-19 23:01 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-19 23:01 UTC (permalink / raw) To: One Thousand Gnomes; +Cc: Michal Hocko, linux-mm, Tetsuo Handa, LKML On Fri, 15 Jan 2016, One Thousand Gnomes wrote: > > > I think it's time to kill sysrq+F and I'll send those two patches > > > unless there is a usecase I'm not aware of. > > > > I have described one in the part you haven't quoted here. Let me repeat: > > : Your system might be trashing to the point you are not able to log in > > : and resolve the situation in a reasonable time yet you are still not > > : OOM. sysrq+f is your only choice then. > > > > Could you clarify why it is better to ditch a potentially usefull > > emergency tool rather than to make it work reliably and predictably? > > Even if it doesn't work reliably and predictably it is *still* better > than removing it as it works currently. Today we have "might save you a > reboot", the removal turns it into "you'll have to reboot". That's a > regression. > Under what circumstance are you supposing to use sysrq+f in your hypothetical? If you have access to the shell, then you can kill any process at random (and you may even be able to make better realtime decisions than the oom killer) and it will gain access to memory reserves immediately under my proposal when it tries to allocate memory. The net result is that calling the oom killer is no better than you issuing the SIGKILL yourself. This doesn't work if your are supposing to use sysrq+f without the ability to get access to the shell. That's the point, I believe, that Michal has raised in this thread. I'd like to address that issue directly rather than requiring human intervention to fix. If you have deployed a very large number of machines to your datacenters, you don't possibly have the resources to do this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-15 10:12 ` Michal Hocko @ 2016-01-19 22:57 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-19 22:57 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Fri, 15 Jan 2016, Michal Hocko wrote: > > I think it's time to kill sysrq+F and I'll send those two patches > > unless there is a usecase I'm not aware of. > > I have described one in the part you haven't quoted here. Let me repeat: > : Your system might be trashing to the point you are not able to log in > : and resolve the situation in a reasonable time yet you are still not > : OOM. sysrq+f is your only choice then. > > Could you clarify why it is better to ditch a potentially usefull > emergency tool rather than to make it work reliably and predictably? I'm concerned about your usecase where the kernel requires admin intervention to resolve such an issue and there is nothing in the VM we can do to fix it. If you have a specific test that demonstrates when your usecase is needed, please provide it so we can address the issue that it triggers. I'd prefer to fix the issue in the VM rather than require human intervention, especially when we try to keep a very large number of machines running in our datacenters. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-19 22:57 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-19 22:57 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Fri, 15 Jan 2016, Michal Hocko wrote: > > I think it's time to kill sysrq+F and I'll send those two patches > > unless there is a usecase I'm not aware of. > > I have described one in the part you haven't quoted here. Let me repeat: > : Your system might be trashing to the point you are not able to log in > : and resolve the situation in a reasonable time yet you are still not > : OOM. sysrq+f is your only choice then. > > Could you clarify why it is better to ditch a potentially usefull > emergency tool rather than to make it work reliably and predictably? I'm concerned about your usecase where the kernel requires admin intervention to resolve such an issue and there is nothing in the VM we can do to fix it. If you have a specific test that demonstrates when your usecase is needed, please provide it so we can address the issue that it triggers. I'd prefer to fix the issue in the VM rather than require human intervention, especially when we try to keep a very large number of machines running in our datacenters. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-19 22:57 ` David Rientjes @ 2016-01-20 9:49 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-20 9:49 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Tue 19-01-16 14:57:33, David Rientjes wrote: > On Fri, 15 Jan 2016, Michal Hocko wrote: > > > > I think it's time to kill sysrq+F and I'll send those two patches > > > unless there is a usecase I'm not aware of. > > > > I have described one in the part you haven't quoted here. Let me repeat: > > : Your system might be trashing to the point you are not able to log in > > : and resolve the situation in a reasonable time yet you are still not > > : OOM. sysrq+f is your only choice then. > > > > Could you clarify why it is better to ditch a potentially usefull > > emergency tool rather than to make it work reliably and predictably? > > I'm concerned about your usecase where the kernel requires admin > intervention to resolve such an issue and there is nothing in the VM we > can do to fix it. > > If you have a specific test that demonstrates when your usecase is needed, > please provide it so we can address the issue that it triggers. No, I do not have a specific load in mind. But let's be realistic. There will _always_ be corner cases where the VM cannot react properly or in a timely fashion. > I'd prefer to fix the issue in the VM rather than require human > intervention, especially when we try to keep a very large number of > machines running in our datacenters. It is always preferable to resolve the mm related issue automagically, of course. We should strive for robustness as much as possible but that doesn't mean we should get the only emergency tool out of administrator hands. To be honest I really fail to understand your line of argumentation here. Just that you think that sysrq+f might be not helpful in large datacenters which you seem to care about, doesn't mean that it is not helpful in other setups. Removing the functionality is out of question IMHO so can we please start discussing how to make it more predictable please? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-20 9:49 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-20 9:49 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Tue 19-01-16 14:57:33, David Rientjes wrote: > On Fri, 15 Jan 2016, Michal Hocko wrote: > > > > I think it's time to kill sysrq+F and I'll send those two patches > > > unless there is a usecase I'm not aware of. > > > > I have described one in the part you haven't quoted here. Let me repeat: > > : Your system might be trashing to the point you are not able to log in > > : and resolve the situation in a reasonable time yet you are still not > > : OOM. sysrq+f is your only choice then. > > > > Could you clarify why it is better to ditch a potentially usefull > > emergency tool rather than to make it work reliably and predictably? > > I'm concerned about your usecase where the kernel requires admin > intervention to resolve such an issue and there is nothing in the VM we > can do to fix it. > > If you have a specific test that demonstrates when your usecase is needed, > please provide it so we can address the issue that it triggers. No, I do not have a specific load in mind. But let's be realistic. There will _always_ be corner cases where the VM cannot react properly or in a timely fashion. > I'd prefer to fix the issue in the VM rather than require human > intervention, especially when we try to keep a very large number of > machines running in our datacenters. It is always preferable to resolve the mm related issue automagically, of course. We should strive for robustness as much as possible but that doesn't mean we should get the only emergency tool out of administrator hands. To be honest I really fail to understand your line of argumentation here. Just that you think that sysrq+f might be not helpful in large datacenters which you seem to care about, doesn't mean that it is not helpful in other setups. Removing the functionality is out of question IMHO so can we please start discussing how to make it more predictable please? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-20 9:49 ` Michal Hocko @ 2016-01-21 0:01 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-21 0:01 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Wed, 20 Jan 2016, Michal Hocko wrote: > No, I do not have a specific load in mind. But let's be realistic. There > will _always_ be corner cases where the VM cannot react properly or in a > timely fashion. > Then let's identify it and fix it, like we do with any other bug? I'm 99% certain you are not advocating that human intervention is the ideal solution to prevent lengthy stalls or livelocks. I can't speak for all possible configurations and workloads; the only thing we use sysrq+f for is automated testing of the oom killer itself. It would help to know of any situations when people actually need to use this to solve issues and then fix those issues rather than insisting that this is the ideal solution. > To be honest I really fail to understand your line of argumentation > here. Just that you think that sysrq+f might be not helpful in large > datacenters which you seem to care about, doesn't mean that it is not > helpful in other setups. > This type of message isn't really contributing anything. You don't have a specific load in mind, you can't identify a pending bug that people have complained about, you presumably can't show a testcase that demonstrates how it's required, yet you're arguing that we should keep a debugging tool around because you think somebody somewhere sometime might use it. [ I would imagine that users would be unhappy they have to kill processes already, and would have reported how ridiculous it is that they had to use sysrq+f, but I haven't seen those bug reports. ] I want the VM to be responsive, I don't want it to thrash forever, and I want it to not require root to trigger a sysrq to have the kernel kill a process for the VM to work properly. We either need to fix the issue that causes the unresponsiveness or oom kill processes earlier. This is very simple. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-21 0:01 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-21 0:01 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Wed, 20 Jan 2016, Michal Hocko wrote: > No, I do not have a specific load in mind. But let's be realistic. There > will _always_ be corner cases where the VM cannot react properly or in a > timely fashion. > Then let's identify it and fix it, like we do with any other bug? I'm 99% certain you are not advocating that human intervention is the ideal solution to prevent lengthy stalls or livelocks. I can't speak for all possible configurations and workloads; the only thing we use sysrq+f for is automated testing of the oom killer itself. It would help to know of any situations when people actually need to use this to solve issues and then fix those issues rather than insisting that this is the ideal solution. > To be honest I really fail to understand your line of argumentation > here. Just that you think that sysrq+f might be not helpful in large > datacenters which you seem to care about, doesn't mean that it is not > helpful in other setups. > This type of message isn't really contributing anything. You don't have a specific load in mind, you can't identify a pending bug that people have complained about, you presumably can't show a testcase that demonstrates how it's required, yet you're arguing that we should keep a debugging tool around because you think somebody somewhere sometime might use it. [ I would imagine that users would be unhappy they have to kill processes already, and would have reported how ridiculous it is that they had to use sysrq+f, but I haven't seen those bug reports. ] I want the VM to be responsive, I don't want it to thrash forever, and I want it to not require root to trigger a sysrq to have the kernel kill a process for the VM to work properly. We either need to fix the issue that causes the unresponsiveness or oom kill processes earlier. This is very simple. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks 2016-01-21 0:01 ` David Rientjes @ 2016-01-21 9:15 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-21 9:15 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Wed 20-01-16 16:01:54, David Rientjes wrote: > On Wed, 20 Jan 2016, Michal Hocko wrote: > > > No, I do not have a specific load in mind. But let's be realistic. There > > will _always_ be corner cases where the VM cannot react properly or in a > > timely fashion. > > > > Then let's identify it and fix it, like we do with any other bug? I'm 99% > certain you are not advocating that human intervention is the ideal > solution to prevent lengthy stalls or livelocks. I didn't claim that! Please read what I have written. I consider sysrq+f as a _last resort_ emergency tool when the system doesn't behave in the expected way. > I can't speak for all possible configurations and workloads; the only > thing we use sysrq+f for is automated testing of the oom killer itself. That is your use case and it is not the one why the this functionality has been introduced. This is _not a debuggin_ tool. Back in 2005 it has been added precisely to allow for an immediate intervention while the system was trashing heavily. > It would help to know of any situations when people actually need to use > this to solve issues and then fix those issues rather than insisting that > this is the ideal solution. I fully agree that such an issues should be investigated and fixed. That is nothing against having the emergency tool and allow the admin to intervene right away when it happens. > > To be honest I really fail to understand your line of argumentation > > here. Just that you think that sysrq+f might be not helpful in large > > datacenters which you seem to care about, doesn't mean that it is not > > helpful in other setups. > > > > This type of message isn't really contributing anything. You don't have a > specific load in mind, you can't identify a pending bug that people have > complained about, you presumably can't show a testcase that demonstrates > how it's required, yet you're arguing that we should keep a debugging tool > around because you think somebody somewhere sometime might use it. Look, I am getting tired of this discussion. You seem to completely ignore the emergency aspect of sysrq+f just because it doesn't seem to fit in _your_ particular usecase. I have seen admins using sysrq+f when a large application got crazy and started trashing to the point when even ssh to the machine took ages and sysrq+f over serial console was the only deterministic way to make the system usable. Such things are still real. Just look at linux-mm ML (just off hand http://lkml.kernel.org/r/20151221123557.GE3060%40orkisz). You can argue we should fix them, and I agree but swap/page cache trashing are real for ages and those are hard problems and very likely to be with us for some more. Until our MM subsystem and all others that might interfere are perfect we need a sledge hammer. And if we have a hammer then we should really make sure it hits something when used rather than hitting the thin air. The patch proposed here doesn't make the code more complicated or harder to maintain. It even doesn't have any side effects outside of sysrq+f triggered OOM. Your only argument so far was: " : It certainly would get TIF_MEMDIE set if it needs to allocate memory : itself and it calls the oom killer. That doesn't mean that we should : kill a different process, though, when the killed process should exit : and free its memory. So NACK to the fatal_signal_pending() check here. " And that argument is fundamentally broken because killed process is not guaranteed to exit and free its memory. Moreover sysrq+f is by definition an async action which might race by passing killed task and that should deactivate it. The race is quite unlikely but emergency tools should be as robust/reliable as possible. You also have ignored my question about what kind of regression would such a change cause. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks @ 2016-01-21 9:15 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-21 9:15 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Wed 20-01-16 16:01:54, David Rientjes wrote: > On Wed, 20 Jan 2016, Michal Hocko wrote: > > > No, I do not have a specific load in mind. But let's be realistic. There > > will _always_ be corner cases where the VM cannot react properly or in a > > timely fashion. > > > > Then let's identify it and fix it, like we do with any other bug? I'm 99% > certain you are not advocating that human intervention is the ideal > solution to prevent lengthy stalls or livelocks. I didn't claim that! Please read what I have written. I consider sysrq+f as a _last resort_ emergency tool when the system doesn't behave in the expected way. > I can't speak for all possible configurations and workloads; the only > thing we use sysrq+f for is automated testing of the oom killer itself. That is your use case and it is not the one why the this functionality has been introduced. This is _not a debuggin_ tool. Back in 2005 it has been added precisely to allow for an immediate intervention while the system was trashing heavily. > It would help to know of any situations when people actually need to use > this to solve issues and then fix those issues rather than insisting that > this is the ideal solution. I fully agree that such an issues should be investigated and fixed. That is nothing against having the emergency tool and allow the admin to intervene right away when it happens. > > To be honest I really fail to understand your line of argumentation > > here. Just that you think that sysrq+f might be not helpful in large > > datacenters which you seem to care about, doesn't mean that it is not > > helpful in other setups. > > > > This type of message isn't really contributing anything. You don't have a > specific load in mind, you can't identify a pending bug that people have > complained about, you presumably can't show a testcase that demonstrates > how it's required, yet you're arguing that we should keep a debugging tool > around because you think somebody somewhere sometime might use it. Look, I am getting tired of this discussion. You seem to completely ignore the emergency aspect of sysrq+f just because it doesn't seem to fit in _your_ particular usecase. I have seen admins using sysrq+f when a large application got crazy and started trashing to the point when even ssh to the machine took ages and sysrq+f over serial console was the only deterministic way to make the system usable. Such things are still real. Just look at linux-mm ML (just off hand http://lkml.kernel.org/r/20151221123557.GE3060%40orkisz). You can argue we should fix them, and I agree but swap/page cache trashing are real for ages and those are hard problems and very likely to be with us for some more. Until our MM subsystem and all others that might interfere are perfect we need a sledge hammer. And if we have a hammer then we should really make sure it hits something when used rather than hitting the thin air. The patch proposed here doesn't make the code more complicated or harder to maintain. It even doesn't have any side effects outside of sysrq+f triggered OOM. Your only argument so far was: " : It certainly would get TIF_MEMDIE set if it needs to allocate memory : itself and it calls the oom killer. That doesn't mean that we should : kill a different process, though, when the killed process should exit : and free its memory. So NACK to the fatal_signal_pending() check here. " And that argument is fundamentally broken because killed process is not guaranteed to exit and free its memory. Moreover sysrq+f is by definition an async action which might race by passing killed task and that should deactivate it. The race is quite unlikely but emergency tools should be as robust/reliable as possible. You also have ignored my question about what kind of regression would such a change cause. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC 2/3] oom: Do not sacrifice already OOM killed children 2016-01-12 21:00 ` Michal Hocko @ 2016-01-12 21:00 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw) To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> oom_kill_process tries to sacrifice a children process of the selected victim to safe as much work done as possible. This is all and good but the current heuristic doesn't check the status of the children before they are examined. So it might be possible that we will end up selecting the same child all over again just because it cannot terminate. Tetsuo Handa has reported exactly this when trying to use sysrq+f to resolve an OOM situation: [ 86.767482] a.out invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO) [ 86.769905] a.out cpuset=/ mems_allowed=0 [ 86.771393] CPU: 2 PID: 9573 Comm: a.out Not tainted 4.4.0-next-20160112+ #279 (...snipped...) [ 86.874710] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name (...snipped...) [ 86.945286] [ 9573] 1000 9573 541717 402522 796 6 0 0 a.out [ 86.947457] [ 9574] 1000 9574 1078 21 7 3 0 0 a.out [ 86.949568] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 86.951538] Killed process 9574 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 86.955296] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD) [ 86.958035] systemd-journal cpuset=/ mems_allowed=0 (...snipped...) [ 87.128808] [ 9573] 1000 9573 541717 402522 796 6 0 0 a.out [ 87.130926] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 87.133055] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 87.134989] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 116.979564] sysrq: SysRq : Manual OOM execution [ 116.984119] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 116.986367] kworker/0:8 cpuset=/ mems_allowed=0 (...snipped...) [ 117.157045] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 117.159191] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 117.161302] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 117.163250] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 119.043685] sysrq: SysRq : Manual OOM execution [ 119.046239] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 119.048453] kworker/0:8 cpuset=/ mems_allowed=0 (...snipped...) [ 119.215982] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 119.218122] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 119.220237] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 119.222129] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 120.179644] sysrq: SysRq : Manual OOM execution [ 120.206938] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 120.209152] kworker/0:8 cpuset=/ mems_allowed=0 [ 120.376821] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 120.378924] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 120.381065] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 120.382929] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB This patch simply rules out all the children which are OOM victims already or have fatal_signal_pending or they are exiting already. It simply doesn't make any sense to kill such tasks if they have already been killed and the OOM situation wasn't resolved as we had to select a new OOM victim. This is true for both the regular and forced oom killer invocation. While we are there let's separate this specific logic into a separate function. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 89 ++++++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 58 insertions(+), 31 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 2b9dc5129a89..8bca0b1e97f7 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) } #define K(x) ((x) << (PAGE_SHIFT-10)) + +/* + * If any of victim's children has a different mm and is eligible for kill, + * the one with the highest oom_badness() score is sacrificed for its + * parent. This attempts to lose the minimal amount of work done while + * still freeing memory. + */ +static struct task_struct * +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, + unsigned long totalpages, struct mem_cgroup *memcg) +{ + struct task_struct *child_victim = NULL; + unsigned int victim_points = 0; + struct task_struct *t; + + read_lock(&tasklist_lock); + for_each_thread(victim, t) { + struct task_struct *child; + + list_for_each_entry(child, &t->children, sibling) { + unsigned int child_points; + + /* + * Skip over already OOM killed children as this hasn't + * helped to resolve the situation obviously. + */ + if (test_tsk_thread_flag(child, TIF_MEMDIE) || + fatal_signal_pending(child) || + task_will_free_mem(child)) + continue; + + if (process_shares_mm(child, victim->mm)) + continue; + + child_points = oom_badness(child, memcg, oc->nodemask, + totalpages); + if (child_points > victim_points) { + if (child_victim) + put_task_struct(child_victim); + child_victim = child; + victim_points = child_points; + get_task_struct(child_victim); + } + } + } + read_unlock(&tasklist_lock); + + if (!child_victim) + goto out; + + put_task_struct(victim); + victim = child_victim; + +out: + return victim; +} + /* * Must be called while holding a reference to p, which will be released upon * returning. @@ -680,10 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, struct mem_cgroup *memcg, const char *message) { struct task_struct *victim = p; - struct task_struct *child; - struct task_struct *t; struct mm_struct *mm; - unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); bool can_oom_reap = true; @@ -707,34 +761,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points); - /* - * If any of p's children has a different mm and is eligible for kill, - * the one with the highest oom_badness() score is sacrificed for its - * parent. This attempts to lose the minimal amount of work done while - * still freeing memory. - */ - read_lock(&tasklist_lock); - for_each_thread(p, t) { - list_for_each_entry(child, &t->children, sibling) { - unsigned int child_points; - - if (process_shares_mm(child, p->mm)) - continue; - /* - * oom_badness() returns 0 if the thread is unkillable - */ - child_points = oom_badness(child, memcg, oc->nodemask, - totalpages); - if (child_points > victim_points) { - put_task_struct(victim); - victim = child; - victim_points = child_points; - get_task_struct(victim); - } - } - } - read_unlock(&tasklist_lock); - + victim = try_to_sacrifice_child(oc, victim, totalpages, memcg); p = find_lock_task_mm(victim); if (!p) { put_task_struct(victim); -- 2.6.4 ^ permalink raw reply related [flat|nested] 44+ messages in thread
* [RFC 2/3] oom: Do not sacrifice already OOM killed children @ 2016-01-12 21:00 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw) To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> oom_kill_process tries to sacrifice a children process of the selected victim to safe as much work done as possible. This is all and good but the current heuristic doesn't check the status of the children before they are examined. So it might be possible that we will end up selecting the same child all over again just because it cannot terminate. Tetsuo Handa has reported exactly this when trying to use sysrq+f to resolve an OOM situation: [ 86.767482] a.out invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO) [ 86.769905] a.out cpuset=/ mems_allowed=0 [ 86.771393] CPU: 2 PID: 9573 Comm: a.out Not tainted 4.4.0-next-20160112+ #279 (...snipped...) [ 86.874710] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name (...snipped...) [ 86.945286] [ 9573] 1000 9573 541717 402522 796 6 0 0 a.out [ 86.947457] [ 9574] 1000 9574 1078 21 7 3 0 0 a.out [ 86.949568] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 86.951538] Killed process 9574 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 86.955296] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD) [ 86.958035] systemd-journal cpuset=/ mems_allowed=0 (...snipped...) [ 87.128808] [ 9573] 1000 9573 541717 402522 796 6 0 0 a.out [ 87.130926] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 87.133055] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 87.134989] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 116.979564] sysrq: SysRq : Manual OOM execution [ 116.984119] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 116.986367] kworker/0:8 cpuset=/ mems_allowed=0 (...snipped...) [ 117.157045] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 117.159191] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 117.161302] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 117.163250] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 119.043685] sysrq: SysRq : Manual OOM execution [ 119.046239] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 119.048453] kworker/0:8 cpuset=/ mems_allowed=0 (...snipped...) [ 119.215982] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 119.218122] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 119.220237] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 119.222129] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 120.179644] sysrq: SysRq : Manual OOM execution [ 120.206938] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 120.209152] kworker/0:8 cpuset=/ mems_allowed=0 [ 120.376821] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 120.378924] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 120.381065] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 120.382929] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB This patch simply rules out all the children which are OOM victims already or have fatal_signal_pending or they are exiting already. It simply doesn't make any sense to kill such tasks if they have already been killed and the OOM situation wasn't resolved as we had to select a new OOM victim. This is true for both the regular and forced oom killer invocation. While we are there let's separate this specific logic into a separate function. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 89 ++++++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 58 insertions(+), 31 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 2b9dc5129a89..8bca0b1e97f7 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) } #define K(x) ((x) << (PAGE_SHIFT-10)) + +/* + * If any of victim's children has a different mm and is eligible for kill, + * the one with the highest oom_badness() score is sacrificed for its + * parent. This attempts to lose the minimal amount of work done while + * still freeing memory. + */ +static struct task_struct * +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, + unsigned long totalpages, struct mem_cgroup *memcg) +{ + struct task_struct *child_victim = NULL; + unsigned int victim_points = 0; + struct task_struct *t; + + read_lock(&tasklist_lock); + for_each_thread(victim, t) { + struct task_struct *child; + + list_for_each_entry(child, &t->children, sibling) { + unsigned int child_points; + + /* + * Skip over already OOM killed children as this hasn't + * helped to resolve the situation obviously. + */ + if (test_tsk_thread_flag(child, TIF_MEMDIE) || + fatal_signal_pending(child) || + task_will_free_mem(child)) + continue; + + if (process_shares_mm(child, victim->mm)) + continue; + + child_points = oom_badness(child, memcg, oc->nodemask, + totalpages); + if (child_points > victim_points) { + if (child_victim) + put_task_struct(child_victim); + child_victim = child; + victim_points = child_points; + get_task_struct(child_victim); + } + } + } + read_unlock(&tasklist_lock); + + if (!child_victim) + goto out; + + put_task_struct(victim); + victim = child_victim; + +out: + return victim; +} + /* * Must be called while holding a reference to p, which will be released upon * returning. @@ -680,10 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, struct mem_cgroup *memcg, const char *message) { struct task_struct *victim = p; - struct task_struct *child; - struct task_struct *t; struct mm_struct *mm; - unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); bool can_oom_reap = true; @@ -707,34 +761,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points); - /* - * If any of p's children has a different mm and is eligible for kill, - * the one with the highest oom_badness() score is sacrificed for its - * parent. This attempts to lose the minimal amount of work done while - * still freeing memory. - */ - read_lock(&tasklist_lock); - for_each_thread(p, t) { - list_for_each_entry(child, &t->children, sibling) { - unsigned int child_points; - - if (process_shares_mm(child, p->mm)) - continue; - /* - * oom_badness() returns 0 if the thread is unkillable - */ - child_points = oom_badness(child, memcg, oc->nodemask, - totalpages); - if (child_points > victim_points) { - put_task_struct(victim); - victim = child; - victim_points = child_points; - get_task_struct(victim); - } - } - } - read_unlock(&tasklist_lock); - + victim = try_to_sacrifice_child(oc, victim, totalpages, memcg); p = find_lock_task_mm(victim); if (!p) { put_task_struct(victim); -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [RFC 2/3] oom: Do not sacrifice already OOM killed children 2016-01-12 21:00 ` Michal Hocko @ 2016-01-13 0:45 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-13 0:45 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML, Michal Hocko On Tue, 12 Jan 2016, Michal Hocko wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 2b9dc5129a89..8bca0b1e97f7 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) > } > > #define K(x) ((x) << (PAGE_SHIFT-10)) > + > +/* > + * If any of victim's children has a different mm and is eligible for kill, > + * the one with the highest oom_badness() score is sacrificed for its > + * parent. This attempts to lose the minimal amount of work done while > + * still freeing memory. > + */ > +static struct task_struct * > +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > + unsigned long totalpages, struct mem_cgroup *memcg) > +{ > + struct task_struct *child_victim = NULL; > + unsigned int victim_points = 0; > + struct task_struct *t; > + > + read_lock(&tasklist_lock); > + for_each_thread(victim, t) { > + struct task_struct *child; > + > + list_for_each_entry(child, &t->children, sibling) { > + unsigned int child_points; > + > + /* > + * Skip over already OOM killed children as this hasn't > + * helped to resolve the situation obviously. > + */ > + if (test_tsk_thread_flag(child, TIF_MEMDIE) || > + fatal_signal_pending(child) || > + task_will_free_mem(child)) > + continue; > + What guarantees that child had time to exit after it has been oom killed (better yet, what guarantees that it has even scheduled after it has been oom killed)? It seems like this would quickly kill many children unnecessarily. > + if (process_shares_mm(child, victim->mm)) > + continue; > + > + child_points = oom_badness(child, memcg, oc->nodemask, > + totalpages); > + if (child_points > victim_points) { > + if (child_victim) > + put_task_struct(child_victim); > + child_victim = child; > + victim_points = child_points; > + get_task_struct(child_victim); > + } > + } > + } > + read_unlock(&tasklist_lock); > + > + if (!child_victim) > + goto out; > + > + put_task_struct(victim); > + victim = child_victim; > + > +out: > + return victim; > +} > + > /* > * Must be called while holding a reference to p, which will be released upon > * returning. > @@ -680,10 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > struct mem_cgroup *memcg, const char *message) > { > struct task_struct *victim = p; > - struct task_struct *child; > - struct task_struct *t; > struct mm_struct *mm; > - unsigned int victim_points = 0; > static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > DEFAULT_RATELIMIT_BURST); > bool can_oom_reap = true; > @@ -707,34 +761,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", > message, task_pid_nr(p), p->comm, points); > > - /* > - * If any of p's children has a different mm and is eligible for kill, > - * the one with the highest oom_badness() score is sacrificed for its > - * parent. This attempts to lose the minimal amount of work done while > - * still freeing memory. > - */ > - read_lock(&tasklist_lock); > - for_each_thread(p, t) { > - list_for_each_entry(child, &t->children, sibling) { > - unsigned int child_points; > - > - if (process_shares_mm(child, p->mm)) > - continue; > - /* > - * oom_badness() returns 0 if the thread is unkillable > - */ > - child_points = oom_badness(child, memcg, oc->nodemask, > - totalpages); > - if (child_points > victim_points) { > - put_task_struct(victim); > - victim = child; > - victim_points = child_points; > - get_task_struct(victim); > - } > - } > - } > - read_unlock(&tasklist_lock); > - > + victim = try_to_sacrifice_child(oc, victim, totalpages, memcg); > p = find_lock_task_mm(victim); > if (!p) { > put_task_struct(victim); ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 2/3] oom: Do not sacrifice already OOM killed children @ 2016-01-13 0:45 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-13 0:45 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML, Michal Hocko On Tue, 12 Jan 2016, Michal Hocko wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 2b9dc5129a89..8bca0b1e97f7 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) > } > > #define K(x) ((x) << (PAGE_SHIFT-10)) > + > +/* > + * If any of victim's children has a different mm and is eligible for kill, > + * the one with the highest oom_badness() score is sacrificed for its > + * parent. This attempts to lose the minimal amount of work done while > + * still freeing memory. > + */ > +static struct task_struct * > +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > + unsigned long totalpages, struct mem_cgroup *memcg) > +{ > + struct task_struct *child_victim = NULL; > + unsigned int victim_points = 0; > + struct task_struct *t; > + > + read_lock(&tasklist_lock); > + for_each_thread(victim, t) { > + struct task_struct *child; > + > + list_for_each_entry(child, &t->children, sibling) { > + unsigned int child_points; > + > + /* > + * Skip over already OOM killed children as this hasn't > + * helped to resolve the situation obviously. > + */ > + if (test_tsk_thread_flag(child, TIF_MEMDIE) || > + fatal_signal_pending(child) || > + task_will_free_mem(child)) > + continue; > + What guarantees that child had time to exit after it has been oom killed (better yet, what guarantees that it has even scheduled after it has been oom killed)? It seems like this would quickly kill many children unnecessarily. > + if (process_shares_mm(child, victim->mm)) > + continue; > + > + child_points = oom_badness(child, memcg, oc->nodemask, > + totalpages); > + if (child_points > victim_points) { > + if (child_victim) > + put_task_struct(child_victim); > + child_victim = child; > + victim_points = child_points; > + get_task_struct(child_victim); > + } > + } > + } > + read_unlock(&tasklist_lock); > + > + if (!child_victim) > + goto out; > + > + put_task_struct(victim); > + victim = child_victim; > + > +out: > + return victim; > +} > + > /* > * Must be called while holding a reference to p, which will be released upon > * returning. > @@ -680,10 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > struct mem_cgroup *memcg, const char *message) > { > struct task_struct *victim = p; > - struct task_struct *child; > - struct task_struct *t; > struct mm_struct *mm; > - unsigned int victim_points = 0; > static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > DEFAULT_RATELIMIT_BURST); > bool can_oom_reap = true; > @@ -707,34 +761,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", > message, task_pid_nr(p), p->comm, points); > > - /* > - * If any of p's children has a different mm and is eligible for kill, > - * the one with the highest oom_badness() score is sacrificed for its > - * parent. This attempts to lose the minimal amount of work done while > - * still freeing memory. > - */ > - read_lock(&tasklist_lock); > - for_each_thread(p, t) { > - list_for_each_entry(child, &t->children, sibling) { > - unsigned int child_points; > - > - if (process_shares_mm(child, p->mm)) > - continue; > - /* > - * oom_badness() returns 0 if the thread is unkillable > - */ > - child_points = oom_badness(child, memcg, oc->nodemask, > - totalpages); > - if (child_points > victim_points) { > - put_task_struct(victim); > - victim = child; > - victim_points = child_points; > - get_task_struct(victim); > - } > - } > - } > - read_unlock(&tasklist_lock); > - > + victim = try_to_sacrifice_child(oc, victim, totalpages, memcg); > p = find_lock_task_mm(victim); > if (!p) { > put_task_struct(victim); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 2/3] oom: Do not sacrifice already OOM killed children 2016-01-13 0:45 ` David Rientjes @ 2016-01-13 9:36 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-13 9:36 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Tue 12-01-16 16:45:35, David Rientjes wrote: > On Tue, 12 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 2b9dc5129a89..8bca0b1e97f7 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) > > } > > > > #define K(x) ((x) << (PAGE_SHIFT-10)) > > + > > +/* > > + * If any of victim's children has a different mm and is eligible for kill, > > + * the one with the highest oom_badness() score is sacrificed for its > > + * parent. This attempts to lose the minimal amount of work done while > > + * still freeing memory. > > + */ > > +static struct task_struct * > > +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > > + unsigned long totalpages, struct mem_cgroup *memcg) > > +{ > > + struct task_struct *child_victim = NULL; > > + unsigned int victim_points = 0; > > + struct task_struct *t; > > + > > + read_lock(&tasklist_lock); > > + for_each_thread(victim, t) { > > + struct task_struct *child; > > + > > + list_for_each_entry(child, &t->children, sibling) { > > + unsigned int child_points; > > + > > + /* > > + * Skip over already OOM killed children as this hasn't > > + * helped to resolve the situation obviously. > > + */ > > + if (test_tsk_thread_flag(child, TIF_MEMDIE) || > > + fatal_signal_pending(child) || > > + task_will_free_mem(child)) > > + continue; > > + > > What guarantees that child had time to exit after it has been oom killed > (better yet, what guarantees that it has even scheduled after it has been > oom killed)? It seems like this would quickly kill many children > unnecessarily. If the child hasn't released any memory after all the allocator attempts to free a memory, which takes quite some time, then what is the advantage of waiting even more and possibly get stuck? This is a heuristic, we should have killed the selected victim but we have chosen to reduce the impact by selecting the child process instead. If that hasn't led to any improvement I believe we should move on rather than looping on potentially unresolvable situation _just because_ of the said heuristic. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 2/3] oom: Do not sacrifice already OOM killed children @ 2016-01-13 9:36 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-13 9:36 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Tue 12-01-16 16:45:35, David Rientjes wrote: > On Tue, 12 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 2b9dc5129a89..8bca0b1e97f7 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) > > } > > > > #define K(x) ((x) << (PAGE_SHIFT-10)) > > + > > +/* > > + * If any of victim's children has a different mm and is eligible for kill, > > + * the one with the highest oom_badness() score is sacrificed for its > > + * parent. This attempts to lose the minimal amount of work done while > > + * still freeing memory. > > + */ > > +static struct task_struct * > > +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > > + unsigned long totalpages, struct mem_cgroup *memcg) > > +{ > > + struct task_struct *child_victim = NULL; > > + unsigned int victim_points = 0; > > + struct task_struct *t; > > + > > + read_lock(&tasklist_lock); > > + for_each_thread(victim, t) { > > + struct task_struct *child; > > + > > + list_for_each_entry(child, &t->children, sibling) { > > + unsigned int child_points; > > + > > + /* > > + * Skip over already OOM killed children as this hasn't > > + * helped to resolve the situation obviously. > > + */ > > + if (test_tsk_thread_flag(child, TIF_MEMDIE) || > > + fatal_signal_pending(child) || > > + task_will_free_mem(child)) > > + continue; > > + > > What guarantees that child had time to exit after it has been oom killed > (better yet, what guarantees that it has even scheduled after it has been > oom killed)? It seems like this would quickly kill many children > unnecessarily. If the child hasn't released any memory after all the allocator attempts to free a memory, which takes quite some time, then what is the advantage of waiting even more and possibly get stuck? This is a heuristic, we should have killed the selected victim but we have chosen to reduce the impact by selecting the child process instead. If that hasn't led to any improvement I believe we should move on rather than looping on potentially unresolvable situation _just because_ of the said heuristic. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 2/3] oom: Do not sacrifice already OOM killed children 2016-01-13 9:36 ` Michal Hocko @ 2016-01-14 0:42 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-14 0:42 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Wed, 13 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index 2b9dc5129a89..8bca0b1e97f7 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) > > > } > > > > > > #define K(x) ((x) << (PAGE_SHIFT-10)) > > > + > > > +/* > > > + * If any of victim's children has a different mm and is eligible for kill, > > > + * the one with the highest oom_badness() score is sacrificed for its > > > + * parent. This attempts to lose the minimal amount of work done while > > > + * still freeing memory. > > > + */ > > > +static struct task_struct * > > > +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > > > + unsigned long totalpages, struct mem_cgroup *memcg) > > > +{ > > > + struct task_struct *child_victim = NULL; > > > + unsigned int victim_points = 0; > > > + struct task_struct *t; > > > + > > > + read_lock(&tasklist_lock); > > > + for_each_thread(victim, t) { > > > + struct task_struct *child; > > > + > > > + list_for_each_entry(child, &t->children, sibling) { > > > + unsigned int child_points; > > > + > > > + /* > > > + * Skip over already OOM killed children as this hasn't > > > + * helped to resolve the situation obviously. > > > + */ > > > + if (test_tsk_thread_flag(child, TIF_MEMDIE) || > > > + fatal_signal_pending(child) || > > > + task_will_free_mem(child)) > > > + continue; > > > + > > > > What guarantees that child had time to exit after it has been oom killed > > (better yet, what guarantees that it has even scheduled after it has been > > oom killed)? It seems like this would quickly kill many children > > unnecessarily. > > If the child hasn't released any memory after all the allocator attempts to > free a memory, which takes quite some time, then what is the advantage of > waiting even more and possibly get stuck? No, we don't rely on implicit page allocator behavior or implementation to decide when additional processes should randomly be killed. It is quite simple to get dozens of processes oom killed if your patch is introduced, just as it is possible to get dozens of processes oom killed unnecessarily if you remove TIF_MEMDIE checks from select_bad_process(). If you are concerned about the child never exiting, then it is quite simple to provide access to memory reserves in the page allocator in such situations, this is no different than TIF_MEMDIE threads failing to exit. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 2/3] oom: Do not sacrifice already OOM killed children @ 2016-01-14 0:42 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-14 0:42 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Wed, 13 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index 2b9dc5129a89..8bca0b1e97f7 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -671,6 +671,63 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) > > > } > > > > > > #define K(x) ((x) << (PAGE_SHIFT-10)) > > > + > > > +/* > > > + * If any of victim's children has a different mm and is eligible for kill, > > > + * the one with the highest oom_badness() score is sacrificed for its > > > + * parent. This attempts to lose the minimal amount of work done while > > > + * still freeing memory. > > > + */ > > > +static struct task_struct * > > > +try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > > > + unsigned long totalpages, struct mem_cgroup *memcg) > > > +{ > > > + struct task_struct *child_victim = NULL; > > > + unsigned int victim_points = 0; > > > + struct task_struct *t; > > > + > > > + read_lock(&tasklist_lock); > > > + for_each_thread(victim, t) { > > > + struct task_struct *child; > > > + > > > + list_for_each_entry(child, &t->children, sibling) { > > > + unsigned int child_points; > > > + > > > + /* > > > + * Skip over already OOM killed children as this hasn't > > > + * helped to resolve the situation obviously. > > > + */ > > > + if (test_tsk_thread_flag(child, TIF_MEMDIE) || > > > + fatal_signal_pending(child) || > > > + task_will_free_mem(child)) > > > + continue; > > > + > > > > What guarantees that child had time to exit after it has been oom killed > > (better yet, what guarantees that it has even scheduled after it has been > > oom killed)? It seems like this would quickly kill many children > > unnecessarily. > > If the child hasn't released any memory after all the allocator attempts to > free a memory, which takes quite some time, then what is the advantage of > waiting even more and possibly get stuck? No, we don't rely on implicit page allocator behavior or implementation to decide when additional processes should randomly be killed. It is quite simple to get dozens of processes oom killed if your patch is introduced, just as it is possible to get dozens of processes oom killed unnecessarily if you remove TIF_MEMDIE checks from select_bad_process(). If you are concerned about the child never exiting, then it is quite simple to provide access to memory reserves in the page allocator in such situations, this is no different than TIF_MEMDIE threads failing to exit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC 3/3] oom: Do not try to sacrifice small children 2016-01-12 21:00 ` Michal Hocko @ 2016-01-12 21:00 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw) To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> try_to_sacrifice_child will select the largest child of the selected OOM victim to protect it and potentially save some work done by the parent. We can however select a small child which has barely touched any memory and killing it wouldn't lead to OOM recovery and only prolong the OOM condition which is not desirable. This patch simply ignores the largest child selection and falls back to the parent (original victim) if the child hasn't accumulated even 1MB worth of oom score. We are not checking the memory consumption directly as we want to honor the oom_score_adj here because this would be the only way to protect children from this heuristic in case they are more important than the parent. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8bca0b1e97f7..b5c0021c6462 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -721,8 +721,16 @@ try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, if (!child_victim) goto out; - put_task_struct(victim); - victim = child_victim; + /* + * Protecting the parent makes sense only if killing the child + * would release at least some memory (at least 1MB). + */ + if (K(victim_points) >= 1024) { + put_task_struct(victim); + victim = child_victim; + } else { + put_task_struct(child_victim); + } out: return victim; -- 2.6.4 ^ permalink raw reply related [flat|nested] 44+ messages in thread
* [RFC 3/3] oom: Do not try to sacrifice small children @ 2016-01-12 21:00 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-12 21:00 UTC (permalink / raw) To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> try_to_sacrifice_child will select the largest child of the selected OOM victim to protect it and potentially save some work done by the parent. We can however select a small child which has barely touched any memory and killing it wouldn't lead to OOM recovery and only prolong the OOM condition which is not desirable. This patch simply ignores the largest child selection and falls back to the parent (original victim) if the child hasn't accumulated even 1MB worth of oom score. We are not checking the memory consumption directly as we want to honor the oom_score_adj here because this would be the only way to protect children from this heuristic in case they are more important than the parent. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8bca0b1e97f7..b5c0021c6462 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -721,8 +721,16 @@ try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, if (!child_victim) goto out; - put_task_struct(victim); - victim = child_victim; + /* + * Protecting the parent makes sense only if killing the child + * would release at least some memory (at least 1MB). + */ + if (K(victim_points) >= 1024) { + put_task_struct(victim); + victim = child_victim; + } else { + put_task_struct(child_victim); + } out: return victim; -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [RFC 3/3] oom: Do not try to sacrifice small children 2016-01-12 21:00 ` Michal Hocko @ 2016-01-13 0:51 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-13 0:51 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML, Michal Hocko On Tue, 12 Jan 2016, Michal Hocko wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 8bca0b1e97f7..b5c0021c6462 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -721,8 +721,16 @@ try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > if (!child_victim) > goto out; > > - put_task_struct(victim); > - victim = child_victim; > + /* > + * Protecting the parent makes sense only if killing the child > + * would release at least some memory (at least 1MB). > + */ > + if (K(victim_points) >= 1024) { > + put_task_struct(victim); > + victim = child_victim; > + } else { > + put_task_struct(child_victim); > + } > > out: > return victim; The purpose of sacrificing a child has always been to prevent a process that has been running with a substantial amount of work done from being terminated and losing all that work if it can be avoided. This happens a lot: imagine a long-living front end client forking a child which simply collects stats and malloc information at a regular intervals and writes them out to disk or over the network. These processes may be quite small, and we're willing to happily sacrifice them if it will save the parent. This was, and still is, the intent of the sacrifice in the first place. We must be able to deal with oom victims that are very small, since userspace has complete control in prioritizing these processes in the first place. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 3/3] oom: Do not try to sacrifice small children @ 2016-01-13 0:51 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-13 0:51 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML, Michal Hocko On Tue, 12 Jan 2016, Michal Hocko wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 8bca0b1e97f7..b5c0021c6462 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -721,8 +721,16 @@ try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > if (!child_victim) > goto out; > > - put_task_struct(victim); > - victim = child_victim; > + /* > + * Protecting the parent makes sense only if killing the child > + * would release at least some memory (at least 1MB). > + */ > + if (K(victim_points) >= 1024) { > + put_task_struct(victim); > + victim = child_victim; > + } else { > + put_task_struct(child_victim); > + } > > out: > return victim; The purpose of sacrificing a child has always been to prevent a process that has been running with a substantial amount of work done from being terminated and losing all that work if it can be avoided. This happens a lot: imagine a long-living front end client forking a child which simply collects stats and malloc information at a regular intervals and writes them out to disk or over the network. These processes may be quite small, and we're willing to happily sacrifice them if it will save the parent. This was, and still is, the intent of the sacrifice in the first place. We must be able to deal with oom victims that are very small, since userspace has complete control in prioritizing these processes in the first place. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 3/3] oom: Do not try to sacrifice small children 2016-01-13 0:51 ` David Rientjes @ 2016-01-13 9:40 ` Michal Hocko -1 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-13 9:40 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Tue 12-01-16 16:51:43, David Rientjes wrote: > On Tue, 12 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 8bca0b1e97f7..b5c0021c6462 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -721,8 +721,16 @@ try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > > if (!child_victim) > > goto out; > > > > - put_task_struct(victim); > > - victim = child_victim; > > + /* > > + * Protecting the parent makes sense only if killing the child > > + * would release at least some memory (at least 1MB). > > + */ > > + if (K(victim_points) >= 1024) { > > + put_task_struct(victim); > > + victim = child_victim; > > + } else { > > + put_task_struct(child_victim); > > + } > > > > out: > > return victim; > > The purpose of sacrificing a child has always been to prevent a process > that has been running with a substantial amount of work done from being > terminated and losing all that work if it can be avoided. This happens a > lot: imagine a long-living front end client forking a child which simply > collects stats and malloc information at a regular intervals and writes > them out to disk or over the network. These processes may be quite small, > and we're willing to happily sacrifice them if it will save the parent. > This was, and still is, the intent of the sacrifice in the first place. Yes I understand the intention of the heuristic. I am just contemplating about what is way too small to sacrifice because it clearly doesn't make much sense to kill a task which is sitting on basically no memory (well just few pages backing page tables and stack) because this would just prolong the OOM agony. > We must be able to deal with oom victims that are very small, since > userspace has complete control in prioritizing these processes in the > first place. Sure the patch is not great but I would like to come up with some threshold when children are way too small to be worthwhile considering. Or maybe there is other measure we can use. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 3/3] oom: Do not try to sacrifice small children @ 2016-01-13 9:40 ` Michal Hocko 0 siblings, 0 replies; 44+ messages in thread From: Michal Hocko @ 2016-01-13 9:40 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, Tetsuo Handa, LKML On Tue 12-01-16 16:51:43, David Rientjes wrote: > On Tue, 12 Jan 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 8bca0b1e97f7..b5c0021c6462 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -721,8 +721,16 @@ try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim, > > if (!child_victim) > > goto out; > > > > - put_task_struct(victim); > > - victim = child_victim; > > + /* > > + * Protecting the parent makes sense only if killing the child > > + * would release at least some memory (at least 1MB). > > + */ > > + if (K(victim_points) >= 1024) { > > + put_task_struct(victim); > > + victim = child_victim; > > + } else { > > + put_task_struct(child_victim); > > + } > > > > out: > > return victim; > > The purpose of sacrificing a child has always been to prevent a process > that has been running with a substantial amount of work done from being > terminated and losing all that work if it can be avoided. This happens a > lot: imagine a long-living front end client forking a child which simply > collects stats and malloc information at a regular intervals and writes > them out to disk or over the network. These processes may be quite small, > and we're willing to happily sacrifice them if it will save the parent. > This was, and still is, the intent of the sacrifice in the first place. Yes I understand the intention of the heuristic. I am just contemplating about what is way too small to sacrifice because it clearly doesn't make much sense to kill a task which is sitting on basically no memory (well just few pages backing page tables and stack) because this would just prolong the OOM agony. > We must be able to deal with oom victims that are very small, since > userspace has complete control in prioritizing these processes in the > first place. Sure the patch is not great but I would like to come up with some threshold when children are way too small to be worthwhile considering. Or maybe there is other measure we can use. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 3/3] oom: Do not try to sacrifice small children 2016-01-13 9:40 ` Michal Hocko @ 2016-01-14 0:43 ` David Rientjes -1 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-14 0:43 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Wed, 13 Jan 2016, Michal Hocko wrote: > > The purpose of sacrificing a child has always been to prevent a process > > that has been running with a substantial amount of work done from being > > terminated and losing all that work if it can be avoided. This happens a > > lot: imagine a long-living front end client forking a child which simply > > collects stats and malloc information at a regular intervals and writes > > them out to disk or over the network. These processes may be quite small, > > and we're willing to happily sacrifice them if it will save the parent. > > This was, and still is, the intent of the sacrifice in the first place. > > Yes I understand the intention of the heuristic. I am just contemplating > about what is way too small to sacrifice because it clearly doesn't make > much sense to kill a task which is sitting on basically no memory (well > just few pages backing page tables and stack) because this would just > prolong the OOM agony. > Nothing is ever too small to kill since we allow userspace the ability to define their own oom priority. We would rather kill your bash shell when you login rather than your sshd. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC 3/3] oom: Do not try to sacrifice small children @ 2016-01-14 0:43 ` David Rientjes 0 siblings, 0 replies; 44+ messages in thread From: David Rientjes @ 2016-01-14 0:43 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Tetsuo Handa, LKML On Wed, 13 Jan 2016, Michal Hocko wrote: > > The purpose of sacrificing a child has always been to prevent a process > > that has been running with a substantial amount of work done from being > > terminated and losing all that work if it can be avoided. This happens a > > lot: imagine a long-living front end client forking a child which simply > > collects stats and malloc information at a regular intervals and writes > > them out to disk or over the network. These processes may be quite small, > > and we're willing to happily sacrifice them if it will save the parent. > > This was, and still is, the intent of the sacrifice in the first place. > > Yes I understand the intention of the heuristic. I am just contemplating > about what is way too small to sacrifice because it clearly doesn't make > much sense to kill a task which is sitting on basically no memory (well > just few pages backing page tables and stack) because this would just > prolong the OOM agony. > Nothing is ever too small to kill since we allow userspace the ability to define their own oom priority. We would rather kill your bash shell when you login rather than your sshd. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2016-01-21 9:15 UTC | newest] Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-12 21:00 [RFC 0/3] oom: few enahancements Michal Hocko 2016-01-12 21:00 ` Michal Hocko 2016-01-12 21:00 ` [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks Michal Hocko 2016-01-12 21:00 ` Michal Hocko 2016-01-13 0:41 ` David Rientjes 2016-01-13 0:41 ` David Rientjes 2016-01-13 9:30 ` Michal Hocko 2016-01-13 9:30 ` Michal Hocko 2016-01-14 0:38 ` David Rientjes 2016-01-14 0:38 ` David Rientjes 2016-01-14 11:00 ` Michal Hocko 2016-01-14 11:00 ` Michal Hocko 2016-01-14 21:51 ` David Rientjes 2016-01-14 21:51 ` David Rientjes 2016-01-15 10:12 ` Michal Hocko 2016-01-15 10:12 ` Michal Hocko 2016-01-15 15:37 ` One Thousand Gnomes 2016-01-15 15:37 ` One Thousand Gnomes 2016-01-19 23:01 ` David Rientjes 2016-01-19 23:01 ` David Rientjes 2016-01-19 22:57 ` David Rientjes 2016-01-19 22:57 ` David Rientjes 2016-01-20 9:49 ` Michal Hocko 2016-01-20 9:49 ` Michal Hocko 2016-01-21 0:01 ` David Rientjes 2016-01-21 0:01 ` David Rientjes 2016-01-21 9:15 ` Michal Hocko 2016-01-21 9:15 ` Michal Hocko 2016-01-12 21:00 ` [RFC 2/3] oom: Do not sacrifice already OOM killed children Michal Hocko 2016-01-12 21:00 ` Michal Hocko 2016-01-13 0:45 ` David Rientjes 2016-01-13 0:45 ` David Rientjes 2016-01-13 9:36 ` Michal Hocko 2016-01-13 9:36 ` Michal Hocko 2016-01-14 0:42 ` David Rientjes 2016-01-14 0:42 ` David Rientjes 2016-01-12 21:00 ` [RFC 3/3] oom: Do not try to sacrifice small children Michal Hocko 2016-01-12 21:00 ` Michal Hocko 2016-01-13 0:51 ` David Rientjes 2016-01-13 0:51 ` David Rientjes 2016-01-13 9:40 ` Michal Hocko 2016-01-13 9:40 ` Michal Hocko 2016-01-14 0:43 ` David Rientjes 2016-01-14 0:43 ` David Rientjes
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.