* [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-17 17:59 ` Kyle Walker 0 siblings, 0 replies; 213+ messages in thread From: Kyle Walker @ 2015-09-17 17:59 UTC (permalink / raw) To: akpm Cc: mhocko, rientjes, hannes, vdavydov, oleg, linux-mm, linux-kernel, Kyle Walker Currently, the oom killer will attempt to kill a process that is in TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional period of time, such as processes writing to a frozen filesystem during a lengthy backup operation, this can result in a deadlock condition as related processes memory access will stall within the page fault handler. Within oom_unkillable_task(), check for processes in TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will move on to another task. Signed-off-by: Kyle Walker <kwalker@redhat.com> --- mm/oom_kill.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1ecc0bc..66f03f8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, if (memcg && !task_in_mem_cgroup(p, memcg)) return true; + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ + if (p->state == TASK_UNINTERRUPTIBLE) + return true; + /* p may not have freeable memory in nodemask */ if (!has_intersects_mems_allowed(p, nodemask)) return true; -- 2.4.3 ^ permalink raw reply related [flat|nested] 213+ messages in thread
* [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-17 17:59 ` Kyle Walker 0 siblings, 0 replies; 213+ messages in thread From: Kyle Walker @ 2015-09-17 17:59 UTC (permalink / raw) To: akpm Cc: mhocko, rientjes, hannes, vdavydov, oleg, linux-mm, linux-kernel, Kyle Walker Currently, the oom killer will attempt to kill a process that is in TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional period of time, such as processes writing to a frozen filesystem during a lengthy backup operation, this can result in a deadlock condition as related processes memory access will stall within the page fault handler. Within oom_unkillable_task(), check for processes in TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will move on to another task. Signed-off-by: Kyle Walker <kwalker@redhat.com> --- mm/oom_kill.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1ecc0bc..66f03f8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, if (memcg && !task_in_mem_cgroup(p, memcg)) return true; + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ + if (p->state == TASK_UNINTERRUPTIBLE) + return true; + /* p may not have freeable memory in nodemask */ if (!has_intersects_mems_allowed(p, nodemask)) return true; -- 2.4.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-17 17:59 ` Kyle Walker @ 2015-09-17 19:22 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-17 19:22 UTC (permalink / raw) To: Kyle Walker Cc: akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina Add cc's. On 09/17, Kyle Walker wrote: > > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. > > Within oom_unkillable_task(), check for processes in > TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will > move on to another task. > > Signed-off-by: Kyle Walker <kwalker@redhat.com> > --- > mm/oom_kill.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1ecc0bc..66f03f8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, > if (memcg && !task_in_mem_cgroup(p, memcg)) > return true; > > + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ > + if (p->state == TASK_UNINTERRUPTIBLE) > + return true; > + So we can skip a memory hog which, say, does mutex_lock(). And this can't help if this task is multithreaded, unless all its sub-threads are in "D" state too oom killer will pick another thread with the same ->mm. Plus other problems. But yes, such a deadlock is possible. I would really like to see the comments from maintainers. In particular, I seem to recall that someone suggested to try to kill another !TIF_MEMDIE process after timeout, perhaps this is what we should actually do... Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-17 19:22 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-17 19:22 UTC (permalink / raw) To: Kyle Walker Cc: akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina Add cc's. On 09/17, Kyle Walker wrote: > > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. > > Within oom_unkillable_task(), check for processes in > TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will > move on to another task. > > Signed-off-by: Kyle Walker <kwalker@redhat.com> > --- > mm/oom_kill.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1ecc0bc..66f03f8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, > if (memcg && !task_in_mem_cgroup(p, memcg)) > return true; > > + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ > + if (p->state == TASK_UNINTERRUPTIBLE) > + return true; > + So we can skip a memory hog which, say, does mutex_lock(). And this can't help if this task is multithreaded, unless all its sub-threads are in "D" state too oom killer will pick another thread with the same ->mm. Plus other problems. But yes, such a deadlock is possible. I would really like to see the comments from maintainers. In particular, I seem to recall that someone suggested to try to kill another !TIF_MEMDIE process after timeout, perhaps this is what we should actually do... Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-17 19:22 ` Oleg Nesterov @ 2015-09-18 15:41 ` Christoph Lameter -1 siblings, 0 replies; 213+ messages in thread From: Christoph Lameter @ 2015-09-18 15:41 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina > But yes, such a deadlock is possible. I would really like to see the comments > from maintainers. In particular, I seem to recall that someone suggested to > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what > we should actually do... Well yes here is a patch that kills another memdie process but there is some risk with such an approach of overusing the reserves. Subject: Allow multiple kills from the OOM killer The OOM killer currently aborts if it finds a process that already is having access to the reserve memory pool for exit processing. This is done so that the reserves are not overcommitted but on the other hand this also allows only one process being oom killed at the time. That process may be stuck in D state. The patch simply removes the aborting of the scan so that other processes may be killed if one is stuck in D state. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/mm/oom_kill.c =================================================================== --- linux.orig/mm/oom_kill.c 2015-09-18 10:38:29.601963726 -0500 +++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500 @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread( * Don't allow any other task to have access to the reserves. */ if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (oc->order != -1) - return OOM_SCAN_ABORT; + if (unlikely(frozen(task))) + __thaw_task(task); } if (!task->mm) return OOM_SCAN_CONTINUE; ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-18 15:41 ` Christoph Lameter 0 siblings, 0 replies; 213+ messages in thread From: Christoph Lameter @ 2015-09-18 15:41 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina > But yes, such a deadlock is possible. I would really like to see the comments > from maintainers. In particular, I seem to recall that someone suggested to > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what > we should actually do... Well yes here is a patch that kills another memdie process but there is some risk with such an approach of overusing the reserves. Subject: Allow multiple kills from the OOM killer The OOM killer currently aborts if it finds a process that already is having access to the reserve memory pool for exit processing. This is done so that the reserves are not overcommitted but on the other hand this also allows only one process being oom killed at the time. That process may be stuck in D state. The patch simply removes the aborting of the scan so that other processes may be killed if one is stuck in D state. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/mm/oom_kill.c =================================================================== --- linux.orig/mm/oom_kill.c 2015-09-18 10:38:29.601963726 -0500 +++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500 @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread( * Don't allow any other task to have access to the reserves. */ if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (oc->order != -1) - return OOM_SCAN_ABORT; + if (unlikely(frozen(task))) + __thaw_task(task); } if (!task->mm) return OOM_SCAN_CONTINUE; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 15:41 ` Christoph Lameter @ 2015-09-18 16:24 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-18 16:24 UTC (permalink / raw) To: Christoph Lameter Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On 09/18, Christoph Lameter wrote: > > > But yes, such a deadlock is possible. I would really like to see the comments > > from maintainers. In particular, I seem to recall that someone suggested to > > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what > > we should actually do... > > Well yes here is a patch that kills another memdie process but there is > some risk with such an approach of overusing the reserves. Yes, I understand it is not that simple. And probably this is all I can understand ;) > --- linux.orig/mm/oom_kill.c 2015-09-18 10:38:29.601963726 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500 > @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread( > * Don't allow any other task to have access to the reserves. > */ > if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > + if (unlikely(frozen(task))) > + __thaw_task(task); To simplify the discussion lets ignore PF_FROZEN, this is another issue. I am not sure this change is enough, we need to ensure that select_bad_process() won't pick the same task (or its sub-thread) again. And perhaps something like wait_event_timeout(oom_victims_wait, !oom_victims, configurable_timeout); before select_bad_process() makes sense? Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-18 16:24 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-18 16:24 UTC (permalink / raw) To: Christoph Lameter Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On 09/18, Christoph Lameter wrote: > > > But yes, such a deadlock is possible. I would really like to see the comments > > from maintainers. In particular, I seem to recall that someone suggested to > > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what > > we should actually do... > > Well yes here is a patch that kills another memdie process but there is > some risk with such an approach of overusing the reserves. Yes, I understand it is not that simple. And probably this is all I can understand ;) > --- linux.orig/mm/oom_kill.c 2015-09-18 10:38:29.601963726 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500 > @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread( > * Don't allow any other task to have access to the reserves. > */ > if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > + if (unlikely(frozen(task))) > + __thaw_task(task); To simplify the discussion lets ignore PF_FROZEN, this is another issue. I am not sure this change is enough, we need to ensure that select_bad_process() won't pick the same task (or its sub-thread) again. And perhaps something like wait_event_timeout(oom_victims_wait, !oom_victims, configurable_timeout); before select_bad_process() makes sense? Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 16:24 ` Oleg Nesterov @ 2015-09-18 16:39 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-18 16:39 UTC (permalink / raw) To: oleg, cl Cc: kwalker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > To simplify the discussion lets ignore PF_FROZEN, this is another issue. > > I am not sure this change is enough, we need to ensure that > select_bad_process() won't pick the same task (or its sub-thread) again. SysRq-f is sometimes unusable because it continues choosing the same thread. oom_kill_process() should not choose a thread which already has TIF_MEMDIE. I think we need to rewrite oom_kill_process(). > > And perhaps something like > > wait_event_timeout(oom_victims_wait, !oom_victims, > configurable_timeout); > > before select_bad_process() makes sense? I think you should not sleep for long with oom_lock mutex held. http://marc.info/?l=linux-mm&m=143031212312459 ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-18 16:39 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-18 16:39 UTC (permalink / raw) To: oleg, cl Cc: kwalker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > To simplify the discussion lets ignore PF_FROZEN, this is another issue. > > I am not sure this change is enough, we need to ensure that > select_bad_process() won't pick the same task (or its sub-thread) again. SysRq-f is sometimes unusable because it continues choosing the same thread. oom_kill_process() should not choose a thread which already has TIF_MEMDIE. I think we need to rewrite oom_kill_process(). > > And perhaps something like > > wait_event_timeout(oom_victims_wait, !oom_victims, > configurable_timeout); > > before select_bad_process() makes sense? I think you should not sleep for long with oom_lock mutex held. http://marc.info/?l=linux-mm&m=143031212312459 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 16:39 ` Tetsuo Handa @ 2015-09-18 16:54 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-18 16:54 UTC (permalink / raw) To: Tetsuo Handa Cc: cl, kwalker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 09/19, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > To simplify the discussion lets ignore PF_FROZEN, this is another issue. > > > > I am not sure this change is enough, we need to ensure that > > select_bad_process() won't pick the same task (or its sub-thread) again. > > SysRq-f is sometimes unusable because it continues choosing the same thread. > oom_kill_process() should not choose a thread which already has TIF_MEMDIE. So I was right, this is really not enough... > I think we need to rewrite oom_kill_process(). Heh. I can only ack the intent and wish you good luck ;) > > And perhaps something like > > > > wait_event_timeout(oom_victims_wait, !oom_victims, > > configurable_timeout); > > > > before select_bad_process() makes sense? > > I think you should not sleep for long with oom_lock mutex held. > http://marc.info/?l=linux-mm&m=143031212312459 Yes, yes, sure, I didn't mean we should wait under oom_lock. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-18 16:54 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-18 16:54 UTC (permalink / raw) To: Tetsuo Handa Cc: cl, kwalker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 09/19, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > To simplify the discussion lets ignore PF_FROZEN, this is another issue. > > > > I am not sure this change is enough, we need to ensure that > > select_bad_process() won't pick the same task (or its sub-thread) again. > > SysRq-f is sometimes unusable because it continues choosing the same thread. > oom_kill_process() should not choose a thread which already has TIF_MEMDIE. So I was right, this is really not enough... > I think we need to rewrite oom_kill_process(). Heh. I can only ack the intent and wish you good luck ;) > > And perhaps something like > > > > wait_event_timeout(oom_victims_wait, !oom_victims, > > configurable_timeout); > > > > before select_bad_process() makes sense? > > I think you should not sleep for long with oom_lock mutex held. > http://marc.info/?l=linux-mm&m=143031212312459 Yes, yes, sure, I didn't mean we should wait under oom_lock. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 16:24 ` Oleg Nesterov @ 2015-09-18 17:00 ` Christoph Lameter -1 siblings, 0 replies; 213+ messages in thread From: Christoph Lameter @ 2015-09-18 17:00 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri, 18 Sep 2015, Oleg Nesterov wrote: > To simplify the discussion lets ignore PF_FROZEN, this is another issue. Ok. Subject: Allow multiple kills from the OOM killer The OOM killer currently aborts if it finds a process that already is having access to the reserve memory pool for exit processing. This is done so that the reserves are not overcommitted but on the other hand this also allows only one process being oom killed at the time. That process may be stuck in D state. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/mm/oom_kill.c =================================================================== --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves. */ - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (oc->order != -1) - return OOM_SCAN_ABORT; - } + if (test_tsk_thread_flag(task, TIF_MEMDIE)) + return OOM_SCAN_CONTINUE; + if (!task->mm) return OOM_SCAN_CONTINUE; ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-18 17:00 ` Christoph Lameter 0 siblings, 0 replies; 213+ messages in thread From: Christoph Lameter @ 2015-09-18 17:00 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri, 18 Sep 2015, Oleg Nesterov wrote: > To simplify the discussion lets ignore PF_FROZEN, this is another issue. Ok. Subject: Allow multiple kills from the OOM killer The OOM killer currently aborts if it finds a process that already is having access to the reserve memory pool for exit processing. This is done so that the reserves are not overcommitted but on the other hand this also allows only one process being oom killed at the time. That process may be stuck in D state. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/mm/oom_kill.c =================================================================== --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves. */ - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (oc->order != -1) - return OOM_SCAN_ABORT; - } + if (test_tsk_thread_flag(task, TIF_MEMDIE)) + return OOM_SCAN_CONTINUE; + if (!task->mm) return OOM_SCAN_CONTINUE; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 17:00 ` Christoph Lameter @ 2015-09-18 19:07 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-18 19:07 UTC (permalink / raw) To: Christoph Lameter Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On 09/18, Christoph Lameter wrote: > > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( > * This task already has access to memory reserves and is being killed. > * Don't allow any other task to have access to the reserves. > */ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + Well, I can't really comment. Hopefully we will see more comments from those who understand oom-killer. But I still think this is not enough, and we need some (configurable?) timeout before we pick another victim... And btw. Yes, this is a bit off-topic, but I think another change make sense too. We should report the fact we are going to kill another task because the previous victim refuse to die, and print its stack trace. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-18 19:07 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-18 19:07 UTC (permalink / raw) To: Christoph Lameter Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On 09/18, Christoph Lameter wrote: > > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( > * This task already has access to memory reserves and is being killed. > * Don't allow any other task to have access to the reserves. > */ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + Well, I can't really comment. Hopefully we will see more comments from those who understand oom-killer. But I still think this is not enough, and we need some (configurable?) timeout before we pick another victim... And btw. Yes, this is a bit off-topic, but I think another change make sense too. We should report the fact we are going to kill another task because the previous victim refuse to die, and print its stack trace. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 19:07 ` Oleg Nesterov @ 2015-09-18 19:19 ` Christoph Lameter -1 siblings, 0 replies; 213+ messages in thread From: Christoph Lameter @ 2015-09-18 19:19 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri, 18 Sep 2015, Oleg Nesterov wrote: > And btw. Yes, this is a bit off-topic, but I think another change make > sense too. We should report the fact we are going to kill another task > because the previous victim refuse to die, and print its stack trace. What happens is that the previous victim did not enter exit processing. If it would then it would be excluded by other checks. The first victim never reacted and never started using the memory resources available for exiting. Thats why I thought it maybe safe to go this way. An issue could result from another process being terminated and the first victim finally reacting to the signal and also beginning termination. Then we have contention on the reserves. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-18 19:19 ` Christoph Lameter 0 siblings, 0 replies; 213+ messages in thread From: Christoph Lameter @ 2015-09-18 19:19 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri, 18 Sep 2015, Oleg Nesterov wrote: > And btw. Yes, this is a bit off-topic, but I think another change make > sense too. We should report the fact we are going to kill another task > because the previous victim refuse to die, and print its stack trace. What happens is that the previous victim did not enter exit processing. If it would then it would be excluded by other checks. The first victim never reacted and never started using the memory resources available for exiting. Thats why I thought it maybe safe to go this way. An issue could result from another process being terminated and the first victim finally reacting to the signal and also beginning termination. Then we have contention on the reserves. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 19:19 ` Christoph Lameter (?) @ 2015-09-18 21:28 ` Kyle Walker 2015-09-18 22:07 ` Christoph Lameter -1 siblings, 1 reply; 213+ messages in thread From: Kyle Walker @ 2015-09-18 21:28 UTC (permalink / raw) To: Christoph Lameter Cc: Oleg Nesterov, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina [-- Attachment #1: Type: text/plain, Size: 1339 bytes --] > On Fri, 18 Sep 2015, Oleg Nesterov wrote: > > And btw. Yes, this is a bit off-topic, but I think another change make > > sense too. We should report the fact we are going to kill another task > > because the previous victim refuse to die, and print its stack trace. Thank you for the review and feedback! I think that would definitely be a nice touch. I would definitely throw my hat in as wanting the above, but in the interests of keeping things as simple as possible, I kept myself out of that level of change. > What happens is that the previous victim did not enter exit processing. If > it would then it would be excluded by other checks. The first victim never > reacted and never started using the memory resources available for > exiting. Thats why I thought it maybe safe to go this way. > > An issue could result from another process being terminated and the first > victim finally reacting to the signal and also beginning termination. Then > we have contention on the reserves. > I do like the idea of not stalling completely in an oom just because the first attempt didn't go so well. Is there any possibility of simply having our cake and eating it too? Specifically, omitting TASK_UNINTERRUPTIBLE tasks as low-hanging fruit and allowing the oom to continue in the event that the first attempt stalls? Just a thought. [-- Attachment #2: Type: text/html, Size: 3752 bytes --] ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 21:28 ` Kyle Walker @ 2015-09-18 22:07 ` Christoph Lameter 0 siblings, 0 replies; 213+ messages in thread From: Christoph Lameter @ 2015-09-18 22:07 UTC (permalink / raw) To: Kyle Walker Cc: Oleg Nesterov, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri, 18 Sep 2015, Kyle Walker wrote: > I do like the idea of not stalling completely in an oom just because the > first attempt didn't go so well. Is there any possibility of simply having > our cake and eating it too? Specifically, omitting TASK_UNINTERRUPTIBLE > tasks > as low-hanging fruit and allowing the oom to continue in the event that the > first attempt stalls? TASK_UNINTERRUPTIBLE tasks should not be sleeping that long and they *should react* in a reasonable timeframe. There is an alternative API for those cases that cannot. Typically this is a write that is stalling. If we kill the process then its pointless to wait on the write to complete. See https://lwn.net/Articles/288056/ http://www.ibm.com/developerworks/library/l-task-killable/ ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-18 22:07 ` Christoph Lameter 0 siblings, 0 replies; 213+ messages in thread From: Christoph Lameter @ 2015-09-18 22:07 UTC (permalink / raw) To: Kyle Walker Cc: Oleg Nesterov, akpm, mhocko, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri, 18 Sep 2015, Kyle Walker wrote: > I do like the idea of not stalling completely in an oom just because the > first attempt didn't go so well. Is there any possibility of simply having > our cake and eating it too? Specifically, omitting TASK_UNINTERRUPTIBLE > tasks > as low-hanging fruit and allowing the oom to continue in the event that the > first attempt stalls? TASK_UNINTERRUPTIBLE tasks should not be sleeping that long and they *should react* in a reasonable timeframe. There is an alternative API for those cases that cannot. Typically this is a write that is stalling. If we kill the process then its pointless to wait on the write to complete. See https://lwn.net/Articles/288056/ http://www.ibm.com/developerworks/library/l-task-killable/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 17:00 ` Christoph Lameter @ 2015-09-19 8:32 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 8:32 UTC (permalink / raw) To: Christoph Lameter Cc: Oleg Nesterov, Kyle Walker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri 18-09-15 12:00:59, Christoph Lameter wrote: [...] > Subject: Allow multiple kills from the OOM killer > > The OOM killer currently aborts if it finds a process that already is having > access to the reserve memory pool for exit processing. This is done so that > the reserves are not overcommitted but on the other hand this also allows > only one process being oom killed at the time. That process may be stuck > in D state. This has been posted in various forms many times over past years. I still do not think this is a right approach of dealing with the problem. You can quickly deplete memory reserves this way without making further progress (I am afraid you can even trigger this from userspace without having big privileges) so even administrator will have no way to intervene. > Signed-off-by: Christoph Lameter <cl@linux.com> > > Index: linux/mm/oom_kill.c > =================================================================== > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( > * This task already has access to memory reserves and is being killed. > * Don't allow any other task to have access to the reserves. > */ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + > if (!task->mm) > return OOM_SCAN_CONTINUE; -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-19 8:32 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 8:32 UTC (permalink / raw) To: Christoph Lameter Cc: Oleg Nesterov, Kyle Walker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri 18-09-15 12:00:59, Christoph Lameter wrote: [...] > Subject: Allow multiple kills from the OOM killer > > The OOM killer currently aborts if it finds a process that already is having > access to the reserve memory pool for exit processing. This is done so that > the reserves are not overcommitted but on the other hand this also allows > only one process being oom killed at the time. That process may be stuck > in D state. This has been posted in various forms many times over past years. I still do not think this is a right approach of dealing with the problem. You can quickly deplete memory reserves this way without making further progress (I am afraid you can even trigger this from userspace without having big privileges) so even administrator will have no way to intervene. > Signed-off-by: Christoph Lameter <cl@linux.com> > > Index: linux/mm/oom_kill.c > =================================================================== > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( > * This task already has access to memory reserves and is being killed. > * Don't allow any other task to have access to the reserves. > */ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + > if (!task->mm) > return OOM_SCAN_CONTINUE; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-19 8:32 ` Michal Hocko @ 2015-09-19 14:33 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-19 14:33 UTC (permalink / raw) To: mhocko, cl Cc: oleg, kwalker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > This has been posted in various forms many times over past years. I > still do not think this is a right approach of dealing with the problem. I do not think "GFP_NOFS can fail" patch is a right approach because that patch easily causes messages like below. Buffer I/O error on dev sda1, logical block 34661831, lost async page write XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250) Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway. I believe choosing more OOM victims is the only way which can solve OOM stalls. > You can quickly deplete memory reserves this way without making further > progress (I am afraid you can even trigger this from userspace without > having big privileges) so even administrator will have no way to > intervene. I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the OOM victim task as soon as possible, but it turned out that it will not work if there is invisible lock dependency. Therefore, why not to give up "there should be only up to 1 TIF_MEMDIE task" rule? What this patch (and many others posted in various forms many times over past years) does is to give up "there should be only up to 1 TIF_MEMDIE task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks and somehow manage in a way memory reserves will not deplete. In my proposal which favors all fatal_signal_pending() tasks evenly ( http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp ) suggests that the OOM victim task unlikely needs all of memory reserves. In other words, the OOM victim task can likely make forward progress if some amount of memory reserves are allowed (compared to normal tasks waiting for memory). So, I think that getting rid of "ALLOC_NO_WATERMARKS via TIF_MEMDIE" rule and replace test_thread_flag(TIF_MEMDIE) with fatal_signal_pending(current) will handle many cases if fatal_signal_pending() tasks are allowed to access some amount of memory reserves. And my proposal which chooses next OOM victim upon timeout will handle the remaining cases without depleting memory reserves. If you still want to keep "there should be only up to 1 TIF_MEMDIE task" rule, what alternative do you have? (I do not like panic_on_oom_timeout because it is more data-lossy approach than choosing next OOM victim.) ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-19 14:33 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-19 14:33 UTC (permalink / raw) To: mhocko, cl Cc: oleg, kwalker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > This has been posted in various forms many times over past years. I > still do not think this is a right approach of dealing with the problem. I do not think "GFP_NOFS can fail" patch is a right approach because that patch easily causes messages like below. Buffer I/O error on dev sda1, logical block 34661831, lost async page write XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250) Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway. I believe choosing more OOM victims is the only way which can solve OOM stalls. > You can quickly deplete memory reserves this way without making further > progress (I am afraid you can even trigger this from userspace without > having big privileges) so even administrator will have no way to > intervene. I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the OOM victim task as soon as possible, but it turned out that it will not work if there is invisible lock dependency. Therefore, why not to give up "there should be only up to 1 TIF_MEMDIE task" rule? What this patch (and many others posted in various forms many times over past years) does is to give up "there should be only up to 1 TIF_MEMDIE task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks and somehow manage in a way memory reserves will not deplete. In my proposal which favors all fatal_signal_pending() tasks evenly ( http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp ) suggests that the OOM victim task unlikely needs all of memory reserves. In other words, the OOM victim task can likely make forward progress if some amount of memory reserves are allowed (compared to normal tasks waiting for memory). So, I think that getting rid of "ALLOC_NO_WATERMARKS via TIF_MEMDIE" rule and replace test_thread_flag(TIF_MEMDIE) with fatal_signal_pending(current) will handle many cases if fatal_signal_pending() tasks are allowed to access some amount of memory reserves. And my proposal which chooses next OOM victim upon timeout will handle the remaining cases without depleting memory reserves. If you still want to keep "there should be only up to 1 TIF_MEMDIE task" rule, what alternative do you have? (I do not like panic_on_oom_timeout because it is more data-lossy approach than choosing next OOM victim.) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-19 14:33 ` Tetsuo Handa @ 2015-09-19 15:51 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 15:51 UTC (permalink / raw) To: Tetsuo Handa Cc: cl, oleg, kwalker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On Sat 19-09-15 23:33:07, Tetsuo Handa wrote: > Michal Hocko wrote: > > This has been posted in various forms many times over past years. I > > still do not think this is a right approach of dealing with the problem. > > I do not think "GFP_NOFS can fail" patch is a right approach because > that patch easily causes messages like below. > > Buffer I/O error on dev sda1, logical block 34661831, lost async page write > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) > XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) > XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250) These messages just tell you that the allocation fails repeatedly. Have a look and check the code. They are basically opencoded NOFAIL allocations. They haven't been converted to actually tell the MM layer that they cannot fail because Dave said they have a long term plan to change this code and basically implement different failing strategies. > Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway. > > I believe choosing more OOM victims is the only way which can solve OOM stalls. I am very well aware of your position and all the attempts to tweak different code paths to actually pass your corner case. I, however, care for the longer term goals more. And I believe that the page allocator and the reclaim should strive for being less deadlock prone in the first place. That includes a more natural semantic and non-failing default semantic is really error prone IMHO. We have been through this discussion many times already and I've tried to express this is a long term goal with incremental steps. I really hate to do "easy" things now just to feel better about particular case which will kick us back little bit later. And from my own experience I can tell you that a more non-deterministic OOM behavior is thing people complain about. > > You can quickly deplete memory reserves this way without making further > > progress (I am afraid you can even trigger this from userspace without > > having big privileges) so even administrator will have no way to > > intervene. > > I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying > cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the > OOM victim task as soon as possible, but it turned out that it will not > work if there is invisible lock dependency. Of course. This is a heurstic and as such it cannot ever work in 100% situations. And it is not the first heuristic we have for the OOM killer. The last time this has been all rewritten was because the OOM killer was too unreliable/non-deterministic. Reports have decreased considerable since then. > Therefore, why not to give up > "there should be only up to 1 TIF_MEMDIE task" rule? This has been explained several times. There is no guaranteed this would help and _your_ own usecase shows how you can end up with such a long lock dependency chains that you can easily eat up the whole memory reserves before you can make any progress. I do agree that a hand break mechanism is really desirable for those who really care. > What this patch (and many others posted in various forms many times over > past years) does is to give up "there should be only up to 1 TIF_MEMDIE > task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks > and somehow manage in a way memory reserves will not deplete. But those two goes against each other. [...] > If you still want to keep "there should be only up to 1 TIF_MEMDIE task" > rule, what alternative do you have? (I do not like panic_on_oom_timeout > because it is more data-lossy approach than choosing next OOM victim.) I am not married to 1 TIF_MEMDIE task thing. I just think that there is still a lot of room for other improvements. The original issue which triggered this discussion again is a good example. I completely miss why a writer has to be unkillable when the fs is frozen. There are others which are more complicated of course. Including the whole class represented by GFP_NOFS allocations as you have noted. But we still have a room for improvements even in the reclaim. It has been suggested quite some time ago that the memory mapped by the OOM victim might be unmapped. Basically what Oleg is proposing in other email. I didn't get to read his email yet properly but that should certainly help to reduce the problem space. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-19 15:51 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 15:51 UTC (permalink / raw) To: Tetsuo Handa Cc: cl, oleg, kwalker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On Sat 19-09-15 23:33:07, Tetsuo Handa wrote: > Michal Hocko wrote: > > This has been posted in various forms many times over past years. I > > still do not think this is a right approach of dealing with the problem. > > I do not think "GFP_NOFS can fail" patch is a right approach because > that patch easily causes messages like below. > > Buffer I/O error on dev sda1, logical block 34661831, lost async page write > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) > XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) > XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250) These messages just tell you that the allocation fails repeatedly. Have a look and check the code. They are basically opencoded NOFAIL allocations. They haven't been converted to actually tell the MM layer that they cannot fail because Dave said they have a long term plan to change this code and basically implement different failing strategies. > Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway. > > I believe choosing more OOM victims is the only way which can solve OOM stalls. I am very well aware of your position and all the attempts to tweak different code paths to actually pass your corner case. I, however, care for the longer term goals more. And I believe that the page allocator and the reclaim should strive for being less deadlock prone in the first place. That includes a more natural semantic and non-failing default semantic is really error prone IMHO. We have been through this discussion many times already and I've tried to express this is a long term goal with incremental steps. I really hate to do "easy" things now just to feel better about particular case which will kick us back little bit later. And from my own experience I can tell you that a more non-deterministic OOM behavior is thing people complain about. > > You can quickly deplete memory reserves this way without making further > > progress (I am afraid you can even trigger this from userspace without > > having big privileges) so even administrator will have no way to > > intervene. > > I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying > cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the > OOM victim task as soon as possible, but it turned out that it will not > work if there is invisible lock dependency. Of course. This is a heurstic and as such it cannot ever work in 100% situations. And it is not the first heuristic we have for the OOM killer. The last time this has been all rewritten was because the OOM killer was too unreliable/non-deterministic. Reports have decreased considerable since then. > Therefore, why not to give up > "there should be only up to 1 TIF_MEMDIE task" rule? This has been explained several times. There is no guaranteed this would help and _your_ own usecase shows how you can end up with such a long lock dependency chains that you can easily eat up the whole memory reserves before you can make any progress. I do agree that a hand break mechanism is really desirable for those who really care. > What this patch (and many others posted in various forms many times over > past years) does is to give up "there should be only up to 1 TIF_MEMDIE > task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks > and somehow manage in a way memory reserves will not deplete. But those two goes against each other. [...] > If you still want to keep "there should be only up to 1 TIF_MEMDIE task" > rule, what alternative do you have? (I do not like panic_on_oom_timeout > because it is more data-lossy approach than choosing next OOM victim.) I am not married to 1 TIF_MEMDIE task thing. I just think that there is still a lot of room for other improvements. The original issue which triggered this discussion again is a good example. I completely miss why a writer has to be unkillable when the fs is frozen. There are others which are more complicated of course. Including the whole class represented by GFP_NOFS allocations as you have noted. But we still have a room for improvements even in the reclaim. It has been suggested quite some time ago that the memory mapped by the OOM victim might be unmapped. Basically what Oleg is proposing in other email. I didn't get to read his email yet properly but that should certainly help to reduce the problem space. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-19 14:33 ` Tetsuo Handa @ 2015-09-21 23:33 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-21 23:33 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, cl, oleg, kwalker, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Sat, 19 Sep 2015, Tetsuo Handa wrote: > I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying > cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the > OOM victim task as soon as possible, but it turned out that it will not > work if there is invisible lock dependency. Therefore, why not to give up > "there should be only up to 1 TIF_MEMDIE task" rule? > I don't see the connection between TIF_MEMDIE and ALLOC_NO_WATERMARKS being problematic. It is simply the mechanism by which we give oom killed processes access to memory reserves if they need it. I believe you are referring only to the oom killer stalling when it finds an oom victim. > What this patch (and many others posted in various forms many times over > past years) does is to give up "there should be only up to 1 TIF_MEMDIE > task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks > and somehow manage in a way memory reserves will not deplete. > Your proposal, which I mostly agree with, tries to kill additional processes so that they allocate and drop the lock that the original victim depends on. My approach, from http://marc.info/?l=linux-kernel&m=144010444913702, is the same, but without the killing. It's unecessary to kill every process on the system that is depending on the same lock, and we can't know which processes are stalling on that lock and which are not. I think it's much easier to simply identify such a situation where a process has not exited in a timely manner and then provide processes access to memory reserves without being killed. We hope that the victim will have queued its mutex_lock() and allocators that are holding the lock will drop it after successfully utilizing memory reserves. We can mitigate immediate depletion of memory reserves by requiring all allocators to reclaim (or compact) and calling the oom killer to identify the timeout before granting access to memory reserves for a single allocation before schedule_timeout_killable(1) and returning. I don't know of any alternative solutions where we can guarantee that memory reserves cannot be depleted unless memory reserves are 100% of memory. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-21 23:33 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-21 23:33 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, cl, oleg, kwalker, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Sat, 19 Sep 2015, Tetsuo Handa wrote: > I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying > cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the > OOM victim task as soon as possible, but it turned out that it will not > work if there is invisible lock dependency. Therefore, why not to give up > "there should be only up to 1 TIF_MEMDIE task" rule? > I don't see the connection between TIF_MEMDIE and ALLOC_NO_WATERMARKS being problematic. It is simply the mechanism by which we give oom killed processes access to memory reserves if they need it. I believe you are referring only to the oom killer stalling when it finds an oom victim. > What this patch (and many others posted in various forms many times over > past years) does is to give up "there should be only up to 1 TIF_MEMDIE > task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks > and somehow manage in a way memory reserves will not deplete. > Your proposal, which I mostly agree with, tries to kill additional processes so that they allocate and drop the lock that the original victim depends on. My approach, from http://marc.info/?l=linux-kernel&m=144010444913702, is the same, but without the killing. It's unecessary to kill every process on the system that is depending on the same lock, and we can't know which processes are stalling on that lock and which are not. I think it's much easier to simply identify such a situation where a process has not exited in a timely manner and then provide processes access to memory reserves without being killed. We hope that the victim will have queued its mutex_lock() and allocators that are holding the lock will drop it after successfully utilizing memory reserves. We can mitigate immediate depletion of memory reserves by requiring all allocators to reclaim (or compact) and calling the oom killer to identify the timeout before granting access to memory reserves for a single allocation before schedule_timeout_killable(1) and returning. I don't know of any alternative solutions where we can guarantee that memory reserves cannot be depleted unless memory reserves are 100% of memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-21 23:33 ` David Rientjes @ 2015-09-22 5:33 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-22 5:33 UTC (permalink / raw) To: rientjes Cc: mhocko, cl, oleg, kwalker, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina David Rientjes wrote: > Your proposal, which I mostly agree with, tries to kill additional > processes so that they allocate and drop the lock that the original victim > depends on. My approach, from > http://marc.info/?l=linux-kernel&m=144010444913702, is the same, but > without the killing. It's unecessary to kill every process on the system > that is depending on the same lock, and we can't know which processes are > stalling on that lock and which are not. Would you try your approach with below program? (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.) ---------- oom-depleter3.c start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> static int zero_fd = EOF; static char *buf = NULL; static unsigned long size = 0; static int dummy(void *unused) { static char buffer[4096] = { }; int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) && fsync(fd) == 0); return 0; } static int trigger(void *unused) { read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ return 0; } int main(int argc, char *argv[]) { unsigned long i; zero_fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } /* * Create many child threads in order to enlarge time lag between * the OOM killer sets TIF_MEMDIE to thread group leader and * the OOM killer sends SIGKILL to that thread. */ for (i = 0; i < 1000; i++) { clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM, NULL); } /* Let a child thread trigger the OOM killer. */ clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL); /* Deplete all memory reserve using the time lag. */ for (i = size; i; i -= 4096) buf[i - 1] = 1; return * (char *) NULL; /* Kill all threads. */ } ---------- oom-depleter3.c end ---------- uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz shows that the memory reserves completely depleted and uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz shows that the memory reserves was not used at all. Is this result what you expected? ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-22 5:33 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-22 5:33 UTC (permalink / raw) To: rientjes Cc: mhocko, cl, oleg, kwalker, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina David Rientjes wrote: > Your proposal, which I mostly agree with, tries to kill additional > processes so that they allocate and drop the lock that the original victim > depends on. My approach, from > http://marc.info/?l=linux-kernel&m=144010444913702, is the same, but > without the killing. It's unecessary to kill every process on the system > that is depending on the same lock, and we can't know which processes are > stalling on that lock and which are not. Would you try your approach with below program? (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.) ---------- oom-depleter3.c start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> static int zero_fd = EOF; static char *buf = NULL; static unsigned long size = 0; static int dummy(void *unused) { static char buffer[4096] = { }; int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) && fsync(fd) == 0); return 0; } static int trigger(void *unused) { read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ return 0; } int main(int argc, char *argv[]) { unsigned long i; zero_fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } /* * Create many child threads in order to enlarge time lag between * the OOM killer sets TIF_MEMDIE to thread group leader and * the OOM killer sends SIGKILL to that thread. */ for (i = 0; i < 1000; i++) { clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM, NULL); } /* Let a child thread trigger the OOM killer. */ clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL); /* Deplete all memory reserve using the time lag. */ for (i = size; i; i -= 4096) buf[i - 1] = 1; return * (char *) NULL; /* Kill all threads. */ } ---------- oom-depleter3.c end ---------- uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz shows that the memory reserves completely depleted and uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz shows that the memory reserves was not used at all. Is this result what you expected? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-22 5:33 ` Tetsuo Handa @ 2015-09-22 23:32 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-22 23:32 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, cl, oleg, kwalker, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue, 22 Sep 2015, Tetsuo Handa wrote: > David Rientjes wrote: > > Your proposal, which I mostly agree with, tries to kill additional > > processes so that they allocate and drop the lock that the original victim > > depends on. My approach, from > > http://marc.info/?l=linux-kernel&m=144010444913702, is the same, but > > without the killing. It's unecessary to kill every process on the system > > that is depending on the same lock, and we can't know which processes are > > stalling on that lock and which are not. > > Would you try your approach with below program? > (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.) > > ---------- oom-depleter3.c start ---------- > #define _GNU_SOURCE > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <sched.h> > > static int zero_fd = EOF; > static char *buf = NULL; > static unsigned long size = 0; > > static int dummy(void *unused) > { > static char buffer[4096] = { }; > int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600); > while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) && > fsync(fd) == 0); > return 0; > } > > static int trigger(void *unused) > { > read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ > return 0; > } > > int main(int argc, char *argv[]) > { > unsigned long i; > zero_fd = open("/dev/zero", O_RDONLY); > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > char *cp = realloc(buf, size); > if (!cp) { > size >>= 1; > break; > } > buf = cp; > } > /* > * Create many child threads in order to enlarge time lag between > * the OOM killer sets TIF_MEMDIE to thread group leader and > * the OOM killer sends SIGKILL to that thread. > */ > for (i = 0; i < 1000; i++) { > clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM, > NULL); > } > /* Let a child thread trigger the OOM killer. */ > clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL); > /* Deplete all memory reserve using the time lag. */ > for (i = size; i; i -= 4096) > buf[i - 1] = 1; > return * (char *) NULL; /* Kill all threads. */ > } > ---------- oom-depleter3.c end ---------- > > uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz > shows that the memory reserves completely depleted and > uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz > shows that the memory reserves was not used at all. > Is this result what you expected? > What are the results when the kernel isn't patched at all? The trade-off being made is that we want to attempt to make forward progress when there is an excessive stall in an oom victim making its exit rather than livelock the system forever waiting for memory that can never be allocated. I struggle to understand how the approach of randomly continuing to kill more and more processes in the hope that it slows down usage of memory reserves or that we get lucky is better. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-22 23:32 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-22 23:32 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, cl, oleg, kwalker, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue, 22 Sep 2015, Tetsuo Handa wrote: > David Rientjes wrote: > > Your proposal, which I mostly agree with, tries to kill additional > > processes so that they allocate and drop the lock that the original victim > > depends on. My approach, from > > http://marc.info/?l=linux-kernel&m=144010444913702, is the same, but > > without the killing. It's unecessary to kill every process on the system > > that is depending on the same lock, and we can't know which processes are > > stalling on that lock and which are not. > > Would you try your approach with below program? > (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.) > > ---------- oom-depleter3.c start ---------- > #define _GNU_SOURCE > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <sched.h> > > static int zero_fd = EOF; > static char *buf = NULL; > static unsigned long size = 0; > > static int dummy(void *unused) > { > static char buffer[4096] = { }; > int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600); > while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) && > fsync(fd) == 0); > return 0; > } > > static int trigger(void *unused) > { > read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ > return 0; > } > > int main(int argc, char *argv[]) > { > unsigned long i; > zero_fd = open("/dev/zero", O_RDONLY); > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > char *cp = realloc(buf, size); > if (!cp) { > size >>= 1; > break; > } > buf = cp; > } > /* > * Create many child threads in order to enlarge time lag between > * the OOM killer sets TIF_MEMDIE to thread group leader and > * the OOM killer sends SIGKILL to that thread. > */ > for (i = 0; i < 1000; i++) { > clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM, > NULL); > } > /* Let a child thread trigger the OOM killer. */ > clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL); > /* Deplete all memory reserve using the time lag. */ > for (i = size; i; i -= 4096) > buf[i - 1] = 1; > return * (char *) NULL; /* Kill all threads. */ > } > ---------- oom-depleter3.c end ---------- > > uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz > shows that the memory reserves completely depleted and > uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz > shows that the memory reserves was not used at all. > Is this result what you expected? > What are the results when the kernel isn't patched at all? The trade-off being made is that we want to attempt to make forward progress when there is an excessive stall in an oom victim making its exit rather than livelock the system forever waiting for memory that can never be allocated. I struggle to understand how the approach of randomly continuing to kill more and more processes in the hope that it slows down usage of memory reserves or that we get lucky is better. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-22 23:32 ` David Rientjes @ 2015-09-23 12:03 ` Kyle Walker -1 siblings, 0 replies; 213+ messages in thread From: Kyle Walker @ 2015-09-23 12:03 UTC (permalink / raw) To: David Rientjes Cc: Tetsuo Handa, mhocko, Christoph Lameter, Oleg Nesterov, akpm, Johannes Weiner, vdavydov, linux-mm, linux-kernel, Stanislav Kozina On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes <rientjes@google.com> wrote: > > I struggle to understand how the approach of randomly continuing to kill > more and more processes in the hope that it slows down usage of memory > reserves or that we get lucky is better. Thank you to one and all for the feedback. I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, and omitting them from the oom selection process, continuing the carnage is likely to result in more unpredictable results. At this time, I believe Oleg's solution of zapping the process memory use while it sleeps with the fatal signal enroute is ideal. Kyle Walker ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-23 12:03 ` Kyle Walker 0 siblings, 0 replies; 213+ messages in thread From: Kyle Walker @ 2015-09-23 12:03 UTC (permalink / raw) To: David Rientjes Cc: Tetsuo Handa, mhocko, Christoph Lameter, Oleg Nesterov, akpm, Johannes Weiner, vdavydov, linux-mm, linux-kernel, Stanislav Kozina On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes <rientjes@google.com> wrote: > > I struggle to understand how the approach of randomly continuing to kill > more and more processes in the hope that it slows down usage of memory > reserves or that we get lucky is better. Thank you to one and all for the feedback. I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, and omitting them from the oom selection process, continuing the carnage is likely to result in more unpredictable results. At this time, I believe Oleg's solution of zapping the process memory use while it sleeps with the fatal signal enroute is ideal. Kyle Walker -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-23 12:03 ` Kyle Walker @ 2015-09-24 11:50 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-24 11:50 UTC (permalink / raw) To: kwalker, rientjes Cc: mhocko, cl, oleg, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Kyle Walker wrote: > I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, > and omitting them from the oom selection process, continuing the > carnage is likely to result in more unpredictable results. At this > time, I believe Oleg's solution of zapping the process memory use > while it sleeps with the fatal signal enroute is ideal. I cannot help thinking about the worst case. (1) If memory zapping code successfully reclaimed some memory from the mm struct used by the OOM victim, what guarantees that the reclaimed memory is used by OOM victims (and processes which are blocking OOM victims)? David's "global access to memory reserves" allows a local unprivileged user to deplete memory reserves; could allow that user to deplete the reclaimed memory as well. I think that my "Favor kthread and dying threads over normal threads" ( http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ) would allow the reclaimed memory to be used by OOM victims and kernel threads if the reclaimed memory is added to free list bit by bit in a way that watermark remains low enough to prevent normal threads from allocating the reclaimed memory. But my patch still fails if normal threads are blocking the OOM victims or unrelated kernel threads consume the reclaimed memory. (2) If memory zapping code failed to reclaim enough memory from the mm struct needed for the OOM victim, what mechanism can solve the OOM stalls? Some administrator sets /proc/pid/oom_score_adj to -1000 to most of enterprise processes (e.g. java) and as a consequence only trivial processes (e.g. grep / sed) are candidates for OOM victims. Moreover, a local unprivileged user can easily fool the OOM killer using decoy tasks (which consumes little memory and /proc/pid/oom_score_adj is set to 999). (3) If memory zapping code reclaimed no memory due to ->mmap_sem contention, what mechanism can solve the OOM stalls? While we don't allocate much memory with ->mmap_sem held for writing, the task which is holding ->mmap_sem for writing can be chosen as one of OOM victims. If such task receives SIGKILL but TIF_MEMDIE is not set, it can form OOM-livelock unless all memory allocations with ->mmap_sem held for writing are __GFP_FS allocations and that task can reach out_of_memory() (i.e. not blocked by unexpected factors such as waiting for filesystem's writeback). After all I think we have to consider what to do if memory zapping code failed. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-24 11:50 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-24 11:50 UTC (permalink / raw) To: kwalker, rientjes Cc: mhocko, cl, oleg, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Kyle Walker wrote: > I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, > and omitting them from the oom selection process, continuing the > carnage is likely to result in more unpredictable results. At this > time, I believe Oleg's solution of zapping the process memory use > while it sleeps with the fatal signal enroute is ideal. I cannot help thinking about the worst case. (1) If memory zapping code successfully reclaimed some memory from the mm struct used by the OOM victim, what guarantees that the reclaimed memory is used by OOM victims (and processes which are blocking OOM victims)? David's "global access to memory reserves" allows a local unprivileged user to deplete memory reserves; could allow that user to deplete the reclaimed memory as well. I think that my "Favor kthread and dying threads over normal threads" ( http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ) would allow the reclaimed memory to be used by OOM victims and kernel threads if the reclaimed memory is added to free list bit by bit in a way that watermark remains low enough to prevent normal threads from allocating the reclaimed memory. But my patch still fails if normal threads are blocking the OOM victims or unrelated kernel threads consume the reclaimed memory. (2) If memory zapping code failed to reclaim enough memory from the mm struct needed for the OOM victim, what mechanism can solve the OOM stalls? Some administrator sets /proc/pid/oom_score_adj to -1000 to most of enterprise processes (e.g. java) and as a consequence only trivial processes (e.g. grep / sed) are candidates for OOM victims. Moreover, a local unprivileged user can easily fool the OOM killer using decoy tasks (which consumes little memory and /proc/pid/oom_score_adj is set to 999). (3) If memory zapping code reclaimed no memory due to ->mmap_sem contention, what mechanism can solve the OOM stalls? While we don't allocate much memory with ->mmap_sem held for writing, the task which is holding ->mmap_sem for writing can be chosen as one of OOM victims. If such task receives SIGKILL but TIF_MEMDIE is not set, it can form OOM-livelock unless all memory allocations with ->mmap_sem held for writing are __GFP_FS allocations and that task can reach out_of_memory() (i.e. not blocked by unexpected factors such as waiting for filesystem's writeback). After all I think we have to consider what to do if memory zapping code failed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-19 8:32 ` Michal Hocko @ 2015-09-19 14:44 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-19 14:44 UTC (permalink / raw) To: Michal Hocko Cc: Christoph Lameter, Kyle Walker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On 09/19, Michal Hocko wrote: > > This has been posted in various forms many times over past years. I > still do not think this is a right approach of dealing with the problem. Agreed. But still I think it makes sense to try to kill another task if the victim refuse to die. Yes, the details are not clear to me. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-19 14:44 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-19 14:44 UTC (permalink / raw) To: Michal Hocko Cc: Christoph Lameter, Kyle Walker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On 09/19, Michal Hocko wrote: > > This has been posted in various forms many times over past years. I > still do not think this is a right approach of dealing with the problem. Agreed. But still I think it makes sense to try to kill another task if the victim refuse to die. Yes, the details are not clear to me. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 17:00 ` Christoph Lameter @ 2015-09-21 23:27 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-21 23:27 UTC (permalink / raw) To: Christoph Lameter Cc: Oleg Nesterov, Kyle Walker, akpm, mhocko, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri, 18 Sep 2015, Christoph Lameter wrote: > Subject: Allow multiple kills from the OOM killer > > The OOM killer currently aborts if it finds a process that already is having > access to the reserve memory pool for exit processing. This is done so that > the reserves are not overcommitted but on the other hand this also allows > only one process being oom killed at the time. That process may be stuck > in D state. > > Signed-off-by: Christoph Lameter <cl@linux.com> > > Index: linux/mm/oom_kill.c > =================================================================== > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( > * This task already has access to memory reserves and is being killed. > * Don't allow any other task to have access to the reserves. > */ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + > if (!task->mm) > return OOM_SCAN_CONTINUE; > If this would result in the newly chosen process being guaranteed to exit, this would be fine. Unfortunately, no such guarantee is possible. If a thread is holding a contended mutex that the victim(s) require, this serial oom killer could eventually panic the system if that thread is OOM_DISABLE. The solution that we have merged internally is described at http://marc.info/?l=linux-kernel&m=144010444913702 -- we provide access to memory reserves to processes that find a stalled exit in the oom killer so that they may allocate. It comes along with a test module that takes a contended mutex and ensures that forward progress is made as long as memory reserves are not depleted. We can't actually guarantee that memory reserves won't be depleted, but we (1) hope that nobody is actually allocating a lot of memory before dropping a mutex and (2) want to avoid the alternative which is a system livelock. This will address situations such as allocator oom victim --------- ---------- mutex_lock(lock) alloc_pages(GFP_KERNEL) mutex_lock(lock) mutex_unlock(lock) handle SIGKILL since this otherwise results in a livelock without a solution such as mine since the GFP_KERNEL allocation stalls forever waiting for the oom victim to acquire the mutex and exit. This also works if the allocator is OOM_DISABLE. This won't handle other situations where the victim gets wedged in D state and is not allocating memory, but this is by far the more common occurrence that we have dealt with. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-21 23:27 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-21 23:27 UTC (permalink / raw) To: Christoph Lameter Cc: Oleg Nesterov, Kyle Walker, akpm, mhocko, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri, 18 Sep 2015, Christoph Lameter wrote: > Subject: Allow multiple kills from the OOM killer > > The OOM killer currently aborts if it finds a process that already is having > access to the reserve memory pool for exit processing. This is done so that > the reserves are not overcommitted but on the other hand this also allows > only one process being oom killed at the time. That process may be stuck > in D state. > > Signed-off-by: Christoph Lameter <cl@linux.com> > > Index: linux/mm/oom_kill.c > =================================================================== > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( > * This task already has access to memory reserves and is being killed. > * Don't allow any other task to have access to the reserves. > */ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + > if (!task->mm) > return OOM_SCAN_CONTINUE; > If this would result in the newly chosen process being guaranteed to exit, this would be fine. Unfortunately, no such guarantee is possible. If a thread is holding a contended mutex that the victim(s) require, this serial oom killer could eventually panic the system if that thread is OOM_DISABLE. The solution that we have merged internally is described at http://marc.info/?l=linux-kernel&m=144010444913702 -- we provide access to memory reserves to processes that find a stalled exit in the oom killer so that they may allocate. It comes along with a test module that takes a contended mutex and ensures that forward progress is made as long as memory reserves are not depleted. We can't actually guarantee that memory reserves won't be depleted, but we (1) hope that nobody is actually allocating a lot of memory before dropping a mutex and (2) want to avoid the alternative which is a system livelock. This will address situations such as allocator oom victim --------- ---------- mutex_lock(lock) alloc_pages(GFP_KERNEL) mutex_lock(lock) mutex_unlock(lock) handle SIGKILL since this otherwise results in a livelock without a solution such as mine since the GFP_KERNEL allocation stalls forever waiting for the oom victim to acquire the mutex and exit. This also works if the allocator is OOM_DISABLE. This won't handle other situations where the victim gets wedged in D state and is not allocating memory, but this is by far the more common occurrence that we have dealt with. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-18 15:41 ` Christoph Lameter @ 2015-09-19 8:25 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 8:25 UTC (permalink / raw) To: Christoph Lameter Cc: Oleg Nesterov, Kyle Walker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri 18-09-15 10:41:09, Christoph Lameter wrote: [...] > if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > + if (unlikely(frozen(task))) > + __thaw_task(task); TIF_MEMDIE processes will get thawed automatically and then cannot be frozen again. Have a look at mark_oom_victim. > } > if (!task->mm) > return OOM_SCAN_CONTINUE; -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-19 8:25 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 8:25 UTC (permalink / raw) To: Christoph Lameter Cc: Oleg Nesterov, Kyle Walker, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Tetsuo Handa, Stanislav Kozina On Fri 18-09-15 10:41:09, Christoph Lameter wrote: [...] > if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > + if (unlikely(frozen(task))) > + __thaw_task(task); TIF_MEMDIE processes will get thawed automatically and then cannot be frozen again. Have a look at mark_oom_victim. > } > if (!task->mm) > return OOM_SCAN_CONTINUE; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-17 17:59 ` Kyle Walker @ 2015-09-19 8:22 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 8:22 UTC (permalink / raw) To: Kyle Walker Cc: akpm, rientjes, hannes, vdavydov, oleg, linux-mm, linux-kernel On Thu 17-09-15 13:59:43, Kyle Walker wrote: > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. I am not familiar with the fs freezing code so I might be missing something important here. __sb_start_write waits for the frozen fs by wait_event which is really UN sleep. Why cannot we sleep here in IN sleep and return with EINTR when interrupted? I would consider this a better behavior not only because of OOM because having unkillable tasks in general is undesirable. AFAIU the fs might be frozen for ever and admin cannot do anything about the pending processes. > Within oom_unkillable_task(), check for processes in > TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will > move on to another task. Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded state. Using it as an oom victim criteria makes the victim selection less deterministic which is undesirable. As much as I am aware of potential issues with the current implementation, making the behavior more random doesn't really help. > Signed-off-by: Kyle Walker <kwalker@redhat.com> > --- > mm/oom_kill.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1ecc0bc..66f03f8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, > if (memcg && !task_in_mem_cgroup(p, memcg)) > return true; > > + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ > + if (p->state == TASK_UNINTERRUPTIBLE) > + return true; > + > /* p may not have freeable memory in nodemask */ > if (!has_intersects_mems_allowed(p, nodemask)) > return true; > -- > 2.4.3 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-19 8:22 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 8:22 UTC (permalink / raw) To: Kyle Walker Cc: akpm, rientjes, hannes, vdavydov, oleg, linux-mm, linux-kernel On Thu 17-09-15 13:59:43, Kyle Walker wrote: > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. I am not familiar with the fs freezing code so I might be missing something important here. __sb_start_write waits for the frozen fs by wait_event which is really UN sleep. Why cannot we sleep here in IN sleep and return with EINTR when interrupted? I would consider this a better behavior not only because of OOM because having unkillable tasks in general is undesirable. AFAIU the fs might be frozen for ever and admin cannot do anything about the pending processes. > Within oom_unkillable_task(), check for processes in > TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will > move on to another task. Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded state. Using it as an oom victim criteria makes the victim selection less deterministic which is undesirable. As much as I am aware of potential issues with the current implementation, making the behavior more random doesn't really help. > Signed-off-by: Kyle Walker <kwalker@redhat.com> > --- > mm/oom_kill.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1ecc0bc..66f03f8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, > if (memcg && !task_in_mem_cgroup(p, memcg)) > return true; > > + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ > + if (p->state == TASK_UNINTERRUPTIBLE) > + return true; > + > /* p may not have freeable memory in nodemask */ > if (!has_intersects_mems_allowed(p, nodemask)) > return true; > -- > 2.4.3 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks 2015-09-19 8:22 ` Michal Hocko @ 2015-09-21 23:08 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-21 23:08 UTC (permalink / raw) To: Michal Hocko Cc: Kyle Walker, akpm, hannes, vdavydov, oleg, linux-mm, linux-kernel On Sat, 19 Sep 2015, Michal Hocko wrote: > Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded > state. Using it as an oom victim criteria makes the victim selection > less deterministic which is undesirable. As much as I am aware of > potential issues with the current implementation, making the behavior > more random doesn't really help. > Agreed, we can't avoid killing a process simply because it is in D state, this isn't an indication that the process will not be able to exit and in the worst case could panic the system if all other processes cannot be oom killed. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks @ 2015-09-21 23:08 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-21 23:08 UTC (permalink / raw) To: Michal Hocko Cc: Kyle Walker, akpm, hannes, vdavydov, oleg, linux-mm, linux-kernel On Sat, 19 Sep 2015, Michal Hocko wrote: > Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded > state. Using it as an oom victim criteria makes the victim selection > less deterministic which is undesirable. As much as I am aware of > potential issues with the current implementation, making the behavior > more random doesn't really help. > Agreed, we can't avoid killing a process simply because it is in D state, this isn't an indication that the process will not be able to exit and in the worst case could panic the system if all other processes cannot be oom killed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* can't oom-kill zap the victim's memory? 2015-09-17 17:59 ` Kyle Walker @ 2015-09-19 15:03 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-19 15:03 UTC (permalink / raw) To: Kyle Walker, Christoph Lameter, Linus Torvalds, Michal Hocko Cc: akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina, Tetsuo Handa On 09/17, Kyle Walker wrote: > > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. And there are other potential reasons for deadlock. Stupid idea. Can't we help the memory hog to free its memory? This is orthogonal to other improvements we can do. Please don't tell me the patch below is ugly, incomplete and suboptimal in many ways, I know ;) I am not sure it is even correct. Just to explain what I mean. Perhaps oom_unmap_func() should only zap the anonymous vmas... and there are a lot of other details which should be discussed if this can make any sense. Oleg. --- --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -493,6 +493,26 @@ void oom_killer_enable(void) up_write(&oom_sem); } +static struct mm_struct *oom_unmap_mm; + +static void oom_unmap_func(struct work_struct *work) +{ + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); + + if (!atomic_inc_not_zero(&mm->mm_users)) + return; + + // If this is not safe we can do use_mm() + unuse_mm() + down_read(&mm->mmap_sem); + if (mm->mmap) + zap_page_range(mm->mmap, 0, TASK_SIZE, NULL); + up_read(&mm->mmap_sem); + + mmput(mm); + mmdrop(mm); +} +static DECLARE_WORK(oom_unmap_work, oom_unmap_func); + #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, victim = p; } - /* mm cannot safely be dereferenced after task_unlock(victim) */ mm = victim->mm; + atomic_inc(&mm->mm_count); mark_tsk_oom_victim(victim); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), @@ -604,6 +624,10 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, rcu_read_unlock(); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); + if (cmpxchg(&oom_unmap_mm, NULL, mm)) + mmdrop(mm); + else + queue_work(system_unbound_wq, &oom_unmap_work); put_task_struct(victim); } #undef K ^ permalink raw reply [flat|nested] 213+ messages in thread
* can't oom-kill zap the victim's memory? @ 2015-09-19 15:03 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-19 15:03 UTC (permalink / raw) To: Kyle Walker, Christoph Lameter, Linus Torvalds, Michal Hocko Cc: akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina, Tetsuo Handa On 09/17, Kyle Walker wrote: > > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. And there are other potential reasons for deadlock. Stupid idea. Can't we help the memory hog to free its memory? This is orthogonal to other improvements we can do. Please don't tell me the patch below is ugly, incomplete and suboptimal in many ways, I know ;) I am not sure it is even correct. Just to explain what I mean. Perhaps oom_unmap_func() should only zap the anonymous vmas... and there are a lot of other details which should be discussed if this can make any sense. Oleg. --- --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -493,6 +493,26 @@ void oom_killer_enable(void) up_write(&oom_sem); } +static struct mm_struct *oom_unmap_mm; + +static void oom_unmap_func(struct work_struct *work) +{ + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); + + if (!atomic_inc_not_zero(&mm->mm_users)) + return; + + // If this is not safe we can do use_mm() + unuse_mm() + down_read(&mm->mmap_sem); + if (mm->mmap) + zap_page_range(mm->mmap, 0, TASK_SIZE, NULL); + up_read(&mm->mmap_sem); + + mmput(mm); + mmdrop(mm); +} +static DECLARE_WORK(oom_unmap_work, oom_unmap_func); + #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, victim = p; } - /* mm cannot safely be dereferenced after task_unlock(victim) */ mm = victim->mm; + atomic_inc(&mm->mm_count); mark_tsk_oom_victim(victim); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), @@ -604,6 +624,10 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, rcu_read_unlock(); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); + if (cmpxchg(&oom_unmap_mm, NULL, mm)) + mmdrop(mm); + else + queue_work(system_unbound_wq, &oom_unmap_work); put_task_struct(victim); } #undef K -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 15:03 ` Oleg Nesterov @ 2015-09-19 15:10 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-19 15:10 UTC (permalink / raw) To: Kyle Walker, Christoph Lameter, Linus Torvalds, Michal Hocko Cc: akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina, Tetsuo Handa (off-topic) On 09/19, Oleg Nesterov wrote: > > @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > victim = p; > } > > - /* mm cannot safely be dereferenced after task_unlock(victim) */ > mm = victim->mm; > + atomic_inc(&mm->mm_count); Btw, I think we need this change anyway. This is pure theoretical, but otherwise this task can exit and free its mm_struct right after task_unlock(), then this mm_struct can be reallocated and used by another task, so we can't trust the "p->mm == mm" check below. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-19 15:10 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-19 15:10 UTC (permalink / raw) To: Kyle Walker, Christoph Lameter, Linus Torvalds, Michal Hocko Cc: akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina, Tetsuo Handa (off-topic) On 09/19, Oleg Nesterov wrote: > > @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > victim = p; > } > > - /* mm cannot safely be dereferenced after task_unlock(victim) */ > mm = victim->mm; > + atomic_inc(&mm->mm_count); Btw, I think we need this change anyway. This is pure theoretical, but otherwise this task can exit and free its mm_struct right after task_unlock(), then this mm_struct can be reallocated and used by another task, so we can't trust the "p->mm == mm" check below. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 15:03 ` Oleg Nesterov @ 2015-09-19 15:58 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 15:58 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Linus Torvalds, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina, Tetsuo Handa On Sat 19-09-15 17:03:16, Oleg Nesterov wrote: > On 09/17, Kyle Walker wrote: > > > > Currently, the oom killer will attempt to kill a process that is in > > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > > period of time, such as processes writing to a frozen filesystem during > > a lengthy backup operation, this can result in a deadlock condition as > > related processes memory access will stall within the page fault > > handler. > > And there are other potential reasons for deadlock. > > Stupid idea. Can't we help the memory hog to free its memory? This is > orthogonal to other improvements we can do. > > Please don't tell me the patch below is ugly, incomplete and suboptimal > in many ways, I know ;) I am not sure it is even correct. Just to explain > what I mean. Unmapping the memory for the oom victim has been already mentioned as a way to improve the OOM killer behavior. Nobody has implemented that yet though unfortunately. I have that on my TODO list since we have discussed it with Mel at LSF. > Perhaps oom_unmap_func() should only zap the anonymous vmas... and there > are a lot of other details which should be discussed if this can make any > sense. I have just returned from an internal conference so my head is completely cabbaged. I will have a look on Monday. From a quick look the idea is feasible. You cannot rely on the worker context because workqueues might be completely stuck with at this stage. You also cannot do take mmap_sem directly because that might be held already so you need a try_lock instead. Focusing on anonymous vmas first sounds like a good idea to me because that would be simpler I guess. > > Oleg. > --- > > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -493,6 +493,26 @@ void oom_killer_enable(void) > up_write(&oom_sem); > } > > +static struct mm_struct *oom_unmap_mm; > + > +static void oom_unmap_func(struct work_struct *work) > +{ > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > + > + if (!atomic_inc_not_zero(&mm->mm_users)) > + return; > + > + // If this is not safe we can do use_mm() + unuse_mm() > + down_read(&mm->mmap_sem); > + if (mm->mmap) > + zap_page_range(mm->mmap, 0, TASK_SIZE, NULL); > + up_read(&mm->mmap_sem); > + > + mmput(mm); > + mmdrop(mm); > +} > +static DECLARE_WORK(oom_unmap_work, oom_unmap_func); > + > #define K(x) ((x) << (PAGE_SHIFT-10)) > /* > * Must be called while holding a reference to p, which will be released upon > @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > victim = p; > } > > - /* mm cannot safely be dereferenced after task_unlock(victim) */ > mm = victim->mm; > + atomic_inc(&mm->mm_count); > mark_tsk_oom_victim(victim); > pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", > task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), > @@ -604,6 +624,10 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > rcu_read_unlock(); > > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); > + if (cmpxchg(&oom_unmap_mm, NULL, mm)) > + mmdrop(mm); > + else > + queue_work(system_unbound_wq, &oom_unmap_work); > put_task_struct(victim); > } > #undef K -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-19 15:58 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-19 15:58 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Linus Torvalds, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina, Tetsuo Handa On Sat 19-09-15 17:03:16, Oleg Nesterov wrote: > On 09/17, Kyle Walker wrote: > > > > Currently, the oom killer will attempt to kill a process that is in > > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > > period of time, such as processes writing to a frozen filesystem during > > a lengthy backup operation, this can result in a deadlock condition as > > related processes memory access will stall within the page fault > > handler. > > And there are other potential reasons for deadlock. > > Stupid idea. Can't we help the memory hog to free its memory? This is > orthogonal to other improvements we can do. > > Please don't tell me the patch below is ugly, incomplete and suboptimal > in many ways, I know ;) I am not sure it is even correct. Just to explain > what I mean. Unmapping the memory for the oom victim has been already mentioned as a way to improve the OOM killer behavior. Nobody has implemented that yet though unfortunately. I have that on my TODO list since we have discussed it with Mel at LSF. > Perhaps oom_unmap_func() should only zap the anonymous vmas... and there > are a lot of other details which should be discussed if this can make any > sense. I have just returned from an internal conference so my head is completely cabbaged. I will have a look on Monday. From a quick look the idea is feasible. You cannot rely on the worker context because workqueues might be completely stuck with at this stage. You also cannot do take mmap_sem directly because that might be held already so you need a try_lock instead. Focusing on anonymous vmas first sounds like a good idea to me because that would be simpler I guess. > > Oleg. > --- > > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -493,6 +493,26 @@ void oom_killer_enable(void) > up_write(&oom_sem); > } > > +static struct mm_struct *oom_unmap_mm; > + > +static void oom_unmap_func(struct work_struct *work) > +{ > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > + > + if (!atomic_inc_not_zero(&mm->mm_users)) > + return; > + > + // If this is not safe we can do use_mm() + unuse_mm() > + down_read(&mm->mmap_sem); > + if (mm->mmap) > + zap_page_range(mm->mmap, 0, TASK_SIZE, NULL); > + up_read(&mm->mmap_sem); > + > + mmput(mm); > + mmdrop(mm); > +} > +static DECLARE_WORK(oom_unmap_work, oom_unmap_func); > + > #define K(x) ((x) << (PAGE_SHIFT-10)) > /* > * Must be called while holding a reference to p, which will be released upon > @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > victim = p; > } > > - /* mm cannot safely be dereferenced after task_unlock(victim) */ > mm = victim->mm; > + atomic_inc(&mm->mm_count); > mark_tsk_oom_victim(victim); > pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", > task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), > @@ -604,6 +624,10 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > rcu_read_unlock(); > > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); > + if (cmpxchg(&oom_unmap_mm, NULL, mm)) > + mmdrop(mm); > + else > + queue_work(system_unbound_wq, &oom_unmap_work); > put_task_struct(victim); > } > #undef K -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 15:58 ` Michal Hocko @ 2015-09-20 13:16 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-20 13:16 UTC (permalink / raw) To: Michal Hocko Cc: Kyle Walker, Christoph Lameter, Linus Torvalds, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina, Tetsuo Handa On 09/19, Michal Hocko wrote: > > On Sat 19-09-15 17:03:16, Oleg Nesterov wrote: > > > > Stupid idea. Can't we help the memory hog to free its memory? This is > > orthogonal to other improvements we can do. > > > > Please don't tell me the patch below is ugly, incomplete and suboptimal > > in many ways, I know ;) I am not sure it is even correct. Just to explain > > what I mean. > > Unmapping the memory for the oom victim has been already mentioned as a > way to improve the OOM killer behavior. Nobody has implemented that yet > though unfortunately. I have that on my TODO list since we have > discussed it with Mel at LSF. OK, good. So perhaps we should try to do this. > > > Perhaps oom_unmap_func() should only zap the anonymous vmas... and there > > are a lot of other details which should be discussed if this can make any > > sense. > > I have just returned from an internal conference so my head is > completely cabbaged. I will have a look on Monday. From a quick look > the idea is feasible. You cannot rely on the worker context because > workqueues might be completely stuck with at this stage. Yes this is true. See another email, probably oom-kill.c needs its own kthread. And again, we should actually try to avoid queue_work or queue_kthread_work in any case. But not in the initial implementation. And initial implementation could use workqueues, I think. I the likely case system_unbound_wq pool should have an idle thread. > You also cannot > do take mmap_sem directly because that might be held already so you need > a try_lock instead. Still can't understand this part. See other emails, perhaps I missed something. > Focusing on anonymous vmas first sounds like a good > idea to me because that would be simpler I guess. And safer. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-20 13:16 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-20 13:16 UTC (permalink / raw) To: Michal Hocko Cc: Kyle Walker, Christoph Lameter, Linus Torvalds, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, Stanislav Kozina, Tetsuo Handa On 09/19, Michal Hocko wrote: > > On Sat 19-09-15 17:03:16, Oleg Nesterov wrote: > > > > Stupid idea. Can't we help the memory hog to free its memory? This is > > orthogonal to other improvements we can do. > > > > Please don't tell me the patch below is ugly, incomplete and suboptimal > > in many ways, I know ;) I am not sure it is even correct. Just to explain > > what I mean. > > Unmapping the memory for the oom victim has been already mentioned as a > way to improve the OOM killer behavior. Nobody has implemented that yet > though unfortunately. I have that on my TODO list since we have > discussed it with Mel at LSF. OK, good. So perhaps we should try to do this. > > > Perhaps oom_unmap_func() should only zap the anonymous vmas... and there > > are a lot of other details which should be discussed if this can make any > > sense. > > I have just returned from an internal conference so my head is > completely cabbaged. I will have a look on Monday. From a quick look > the idea is feasible. You cannot rely on the worker context because > workqueues might be completely stuck with at this stage. Yes this is true. See another email, probably oom-kill.c needs its own kthread. And again, we should actually try to avoid queue_work or queue_kthread_work in any case. But not in the initial implementation. And initial implementation could use workqueues, I think. I the likely case system_unbound_wq pool should have an idle thread. > You also cannot > do take mmap_sem directly because that might be held already so you need > a try_lock instead. Still can't understand this part. See other emails, perhaps I missed something. > Focusing on anonymous vmas first sounds like a good > idea to me because that would be simpler I guess. And safer. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 15:03 ` Oleg Nesterov @ 2015-09-19 22:24 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-09-19 22:24 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: > + > +static void oom_unmap_func(struct work_struct *work) > +{ > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > + > + if (!atomic_inc_not_zero(&mm->mm_users)) > + return; > + > + // If this is not safe we can do use_mm() + unuse_mm() > + down_read(&mm->mmap_sem); I don't think this is safe. What makes you sure that we might not deadlock on the mmap_sem here? For all we know, the process that is going out of memory is in the middle of a mmap(), and already holds the mmap_sem for writing. No? So at the very least that needs to be a trylock, I think. And I'm not sure zap_page_range() is ok with the mmap_sem only held for reading. Normally our rule is that you can *populate* the page tables concurrently, but you can't tear the down. Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-19 22:24 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-09-19 22:24 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: > + > +static void oom_unmap_func(struct work_struct *work) > +{ > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > + > + if (!atomic_inc_not_zero(&mm->mm_users)) > + return; > + > + // If this is not safe we can do use_mm() + unuse_mm() > + down_read(&mm->mmap_sem); I don't think this is safe. What makes you sure that we might not deadlock on the mmap_sem here? For all we know, the process that is going out of memory is in the middle of a mmap(), and already holds the mmap_sem for writing. No? So at the very least that needs to be a trylock, I think. And I'm not sure zap_page_range() is ok with the mmap_sem only held for reading. Normally our rule is that you can *populate* the page tables concurrently, but you can't tear the down. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 22:24 ` Linus Torvalds (?) @ 2015-09-19 22:54 ` Raymond Jennings -1 siblings, 0 replies; 213+ messages in thread From: Raymond Jennings @ 2015-09-19 22:54 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa [-- Attachment #1: Type: text/plain, Size: 1434 bytes --] On Sat, Sep 19, 2015 at 3:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> > wrote: >> + >> +static void oom_unmap_func(struct work_struct *work) >> +{ >> + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); >> + >> + if (!atomic_inc_not_zero(&mm->mm_users)) >> + return; >> + >> + // If this is not safe we can do use_mm() + unuse_mm() >> + down_read(&mm->mmap_sem); > > I don't think this is safe. > > What makes you sure that we might not deadlock on the mmap_sem here? > For all we know, the process that is going out of memory is in the > middle of a mmap(), and already holds the mmap_sem for writing. No? > > So at the very least that needs to be a trylock, I think. And I'm not > sure zap_page_range() is ok with the mmap_sem only held for reading. > Normally our rule is that you can *populate* the page tables > concurrently, but you can't tear the down. Is it also possible to have mmap fail with EINTR? Presumably that would let a pending SIGKILL from the oom handler punch it out of the kernel and back to userspace. > > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> [-- Attachment #2: Type: text/html, Size: 1769 bytes --] ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 22:24 ` Linus Torvalds @ 2015-09-19 23:00 ` Raymond Jennings -1 siblings, 0 replies; 213+ messages in thread From: Raymond Jennings @ 2015-09-19 23:00 UTC (permalink / raw) To: Linus Torvalds, Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/19/15 15:24, Linus Torvalds wrote: > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> + >> +static void oom_unmap_func(struct work_struct *work) >> +{ >> + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); >> + >> + if (!atomic_inc_not_zero(&mm->mm_users)) >> + return; >> + >> + // If this is not safe we can do use_mm() + unuse_mm() >> + down_read(&mm->mmap_sem); > I don't think this is safe. > > What makes you sure that we might not deadlock on the mmap_sem here? > For all we know, the process that is going out of memory is in the > middle of a mmap(), and already holds the mmap_sem for writing. No? Potentially stupid question that others may be asking: Is it legal to return EINTR from mmap() to let a SIGKILL from the OOM handler punch the task out of the kernel and back to userspace? (sorry for the dupe btw, new email client snuck in html and I got bounced) > So at the very least that needs to be a trylock, I think. And I'm not > sure zap_page_range() is ok with the mmap_sem only held for reading. > Normally our rule is that you can *populate* the page tables > concurrently, but you can't tear the down. > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-19 23:00 ` Raymond Jennings 0 siblings, 0 replies; 213+ messages in thread From: Raymond Jennings @ 2015-09-19 23:00 UTC (permalink / raw) To: Linus Torvalds, Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/19/15 15:24, Linus Torvalds wrote: > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> + >> +static void oom_unmap_func(struct work_struct *work) >> +{ >> + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); >> + >> + if (!atomic_inc_not_zero(&mm->mm_users)) >> + return; >> + >> + // If this is not safe we can do use_mm() + unuse_mm() >> + down_read(&mm->mmap_sem); > I don't think this is safe. > > What makes you sure that we might not deadlock on the mmap_sem here? > For all we know, the process that is going out of memory is in the > middle of a mmap(), and already holds the mmap_sem for writing. No? Potentially stupid question that others may be asking: Is it legal to return EINTR from mmap() to let a SIGKILL from the OOM handler punch the task out of the kernel and back to userspace? (sorry for the dupe btw, new email client snuck in html and I got bounced) > So at the very least that needs to be a trylock, I think. And I'm not > sure zap_page_range() is ok with the mmap_sem only held for reading. > Normally our rule is that you can *populate* the page tables > concurrently, but you can't tear the down. > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 23:00 ` Raymond Jennings @ 2015-09-19 23:13 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-09-19 23:13 UTC (permalink / raw) To: Raymond Jennings Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Sat, Sep 19, 2015 at 4:00 PM, Raymond Jennings <shentino@gmail.com> wrote: > > Potentially stupid question that others may be asking: Is it legal to return > EINTR from mmap() to let a SIGKILL from the OOM handler punch the task out > of the kernel and back to userspace? Yes. Note that mmap() itself seldom sleeps or allocates much memory (yeah, there's the vma itself and soem minimal stuff), so it's mainly an issue for things like MAP_POPULATE etc. The more common situation is things like uninterruptible reads when a device (or network) is not responding, and we have special support for "killable" waits that act like normal uninterruptible waits but can be interrupted by deadly signals, exactly because for those cases we don't need to worry about things like POSIX return value guarantees ("all or nothing" for file reads) etc. So you do generally have to write extra code for the "killable sleep". But it's a good thing to do, if you notice that certain cases aren't responding well to oom killing because they keep on waiting. Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-19 23:13 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-09-19 23:13 UTC (permalink / raw) To: Raymond Jennings Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Sat, Sep 19, 2015 at 4:00 PM, Raymond Jennings <shentino@gmail.com> wrote: > > Potentially stupid question that others may be asking: Is it legal to return > EINTR from mmap() to let a SIGKILL from the OOM handler punch the task out > of the kernel and back to userspace? Yes. Note that mmap() itself seldom sleeps or allocates much memory (yeah, there's the vma itself and soem minimal stuff), so it's mainly an issue for things like MAP_POPULATE etc. The more common situation is things like uninterruptible reads when a device (or network) is not responding, and we have special support for "killable" waits that act like normal uninterruptible waits but can be interrupted by deadly signals, exactly because for those cases we don't need to worry about things like POSIX return value guarantees ("all or nothing" for file reads) etc. So you do generally have to write extra code for the "killable sleep". But it's a good thing to do, if you notice that certain cases aren't responding well to oom killing because they keep on waiting. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 22:24 ` Linus Torvalds @ 2015-09-20 9:33 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-20 9:33 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Sat 19-09-15 15:24:02, Linus Torvalds wrote: > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > + > > +static void oom_unmap_func(struct work_struct *work) > > +{ > > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > > + > > + if (!atomic_inc_not_zero(&mm->mm_users)) > > + return; > > + > > + // If this is not safe we can do use_mm() + unuse_mm() > > + down_read(&mm->mmap_sem); > > I don't think this is safe. > > What makes you sure that we might not deadlock on the mmap_sem here? > For all we know, the process that is going out of memory is in the > middle of a mmap(), and already holds the mmap_sem for writing. No? > > So at the very least that needs to be a trylock, I think. Agreed. > And I'm not > sure zap_page_range() is ok with the mmap_sem only held for reading. > Normally our rule is that you can *populate* the page tables > concurrently, but you can't tear the down Actually mmap_sem for reading should be sufficient because we do not alter the layout. Both MADV_DONTNEED and MADV_FREE require read mmap_sem for example. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-20 9:33 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-20 9:33 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Sat 19-09-15 15:24:02, Linus Torvalds wrote: > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > + > > +static void oom_unmap_func(struct work_struct *work) > > +{ > > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > > + > > + if (!atomic_inc_not_zero(&mm->mm_users)) > > + return; > > + > > + // If this is not safe we can do use_mm() + unuse_mm() > > + down_read(&mm->mmap_sem); > > I don't think this is safe. > > What makes you sure that we might not deadlock on the mmap_sem here? > For all we know, the process that is going out of memory is in the > middle of a mmap(), and already holds the mmap_sem for writing. No? > > So at the very least that needs to be a trylock, I think. Agreed. > And I'm not > sure zap_page_range() is ok with the mmap_sem only held for reading. > Normally our rule is that you can *populate* the page tables > concurrently, but you can't tear the down Actually mmap_sem for reading should be sufficient because we do not alter the layout. Both MADV_DONTNEED and MADV_FREE require read mmap_sem for example. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-20 9:33 ` Michal Hocko @ 2015-09-20 13:06 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-20 13:06 UTC (permalink / raw) To: Michal Hocko Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/20, Michal Hocko wrote: > > On Sat 19-09-15 15:24:02, Linus Torvalds wrote: > > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > + > > > +static void oom_unmap_func(struct work_struct *work) > > > +{ > > > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > > > + > > > + if (!atomic_inc_not_zero(&mm->mm_users)) > > > + return; > > > + > > > + // If this is not safe we can do use_mm() + unuse_mm() > > > + down_read(&mm->mmap_sem); > > > > I don't think this is safe. > > > > What makes you sure that we might not deadlock on the mmap_sem here? > > For all we know, the process that is going out of memory is in the > > middle of a mmap(), and already holds the mmap_sem for writing. No? > > > > So at the very least that needs to be a trylock, I think. > > Agreed. Why? See my reply to Linus's email. Just in case, yes sure the unconditonal down_read() is suboptimal, but this is minor compared to other problems we need to solve. > > And I'm not > > sure zap_page_range() is ok with the mmap_sem only held for reading. > > Normally our rule is that you can *populate* the page tables > > concurrently, but you can't tear the down > > Actually mmap_sem for reading should be sufficient because we do not > alter the layout. Both MADV_DONTNEED and MADV_FREE require read mmap_sem > for example. Yes, but see the ->vm_flags check in madvise_dontneed(). Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-20 13:06 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-20 13:06 UTC (permalink / raw) To: Michal Hocko Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/20, Michal Hocko wrote: > > On Sat 19-09-15 15:24:02, Linus Torvalds wrote: > > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > + > > > +static void oom_unmap_func(struct work_struct *work) > > > +{ > > > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > > > + > > > + if (!atomic_inc_not_zero(&mm->mm_users)) > > > + return; > > > + > > > + // If this is not safe we can do use_mm() + unuse_mm() > > > + down_read(&mm->mmap_sem); > > > > I don't think this is safe. > > > > What makes you sure that we might not deadlock on the mmap_sem here? > > For all we know, the process that is going out of memory is in the > > middle of a mmap(), and already holds the mmap_sem for writing. No? > > > > So at the very least that needs to be a trylock, I think. > > Agreed. Why? See my reply to Linus's email. Just in case, yes sure the unconditonal down_read() is suboptimal, but this is minor compared to other problems we need to solve. > > And I'm not > > sure zap_page_range() is ok with the mmap_sem only held for reading. > > Normally our rule is that you can *populate* the page tables > > concurrently, but you can't tear the down > > Actually mmap_sem for reading should be sufficient because we do not > alter the layout. Both MADV_DONTNEED and MADV_FREE require read mmap_sem > for example. Yes, but see the ->vm_flags check in madvise_dontneed(). Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 22:24 ` Linus Torvalds @ 2015-09-20 12:56 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-20 12:56 UTC (permalink / raw) To: Linus Torvalds Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/19, Linus Torvalds wrote: > > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > + > > +static void oom_unmap_func(struct work_struct *work) > > +{ > > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > > + > > + if (!atomic_inc_not_zero(&mm->mm_users)) > > + return; > > + > > + // If this is not safe we can do use_mm() + unuse_mm() > > + down_read(&mm->mmap_sem); > > I don't think this is safe. > > What makes you sure that we might not deadlock on the mmap_sem here? > For all we know, the process that is going out of memory is in the > middle of a mmap(), and already holds the mmap_sem for writing. No? In this case the workqueue thread will block. But it can not block forever. I mean if it can then the killed process will never exit (exit_mm does down_read) and release its memory, so we lose anyway. But let me repeat this patch is obviously not complete/etc, > So at the very least that needs to be a trylock, I think. And we want to avoid using workqueues when the caller can do this directly. And in this case we certainly need trylock. But this needs some refactoring: we do not want to do this under oom_lock, otoh it makes sense to do this from mark_oom_victim() if current && killed, and a lot more details. The workqueue thread has other reasons for trylock, but probably not in the initial version of this patch. And perhaps we should use a dedicated kthread and do not use workqueues at all. And yes, a single "mm_struct *oom_unmap_mm" is ugly, it should be the list of mm's to unmap, but then at least we need MMF_MEMDIE. > And I'm not > sure zap_page_range() is ok with the mmap_sem only held for reading. > Normally our rule is that you can *populate* the page tables > concurrently, but you can't tear the down. Well, according to madvise_need_mmap_write() MADV_DONTNEED does this under down_read(). But yes, yes, this is probably not right anyway. Say, VM_LOCKED... That is why I mentioned that perhaps this should only unmap the anonymous pages. We can probably add zap_details->for_oom hint. Another question if it is safe to abuse the foreign mm this way. Well, zap_page_range_single() does this, so this is probably safe. But we can do use_mm(). Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-20 12:56 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-20 12:56 UTC (permalink / raw) To: Linus Torvalds Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/19, Linus Torvalds wrote: > > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > + > > +static void oom_unmap_func(struct work_struct *work) > > +{ > > + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL); > > + > > + if (!atomic_inc_not_zero(&mm->mm_users)) > > + return; > > + > > + // If this is not safe we can do use_mm() + unuse_mm() > > + down_read(&mm->mmap_sem); > > I don't think this is safe. > > What makes you sure that we might not deadlock on the mmap_sem here? > For all we know, the process that is going out of memory is in the > middle of a mmap(), and already holds the mmap_sem for writing. No? In this case the workqueue thread will block. But it can not block forever. I mean if it can then the killed process will never exit (exit_mm does down_read) and release its memory, so we lose anyway. But let me repeat this patch is obviously not complete/etc, > So at the very least that needs to be a trylock, I think. And we want to avoid using workqueues when the caller can do this directly. And in this case we certainly need trylock. But this needs some refactoring: we do not want to do this under oom_lock, otoh it makes sense to do this from mark_oom_victim() if current && killed, and a lot more details. The workqueue thread has other reasons for trylock, but probably not in the initial version of this patch. And perhaps we should use a dedicated kthread and do not use workqueues at all. And yes, a single "mm_struct *oom_unmap_mm" is ugly, it should be the list of mm's to unmap, but then at least we need MMF_MEMDIE. > And I'm not > sure zap_page_range() is ok with the mmap_sem only held for reading. > Normally our rule is that you can *populate* the page tables > concurrently, but you can't tear the down. Well, according to madvise_need_mmap_write() MADV_DONTNEED does this under down_read(). But yes, yes, this is probably not right anyway. Say, VM_LOCKED... That is why I mentioned that perhaps this should only unmap the anonymous pages. We can probably add zap_details->for_oom hint. Another question if it is safe to abuse the foreign mm this way. Well, zap_page_range_single() does this, so this is probably safe. But we can do use_mm(). Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-20 12:56 ` Oleg Nesterov @ 2015-09-20 18:05 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-09-20 18:05 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > In this case the workqueue thread will block. What workqueue thread? pagefault_out_of_memory -> out_of_memory -> oom_kill_process as far as I can tell, this can be called by any task. Now, that pagefault case should only happen when the page fault comes from user space, but we also have __alloc_pages_slowpath -> __alloc_pages_may_oom -> out_of_memory -> oom_kill_process which can be called from just about any context (but atomic allocations will never get here, so it can schedule etc). So what's your point? Explain again just how do you guarantee that you can take the mmap_sem. Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-20 18:05 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-09-20 18:05 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > In this case the workqueue thread will block. What workqueue thread? pagefault_out_of_memory -> out_of_memory -> oom_kill_process as far as I can tell, this can be called by any task. Now, that pagefault case should only happen when the page fault comes from user space, but we also have __alloc_pages_slowpath -> __alloc_pages_may_oom -> out_of_memory -> oom_kill_process which can be called from just about any context (but atomic allocations will never get here, so it can schedule etc). So what's your point? Explain again just how do you guarantee that you can take the mmap_sem. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-20 18:05 ` Linus Torvalds (?) @ 2015-09-20 18:21 ` Raymond Jennings -1 siblings, 0 replies; 213+ messages in thread From: Raymond Jennings @ 2015-09-20 18:21 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa [-- Attachment #1: Type: text/plain, Size: 1496 bytes --] On Sun, Sep 20, 2015 at 11:05 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> > wrote: >> >> In this case the workqueue thread will block. > > What workqueue thread? > > pagefault_out_of_memory -> > out_of_memory -> > oom_kill_process > > as far as I can tell, this can be called by any task. Now, that > pagefault case should only happen when the page fault comes from user > space, but we also have > > __alloc_pages_slowpath -> > __alloc_pages_may_oom -> > out_of_memory -> > oom_kill_process > > which can be called from just about any context (but atomic > allocations will never get here, so it can schedule etc). > > So what's your point? Explain again just how do you guarantee that you > can take the mmap_sem. > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> Would it be a cleaner design in general to require all threads to completely exit kernel space before being terminated? Possibly expedited by noticing fatal signals and riding the EINTR rocket back up the stack? My two cents: If we do that we won't have to worry about fatally wounded tasks slipping into a coma before they cough up any semaphores or locks. [-- Attachment #2: Type: text/html, Size: 1762 bytes --] ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-20 18:05 ` Linus Torvalds (?) (?) @ 2015-09-20 18:23 ` Raymond Jennings -1 siblings, 0 replies; 213+ messages in thread From: Raymond Jennings @ 2015-09-20 18:23 UTC (permalink / raw) To: Linus Torvalds, Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa [-- Attachment #1: Type: text/plain, Size: 1443 bytes --] On 09/20/15 11:05, Linus Torvalds wrote: > On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> In this case the workqueue thread will block. > What workqueue thread? > > pagefault_out_of_memory -> > out_of_memory -> > oom_kill_process > > as far as I can tell, this can be called by any task. Now, that > pagefault case should only happen when the page fault comes from user > space, but we also have > > __alloc_pages_slowpath -> > __alloc_pages_may_oom -> > out_of_memory -> > oom_kill_process > > which can be called from just about any context (but atomic > allocations will never get here, so it can schedule etc). > > So what's your point? Explain again just how do you guarantee that you > can take the mmap_sem. > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ .dadsf > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> Would it be a cleaner design in general to require all threads to completely exit kernel space before being terminated? Possibly expedited by noticing fatal signals and riding the EINTR rocket back up the stack? My two cents: If we do that we won't have to worry about fatally wounded tasks slipping into a coma before they cough up any semaphores or locks. [-- Attachment #2: Type: text/html, Size: 3651 bytes --] ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-20 18:05 ` Linus Torvalds @ 2015-09-20 19:07 ` Raymond Jennings -1 siblings, 0 replies; 213+ messages in thread From: Raymond Jennings @ 2015-09-20 19:07 UTC (permalink / raw) To: Linus Torvalds, Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/20/15 11:05, Linus Torvalds wrote: > On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> In this case the workqueue thread will block. > What workqueue thread? > > pagefault_out_of_memory -> > out_of_memory -> > oom_kill_process > > as far as I can tell, this can be called by any task. Now, that > pagefault case should only happen when the page fault comes from user > space, but we also have > > __alloc_pages_slowpath -> > __alloc_pages_may_oom -> > out_of_memory -> > oom_kill_process > > which can be called from just about any context (but atomic > allocations will never get here, so it can schedule etc). I think in this case the oom killer should just slap a SIGKILL on the task and then back out, and whatever needed the memory should just wait patiently for the sacrificial lamb to commit seppuku. Which, btw, we should IMO encourage ASAP in the context of the lamb by having anything potentially locky or semaphory pay attention to if the task in question has a fatal signal pending, and if so, drop everything and run like hell so that the task can cough up any locks or semaphores. > So what's your point? Explain again just how do you guarantee that you > can take the mmap_sem. > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> Also, I observed that a task in the middle of dumping core doesn't respond to signals while it's dumping, and I would guess that might be the case even if the task receives a SIGKILL from the OOM handler. Just a potential observation. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-20 19:07 ` Raymond Jennings 0 siblings, 0 replies; 213+ messages in thread From: Raymond Jennings @ 2015-09-20 19:07 UTC (permalink / raw) To: Linus Torvalds, Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/20/15 11:05, Linus Torvalds wrote: > On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> In this case the workqueue thread will block. > What workqueue thread? > > pagefault_out_of_memory -> > out_of_memory -> > oom_kill_process > > as far as I can tell, this can be called by any task. Now, that > pagefault case should only happen when the page fault comes from user > space, but we also have > > __alloc_pages_slowpath -> > __alloc_pages_may_oom -> > out_of_memory -> > oom_kill_process > > which can be called from just about any context (but atomic > allocations will never get here, so it can schedule etc). I think in this case the oom killer should just slap a SIGKILL on the task and then back out, and whatever needed the memory should just wait patiently for the sacrificial lamb to commit seppuku. Which, btw, we should IMO encourage ASAP in the context of the lamb by having anything potentially locky or semaphory pay attention to if the task in question has a fatal signal pending, and if so, drop everything and run like hell so that the task can cough up any locks or semaphores. > So what's your point? Explain again just how do you guarantee that you > can take the mmap_sem. > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> Also, I observed that a task in the middle of dumping core doesn't respond to signals while it's dumping, and I would guess that might be the case even if the task receives a SIGKILL from the OOM handler. Just a potential observation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-20 19:07 ` Raymond Jennings @ 2015-09-21 13:57 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-21 13:57 UTC (permalink / raw) To: Raymond Jennings Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/20, Raymond Jennings wrote: > > On 09/20/15 11:05, Linus Torvalds wrote: >> >> which can be called from just about any context (but atomic >> allocations will never get here, so it can schedule etc). > > I think in this case the oom killer should just slap a SIGKILL on the > task and then back out, and whatever needed the memory should just wait > patiently for the sacrificial lamb to commit seppuku. Not sure I understand you correctly, but this is what we currently do. The only problem is that this doesn't work sometimes. > Also, I observed that a task in the middle of dumping core doesn't > respond to signals while it's dumping, How did you observe this? The coredumping is killable. Although yes, we have problems here in oom condition. In particular with CLONE_VM tasks. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-21 13:57 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-21 13:57 UTC (permalink / raw) To: Raymond Jennings Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/20, Raymond Jennings wrote: > > On 09/20/15 11:05, Linus Torvalds wrote: >> >> which can be called from just about any context (but atomic >> allocations will never get here, so it can schedule etc). > > I think in this case the oom killer should just slap a SIGKILL on the > task and then back out, and whatever needed the memory should just wait > patiently for the sacrificial lamb to commit seppuku. Not sure I understand you correctly, but this is what we currently do. The only problem is that this doesn't work sometimes. > Also, I observed that a task in the middle of dumping core doesn't > respond to signals while it's dumping, How did you observe this? The coredumping is killable. Although yes, we have problems here in oom condition. In particular with CLONE_VM tasks. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-20 18:05 ` Linus Torvalds @ 2015-09-21 13:44 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-21 13:44 UTC (permalink / raw) To: Linus Torvalds Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/20, Linus Torvalds wrote: > > On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > In this case the workqueue thread will block. > > What workqueue thread? I must have missed something. I can't understand your and Michal's concerns. > pagefault_out_of_memory -> > out_of_memory -> > oom_kill_process > > as far as I can tell, this can be called by any task. Now, that > pagefault case should only happen when the page fault comes from user > space, but we also have > > __alloc_pages_slowpath -> > __alloc_pages_may_oom -> > out_of_memory -> > oom_kill_process > > which can be called from just about any context (but atomic > allocations will never get here, so it can schedule etc). So yes, in general oom_kill_process() can't call oom_unmap_func() directly. That is why the patch uses queue_work(oom_unmap_func). The workqueue thread takes mmap_sem and frees the memory allocated by user space. If this can lead to deadlock somehow, then we can hit the same deadlock when an oom-killed thread calls exit_mm(). > So what's your point? This can help if the killed process refuse to die and (of course) it doesn't hold the mmap_sem for writing. Say, it waits for some mutex held by the task which tries to alloc the memory and triggers oom. > Explain again just how do you guarantee that you > can take the mmap_sem. This is not guaranteed, down_read(mmap_sem) can block forever. But this means that the (killed) victim never drops mmap_sem / never exits, so we lose anyway. We have no memory, oom-killer is blocked, etc. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-21 13:44 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-21 13:44 UTC (permalink / raw) To: Linus Torvalds Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/20, Linus Torvalds wrote: > > On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > In this case the workqueue thread will block. > > What workqueue thread? I must have missed something. I can't understand your and Michal's concerns. > pagefault_out_of_memory -> > out_of_memory -> > oom_kill_process > > as far as I can tell, this can be called by any task. Now, that > pagefault case should only happen when the page fault comes from user > space, but we also have > > __alloc_pages_slowpath -> > __alloc_pages_may_oom -> > out_of_memory -> > oom_kill_process > > which can be called from just about any context (but atomic > allocations will never get here, so it can schedule etc). So yes, in general oom_kill_process() can't call oom_unmap_func() directly. That is why the patch uses queue_work(oom_unmap_func). The workqueue thread takes mmap_sem and frees the memory allocated by user space. If this can lead to deadlock somehow, then we can hit the same deadlock when an oom-killed thread calls exit_mm(). > So what's your point? This can help if the killed process refuse to die and (of course) it doesn't hold the mmap_sem for writing. Say, it waits for some mutex held by the task which tries to alloc the memory and triggers oom. > Explain again just how do you guarantee that you > can take the mmap_sem. This is not guaranteed, down_read(mmap_sem) can block forever. But this means that the (killed) victim never drops mmap_sem / never exits, so we lose anyway. We have no memory, oom-killer is blocked, etc. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-21 13:44 ` Oleg Nesterov @ 2015-09-21 14:24 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-21 14:24 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon 21-09-15 15:44:14, Oleg Nesterov wrote: [...] > So yes, in general oom_kill_process() can't call oom_unmap_func() directly. > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread > takes mmap_sem and frees the memory allocated by user space. OK, this might have been a bit confusing. I didn't mean you cannot use mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've mentioned that you _shouldn't_ use workqueue context in the first place because all the workers might be blocked on locks and new workers cannot be created due to memory pressure. This has been demostrated already where sysrq+f couldn't trigger OOM killer because the work item to do so was waiting for a worker which never came... So I think we probably need to do this in the OOM killer context (with try_lock) or hand over to a special kernel thread. I am not sure a special kernel thread is really worth that but maybe it will turn out to be a better choice. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-21 14:24 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-21 14:24 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon 21-09-15 15:44:14, Oleg Nesterov wrote: [...] > So yes, in general oom_kill_process() can't call oom_unmap_func() directly. > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread > takes mmap_sem and frees the memory allocated by user space. OK, this might have been a bit confusing. I didn't mean you cannot use mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've mentioned that you _shouldn't_ use workqueue context in the first place because all the workers might be blocked on locks and new workers cannot be created due to memory pressure. This has been demostrated already where sysrq+f couldn't trigger OOM killer because the work item to do so was waiting for a worker which never came... So I think we probably need to do this in the OOM killer context (with try_lock) or hand over to a special kernel thread. I am not sure a special kernel thread is really worth that but maybe it will turn out to be a better choice. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-21 14:24 ` Michal Hocko @ 2015-09-21 15:32 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-21 15:32 UTC (permalink / raw) To: Michal Hocko Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/21, Michal Hocko wrote: > > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote: > [...] > > So yes, in general oom_kill_process() can't call oom_unmap_func() directly. > > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread > > takes mmap_sem and frees the memory allocated by user space. > > OK, this might have been a bit confusing. I didn't mean you cannot use > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've > mentioned that you _shouldn't_ use workqueue context in the first place > because all the workers might be blocked on locks and new workers cannot > be created due to memory pressure. Yes, yes, and I already tried to comment this part. We probably need a dedicated kernel thread, but I still think (although I am not sure) that initial change can use workueue. In the likely case system_unbound_wq pool should have an idle thread, if not - OK, this change won't help in this case. This is minor. > So I think we probably need to do this in the OOM killer context (with > try_lock) Yes we should try to do this in the OOM killer context, and in this case (of course) we need trylock. Let me quote my previous email: And we want to avoid using workqueues when the caller can do this directly. And in this case we certainly need trylock. But this needs some refactoring: we do not want to do this under oom_lock, otoh it makes sense to do this from mark_oom_victim() if current && killed, and a lot more details. and probably this is another reason why do we need MMF_MEMDIE. But again, I think the initial change should be simple. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-21 15:32 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-21 15:32 UTC (permalink / raw) To: Michal Hocko Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/21, Michal Hocko wrote: > > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote: > [...] > > So yes, in general oom_kill_process() can't call oom_unmap_func() directly. > > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread > > takes mmap_sem and frees the memory allocated by user space. > > OK, this might have been a bit confusing. I didn't mean you cannot use > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've > mentioned that you _shouldn't_ use workqueue context in the first place > because all the workers might be blocked on locks and new workers cannot > be created due to memory pressure. Yes, yes, and I already tried to comment this part. We probably need a dedicated kernel thread, but I still think (although I am not sure) that initial change can use workueue. In the likely case system_unbound_wq pool should have an idle thread, if not - OK, this change won't help in this case. This is minor. > So I think we probably need to do this in the OOM killer context (with > try_lock) Yes we should try to do this in the OOM killer context, and in this case (of course) we need trylock. Let me quote my previous email: And we want to avoid using workqueues when the caller can do this directly. And in this case we certainly need trylock. But this needs some refactoring: we do not want to do this under oom_lock, otoh it makes sense to do this from mark_oom_victim() if current && killed, and a lot more details. and probably this is another reason why do we need MMF_MEMDIE. But again, I think the initial change should be simple. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-21 15:32 ` Oleg Nesterov @ 2015-09-21 16:12 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-21 16:12 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon 21-09-15 17:32:52, Oleg Nesterov wrote: > On 09/21, Michal Hocko wrote: > > > > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote: > > [...] > > > So yes, in general oom_kill_process() can't call oom_unmap_func() directly. > > > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread > > > takes mmap_sem and frees the memory allocated by user space. > > > > OK, this might have been a bit confusing. I didn't mean you cannot use > > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've > > mentioned that you _shouldn't_ use workqueue context in the first place > > because all the workers might be blocked on locks and new workers cannot > > be created due to memory pressure. > > Yes, yes, and I already tried to comment this part. OK then we are on the same page, good. > We probably need a > dedicated kernel thread, but I still think (although I am not sure) that > initial change can use workueue. In the likely case system_unbound_wq pool > should have an idle thread, if not - OK, this change won't help in this > case. This is minor. The point is that the implementation should be robust from the very beginning. I am not sure what you mean by the idle thread here but the rescuer can get stuck the very same way other workers. So I think that we cannot rely on WQ for a real solution here. > > So I think we probably need to do this in the OOM killer context (with > > try_lock) > > Yes we should try to do this in the OOM killer context, and in this case > (of course) we need trylock. Let me quote my previous email: > > And we want to avoid using workqueues when the caller can do this > directly. And in this case we certainly need trylock. But this needs > some refactoring: we do not want to do this under oom_lock, Why do you think oom_lock would be a big deal? Address space of the victim might be really large but we can back off after a batch of unmapped pages. > otoh it > makes sense to do this from mark_oom_victim() if current && killed, > and a lot more details. > > and probably this is another reason why do we need MMF_MEMDIE. But again, > I think the initial change should be simple. I definitely agree with the simplicity for the first iteration. That means only unmap private exclusive pages and release at most few megs of them. I am still not sure about some details, e.g. futex sitting in such a memory. Wouldn't threads blow up when they see an unmapped futex page, try to page it in and it would be in an uninitialized state? Maybe this is safe because they will die anyway but I am not familiar with that code. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-21 16:12 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-21 16:12 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon 21-09-15 17:32:52, Oleg Nesterov wrote: > On 09/21, Michal Hocko wrote: > > > > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote: > > [...] > > > So yes, in general oom_kill_process() can't call oom_unmap_func() directly. > > > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread > > > takes mmap_sem and frees the memory allocated by user space. > > > > OK, this might have been a bit confusing. I didn't mean you cannot use > > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've > > mentioned that you _shouldn't_ use workqueue context in the first place > > because all the workers might be blocked on locks and new workers cannot > > be created due to memory pressure. > > Yes, yes, and I already tried to comment this part. OK then we are on the same page, good. > We probably need a > dedicated kernel thread, but I still think (although I am not sure) that > initial change can use workueue. In the likely case system_unbound_wq pool > should have an idle thread, if not - OK, this change won't help in this > case. This is minor. The point is that the implementation should be robust from the very beginning. I am not sure what you mean by the idle thread here but the rescuer can get stuck the very same way other workers. So I think that we cannot rely on WQ for a real solution here. > > So I think we probably need to do this in the OOM killer context (with > > try_lock) > > Yes we should try to do this in the OOM killer context, and in this case > (of course) we need trylock. Let me quote my previous email: > > And we want to avoid using workqueues when the caller can do this > directly. And in this case we certainly need trylock. But this needs > some refactoring: we do not want to do this under oom_lock, Why do you think oom_lock would be a big deal? Address space of the victim might be really large but we can back off after a batch of unmapped pages. > otoh it > makes sense to do this from mark_oom_victim() if current && killed, > and a lot more details. > > and probably this is another reason why do we need MMF_MEMDIE. But again, > I think the initial change should be simple. I definitely agree with the simplicity for the first iteration. That means only unmap private exclusive pages and release at most few megs of them. I am still not sure about some details, e.g. futex sitting in such a memory. Wouldn't threads blow up when they see an unmapped futex page, try to page it in and it would be in an uninitialized state? Maybe this is safe because they will die anyway but I am not familiar with that code. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-21 16:12 ` Michal Hocko @ 2015-09-22 16:06 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-22 16:06 UTC (permalink / raw) To: Michal Hocko Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/21, Michal Hocko wrote: > > On Mon 21-09-15 17:32:52, Oleg Nesterov wrote: > > On 09/21, Michal Hocko wrote: > > > > > > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote: > > > [...] > > > > So yes, in general oom_kill_process() can't call oom_unmap_func() directly. > > > > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread > > > > takes mmap_sem and frees the memory allocated by user space. > > > > > > OK, this might have been a bit confusing. I didn't mean you cannot use > > > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've > > > mentioned that you _shouldn't_ use workqueue context in the first place > > > because all the workers might be blocked on locks and new workers cannot > > > be created due to memory pressure. > > > > Yes, yes, and I already tried to comment this part. > > OK then we are on the same page, good. Yes, yes. > > We probably need a > > dedicated kernel thread, but I still think (although I am not sure) that > > initial change can use workueue. In the likely case system_unbound_wq pool > > should have an idle thread, if not - OK, this change won't help in this > > case. This is minor. > > The point is that the implementation should be robust from the very > beginning. OK, let it be a kthread from the very beginning, I won't argue. This is really minor compared to other problems. > > > So I think we probably need to do this in the OOM killer context (with > > > try_lock) > > > > Yes we should try to do this in the OOM killer context, and in this case > > (of course) we need trylock. Let me quote my previous email: > > > > And we want to avoid using workqueues when the caller can do this > > directly. And in this case we certainly need trylock. But this needs > > some refactoring: we do not want to do this under oom_lock, > > Why do you think oom_lock would be a big deal? I don't really know... This doesn't look sane to me, but perhaps this is just because I don't understand this code enough. And note that the caller can held other locks we do not even know about. Most probably we should not deadlock, at least if we only unmap the anon pages, but still this doesn't look safe. But I agree, this probably needs more discussion. > Address space of the > victim might be really large but we can back off after a batch of > unmapped pages. Hmm. If we already have mmap_sem and started zap_page_range() then I do not think it makes sense to stop until we free everything we can. > I definitely agree with the simplicity for the first iteration. That > means only unmap private exclusive pages and release at most few megs of > them. See above, I am not sure this makes sense. And in any case this will complicate the initial changes, not simplify. > I am still not sure about some details, e.g. futex sitting in such > a memory. Wouldn't threads blow up when they see an unmapped futex page, > try to page it in and it would be in an uninitialized state? Maybe this > is safe But this must be safe. We do not care about userspace (assuming that all mm users have a pending SIGKILL). If this can (say) crash the kernel somehow, then we have a bug which should be fixed. Simply because userspace can exploit this bug doing MADV_DONTEED from another thread or CLONE_VM process. Finally. Whatever we do, we need to change oom_kill_process() first, and I think we should do this regardless. The "Kill all user processes sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. I'll try to make some patches tomorrow if I have time... But. Can't we just remove another ->oom_score_adj check when we try to kill all mm users (the last for_each_process loop). If yes, this all can be simplified. I guess we can't and its a pity. Because it looks simply pointless to not kill all mm users. This just means the select_bad_process() picked the wrong task. Say, vfork(). OK, it is possible that parent is OOM_SCORE_ADJ_MIN and the child has already updated its oom_score_adj before exec. Now if we to kill the child we will only upset the parent for no reason, this won't help to free the memory. And while this completely offtopic... why does it take task_lock() to protect ->comm? Sure, without task_lock() we can print garbage. Is it really that important? I am asking because sometime people think that it is not safe to use ->comm lockless, but this is not true. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-22 16:06 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-22 16:06 UTC (permalink / raw) To: Michal Hocko Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On 09/21, Michal Hocko wrote: > > On Mon 21-09-15 17:32:52, Oleg Nesterov wrote: > > On 09/21, Michal Hocko wrote: > > > > > > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote: > > > [...] > > > > So yes, in general oom_kill_process() can't call oom_unmap_func() directly. > > > > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread > > > > takes mmap_sem and frees the memory allocated by user space. > > > > > > OK, this might have been a bit confusing. I didn't mean you cannot use > > > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've > > > mentioned that you _shouldn't_ use workqueue context in the first place > > > because all the workers might be blocked on locks and new workers cannot > > > be created due to memory pressure. > > > > Yes, yes, and I already tried to comment this part. > > OK then we are on the same page, good. Yes, yes. > > We probably need a > > dedicated kernel thread, but I still think (although I am not sure) that > > initial change can use workueue. In the likely case system_unbound_wq pool > > should have an idle thread, if not - OK, this change won't help in this > > case. This is minor. > > The point is that the implementation should be robust from the very > beginning. OK, let it be a kthread from the very beginning, I won't argue. This is really minor compared to other problems. > > > So I think we probably need to do this in the OOM killer context (with > > > try_lock) > > > > Yes we should try to do this in the OOM killer context, and in this case > > (of course) we need trylock. Let me quote my previous email: > > > > And we want to avoid using workqueues when the caller can do this > > directly. And in this case we certainly need trylock. But this needs > > some refactoring: we do not want to do this under oom_lock, > > Why do you think oom_lock would be a big deal? I don't really know... This doesn't look sane to me, but perhaps this is just because I don't understand this code enough. And note that the caller can held other locks we do not even know about. Most probably we should not deadlock, at least if we only unmap the anon pages, but still this doesn't look safe. But I agree, this probably needs more discussion. > Address space of the > victim might be really large but we can back off after a batch of > unmapped pages. Hmm. If we already have mmap_sem and started zap_page_range() then I do not think it makes sense to stop until we free everything we can. > I definitely agree with the simplicity for the first iteration. That > means only unmap private exclusive pages and release at most few megs of > them. See above, I am not sure this makes sense. And in any case this will complicate the initial changes, not simplify. > I am still not sure about some details, e.g. futex sitting in such > a memory. Wouldn't threads blow up when they see an unmapped futex page, > try to page it in and it would be in an uninitialized state? Maybe this > is safe But this must be safe. We do not care about userspace (assuming that all mm users have a pending SIGKILL). If this can (say) crash the kernel somehow, then we have a bug which should be fixed. Simply because userspace can exploit this bug doing MADV_DONTEED from another thread or CLONE_VM process. Finally. Whatever we do, we need to change oom_kill_process() first, and I think we should do this regardless. The "Kill all user processes sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. I'll try to make some patches tomorrow if I have time... But. Can't we just remove another ->oom_score_adj check when we try to kill all mm users (the last for_each_process loop). If yes, this all can be simplified. I guess we can't and its a pity. Because it looks simply pointless to not kill all mm users. This just means the select_bad_process() picked the wrong task. Say, vfork(). OK, it is possible that parent is OOM_SCORE_ADJ_MIN and the child has already updated its oom_score_adj before exec. Now if we to kill the child we will only upset the parent for no reason, this won't help to free the memory. And while this completely offtopic... why does it take task_lock() to protect ->comm? Sure, without task_lock() we can print garbage. Is it really that important? I am asking because sometime people think that it is not safe to use ->comm lockless, but this is not true. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-22 16:06 ` Oleg Nesterov @ 2015-09-22 23:04 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-22 23:04 UTC (permalink / raw) To: Oleg Nesterov Cc: Michal Hocko, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Tue, 22 Sep 2015, Oleg Nesterov wrote: > Finally. Whatever we do, we need to change oom_kill_process() first, > and I think we should do this regardless. The "Kill all user processes > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > I'll try to make some patches tomorrow if I have time... > Killing all processes sharing the ->mm has been done in the past to obviously ensure that memory is eventually freed, but also to solve mm->mmap_sem livelocks where a thread is holding a contended mutex and needs a fatal signal to acquire TIF_MEMDIE if it calls into the oom killer and be able to allocate so that it may eventually drop the mutex. > But. Can't we just remove another ->oom_score_adj check when we try > to kill all mm users (the last for_each_process loop). If yes, this > all can be simplified. > For complete correctness, we would avoid killing any process that shares memory with an oom disabled thread since the oom killer shall not kill it and otherwise we do not free any memory. > I guess we can't and its a pity. Because it looks simply pointless > to not kill all mm users. This just means the select_bad_process() > picked the wrong task. > This is a side-effect of moving oom scoring to signal_struct from mm_struct. It could be improved separately by flagging mm_structs that are unkillable which would also allow for an optimization in find_lock_task_mm(). > And while this completely offtopic... why does it take task_lock() > to protect ->comm? Sure, without task_lock() we can print garbage. > Is it really that important? I am asking because sometime people > think that it is not safe to use ->comm lockless, but this is not > true. > This has come up a couple times in the past and, from what I recall, Andrew has said that we don't actually care since the string will always be terminated and if we race we don't actually care. There are other places in the kernel where task_lock() isn't used solely to protect ->comm. It can be removed from the oom_kill_process() loop checking for other potential victims. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-22 23:04 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-22 23:04 UTC (permalink / raw) To: Oleg Nesterov Cc: Michal Hocko, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Tue, 22 Sep 2015, Oleg Nesterov wrote: > Finally. Whatever we do, we need to change oom_kill_process() first, > and I think we should do this regardless. The "Kill all user processes > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > I'll try to make some patches tomorrow if I have time... > Killing all processes sharing the ->mm has been done in the past to obviously ensure that memory is eventually freed, but also to solve mm->mmap_sem livelocks where a thread is holding a contended mutex and needs a fatal signal to acquire TIF_MEMDIE if it calls into the oom killer and be able to allocate so that it may eventually drop the mutex. > But. Can't we just remove another ->oom_score_adj check when we try > to kill all mm users (the last for_each_process loop). If yes, this > all can be simplified. > For complete correctness, we would avoid killing any process that shares memory with an oom disabled thread since the oom killer shall not kill it and otherwise we do not free any memory. > I guess we can't and its a pity. Because it looks simply pointless > to not kill all mm users. This just means the select_bad_process() > picked the wrong task. > This is a side-effect of moving oom scoring to signal_struct from mm_struct. It could be improved separately by flagging mm_structs that are unkillable which would also allow for an optimization in find_lock_task_mm(). > And while this completely offtopic... why does it take task_lock() > to protect ->comm? Sure, without task_lock() we can print garbage. > Is it really that important? I am asking because sometime people > think that it is not safe to use ->comm lockless, but this is not > true. > This has come up a couple times in the past and, from what I recall, Andrew has said that we don't actually care since the string will always be terminated and if we race we don't actually care. There are other places in the kernel where task_lock() isn't used solely to protect ->comm. It can be removed from the oom_kill_process() loop checking for other potential victims. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-22 16:06 ` Oleg Nesterov @ 2015-09-23 20:59 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-23 20:59 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Tue 22-09-15 18:06:08, Oleg Nesterov wrote: > On 09/21, Michal Hocko wrote: > > > > On Mon 21-09-15 17:32:52, Oleg Nesterov wrote: [...] > > > We probably need a > > > dedicated kernel thread, but I still think (although I am not sure) that > > > initial change can use workueue. In the likely case system_unbound_wq pool > > > should have an idle thread, if not - OK, this change won't help in this > > > case. This is minor. > > > > The point is that the implementation should be robust from the very > > beginning. > > OK, let it be a kthread from the very beginning, I won't argue. This > is really minor compared to other problems. I am still not sure how you want to implement that kernel thread but I am quite skeptical it would be very much useful because all the current allocations which end up in the OOM killer path cannot simply back off and drop the locks with the current allocator semantic. So they will be sitting on top of unknown pile of locks whether you do an additional reclaim (unmap the anon memory) in the direct OOM context or looping in the allocator and waiting for kthread/workqueue to do its work. The only argument that I can see is the stack usage but I haven't seen stack overflows in the OOM path AFAIR. > > > > So I think we probably need to do this in the OOM killer context (with > > > > try_lock) > > > > > > Yes we should try to do this in the OOM killer context, and in this case > > > (of course) we need trylock. Let me quote my previous email: > > > > > > And we want to avoid using workqueues when the caller can do this > > > directly. And in this case we certainly need trylock. But this needs > > > some refactoring: we do not want to do this under oom_lock, > > > > Why do you think oom_lock would be a big deal? > > I don't really know... This doesn't look sane to me, but perhaps this > is just because I don't understand this code enough. Well one of the purpose of this lock is to throttle all the concurrent allocators to not step on each other toes because only one task is allowed to get killed currently. So they wouldn't be any useful anyway. > And note that the caller can held other locks we do not even know about. > Most probably we should not deadlock, at least if we only unmap the anon > pages, but still this doesn't look safe. The unmapper cannot fall back to reclaim and/or trigger the OOM so we should be indeed very careful and mark the allocation context appropriately. I can remember mmu_gather but it is only doing opportunistic allocation AFAIR. > But I agree, this probably needs more discussion. > > > Address space of the > > victim might be really large but we can back off after a batch of > > unmapped pages. > > Hmm. If we already have mmap_sem and started zap_page_range() then > I do not think it makes sense to stop until we free everything we can. Zapping a huge address space can take quite some time and we really do not have to free it all on behalf of the killer when enough memory is freed to allow for further progress and the rest can be done by the victim. If one batch doesn't seem sufficient then another retry can continue. I do not think that a limited scan would make the implementation more complicated but I will leave the decision to you of course. > > I definitely agree with the simplicity for the first iteration. That > > means only unmap private exclusive pages and release at most few megs of > > them. > > See above, I am not sure this makes sense. And in any case this will > complicate the initial changes, not simplify. > > > I am still not sure about some details, e.g. futex sitting in such > > a memory. Wouldn't threads blow up when they see an unmapped futex page, > > try to page it in and it would be in an uninitialized state? Maybe this > > is safe > > But this must be safe. > > We do not care about userspace (assuming that all mm users have a > pending SIGKILL). > > If this can (say) crash the kernel somehow, then we have a bug which > should be fixed. Simply because userspace can exploit this bug doing > MADV_DONTEED from another thread or CLONE_VM process. OK, that makes perfect sense. I should have realized that an in-kernel state for a futex must not be controlled from the userspace. So you are right and futex shouldn't be a big deal. > Finally. Whatever we do, we need to change oom_kill_process() first, > and I think we should do this regardless. The "Kill all user processes > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > I'll try to make some patches tomorrow if I have time... That would be appreciated. I do not like that part either. At least we shouldn't go over the whole list when we have a good chance that the mm is not shared with other processes. > But. Can't we just remove another ->oom_score_adj check when we try > to kill all mm users (the last for_each_process loop). If yes, this > all can be simplified. > > I guess we can't and its a pity. Because it looks simply pointless > to not kill all mm users. This just means the select_bad_process() > picked the wrong task. Yes I am not really sure why oom_score_adj is not per-mm and we are doing that per signal struct to be honest. It doesn't make much sense as the mm_struct is the primary source of information for the oom victim selection. And the fact that mm might be shared withtout sharing signals make it double the reason to have it in mm. It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj value from task_struct to mm_struct") but it was later reverted by 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree with the reasoning there because vfork is documented to have undefined behavior " if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions. " Maybe we can revisit this... It would make the whole semantic much more straightforward. The current situation when you kill a task which might share the mm with OOM unkillable task is clearly suboptimal and confusing. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-23 20:59 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-23 20:59 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Tue 22-09-15 18:06:08, Oleg Nesterov wrote: > On 09/21, Michal Hocko wrote: > > > > On Mon 21-09-15 17:32:52, Oleg Nesterov wrote: [...] > > > We probably need a > > > dedicated kernel thread, but I still think (although I am not sure) that > > > initial change can use workueue. In the likely case system_unbound_wq pool > > > should have an idle thread, if not - OK, this change won't help in this > > > case. This is minor. > > > > The point is that the implementation should be robust from the very > > beginning. > > OK, let it be a kthread from the very beginning, I won't argue. This > is really minor compared to other problems. I am still not sure how you want to implement that kernel thread but I am quite skeptical it would be very much useful because all the current allocations which end up in the OOM killer path cannot simply back off and drop the locks with the current allocator semantic. So they will be sitting on top of unknown pile of locks whether you do an additional reclaim (unmap the anon memory) in the direct OOM context or looping in the allocator and waiting for kthread/workqueue to do its work. The only argument that I can see is the stack usage but I haven't seen stack overflows in the OOM path AFAIR. > > > > So I think we probably need to do this in the OOM killer context (with > > > > try_lock) > > > > > > Yes we should try to do this in the OOM killer context, and in this case > > > (of course) we need trylock. Let me quote my previous email: > > > > > > And we want to avoid using workqueues when the caller can do this > > > directly. And in this case we certainly need trylock. But this needs > > > some refactoring: we do not want to do this under oom_lock, > > > > Why do you think oom_lock would be a big deal? > > I don't really know... This doesn't look sane to me, but perhaps this > is just because I don't understand this code enough. Well one of the purpose of this lock is to throttle all the concurrent allocators to not step on each other toes because only one task is allowed to get killed currently. So they wouldn't be any useful anyway. > And note that the caller can held other locks we do not even know about. > Most probably we should not deadlock, at least if we only unmap the anon > pages, but still this doesn't look safe. The unmapper cannot fall back to reclaim and/or trigger the OOM so we should be indeed very careful and mark the allocation context appropriately. I can remember mmu_gather but it is only doing opportunistic allocation AFAIR. > But I agree, this probably needs more discussion. > > > Address space of the > > victim might be really large but we can back off after a batch of > > unmapped pages. > > Hmm. If we already have mmap_sem and started zap_page_range() then > I do not think it makes sense to stop until we free everything we can. Zapping a huge address space can take quite some time and we really do not have to free it all on behalf of the killer when enough memory is freed to allow for further progress and the rest can be done by the victim. If one batch doesn't seem sufficient then another retry can continue. I do not think that a limited scan would make the implementation more complicated but I will leave the decision to you of course. > > I definitely agree with the simplicity for the first iteration. That > > means only unmap private exclusive pages and release at most few megs of > > them. > > See above, I am not sure this makes sense. And in any case this will > complicate the initial changes, not simplify. > > > I am still not sure about some details, e.g. futex sitting in such > > a memory. Wouldn't threads blow up when they see an unmapped futex page, > > try to page it in and it would be in an uninitialized state? Maybe this > > is safe > > But this must be safe. > > We do not care about userspace (assuming that all mm users have a > pending SIGKILL). > > If this can (say) crash the kernel somehow, then we have a bug which > should be fixed. Simply because userspace can exploit this bug doing > MADV_DONTEED from another thread or CLONE_VM process. OK, that makes perfect sense. I should have realized that an in-kernel state for a futex must not be controlled from the userspace. So you are right and futex shouldn't be a big deal. > Finally. Whatever we do, we need to change oom_kill_process() first, > and I think we should do this regardless. The "Kill all user processes > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > I'll try to make some patches tomorrow if I have time... That would be appreciated. I do not like that part either. At least we shouldn't go over the whole list when we have a good chance that the mm is not shared with other processes. > But. Can't we just remove another ->oom_score_adj check when we try > to kill all mm users (the last for_each_process loop). If yes, this > all can be simplified. > > I guess we can't and its a pity. Because it looks simply pointless > to not kill all mm users. This just means the select_bad_process() > picked the wrong task. Yes I am not really sure why oom_score_adj is not per-mm and we are doing that per signal struct to be honest. It doesn't make much sense as the mm_struct is the primary source of information for the oom victim selection. And the fact that mm might be shared withtout sharing signals make it double the reason to have it in mm. It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj value from task_struct to mm_struct") but it was later reverted by 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree with the reasoning there because vfork is documented to have undefined behavior " if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions. " Maybe we can revisit this... It would make the whole semantic much more straightforward. The current situation when you kill a task which might share the mm with OOM unkillable task is clearly suboptimal and confusing. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-23 20:59 ` Michal Hocko @ 2015-09-24 21:15 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-24 21:15 UTC (permalink / raw) To: Michal Hocko Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Wed, 23 Sep 2015, Michal Hocko wrote: > I am still not sure how you want to implement that kernel thread but I > am quite skeptical it would be very much useful because all the current > allocations which end up in the OOM killer path cannot simply back off > and drop the locks with the current allocator semantic. So they will > be sitting on top of unknown pile of locks whether you do an additional > reclaim (unmap the anon memory) in the direct OOM context or looping > in the allocator and waiting for kthread/workqueue to do its work. The > only argument that I can see is the stack usage but I haven't seen stack > overflows in the OOM path AFAIR. > Which locks are you specifically interested in? We have already discussed the usefulness of killing all threads on the system sharing the same ->mm, meaning all threads that are either holding or want to hold mm->mmap_sem will be able to allocate into memory reserves. Any allocator holding down_write(&mm->mmap_sem) should be able to allocate and drop its lock. (Are you concerned about MAP_POPULATE?) > > Finally. Whatever we do, we need to change oom_kill_process() first, > > and I think we should do this regardless. The "Kill all user processes > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > > I'll try to make some patches tomorrow if I have time... > > That would be appreciated. I do not like that part either. At least we > shouldn't go over the whole list when we have a good chance that the mm > is not shared with other processes. > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, it's the reason the code exists. Any optimizations to that is certainly welcome, but we definitely need to send SIGKILL to all threads sharing the mm to make forward progress, otherwise we are going back to pre-2008 livelocks. > Yes I am not really sure why oom_score_adj is not per-mm and we are > doing that per signal struct to be honest. It doesn't make much sense as > the mm_struct is the primary source of information for the oom victim > selection. And the fact that mm might be shared withtout sharing signals > make it double the reason to have it in mm. > > It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj > value from task_struct to mm_struct") but it was later reverted by > 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree > with the reasoning there because vfork is documented to have undefined > behavior > " > if the process created by vfork() either modifies any data other > than a variable of type pid_t used to store the return value > from vfork(), or returns from the function in which vfork() was > called, or calls any other function before successfully calling > _exit(2) or one of the exec(3) family of functions. > " > Maybe we can revisit this... It would make the whole semantic much more > straightforward. The current situation when you kill a task which might > share the mm with OOM unkillable task is clearly suboptimal and > confusing. > How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj value from task_struct to signal_struct")? We also must appreciate the real-world usecase for an oom disabled process doing fork(), setting /proc/child/oom_score_adj to non-disabled, and exec(). ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-24 21:15 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-24 21:15 UTC (permalink / raw) To: Michal Hocko Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Wed, 23 Sep 2015, Michal Hocko wrote: > I am still not sure how you want to implement that kernel thread but I > am quite skeptical it would be very much useful because all the current > allocations which end up in the OOM killer path cannot simply back off > and drop the locks with the current allocator semantic. So they will > be sitting on top of unknown pile of locks whether you do an additional > reclaim (unmap the anon memory) in the direct OOM context or looping > in the allocator and waiting for kthread/workqueue to do its work. The > only argument that I can see is the stack usage but I haven't seen stack > overflows in the OOM path AFAIR. > Which locks are you specifically interested in? We have already discussed the usefulness of killing all threads on the system sharing the same ->mm, meaning all threads that are either holding or want to hold mm->mmap_sem will be able to allocate into memory reserves. Any allocator holding down_write(&mm->mmap_sem) should be able to allocate and drop its lock. (Are you concerned about MAP_POPULATE?) > > Finally. Whatever we do, we need to change oom_kill_process() first, > > and I think we should do this regardless. The "Kill all user processes > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > > I'll try to make some patches tomorrow if I have time... > > That would be appreciated. I do not like that part either. At least we > shouldn't go over the whole list when we have a good chance that the mm > is not shared with other processes. > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, it's the reason the code exists. Any optimizations to that is certainly welcome, but we definitely need to send SIGKILL to all threads sharing the mm to make forward progress, otherwise we are going back to pre-2008 livelocks. > Yes I am not really sure why oom_score_adj is not per-mm and we are > doing that per signal struct to be honest. It doesn't make much sense as > the mm_struct is the primary source of information for the oom victim > selection. And the fact that mm might be shared withtout sharing signals > make it double the reason to have it in mm. > > It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj > value from task_struct to mm_struct") but it was later reverted by > 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree > with the reasoning there because vfork is documented to have undefined > behavior > " > if the process created by vfork() either modifies any data other > than a variable of type pid_t used to store the return value > from vfork(), or returns from the function in which vfork() was > called, or calls any other function before successfully calling > _exit(2) or one of the exec(3) family of functions. > " > Maybe we can revisit this... It would make the whole semantic much more > straightforward. The current situation when you kill a task which might > share the mm with OOM unkillable task is clearly suboptimal and > confusing. > How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj value from task_struct to signal_struct")? We also must appreciate the real-world usecase for an oom disabled process doing fork(), setting /proc/child/oom_score_adj to non-disabled, and exec(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-24 21:15 ` David Rientjes @ 2015-09-25 9:35 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-25 9:35 UTC (permalink / raw) To: David Rientjes Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Thu 24-09-15 14:15:34, David Rientjes wrote: > On Wed, 23 Sep 2015, Michal Hocko wrote: > > > I am still not sure how you want to implement that kernel thread but I > > am quite skeptical it would be very much useful because all the current > > allocations which end up in the OOM killer path cannot simply back off > > and drop the locks with the current allocator semantic. So they will > > be sitting on top of unknown pile of locks whether you do an additional > > reclaim (unmap the anon memory) in the direct OOM context or looping > > in the allocator and waiting for kthread/workqueue to do its work. The > > only argument that I can see is the stack usage but I haven't seen stack > > overflows in the OOM path AFAIR. > > > > Which locks are you specifically interested in? Any locks they were holding before they entered the page allocator (e.g. i_mutex is the easiest one to trigger from the userspace but mmap_sem might be involved as well because we are doing kmalloc(GFP_KERNEL) with mmap_sem held for write). Those would be locked until the page allocator returns, which with the current semantic might be _never_. > We have already discussed > the usefulness of killing all threads on the system sharing the same ->mm, > meaning all threads that are either holding or want to hold mm->mmap_sem > will be able to allocate into memory reserves. Any allocator holding > down_write(&mm->mmap_sem) should be able to allocate and drop its lock. > (Are you concerned about MAP_POPULATE?) I am not sure I understand. We would have to fail the request in order the context which requested the memory could drop the lock. Are we talking about the same thing here? The point I've tried to made is that oom unmapper running in a detached context (e.g. kernel thread) vs. directly in the oom context doesn't make any difference wrt. lock because the holders of the lock would loop inside the allocator anyway because we do not fail small allocations. > > > Finally. Whatever we do, we need to change oom_kill_process() first, > > > and I think we should do this regardless. The "Kill all user processes > > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > > > I'll try to make some patches tomorrow if I have time... > > > > That would be appreciated. I do not like that part either. At least we > > shouldn't go over the whole list when we have a good chance that the mm > > is not shared with other processes. > > > > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, > it's the reason the code exists. Any optimizations to that is certainly > welcome, but we definitely need to send SIGKILL to all threads sharing the > mm to make forward progress, otherwise we are going back to pre-2008 > livelocks. Yes but mm is not shared between processes most of the time. CLONE_VM without CLONE_THREAD is more a corner case yet we have to crawl all the task_structs for _each_ OOM killer invocation. Yes this is an extreme slow path but still might take quite some unnecessarily time. > > Yes I am not really sure why oom_score_adj is not per-mm and we are > > doing that per signal struct to be honest. It doesn't make much sense as > > the mm_struct is the primary source of information for the oom victim > > selection. And the fact that mm might be shared withtout sharing signals > > make it double the reason to have it in mm. > > > > It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj > > value from task_struct to mm_struct") but it was later reverted by > > 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree > > with the reasoning there because vfork is documented to have undefined > > behavior > > " > > if the process created by vfork() either modifies any data other > > than a variable of type pid_t used to store the return value > > from vfork(), or returns from the function in which vfork() was > > called, or calls any other function before successfully calling > > _exit(2) or one of the exec(3) family of functions. > > " > > Maybe we can revisit this... It would make the whole semantic much more > > straightforward. The current situation when you kill a task which might > > share the mm with OOM unkillable task is clearly suboptimal and > > confusing. > > > > How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj > value from task_struct to signal_struct")? If the oom_score_adj is per mm then all the threads and processes which share the mm would share the same value. So that would naturally extend per-process to per address space sharing tasks and would be in line with the above commit. > We also must appreciate the > real-world usecase for an oom disabled process doing fork(), setting > /proc/child/oom_score_adj to non-disabled, and exec(). I guess you meant vfork mentioned in 0753ba01e126. I am not sure this is a valid use of set_oom_adj. As the documentation explicitly states this leads to an undefined behavior. But if we really want to support this particular case, and I can see a reason we would, then we can work around it and store the oom_score_adj temporarily to task_struct and reset it to mm_struct after exec. Not nice for sure but this is a clear violation of the vfork semantic. The per-mm oom_score_adj has a better semantic but if there is a general consensus that an inconsistent value among processes sharing the same mm is a configuration bug I can live with that. It surely makes the code uglier and more subtly, though. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-25 9:35 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-09-25 9:35 UTC (permalink / raw) To: David Rientjes Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Thu 24-09-15 14:15:34, David Rientjes wrote: > On Wed, 23 Sep 2015, Michal Hocko wrote: > > > I am still not sure how you want to implement that kernel thread but I > > am quite skeptical it would be very much useful because all the current > > allocations which end up in the OOM killer path cannot simply back off > > and drop the locks with the current allocator semantic. So they will > > be sitting on top of unknown pile of locks whether you do an additional > > reclaim (unmap the anon memory) in the direct OOM context or looping > > in the allocator and waiting for kthread/workqueue to do its work. The > > only argument that I can see is the stack usage but I haven't seen stack > > overflows in the OOM path AFAIR. > > > > Which locks are you specifically interested in? Any locks they were holding before they entered the page allocator (e.g. i_mutex is the easiest one to trigger from the userspace but mmap_sem might be involved as well because we are doing kmalloc(GFP_KERNEL) with mmap_sem held for write). Those would be locked until the page allocator returns, which with the current semantic might be _never_. > We have already discussed > the usefulness of killing all threads on the system sharing the same ->mm, > meaning all threads that are either holding or want to hold mm->mmap_sem > will be able to allocate into memory reserves. Any allocator holding > down_write(&mm->mmap_sem) should be able to allocate and drop its lock. > (Are you concerned about MAP_POPULATE?) I am not sure I understand. We would have to fail the request in order the context which requested the memory could drop the lock. Are we talking about the same thing here? The point I've tried to made is that oom unmapper running in a detached context (e.g. kernel thread) vs. directly in the oom context doesn't make any difference wrt. lock because the holders of the lock would loop inside the allocator anyway because we do not fail small allocations. > > > Finally. Whatever we do, we need to change oom_kill_process() first, > > > and I think we should do this regardless. The "Kill all user processes > > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > > > I'll try to make some patches tomorrow if I have time... > > > > That would be appreciated. I do not like that part either. At least we > > shouldn't go over the whole list when we have a good chance that the mm > > is not shared with other processes. > > > > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, > it's the reason the code exists. Any optimizations to that is certainly > welcome, but we definitely need to send SIGKILL to all threads sharing the > mm to make forward progress, otherwise we are going back to pre-2008 > livelocks. Yes but mm is not shared between processes most of the time. CLONE_VM without CLONE_THREAD is more a corner case yet we have to crawl all the task_structs for _each_ OOM killer invocation. Yes this is an extreme slow path but still might take quite some unnecessarily time. > > Yes I am not really sure why oom_score_adj is not per-mm and we are > > doing that per signal struct to be honest. It doesn't make much sense as > > the mm_struct is the primary source of information for the oom victim > > selection. And the fact that mm might be shared withtout sharing signals > > make it double the reason to have it in mm. > > > > It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj > > value from task_struct to mm_struct") but it was later reverted by > > 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree > > with the reasoning there because vfork is documented to have undefined > > behavior > > " > > if the process created by vfork() either modifies any data other > > than a variable of type pid_t used to store the return value > > from vfork(), or returns from the function in which vfork() was > > called, or calls any other function before successfully calling > > _exit(2) or one of the exec(3) family of functions. > > " > > Maybe we can revisit this... It would make the whole semantic much more > > straightforward. The current situation when you kill a task which might > > share the mm with OOM unkillable task is clearly suboptimal and > > confusing. > > > > How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj > value from task_struct to signal_struct")? If the oom_score_adj is per mm then all the threads and processes which share the mm would share the same value. So that would naturally extend per-process to per address space sharing tasks and would be in line with the above commit. > We also must appreciate the > real-world usecase for an oom disabled process doing fork(), setting > /proc/child/oom_score_adj to non-disabled, and exec(). I guess you meant vfork mentioned in 0753ba01e126. I am not sure this is a valid use of set_oom_adj. As the documentation explicitly states this leads to an undefined behavior. But if we really want to support this particular case, and I can see a reason we would, then we can work around it and store the oom_score_adj temporarily to task_struct and reset it to mm_struct after exec. Not nice for sure but this is a clear violation of the vfork semantic. The per-mm oom_score_adj has a better semantic but if there is a general consensus that an inconsistent value among processes sharing the same mm is a configuration bug I can live with that. It surely makes the code uglier and more subtly, though. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-25 9:35 ` Michal Hocko @ 2015-09-25 16:14 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-25 16:14 UTC (permalink / raw) To: mhocko, rientjes Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Thu 24-09-15 14:15:34, David Rientjes wrote: > > > > Finally. Whatever we do, we need to change oom_kill_process() first, > > > > and I think we should do this regardless. The "Kill all user processes > > > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > > > > I'll try to make some patches tomorrow if I have time... > > > > > > That would be appreciated. I do not like that part either. At least we > > > shouldn't go over the whole list when we have a good chance that the mm > > > is not shared with other processes. > > > > > > > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, > > it's the reason the code exists. Any optimizations to that is certainly > > welcome, but we definitely need to send SIGKILL to all threads sharing the > > mm to make forward progress, otherwise we are going back to pre-2008 > > livelocks. > > Yes but mm is not shared between processes most of the time. CLONE_VM > without CLONE_THREAD is more a corner case yet we have to crawl all the > task_structs for _each_ OOM killer invocation. Yes this is an extreme > slow path but still might take quite some unnecessarily time. Excuse me, but thinking about CLONE_VM without CLONE_THREAD case... Isn't there possibility of hitting livelocks at /* * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. * * But don't select if current has already released its mm and cleared * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ if (current->mm && (fatal_signal_pending(current) || task_will_free_mem(current))) { mark_oom_victim(current); return true; } if current thread receives SIGKILL just before reaching here, for we don't send SIGKILL to all threads sharing the mm? Hopefully current thread is not holding inode->i_mutex because reaching here (i.e. calling out_of_memory()) suggests that we are doing GFP_KERNEL allocation. But it could be !__GFP_NOFS && __GFP_NOFAIL allocation, or different locks contended by another thread sharing the mm? I don't like "That thread will now get access to memory reserves since it has a pending fatal signal." line in comments for the "Kill all user processes sharing victim->mm" logic. That thread won't get access to memory reserves unless that thread can call out_of_memory() (i.e. doing __GFP_FS or __GFP_NOFAIL allocations). Since I can observe that that thread may be doing !__GFP_NOFS allocation, I think that this comment needs to be updated. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-25 16:14 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-25 16:14 UTC (permalink / raw) To: mhocko, rientjes Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Thu 24-09-15 14:15:34, David Rientjes wrote: > > > > Finally. Whatever we do, we need to change oom_kill_process() first, > > > > and I think we should do this regardless. The "Kill all user processes > > > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated. > > > > I'll try to make some patches tomorrow if I have time... > > > > > > That would be appreciated. I do not like that part either. At least we > > > shouldn't go over the whole list when we have a good chance that the mm > > > is not shared with other processes. > > > > > > > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, > > it's the reason the code exists. Any optimizations to that is certainly > > welcome, but we definitely need to send SIGKILL to all threads sharing the > > mm to make forward progress, otherwise we are going back to pre-2008 > > livelocks. > > Yes but mm is not shared between processes most of the time. CLONE_VM > without CLONE_THREAD is more a corner case yet we have to crawl all the > task_structs for _each_ OOM killer invocation. Yes this is an extreme > slow path but still might take quite some unnecessarily time. Excuse me, but thinking about CLONE_VM without CLONE_THREAD case... Isn't there possibility of hitting livelocks at /* * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. * * But don't select if current has already released its mm and cleared * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ if (current->mm && (fatal_signal_pending(current) || task_will_free_mem(current))) { mark_oom_victim(current); return true; } if current thread receives SIGKILL just before reaching here, for we don't send SIGKILL to all threads sharing the mm? Hopefully current thread is not holding inode->i_mutex because reaching here (i.e. calling out_of_memory()) suggests that we are doing GFP_KERNEL allocation. But it could be !__GFP_NOFS && __GFP_NOFAIL allocation, or different locks contended by another thread sharing the mm? I don't like "That thread will now get access to memory reserves since it has a pending fatal signal." line in comments for the "Kill all user processes sharing victim->mm" logic. That thread won't get access to memory reserves unless that thread can call out_of_memory() (i.e. doing __GFP_FS or __GFP_NOFAIL allocations). Since I can observe that that thread may be doing !__GFP_NOFS allocation, I think that this comment needs to be updated. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-25 16:14 ` Tetsuo Handa @ 2015-09-28 16:18 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-28 16:18 UTC (permalink / raw) To: mhocko, rientjes Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > The point I've tried to made is that oom unmapper running in a detached > context (e.g. kernel thread) vs. directly in the oom context doesn't > make any difference wrt. lock because the holders of the lock would loop > inside the allocator anyway because we do not fail small allocations. We tried to allow small allocations to fail. It resulted in unstable system with obscure bugs. We tried to allow small !__GFP_FS allocations to fail. It failed to fail by effectively __GFP_NOFAIL allocations. We are now trying to allow zapping OOM victim's mm. Michal is already skeptical about this approach due to lock dependency. We already spent 9 months on this OOM livelock. No silver bullet yet. Proposed approaches are too drastic to backport for existing users. I think we are out of bullet. Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most of callsites, timeout based workaround will be the only bullet we can use. Michal's panic_on_oom_timeout and David's "global access to memory reserves" will be acceptable for some users if these approaches are used as opt-in. Likewise, my memdie_task_skip_secs / memdie_task_panic_secs will be acceptable for those who want to retry a bit more rather than panic on accidental livelock if this approach is used as opt-in. Tetsuo Handa wrote: > Excuse me, but thinking about CLONE_VM without CLONE_THREAD case... > Isn't there possibility of hitting livelocks at > > /* > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > * quickly exit and free its memory. > * > * But don't select if current has already released its mm and cleared > * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. > */ > if (current->mm && > (fatal_signal_pending(current) || task_will_free_mem(current))) { > mark_oom_victim(current); > return true; > } > > if current thread receives SIGKILL just before reaching here, for we don't > send SIGKILL to all threads sharing the mm? Seems that CLONE_VM without CLONE_THREAD is irrelevant here. We have sequences like Do a GFP_KENREL allocation. Hold a lock. Do a GFP_NOFS allocation. Release a lock. where an example is seen in VFS operations which receive pathname from user space using getname() and then call VFS functions and filesystem code takes locks which can contend with other threads. ------------------------------------------------------------ diff --git a/fs/namei.c b/fs/namei.c index d68c21f..d51c333 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -4005,6 +4005,8 @@ int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname) if (error) return error; + if (fatal_signal_pending(current)) + printk(KERN_INFO "Calling symlink with SIGKILL pending\n"); error = dir->i_op->symlink(dir, dentry, oldname); if (!error) fsnotify_create(dir, dentry); @@ -4021,6 +4023,10 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname, struct path path; unsigned int lookup_flags = 0; + if (!strcmp(current->comm, "a.out")) { + printk(KERN_INFO "Sending SIGKILL to current thread\n"); + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, current, true); + } from = getname(oldname); if (IS_ERR(from)) return PTR_ERR(from); diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index 996481e..2b6faa5 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -240,6 +240,8 @@ xfs_symlink( if (error) goto out_trans_cancel; + if (fatal_signal_pending(current)) + printk(KERN_INFO "Calling xfs_ilock() with SIGKILL pending\n"); xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL | XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT); unlock_dp_on_error = true; ------------------------------------------------------------ [ 119.534976] Sending SIGKILL to current thread [ 119.535898] Calling symlink with SIGKILL pending [ 119.536870] Calling xfs_ilock() with SIGKILL pending Any program can potentially hit this silent livelock. We can't predict what locks the OOM victim threads will depend on after TIF_MEMDIE was set by the OOM killer. Therefore, I think that TIF_MEMDIE disables the OOM killer indefinitely is one of possible causes regarding silent hangup troubles. Michal Hocko wrote: > I really hate to do "easy" things now just to feel better about > particular case which will kick us back little bit later. And from my > own experience I can tell you that a more non-deterministic OOM behavior > is thing people complain about. I believe that not waiting for TIF_MEMDIE thread indefinitely is the first choice we can propose people to try. From my own experience I can tell you that some customers are really sensitive about bugs which halt their systems (e.g. https://access.redhat.com/solutions/68466 ). Opt-in version of TIF_MEMDIE timeout should be acceptable for people who prefer avoiding silent hangup over non-deterministic OOM behavior if they were explained about the truth of current memory allocator's behavior. ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-28 16:18 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-28 16:18 UTC (permalink / raw) To: mhocko, rientjes Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > The point I've tried to made is that oom unmapper running in a detached > context (e.g. kernel thread) vs. directly in the oom context doesn't > make any difference wrt. lock because the holders of the lock would loop > inside the allocator anyway because we do not fail small allocations. We tried to allow small allocations to fail. It resulted in unstable system with obscure bugs. We tried to allow small !__GFP_FS allocations to fail. It failed to fail by effectively __GFP_NOFAIL allocations. We are now trying to allow zapping OOM victim's mm. Michal is already skeptical about this approach due to lock dependency. We already spent 9 months on this OOM livelock. No silver bullet yet. Proposed approaches are too drastic to backport for existing users. I think we are out of bullet. Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most of callsites, timeout based workaround will be the only bullet we can use. Michal's panic_on_oom_timeout and David's "global access to memory reserves" will be acceptable for some users if these approaches are used as opt-in. Likewise, my memdie_task_skip_secs / memdie_task_panic_secs will be acceptable for those who want to retry a bit more rather than panic on accidental livelock if this approach is used as opt-in. Tetsuo Handa wrote: > Excuse me, but thinking about CLONE_VM without CLONE_THREAD case... > Isn't there possibility of hitting livelocks at > > /* > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > * quickly exit and free its memory. > * > * But don't select if current has already released its mm and cleared > * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. > */ > if (current->mm && > (fatal_signal_pending(current) || task_will_free_mem(current))) { > mark_oom_victim(current); > return true; > } > > if current thread receives SIGKILL just before reaching here, for we don't > send SIGKILL to all threads sharing the mm? Seems that CLONE_VM without CLONE_THREAD is irrelevant here. We have sequences like Do a GFP_KENREL allocation. Hold a lock. Do a GFP_NOFS allocation. Release a lock. where an example is seen in VFS operations which receive pathname from user space using getname() and then call VFS functions and filesystem code takes locks which can contend with other threads. ------------------------------------------------------------ diff --git a/fs/namei.c b/fs/namei.c index d68c21f..d51c333 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -4005,6 +4005,8 @@ int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname) if (error) return error; + if (fatal_signal_pending(current)) + printk(KERN_INFO "Calling symlink with SIGKILL pending\n"); error = dir->i_op->symlink(dir, dentry, oldname); if (!error) fsnotify_create(dir, dentry); @@ -4021,6 +4023,10 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname, struct path path; unsigned int lookup_flags = 0; + if (!strcmp(current->comm, "a.out")) { + printk(KERN_INFO "Sending SIGKILL to current thread\n"); + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, current, true); + } from = getname(oldname); if (IS_ERR(from)) return PTR_ERR(from); diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index 996481e..2b6faa5 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -240,6 +240,8 @@ xfs_symlink( if (error) goto out_trans_cancel; + if (fatal_signal_pending(current)) + printk(KERN_INFO "Calling xfs_ilock() with SIGKILL pending\n"); xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL | XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT); unlock_dp_on_error = true; ------------------------------------------------------------ [ 119.534976] Sending SIGKILL to current thread [ 119.535898] Calling symlink with SIGKILL pending [ 119.536870] Calling xfs_ilock() with SIGKILL pending Any program can potentially hit this silent livelock. We can't predict what locks the OOM victim threads will depend on after TIF_MEMDIE was set by the OOM killer. Therefore, I think that TIF_MEMDIE disables the OOM killer indefinitely is one of possible causes regarding silent hangup troubles. Michal Hocko wrote: > I really hate to do "easy" things now just to feel better about > particular case which will kick us back little bit later. And from my > own experience I can tell you that a more non-deterministic OOM behavior > is thing people complain about. I believe that not waiting for TIF_MEMDIE thread indefinitely is the first choice we can propose people to try. From my own experience I can tell you that some customers are really sensitive about bugs which halt their systems (e.g. https://access.redhat.com/solutions/68466 ). Opt-in version of TIF_MEMDIE timeout should be acceptable for people who prefer avoiding silent hangup over non-deterministic OOM behavior if they were explained about the truth of current memory allocator's behavior. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-28 16:18 ` Tetsuo Handa @ 2015-09-28 22:28 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-28 22:28 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue, 29 Sep 2015, Tetsuo Handa wrote: > > The point I've tried to made is that oom unmapper running in a detached > > context (e.g. kernel thread) vs. directly in the oom context doesn't > > make any difference wrt. lock because the holders of the lock would loop > > inside the allocator anyway because we do not fail small allocations. > > We tried to allow small allocations to fail. It resulted in unstable system > with obscure bugs. > These are helpful to identify regardless of the outcome of this discussion. I'm not sure where the best place to report them would be, or whether its even feasible to dig through looking for possibilities, but I think it would be interesting to see which callers are relying on internal page allocator implementation to work properly since it may uncover bugs that would occur later if it were changed. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-28 22:28 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-28 22:28 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue, 29 Sep 2015, Tetsuo Handa wrote: > > The point I've tried to made is that oom unmapper running in a detached > > context (e.g. kernel thread) vs. directly in the oom context doesn't > > make any difference wrt. lock because the holders of the lock would loop > > inside the allocator anyway because we do not fail small allocations. > > We tried to allow small allocations to fail. It resulted in unstable system > with obscure bugs. > These are helpful to identify regardless of the outcome of this discussion. I'm not sure where the best place to report them would be, or whether its even feasible to dig through looking for possibilities, but I think it would be interesting to see which callers are relying on internal page allocator implementation to work properly since it may uncover bugs that would occur later if it were changed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-28 16:18 ` Tetsuo Handa @ 2015-10-02 12:36 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-02 12:36 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue 29-09-15 01:18:00, Tetsuo Handa wrote: > Michal Hocko wrote: > > The point I've tried to made is that oom unmapper running in a detached > > context (e.g. kernel thread) vs. directly in the oom context doesn't > > make any difference wrt. lock because the holders of the lock would loop > > inside the allocator anyway because we do not fail small allocations. > > We tried to allow small allocations to fail. It resulted in unstable system > with obscure bugs. Have they been reported/fixed? All kernel paths doing an allocation are _supposed_ to check and handle ENOMEM. If they are not then they are buggy and should be fixed. > We tried to allow small !__GFP_FS allocations to fail. It failed to fail by > effectively __GFP_NOFAIL allocations. What do you mean by that? An opencoded __GFP_NOFAIL? > We are now trying to allow zapping OOM victim's mm. Michal is already > skeptical about this approach due to lock dependency. I am not sure where this came from. I am all for this approach. It will not solve the problem completely for sure but it can help in many cases already. > We already spent 9 months on this OOM livelock. No silver bullet yet. > Proposed approaches are too drastic to backport for existing users. > I think we are out of bullet. Not at all. We have this problem since ever basically. And we have a lot of legacy issues to care about. But nobody could reasonably expect this will be solved in a short time period. > Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most > of callsites, This is simply not doable. There are thousand of allocation sites all over the kernel. > timeout based workaround will be the only bullet we can use. Those are the last resort which only paper over real bugs which should be fixed. I would agree with your urging if this was something that can easily happen on a _properly_ configured system. System which can blow into an OOM storm is far from being configured properly. If you have an untrusted users running on your system you should better put them into a highly restricted environment and limit as much as possible. I can completely understand your frustration about the pace of the progress here but this is nothing new and we should strive for long term vision which would be much less fragile than what we have right now. No timeout based solution is the way in that direction. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-02 12:36 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-02 12:36 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue 29-09-15 01:18:00, Tetsuo Handa wrote: > Michal Hocko wrote: > > The point I've tried to made is that oom unmapper running in a detached > > context (e.g. kernel thread) vs. directly in the oom context doesn't > > make any difference wrt. lock because the holders of the lock would loop > > inside the allocator anyway because we do not fail small allocations. > > We tried to allow small allocations to fail. It resulted in unstable system > with obscure bugs. Have they been reported/fixed? All kernel paths doing an allocation are _supposed_ to check and handle ENOMEM. If they are not then they are buggy and should be fixed. > We tried to allow small !__GFP_FS allocations to fail. It failed to fail by > effectively __GFP_NOFAIL allocations. What do you mean by that? An opencoded __GFP_NOFAIL? > We are now trying to allow zapping OOM victim's mm. Michal is already > skeptical about this approach due to lock dependency. I am not sure where this came from. I am all for this approach. It will not solve the problem completely for sure but it can help in many cases already. > We already spent 9 months on this OOM livelock. No silver bullet yet. > Proposed approaches are too drastic to backport for existing users. > I think we are out of bullet. Not at all. We have this problem since ever basically. And we have a lot of legacy issues to care about. But nobody could reasonably expect this will be solved in a short time period. > Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most > of callsites, This is simply not doable. There are thousand of allocation sites all over the kernel. > timeout based workaround will be the only bullet we can use. Those are the last resort which only paper over real bugs which should be fixed. I would agree with your urging if this was something that can easily happen on a _properly_ configured system. System which can blow into an OOM storm is far from being configured properly. If you have an untrusted users running on your system you should better put them into a highly restricted environment and limit as much as possible. I can completely understand your frustration about the pace of the progress here but this is nothing new and we should strive for long term vision which would be much less fragile than what we have right now. No timeout based solution is the way in that direction. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-02 12:36 ` Michal Hocko @ 2015-10-02 19:01 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-02 19:01 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@kernel.org> wrote: > > Have they been reported/fixed? All kernel paths doing an allocation are > _supposed_ to check and handle ENOMEM. If they are not then they are > buggy and should be fixed. No. Stop this theoretical idiocy. We've tried it. I objected before people tried it, and it turns out that it was a horrible idea. Small kernel allocations should basically never fail, because we end up needing memory for random things, and if a kmalloc() fails it's because some application is using too much memory, and the application should be killed. Never should the kernel allocation fail. It really is that simple. If we are out of memory, that does not mean that we should start failing random kernel things. So this "people should check for allocation failures" is bullshit. It's a computer science myth. It's simply not true in all cases. Kernel allocators that know that they do large allocations (ie bigger than a few pages) need to be able to handle the failure, but not the general case. Also, kernel allocators that know they have a good fallback (eg they try a large allocation first but can fall back to a smaller one) should use __GFP_NORETRY, but again, that does *not* in any way mean that general kernel allocations should randomly fail. So no. The answer is ABSOLUTELY NOT "everybody should check allocation failure". Get over it. I refuse to go through that circus again. It's stupid. Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-02 19:01 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-02 19:01 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@kernel.org> wrote: > > Have they been reported/fixed? All kernel paths doing an allocation are > _supposed_ to check and handle ENOMEM. If they are not then they are > buggy and should be fixed. No. Stop this theoretical idiocy. We've tried it. I objected before people tried it, and it turns out that it was a horrible idea. Small kernel allocations should basically never fail, because we end up needing memory for random things, and if a kmalloc() fails it's because some application is using too much memory, and the application should be killed. Never should the kernel allocation fail. It really is that simple. If we are out of memory, that does not mean that we should start failing random kernel things. So this "people should check for allocation failures" is bullshit. It's a computer science myth. It's simply not true in all cases. Kernel allocators that know that they do large allocations (ie bigger than a few pages) need to be able to handle the failure, but not the general case. Also, kernel allocators that know they have a good fallback (eg they try a large allocation first but can fall back to a smaller one) should use __GFP_NORETRY, but again, that does *not* in any way mean that general kernel allocations should randomly fail. So no. The answer is ABSOLUTELY NOT "everybody should check allocation failure". Get over it. I refuse to go through that circus again. It's stupid. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-02 19:01 ` Linus Torvalds @ 2015-10-05 14:44 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-05 14:44 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Fri 02-10-15 15:01:06, Linus Torvalds wrote: > On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > Have they been reported/fixed? All kernel paths doing an allocation are > > _supposed_ to check and handle ENOMEM. If they are not then they are > > buggy and should be fixed. > > No. Stop this theoretical idiocy. > > We've tried it. I objected before people tried it, and it turns out > that it was a horrible idea. > > Small kernel allocations should basically never fail, because we end > up needing memory for random things, and if a kmalloc() fails it's > because some application is using too much memory, and the application > should be killed. Never should the kernel allocation fail. It really > is that simple. If we are out of memory, that does not mean that we > should start failing random kernel things. But you do realize that killing a task as a memory reclaim technique is not 100% reliable, right? Any task might be blocked in an uninterruptible context (e.g. a mutex) waiting for completion which depends on the allocation success. The page allocator (resp. OOM killer) is not aware of these dependencies and I am really skeptical it will ever be because dependency tracking is way too expensive. So killing a task doesn't guarantee a forward progress. So I can see basically only few ways out of this deadlock situation. Either we face the reality and allow small allocations (withtout __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed (so after even OOM killer hasn't made any progress). Or we can start killing other tasks but this might end up in the same state and the time to resolve the problem might be basically unbounded (it is trivial to construct loads where hundreds of tasks are bashing against a single i_mutex and all of them depending on an allocation...). Or we can panic/reboot the system if the OOM situation cannot be solved within a selected timeout. There are other ways to micro-optimize the current implementation by playing with memory reserves but all that is just postponing the final disaster and there is still a point of no further progress that we have to deal with somehow. > So this "people should check for allocation failures" is bullshit. > It's a computer science myth. It's simply not true in all cases. Sure it is not true in _all_ cases. If some paths cannot fail they can use __GFP_NOFAIL for that purpose. The point is that most allocations _can_ handle the failure. People are taught to check for allocation failures. We even have scripts/coccinelle/null/kmerr.cocci which helps to detect slab allocator users to some degree. > Kernel allocators that know that they do large allocations (ie bigger > than a few pages) need to be able to handle the failure, but not the > general case. Also, kernel allocators that know they have a good > fallback (eg they try a large allocation first but can fall back to a > smaller one) should use __GFP_NORETRY, but again, that does *not* in > any way mean that general kernel allocations should randomly fail. > > So no. The answer is ABSOLUTELY NOT "everybody should check allocation > failure". Get over it. I refuse to go through that circus again. It's > stupid. > > Linus -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-05 14:44 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-05 14:44 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Fri 02-10-15 15:01:06, Linus Torvalds wrote: > On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > Have they been reported/fixed? All kernel paths doing an allocation are > > _supposed_ to check and handle ENOMEM. If they are not then they are > > buggy and should be fixed. > > No. Stop this theoretical idiocy. > > We've tried it. I objected before people tried it, and it turns out > that it was a horrible idea. > > Small kernel allocations should basically never fail, because we end > up needing memory for random things, and if a kmalloc() fails it's > because some application is using too much memory, and the application > should be killed. Never should the kernel allocation fail. It really > is that simple. If we are out of memory, that does not mean that we > should start failing random kernel things. But you do realize that killing a task as a memory reclaim technique is not 100% reliable, right? Any task might be blocked in an uninterruptible context (e.g. a mutex) waiting for completion which depends on the allocation success. The page allocator (resp. OOM killer) is not aware of these dependencies and I am really skeptical it will ever be because dependency tracking is way too expensive. So killing a task doesn't guarantee a forward progress. So I can see basically only few ways out of this deadlock situation. Either we face the reality and allow small allocations (withtout __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed (so after even OOM killer hasn't made any progress). Or we can start killing other tasks but this might end up in the same state and the time to resolve the problem might be basically unbounded (it is trivial to construct loads where hundreds of tasks are bashing against a single i_mutex and all of them depending on an allocation...). Or we can panic/reboot the system if the OOM situation cannot be solved within a selected timeout. There are other ways to micro-optimize the current implementation by playing with memory reserves but all that is just postponing the final disaster and there is still a point of no further progress that we have to deal with somehow. > So this "people should check for allocation failures" is bullshit. > It's a computer science myth. It's simply not true in all cases. Sure it is not true in _all_ cases. If some paths cannot fail they can use __GFP_NOFAIL for that purpose. The point is that most allocations _can_ handle the failure. People are taught to check for allocation failures. We even have scripts/coccinelle/null/kmerr.cocci which helps to detect slab allocator users to some degree. > Kernel allocators that know that they do large allocations (ie bigger > than a few pages) need to be able to handle the failure, but not the > general case. Also, kernel allocators that know they have a good > fallback (eg they try a large allocation first but can fall back to a > smaller one) should use __GFP_NORETRY, but again, that does *not* in > any way mean that general kernel allocations should randomly fail. > > So no. The answer is ABSOLUTELY NOT "everybody should check allocation > failure". Get over it. I refuse to go through that circus again. It's > stupid. > > Linus -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-05 14:44 ` Michal Hocko @ 2015-10-07 5:16 ` Vlastimil Babka -1 siblings, 0 replies; 213+ messages in thread From: Vlastimil Babka @ 2015-10-07 5:16 UTC (permalink / raw) To: Michal Hocko, Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On 5.10.2015 16:44, Michal Hocko wrote: > So I can see basically only few ways out of this deadlock situation. > Either we face the reality and allow small allocations (withtout > __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed > (so after even OOM killer hasn't made any progress). Note that small allocations already *can* fail if they are done in the context of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case when they failed in a code that "handled" the allocation failure with a BUG_ON(!page). ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-07 5:16 ` Vlastimil Babka 0 siblings, 0 replies; 213+ messages in thread From: Vlastimil Babka @ 2015-10-07 5:16 UTC (permalink / raw) To: Michal Hocko, Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On 5.10.2015 16:44, Michal Hocko wrote: > So I can see basically only few ways out of this deadlock situation. > Either we face the reality and allow small allocations (withtout > __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed > (so after even OOM killer hasn't made any progress). Note that small allocations already *can* fail if they are done in the context of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case when they failed in a code that "handled" the allocation failure with a BUG_ON(!page). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-07 5:16 ` Vlastimil Babka @ 2015-10-07 10:43 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-07 10:43 UTC (permalink / raw) To: vbabka Cc: mhocko, torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Vlastimil Babka wrote: > On 5.10.2015 16:44, Michal Hocko wrote: > > So I can see basically only few ways out of this deadlock situation. > > Either we face the reality and allow small allocations (withtout > > __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed > > (so after even OOM killer hasn't made any progress). > > Note that small allocations already *can* fail if they are done in the context > of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case > when they failed in a code that "handled" the allocation failure with a > BUG_ON(!page). > Did You hit a race described below? http://lkml.kernel.org/r/201508272249.HDH81838.FtQOLMFFOVSJOH@I-love.SAKURA.ne.jp Where was the BUG_ON(!page) ? Maybe it is a candidate for adding __GFP_NOFAIL. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-07 10:43 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-07 10:43 UTC (permalink / raw) To: vbabka Cc: mhocko, torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Vlastimil Babka wrote: > On 5.10.2015 16:44, Michal Hocko wrote: > > So I can see basically only few ways out of this deadlock situation. > > Either we face the reality and allow small allocations (withtout > > __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed > > (so after even OOM killer hasn't made any progress). > > Note that small allocations already *can* fail if they are done in the context > of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case > when they failed in a code that "handled" the allocation failure with a > BUG_ON(!page). > Did You hit a race described below? http://lkml.kernel.org/r/201508272249.HDH81838.FtQOLMFFOVSJOH@I-love.SAKURA.ne.jp Where was the BUG_ON(!page) ? Maybe it is a candidate for adding __GFP_NOFAIL. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-07 10:43 ` Tetsuo Handa @ 2015-10-08 9:40 ` Vlastimil Babka -1 siblings, 0 replies; 213+ messages in thread From: Vlastimil Babka @ 2015-10-08 9:40 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On 10/07/2015 12:43 PM, Tetsuo Handa wrote: > Vlastimil Babka wrote: >> On 5.10.2015 16:44, Michal Hocko wrote: >>> So I can see basically only few ways out of this deadlock situation. >>> Either we face the reality and allow small allocations (withtout >>> __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed >>> (so after even OOM killer hasn't made any progress). >> >> Note that small allocations already *can* fail if they are done in the context >> of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case >> when they failed in a code that "handled" the allocation failure with a >> BUG_ON(!page). >> > Did You hit a race described below? I don't know, I don't even have direct evidence of TIF_MEMDIE being set, but OOMs were happening all over the place, and I haven't found another reason why the allocation would not be too-small-to-fail otherwise. > http://lkml.kernel.org/r/201508272249.HDH81838.FtQOLMFFOVSJOH@I-love.SAKURA.ne.jp > > Where was the BUG_ON(!page) ? Maybe it is a candidate for adding __GFP_NOFAIL. Yes, I suggested so: http://marc.info/?l=linux-kernel&m=144181523115244&w=2 ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-08 9:40 ` Vlastimil Babka 0 siblings, 0 replies; 213+ messages in thread From: Vlastimil Babka @ 2015-10-08 9:40 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On 10/07/2015 12:43 PM, Tetsuo Handa wrote: > Vlastimil Babka wrote: >> On 5.10.2015 16:44, Michal Hocko wrote: >>> So I can see basically only few ways out of this deadlock situation. >>> Either we face the reality and allow small allocations (withtout >>> __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed >>> (so after even OOM killer hasn't made any progress). >> >> Note that small allocations already *can* fail if they are done in the context >> of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case >> when they failed in a code that "handled" the allocation failure with a >> BUG_ON(!page). >> > Did You hit a race described below? I don't know, I don't even have direct evidence of TIF_MEMDIE being set, but OOMs were happening all over the place, and I haven't found another reason why the allocation would not be too-small-to-fail otherwise. > http://lkml.kernel.org/r/201508272249.HDH81838.FtQOLMFFOVSJOH@I-love.SAKURA.ne.jp > > Where was the BUG_ON(!page) ? Maybe it is a candidate for adding __GFP_NOFAIL. Yes, I suggested so: http://marc.info/?l=linux-kernel&m=144181523115244&w=2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-02 19:01 ` Linus Torvalds @ 2015-10-06 7:55 ` Eric W. Biederman -1 siblings, 0 replies; 213+ messages in thread From: Eric W. Biederman @ 2015-10-06 7:55 UTC (permalink / raw) To: Linus Torvalds Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina Linus Torvalds <torvalds@linux-foundation.org> writes: > On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@kernel.org> wrote: >> >> Have they been reported/fixed? All kernel paths doing an allocation are >> _supposed_ to check and handle ENOMEM. If they are not then they are >> buggy and should be fixed. > > No. Stop this theoretical idiocy. > > We've tried it. I objected before people tried it, and it turns out > that it was a horrible idea. > > Small kernel allocations should basically never fail, because we end > up needing memory for random things, and if a kmalloc() fails it's > because some application is using too much memory, and the application > should be killed. Never should the kernel allocation fail. It really > is that simple. If we are out of memory, that does not mean that we > should start failing random kernel things. > > So this "people should check for allocation failures" is bullshit. > It's a computer science myth. It's simply not true in all cases. > > Kernel allocators that know that they do large allocations (ie bigger > than a few pages) need to be able to handle the failure, but not the > general case. Also, kernel allocators that know they have a good > fallback (eg they try a large allocation first but can fall back to a > smaller one) should use __GFP_NORETRY, but again, that does *not* in > any way mean that general kernel allocations should randomly fail. > > So no. The answer is ABSOLUTELY NOT "everybody should check allocation > failure". Get over it. I refuse to go through that circus again. It's > stupid. Not to take away from your point about very small allocations. However assuming allocations larger than a page will always succeed is down right dangerous. Last time this issue rose up and bit me I sat down and did the math, and it is ugly. You have to have 50% of the memory free to guarantee that an order 1 allocation will succeed. So quite frankly I think it is only safe to require order 0 alloctions to succeed. Larger allocations do fail in practice, and it causes real problems on real workloads when we try and loop forever waiting for something that will never come. My analysis from when it bit me. commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Mon Feb 10 14:25:41 2014 -0800 fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem Recently due to a spike in connections per second memcached on 3 separate boxes triggered the OOM killer from accept. At the time the OOM killer was triggered there was 4GB out of 36GB free in zone 1. The problem was that alloc_fdtable was allocating an order 3 page (32KiB) to hold a bitmap, and there was sufficient fragmentation that the largest page available was 8KiB. I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious but I do agree that order 3 allocations are very likely to succeed. There are always pathologies where order > 0 allocations can fail when there are copious amounts of free memory available. Using the pigeon hole principle it is easy to show that it requires 1 page more than 50% of the pages being free to guarantee an order 1 (8KiB) allocation will succeed, 1 page more than 75% of the pages being free to guarantee an order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of the pages being free to guarantee an order 3 allocate will succeed. A server churning memory with a lot of small requests and replies like memcached is a common case that if anything can will skew the odds against large pages being available. Therefore let's not give external applications a practical way to kill linux server applications, and specify __GFP_NORETRY to the kmalloc in alloc_fdmem. Unless I am misreading the code and by the time the code reaches should_alloc_retry in __alloc_pages_slowpath (where __GFP_NORETRY becomes signification). We have already tried everything reasonable to allocate a page and the only thing left to do is wait. So not waiting and falling back to vmalloc immediately seems like the reasonable thing to do even if there wasn't a chance of triggering the OOM killer. Eric ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-06 7:55 ` Eric W. Biederman 0 siblings, 0 replies; 213+ messages in thread From: Eric W. Biederman @ 2015-10-06 7:55 UTC (permalink / raw) To: Linus Torvalds Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina Linus Torvalds <torvalds@linux-foundation.org> writes: > On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@kernel.org> wrote: >> >> Have they been reported/fixed? All kernel paths doing an allocation are >> _supposed_ to check and handle ENOMEM. If they are not then they are >> buggy and should be fixed. > > No. Stop this theoretical idiocy. > > We've tried it. I objected before people tried it, and it turns out > that it was a horrible idea. > > Small kernel allocations should basically never fail, because we end > up needing memory for random things, and if a kmalloc() fails it's > because some application is using too much memory, and the application > should be killed. Never should the kernel allocation fail. It really > is that simple. If we are out of memory, that does not mean that we > should start failing random kernel things. > > So this "people should check for allocation failures" is bullshit. > It's a computer science myth. It's simply not true in all cases. > > Kernel allocators that know that they do large allocations (ie bigger > than a few pages) need to be able to handle the failure, but not the > general case. Also, kernel allocators that know they have a good > fallback (eg they try a large allocation first but can fall back to a > smaller one) should use __GFP_NORETRY, but again, that does *not* in > any way mean that general kernel allocations should randomly fail. > > So no. The answer is ABSOLUTELY NOT "everybody should check allocation > failure". Get over it. I refuse to go through that circus again. It's > stupid. Not to take away from your point about very small allocations. However assuming allocations larger than a page will always succeed is down right dangerous. Last time this issue rose up and bit me I sat down and did the math, and it is ugly. You have to have 50% of the memory free to guarantee that an order 1 allocation will succeed. So quite frankly I think it is only safe to require order 0 alloctions to succeed. Larger allocations do fail in practice, and it causes real problems on real workloads when we try and loop forever waiting for something that will never come. My analysis from when it bit me. commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Mon Feb 10 14:25:41 2014 -0800 fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem Recently due to a spike in connections per second memcached on 3 separate boxes triggered the OOM killer from accept. At the time the OOM killer was triggered there was 4GB out of 36GB free in zone 1. The problem was that alloc_fdtable was allocating an order 3 page (32KiB) to hold a bitmap, and there was sufficient fragmentation that the largest page available was 8KiB. I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious but I do agree that order 3 allocations are very likely to succeed. There are always pathologies where order > 0 allocations can fail when there are copious amounts of free memory available. Using the pigeon hole principle it is easy to show that it requires 1 page more than 50% of the pages being free to guarantee an order 1 (8KiB) allocation will succeed, 1 page more than 75% of the pages being free to guarantee an order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of the pages being free to guarantee an order 3 allocate will succeed. A server churning memory with a lot of small requests and replies like memcached is a common case that if anything can will skew the odds against large pages being available. Therefore let's not give external applications a practical way to kill linux server applications, and specify __GFP_NORETRY to the kmalloc in alloc_fdmem. Unless I am misreading the code and by the time the code reaches should_alloc_retry in __alloc_pages_slowpath (where __GFP_NORETRY becomes signification). We have already tried everything reasonable to allocate a page and the only thing left to do is wait. So not waiting and falling back to vmalloc immediately seems like the reasonable thing to do even if there wasn't a chance of triggering the OOM killer. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-06 7:55 ` Eric W. Biederman @ 2015-10-06 8:49 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-06 8:49 UTC (permalink / raw) To: Eric W. Biederman Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Tue, Oct 6, 2015 at 8:55 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > Not to take away from your point about very small allocations. However > assuming allocations larger than a page will always succeed is down > right dangerous. We've required retrying for *at least* order-1 allocations. Exactly because things like fork() etc have wanted them, and: - as you say, you can be unlucky even with reasonable amounts of free memory - the page-out code is approximate and doesn't guarantee that you get buddy coalescing - just failing after a couple of loops has been known to result in fork() and similar friends returning -EAGAIN and breaking user space. Really. Stop this idiocy. We have gone through this before. It's a disaster. The basic fact remains: kernel allocations are so important that rather than fail, you should kill user space. Only kernel allocations that *explicitly* know that they have fallback code should fail, and they should just do the __GFP_NORETRY. So the rule ends up being that we retry the memory freeing loop for small allocations (where "small" is something like "order 2 or less") So really. If you find some particular case that is painful because it wants an order-1 or order-2 allocation, then you do this: - do the allocation with GFP_NORETRY - have a fallback that uses vmalloc or just is able to make the buffer even smaller. But by default we will continue to make small orders retry. As mentioned, we have tried the alternatives. It doesn't work. Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-06 8:49 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-06 8:49 UTC (permalink / raw) To: Eric W. Biederman Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Tue, Oct 6, 2015 at 8:55 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > Not to take away from your point about very small allocations. However > assuming allocations larger than a page will always succeed is down > right dangerous. We've required retrying for *at least* order-1 allocations. Exactly because things like fork() etc have wanted them, and: - as you say, you can be unlucky even with reasonable amounts of free memory - the page-out code is approximate and doesn't guarantee that you get buddy coalescing - just failing after a couple of loops has been known to result in fork() and similar friends returning -EAGAIN and breaking user space. Really. Stop this idiocy. We have gone through this before. It's a disaster. The basic fact remains: kernel allocations are so important that rather than fail, you should kill user space. Only kernel allocations that *explicitly* know that they have fallback code should fail, and they should just do the __GFP_NORETRY. So the rule ends up being that we retry the memory freeing loop for small allocations (where "small" is something like "order 2 or less") So really. If you find some particular case that is painful because it wants an order-1 or order-2 allocation, then you do this: - do the allocation with GFP_NORETRY - have a fallback that uses vmalloc or just is able to make the buffer even smaller. But by default we will continue to make small orders retry. As mentioned, we have tried the alternatives. It doesn't work. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-06 8:49 ` Linus Torvalds @ 2015-10-06 8:55 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-06 8:55 UTC (permalink / raw) To: Eric W. Biederman Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Tue, Oct 6, 2015 at 9:49 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The basic fact remains: kernel allocations are so important that > rather than fail, you should kill user space. Only kernel allocations > that *explicitly* know that they have fallback code should fail, and > they should just do the __GFP_NORETRY. To be clear: "big" orders (I forget if the limit is at order-3 or order-4) do fail much more aggressively. But no, we do not limit retry to just order-0, because even small kmalloc sizes tend to often do order-1 or order-2 just because of memory packing issues (ie trying to pack into a single page wastes too much memory if the allocation sizes don't come out right). So no, order-0 isn't special. 1/2 are rather important too. [ Checking /proc/slabinfo: it looks like several slabs are order-3, for things like files_cache, signal_cache and sighand_cache for me at least. So I think it's up to order-3 that we basically need to consider "we'll need to shrink user space aggressively unless we have an explicit fallback for the allocation" ] Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-06 8:55 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-06 8:55 UTC (permalink / raw) To: Eric W. Biederman Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Tue, Oct 6, 2015 at 9:49 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The basic fact remains: kernel allocations are so important that > rather than fail, you should kill user space. Only kernel allocations > that *explicitly* know that they have fallback code should fail, and > they should just do the __GFP_NORETRY. To be clear: "big" orders (I forget if the limit is at order-3 or order-4) do fail much more aggressively. But no, we do not limit retry to just order-0, because even small kmalloc sizes tend to often do order-1 or order-2 just because of memory packing issues (ie trying to pack into a single page wastes too much memory if the allocation sizes don't come out right). So no, order-0 isn't special. 1/2 are rather important too. [ Checking /proc/slabinfo: it looks like several slabs are order-3, for things like files_cache, signal_cache and sighand_cache for me at least. So I think it's up to order-3 that we basically need to consider "we'll need to shrink user space aggressively unless we have an explicit fallback for the allocation" ] Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-06 8:55 ` Linus Torvalds @ 2015-10-06 14:52 ` Eric W. Biederman -1 siblings, 0 replies; 213+ messages in thread From: Eric W. Biederman @ 2015-10-06 14:52 UTC (permalink / raw) To: Linus Torvalds Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, Oct 6, 2015 at 9:49 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> The basic fact remains: kernel allocations are so important that >> rather than fail, you should kill user space. Only kernel allocations >> that *explicitly* know that they have fallback code should fail, and >> they should just do the __GFP_NORETRY. If you have reached the point of killing userspace you might as well panic the box. Userspace will recover more cleanly and more quickly. The oom-killer is like an oops. Nice for debugging but not something you want on a production workload. > To be clear: "big" orders (I forget if the limit is at order-3 or > order-4) do fail much more aggressively. But no, we do not limit retry > to just order-0, because even small kmalloc sizes tend to often do > order-1 or order-2 just because of memory packing issues (ie trying to > pack into a single page wastes too much memory if the allocation sizes > don't come out right). I am not asking that we limit retry to just order-0 pages. I am asking that we limit the oom-killer on failure to just order-0 pages. > So no, order-0 isn't special. 1/2 are rather important too. That is a justification for retrying. That is not a justification for killing the box. > [ Checking /proc/slabinfo: it looks like several slabs are order-3, > for things like files_cache, signal_cache and sighand_cache for me at > least. So I think it's up to order-3 that we basically need to > consider "we'll need to shrink user space aggressively unless we have > an explicit fallback for the allocation" ] What I know is that order-3 is definitely too big. I had 4G of RAM free. I needed 16K to exapand the fd table. The box died. That is not good. We have static checkers now, failure to check and handle errors tends to be caught. So yes for the rare case of order-[123] allocations failing we should return the failure to the caller. The kernel can handle it. Userspace can handle just about anything better than random processes dying. Eric ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-06 14:52 ` Eric W. Biederman 0 siblings, 0 replies; 213+ messages in thread From: Eric W. Biederman @ 2015-10-06 14:52 UTC (permalink / raw) To: Linus Torvalds Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, Oct 6, 2015 at 9:49 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> The basic fact remains: kernel allocations are so important that >> rather than fail, you should kill user space. Only kernel allocations >> that *explicitly* know that they have fallback code should fail, and >> they should just do the __GFP_NORETRY. If you have reached the point of killing userspace you might as well panic the box. Userspace will recover more cleanly and more quickly. The oom-killer is like an oops. Nice for debugging but not something you want on a production workload. > To be clear: "big" orders (I forget if the limit is at order-3 or > order-4) do fail much more aggressively. But no, we do not limit retry > to just order-0, because even small kmalloc sizes tend to often do > order-1 or order-2 just because of memory packing issues (ie trying to > pack into a single page wastes too much memory if the allocation sizes > don't come out right). I am not asking that we limit retry to just order-0 pages. I am asking that we limit the oom-killer on failure to just order-0 pages. > So no, order-0 isn't special. 1/2 are rather important too. That is a justification for retrying. That is not a justification for killing the box. > [ Checking /proc/slabinfo: it looks like several slabs are order-3, > for things like files_cache, signal_cache and sighand_cache for me at > least. So I think it's up to order-3 that we basically need to > consider "we'll need to shrink user space aggressively unless we have > an explicit fallback for the allocation" ] What I know is that order-3 is definitely too big. I had 4G of RAM free. I needed 16K to exapand the fd table. The box died. That is not good. We have static checkers now, failure to check and handle errors tends to be caught. So yes for the rare case of order-[123] allocations failing we should return the failure to the caller. The kernel can handle it. Userspace can handle just about anything better than random processes dying. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Can't we use timeout based OOM warning/killing? 2015-10-02 12:36 ` Michal Hocko @ 2015-10-03 6:02 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-03 6:02 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Tue 29-09-15 01:18:00, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > The point I've tried to made is that oom unmapper running in a detached > > > context (e.g. kernel thread) vs. directly in the oom context doesn't > > > make any difference wrt. lock because the holders of the lock would loop > > > inside the allocator anyway because we do not fail small allocations. > > > > We tried to allow small allocations to fail. It resulted in unstable system > > with obscure bugs. > > Have they been reported/fixed? All kernel paths doing an allocation are > _supposed_ to check and handle ENOMEM. If they are not then they are > buggy and should be fixed. > Kernel developers are not interested in testing OOM cases. I proposed a SystemTap-based mandatory memory allocation failure injection for testing OOM cases, but there was no response. Most of memory allocation failure paths in the kernel remain untested. Unless you persuade all kernel developers to test OOM cases and add a gfp flag which bypasses memory allocation failure injection test (e.g. __GFP_FITv1_PASSED) and change any !__GFP_FITv1_PASSED && !__GFP_NOFAIL allocations always fail, we can't check that "all kernel paths doing an allocation are _supposed_ to check and handle ENOMEM". > > We tried to allow small !__GFP_FS allocations to fail. It failed to fail by > > effectively __GFP_NOFAIL allocations. > > What do you mean by that? An opencoded __GFP_NOFAIL? > Yes. XFS livelock is an example I can trivially reproduce. Loss of reliability of buffered write()s is another example. [ 1721.405074] buffer_io_error: 36 callbacks suppressed [ 1721.406263] Buffer I/O error on dev sda1, logical block 34652401, lost async page write [ 1721.406996] Buffer I/O error on dev sda1, logical block 34650278, lost async page write [ 1721.407125] Buffer I/O error on dev sda1, logical block 34652330, lost async page write [ 1721.407197] Buffer I/O error on dev sda1, logical block 34653485, lost async page write [ 1721.407203] Buffer I/O error on dev sda1, logical block 34652398, lost async page write [ 1721.407232] Buffer I/O error on dev sda1, logical block 34650494, lost async page write [ 1721.407356] Buffer I/O error on dev sda1, logical block 34652361, lost async page write [ 1721.407386] Buffer I/O error on dev sda1, logical block 34653484, lost async page write [ 1721.407481] Buffer I/O error on dev sda1, logical block 34652396, lost async page write [ 1721.407504] Buffer I/O error on dev sda1, logical block 34650291, lost async page write [ 1723.369963] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1723.810033] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1725.434057] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.448049] XFS: a.out(7810) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.470757] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.474061] XFS: a.out(7881) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.586610] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1726.026702] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1726.043988] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1727.682001] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1727.688661] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1727.785214] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1728.226640] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1728.290648] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1729.930028] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250) > > We are now trying to allow zapping OOM victim's mm. Michal is already > > skeptical about this approach due to lock dependency. > > I am not sure where this came from. I am all for this approach. It will > not solve the problem completely for sure but it can help in many cases > already. > Sorry. This was my misunderstanding. But I still think that we need to be prepared for cases where zapping OOM victim's mm approach fails. ( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp ) > > We already spent 9 months on this OOM livelock. No silver bullet yet. > > Proposed approaches are too drastic to backport for existing users. > > I think we are out of bullet. > > Not at all. We have this problem since ever basically. And we have a lot > of legacy issues to care about. But nobody could reasonably expect this > will be solved in a short time period. > What people generally imagine with OOM killer is that OOM killer is invoked when the system is out of memory. But we know that there are many possible cases where OOM killer messages are not printed. We did not make effort to break people free from the belief that OOM killer is invoked when the system is out of memory, nor make effort to provide people a mean to warn OOM situation, after we recognized the "too small to fail" memory-allocation rule ( https://lwn.net/Articles/627419/ ) which was 9 months ago. > > Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most > > of callsites, > > This is simply not doable. There are thousand of allocation sites all > over the kernel. But changing the default behavior (i.e. implicitly behave like __GFP_NORETRY inside memory allocator unless __GFP_NOFAIL is passed) is also not doable. You will need to ask for ACKs from thousand of allocation sites all over the kernel but that is not realistic. An example. I proposed a patch which changes the default behavior in XFS and got a feedback ( http://marc.info/?l=linux-mm&m=144279862227010 ) that fundamentally changing the allocation behavior of the filesystem requires some indication of the testing and characterization of how the change has impacted low memory balance and performance of the filesystem. You will need to ask for ACKs from all filesystem developers. Another example. I don't like that permission checks for access requests from user space start failing with ENOMEM error when memory is tight. It is not happy that access requests by critical processes are failed by inconsequential process's memory consumption. ( https://www.mail-archive.com/tomoyo-users-en@lists.osdn.me/msg00008.html ) This problem is not limited to permission checks. If a process executed a program using execve() and that process reached the point of no return in the execve() operation, any memory allocation failure before reaching the point of handling ENOMEM errors (e.g. failing to load shared libraries before calling the main() function of the new program), the process will be killed. If the process were the global init process, the system will panic(). Despite we mean to simply enforce only "all kernel paths doing an allocation are _supposed_ to check and handle ENOMEM", we have a period where memory allocation failure in the user space results in an unrecoverable failure. We depend on /proc/$pid/oom_score_adj for protecting critical processes from inconsequential process. I'm happy to give up memory allocation upon SIGKILL, but I'm not happy to give up upon ENOMEM without making effort to solve OOM situation. > > > timeout based workaround will be the only bullet we can use. > > Those are the last resort which only paper over real bugs which should > be fixed. I would agree with your urging if this was something that can > easily happen on a _properly_ configured system. System which can blow > into an OOM storm is far from being configured properly. If you have an > untrusted users running on your system you should better put them into a > highly restricted environment and limit as much as possible. People are reporting hang up problems. I'm suspecting that some of them are caused by silent OOM. I showed you that there are many possible paths which can lead to silent hang up. But we are forcing people to use kernels without means to find out what was happening. Therefore, "there is no report" does not mean that "we are not hitting OOM livelock problems". Without means to find out what was happening, we will "overlook real bugs" before "paper over real bugs". The means are expected to work without knowledge to use trace points functionality, are expected to run without memory allocation, are expected to dump output without administrator's operation, are expected to work before power reset by watchdog timers. > > I can completely understand your frustration about the pace of the > progress here but this is nothing new and we should strive for long term > vision which would be much less fragile than what we have right now. No > timeout based solution is the way in that direction. Can we stop randomly setting TIF_MEMDIE to only one task and staying silent forever in the hope that the task can make a quick exit? As long as small allocations do not fail, this TIF_MEMDIE logic is prone to livelock. We won't be able to change small allocations to fail (like Linus said at http://lkml.kernel.org/r/CA+55aFw=OLSdh-5Ut2vjy=4Yf1fTXqpzoDHdF7XnT5gDHs6sYA@mail.gmail.com and I said in this post) in the near future. Like I said at http://lkml.kernel.org/r/201510012113.HEA98301.SVFQOFtFOHLMOJ@I-love.SAKURA.ne.jp , can't we start adding a mean to emit some diagnostic kernel messages automatically? ^ permalink raw reply [flat|nested] 213+ messages in thread
* Can't we use timeout based OOM warning/killing? @ 2015-10-03 6:02 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-03 6:02 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Tue 29-09-15 01:18:00, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > The point I've tried to made is that oom unmapper running in a detached > > > context (e.g. kernel thread) vs. directly in the oom context doesn't > > > make any difference wrt. lock because the holders of the lock would loop > > > inside the allocator anyway because we do not fail small allocations. > > > > We tried to allow small allocations to fail. It resulted in unstable system > > with obscure bugs. > > Have they been reported/fixed? All kernel paths doing an allocation are > _supposed_ to check and handle ENOMEM. If they are not then they are > buggy and should be fixed. > Kernel developers are not interested in testing OOM cases. I proposed a SystemTap-based mandatory memory allocation failure injection for testing OOM cases, but there was no response. Most of memory allocation failure paths in the kernel remain untested. Unless you persuade all kernel developers to test OOM cases and add a gfp flag which bypasses memory allocation failure injection test (e.g. __GFP_FITv1_PASSED) and change any !__GFP_FITv1_PASSED && !__GFP_NOFAIL allocations always fail, we can't check that "all kernel paths doing an allocation are _supposed_ to check and handle ENOMEM". > > We tried to allow small !__GFP_FS allocations to fail. It failed to fail by > > effectively __GFP_NOFAIL allocations. > > What do you mean by that? An opencoded __GFP_NOFAIL? > Yes. XFS livelock is an example I can trivially reproduce. Loss of reliability of buffered write()s is another example. [ 1721.405074] buffer_io_error: 36 callbacks suppressed [ 1721.406263] Buffer I/O error on dev sda1, logical block 34652401, lost async page write [ 1721.406996] Buffer I/O error on dev sda1, logical block 34650278, lost async page write [ 1721.407125] Buffer I/O error on dev sda1, logical block 34652330, lost async page write [ 1721.407197] Buffer I/O error on dev sda1, logical block 34653485, lost async page write [ 1721.407203] Buffer I/O error on dev sda1, logical block 34652398, lost async page write [ 1721.407232] Buffer I/O error on dev sda1, logical block 34650494, lost async page write [ 1721.407356] Buffer I/O error on dev sda1, logical block 34652361, lost async page write [ 1721.407386] Buffer I/O error on dev sda1, logical block 34653484, lost async page write [ 1721.407481] Buffer I/O error on dev sda1, logical block 34652396, lost async page write [ 1721.407504] Buffer I/O error on dev sda1, logical block 34650291, lost async page write [ 1723.369963] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1723.810033] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1725.434057] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.448049] XFS: a.out(7810) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.470757] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.474061] XFS: a.out(7881) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.586610] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1726.026702] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1726.043988] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1727.682001] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1727.688661] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1727.785214] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1728.226640] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1728.290648] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1729.930028] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250) > > We are now trying to allow zapping OOM victim's mm. Michal is already > > skeptical about this approach due to lock dependency. > > I am not sure where this came from. I am all for this approach. It will > not solve the problem completely for sure but it can help in many cases > already. > Sorry. This was my misunderstanding. But I still think that we need to be prepared for cases where zapping OOM victim's mm approach fails. ( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp ) > > We already spent 9 months on this OOM livelock. No silver bullet yet. > > Proposed approaches are too drastic to backport for existing users. > > I think we are out of bullet. > > Not at all. We have this problem since ever basically. And we have a lot > of legacy issues to care about. But nobody could reasonably expect this > will be solved in a short time period. > What people generally imagine with OOM killer is that OOM killer is invoked when the system is out of memory. But we know that there are many possible cases where OOM killer messages are not printed. We did not make effort to break people free from the belief that OOM killer is invoked when the system is out of memory, nor make effort to provide people a mean to warn OOM situation, after we recognized the "too small to fail" memory-allocation rule ( https://lwn.net/Articles/627419/ ) which was 9 months ago. > > Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most > > of callsites, > > This is simply not doable. There are thousand of allocation sites all > over the kernel. But changing the default behavior (i.e. implicitly behave like __GFP_NORETRY inside memory allocator unless __GFP_NOFAIL is passed) is also not doable. You will need to ask for ACKs from thousand of allocation sites all over the kernel but that is not realistic. An example. I proposed a patch which changes the default behavior in XFS and got a feedback ( http://marc.info/?l=linux-mm&m=144279862227010 ) that fundamentally changing the allocation behavior of the filesystem requires some indication of the testing and characterization of how the change has impacted low memory balance and performance of the filesystem. You will need to ask for ACKs from all filesystem developers. Another example. I don't like that permission checks for access requests from user space start failing with ENOMEM error when memory is tight. It is not happy that access requests by critical processes are failed by inconsequential process's memory consumption. ( https://www.mail-archive.com/tomoyo-users-en@lists.osdn.me/msg00008.html ) This problem is not limited to permission checks. If a process executed a program using execve() and that process reached the point of no return in the execve() operation, any memory allocation failure before reaching the point of handling ENOMEM errors (e.g. failing to load shared libraries before calling the main() function of the new program), the process will be killed. If the process were the global init process, the system will panic(). Despite we mean to simply enforce only "all kernel paths doing an allocation are _supposed_ to check and handle ENOMEM", we have a period where memory allocation failure in the user space results in an unrecoverable failure. We depend on /proc/$pid/oom_score_adj for protecting critical processes from inconsequential process. I'm happy to give up memory allocation upon SIGKILL, but I'm not happy to give up upon ENOMEM without making effort to solve OOM situation. > > > timeout based workaround will be the only bullet we can use. > > Those are the last resort which only paper over real bugs which should > be fixed. I would agree with your urging if this was something that can > easily happen on a _properly_ configured system. System which can blow > into an OOM storm is far from being configured properly. If you have an > untrusted users running on your system you should better put them into a > highly restricted environment and limit as much as possible. People are reporting hang up problems. I'm suspecting that some of them are caused by silent OOM. I showed you that there are many possible paths which can lead to silent hang up. But we are forcing people to use kernels without means to find out what was happening. Therefore, "there is no report" does not mean that "we are not hitting OOM livelock problems". Without means to find out what was happening, we will "overlook real bugs" before "paper over real bugs". The means are expected to work without knowledge to use trace points functionality, are expected to run without memory allocation, are expected to dump output without administrator's operation, are expected to work before power reset by watchdog timers. > > I can completely understand your frustration about the pace of the > progress here but this is nothing new and we should strive for long term > vision which would be much less fragile than what we have right now. No > timeout based solution is the way in that direction. Can we stop randomly setting TIF_MEMDIE to only one task and staying silent forever in the hope that the task can make a quick exit? As long as small allocations do not fail, this TIF_MEMDIE logic is prone to livelock. We won't be able to change small allocations to fail (like Linus said at http://lkml.kernel.org/r/CA+55aFw=OLSdh-5Ut2vjy=4Yf1fTXqpzoDHdF7XnT5gDHs6sYA@mail.gmail.com and I said in this post) in the near future. Like I said at http://lkml.kernel.org/r/201510012113.HEA98301.SVFQOFtFOHLMOJ@I-love.SAKURA.ne.jp , can't we start adding a mean to emit some diagnostic kernel messages automatically? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? 2015-10-03 6:02 ` Tetsuo Handa @ 2015-10-06 14:51 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-06 14:51 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > Sorry. This was my misunderstanding. But I still think that we need to be > prepared for cases where zapping OOM victim's mm approach fails. > ( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp ) I tested whether it is easy/difficult to make zapping OOM victim's mm approach fail. The result seems that not difficult to make it fail. ---------- Reproducer start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> #include <sys/mman.h> static int reader(void *unused) { char c; int fd = open("/proc/self/cmdline", O_RDONLY); while (pread(fd, &c, 1, 0) == 1); return 0; } static int writer(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); static void *ptr[10000]; int i; sleep(2); while (1) { for (i = 0; i < 10000; i++) ptr[i] = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0); for (i = 0; i < 10000; i++) munmap(ptr[i], 4096); } return 0; } int main(int argc, char *argv[]) { int zero_fd = open("/dev/zero", O_RDONLY); char *buf = NULL; unsigned long size = 0; int i; for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } for (i = 0; i < 100; i++) { clone(reader, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); } clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ return * (char *) NULL; /* Kill all threads. */ } ---------- Reproducer end ---------- (I wrote this program for trying to mimic a trouble that a customer's system hung up with a lot of ps processes blocked at reading /proc/pid/ entries due to unkillable down_read(&mm->mmap_sem) in __access_remote_vm(). Though I couldn't identify what function was holding the mmap_sem for writing...) Uptime > 429 of http://I-love.SAKURA.ne.jp/tmp/serial-20151006.txt.xz showed a OOM livelock that (1) thread group leader is blocked at down_read(&mm->mmap_sem) in exit_mm() called from do_exit(). (2) writer thread is blocked at down_write(&mm->mmap_sem) in vm_mmap_pgoff() called from SyS_mmap_pgoff() called from SyS_mmap(). (3) many reader threads are blocking the writer thread because of down_read(&mm->mmap_sem) called from proc_pid_cmdline_read(). (4) while the thread group leader is blocked at down_read(&mm->mmap_sem), some of the reader threads are trying to allocate memory via page fault. So, zapping the first OOM victim's mm might fail by chance. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? @ 2015-10-06 14:51 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-06 14:51 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > Sorry. This was my misunderstanding. But I still think that we need to be > prepared for cases where zapping OOM victim's mm approach fails. > ( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp ) I tested whether it is easy/difficult to make zapping OOM victim's mm approach fail. The result seems that not difficult to make it fail. ---------- Reproducer start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> #include <sys/mman.h> static int reader(void *unused) { char c; int fd = open("/proc/self/cmdline", O_RDONLY); while (pread(fd, &c, 1, 0) == 1); return 0; } static int writer(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); static void *ptr[10000]; int i; sleep(2); while (1) { for (i = 0; i < 10000; i++) ptr[i] = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0); for (i = 0; i < 10000; i++) munmap(ptr[i], 4096); } return 0; } int main(int argc, char *argv[]) { int zero_fd = open("/dev/zero", O_RDONLY); char *buf = NULL; unsigned long size = 0; int i; for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } for (i = 0; i < 100; i++) { clone(reader, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); } clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ return * (char *) NULL; /* Kill all threads. */ } ---------- Reproducer end ---------- (I wrote this program for trying to mimic a trouble that a customer's system hung up with a lot of ps processes blocked at reading /proc/pid/ entries due to unkillable down_read(&mm->mmap_sem) in __access_remote_vm(). Though I couldn't identify what function was holding the mmap_sem for writing...) Uptime > 429 of http://I-love.SAKURA.ne.jp/tmp/serial-20151006.txt.xz showed a OOM livelock that (1) thread group leader is blocked at down_read(&mm->mmap_sem) in exit_mm() called from do_exit(). (2) writer thread is blocked at down_write(&mm->mmap_sem) in vm_mmap_pgoff() called from SyS_mmap_pgoff() called from SyS_mmap(). (3) many reader threads are blocking the writer thread because of down_read(&mm->mmap_sem) called from proc_pid_cmdline_read(). (4) while the thread group leader is blocked at down_read(&mm->mmap_sem), some of the reader threads are trying to allocate memory via page fault. So, zapping the first OOM victim's mm might fail by chance. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? 2015-10-06 14:51 ` Tetsuo Handa @ 2015-10-12 6:43 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-12 6:43 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > So, zapping the first OOM victim's mm might fail by chance. I retested with a slightly different version. ---------- Reproducer start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> #include <sys/mman.h> static int writer(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); while (1) { void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0); munmap(ptr, 4096); } return 0; } int main(int argc, char *argv[]) { char buffer[128] = { }; const pid_t pid = fork(); if (pid == 0) { /* down_write(&mm->mmap_sem) requester which is chosen as an OOM victim. */ int i; for (i = 0; i < 9; i++) clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); writer(NULL); } snprintf(buffer, sizeof(buffer) - 1, "/proc/%u/stat", pid); if (fork() == 0) { /* down_read(&mm->mmap_sem) requester. */ const int fd = open(buffer, O_RDONLY); while (pread(fd, buffer, sizeof(buffer), 0) > 0); _exit(0); } else { /* A dummy process for invoking the OOM killer. */ char *buf = NULL; unsigned long size = 0; const int fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } read(fd, buf, size); /* Will cause OOM due to overcommit */ return 0; } } ---------- Reproducer end ---------- Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151012.txt.xz . Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages, no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f at uptime = 289. I don't know the reason of this silent hang up, but the memory unzapping kernel thread will not help because there is no OOM victim. ---------- [ 101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 289.343187] sysrq: SysRq : Manual OOM execution (...snipped...) [ 292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0 (...snipped...) [ 302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task. ---------- Uptime between 379 and 605 is a mmap_sem livelock after the OOM killer was invoked. ---------- [ 380.039897] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 380.042500] [ 467] 0 467 14047 1815 28 3 0 0 systemd-journal [ 380.045055] [ 482] 0 482 10413 259 23 3 0 -1000 systemd-udevd [ 380.047637] [ 504] 0 504 12795 119 25 3 0 -1000 auditd [ 380.050127] [ 1244] 0 1244 82428 4257 81 3 0 0 firewalld [ 380.052536] [ 1247] 70 1247 6988 61 21 3 0 0 avahi-daemon [ 380.055028] [ 1250] 0 1250 54104 1372 42 4 0 0 rsyslogd [ 380.057505] [ 1251] 0 1251 137547 2620 91 3 0 0 tuned [ 380.059996] [ 1255] 0 1255 4823 77 15 3 0 0 irqbalance [ 380.062552] [ 1256] 0 1256 1095 37 8 3 0 0 rngd [ 380.065020] [ 1259] 0 1259 53626 441 60 3 0 0 abrtd [ 380.067383] [ 1260] 0 1260 53001 341 58 5 0 0 abrt-watch-log [ 380.069965] [ 1265] 0 1265 8673 83 21 3 0 0 systemd-logind [ 380.072554] [ 1266] 81 1266 6663 117 18 3 0 -900 dbus-daemon [ 380.075122] [ 1272] 0 1272 31577 154 21 3 0 0 crond [ 380.077544] [ 1314] 70 1314 6988 57 19 3 0 0 avahi-daemon [ 380.080013] [ 1427] 0 1427 46741 225 44 3 0 0 vmtoolsd [ 380.082478] [ 1969] 0 1969 25942 3100 48 3 0 0 dhclient [ 380.084969] [ 1990] 999 1990 128626 1929 50 4 0 0 polkitd [ 380.087516] [ 2073] 0 2073 20629 214 45 3 0 -1000 sshd [ 380.090065] [ 2201] 0 2201 7320 68 21 3 0 0 xinetd [ 380.092465] [ 3215] 0 3215 22773 257 44 3 0 0 master [ 380.094879] [ 3217] 89 3217 22816 249 45 3 0 0 qmgr [ 380.097304] [ 3249] 0 3249 75245 315 97 3 0 0 nmbd [ 380.099666] [ 3259] 0 3259 92963 486 131 5 0 0 smbd [ 380.101956] [ 3282] 0 3282 27503 30 12 3 0 0 agetty [ 380.104277] [ 3283] 0 3283 21788 154 49 3 0 0 login [ 380.106574] [ 3286] 0 3286 92963 486 126 5 0 0 smbd [ 380.108835] [ 3296] 1000 3296 28864 117 13 3 0 0 bash [ 380.111073] [ 3374] 89 3374 22799 249 46 3 0 0 pickup [ 380.113298] [ 3378] 89 3378 22836 252 45 3 0 0 cleanup [ 380.115555] [ 3385] 89 3385 22800 248 44 3 0 0 trivial-rewrite [ 380.117811] [ 3392] 0 3392 22825 265 48 3 0 0 local [ 380.119995] [ 3393] 0 3393 30828 59 17 3 0 0 anacron [ 380.122183] [ 3417] 1000 3417 541715 397587 787 6 0 0 a.out [ 380.124315] [ 3418] 1000 3418 1081 24 8 3 0 0 a.out [ 380.126410] [ 3419] 1000 3419 1042 21 7 3 0 0 a.out [ 380.128535] Out of memory: Kill process 3417 (a.out) score 890 or sacrifice child [ 380.130392] Killed process 3418 (a.out) total-vm:4324kB, anon-rss:96kB, file-rss:0kB [ 392.704028] MemAlloc-Info: 7 stalling task, 10 dying task, 1 victim task. (...snipped...) [ 601.129977] a.out R running task 0 3417 3296 0x00000080 [ 601.131899] ffff8800774dba10 ffffffff8112b174 0000000000000100 0000000000000000 [ 601.134026] 0000000000000000 0000000000000000 00000000a23cb49d 0000000000000000 [ 601.136076] ffff880077603200 00000000024280ca 0000000000000000 ffff880077603200 [ 601.138090] Call Trace: [ 601.139145] [<ffffffff8112b174>] ? try_to_free_pages+0x94/0xc0 [ 601.140831] [<ffffffff8111a8c4>] ? out_of_memory+0x2f4/0x460 [ 601.142489] [<ffffffff8111fa63>] ? __alloc_pages_nodemask+0x613/0xc30 [ 601.144328] [<ffffffff81161c40>] ? alloc_pages_vma+0xb0/0x200 [ 601.145994] [<ffffffff81143056>] ? handle_mm_fault+0xfa6/0x1370 [ 601.147677] [<ffffffff8162f557>] ? native_iret+0x7/0x7 [ 601.149258] [<ffffffff81058217>] ? __do_page_fault+0x177/0x400 [ 601.150966] [<ffffffff810584d0>] ? do_page_fault+0x30/0x80 [ 601.152625] [<ffffffff81630518>] ? page_fault+0x28/0x30 [ 601.154159] [<ffffffff813230c0>] ? __clear_user+0x20/0x50 [ 601.155723] [<ffffffff81327a68>] ? iov_iter_zero+0x68/0x250 [ 601.157329] [<ffffffff813fc6c8>] ? read_iter_zero+0x38/0xa0 [ 601.158923] [<ffffffff81187f04>] ? __vfs_read+0xc4/0xf0 [ 601.160453] [<ffffffff8118868a>] ? vfs_read+0x7a/0x120 [ 601.161961] [<ffffffff811893a0>] ? SyS_read+0x50/0xc0 [ 601.163513] [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71 [ 601.165254] a.out D ffff8800777b7e08 0 3418 3417 0x00100084 [ 601.167118] ffff8800777b7e08 ffff880077606400 ffff8800777b8000 ffff880036032e00 [ 601.169137] ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff8800777b7e20 [ 601.171159] ffffffff8162a570 ffff880077606400 ffff8800777b7ea8 ffffffff8162d8eb [ 601.173183] Call Trace: [ 601.174193] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.175661] [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350 [ 601.177388] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30 [ 601.179194] [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20 [ 601.180971] [<ffffffff8162d05f>] ? down_write+0x1f/0x30 [ 601.182509] [<ffffffff81147abe>] vm_munmap+0x2e/0x60 [ 601.183992] [<ffffffff811489fd>] SyS_munmap+0x1d/0x30 [ 601.185485] [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 601.187224] a.out D ffff88007c60fdf0 0 3420 3417 0x00000084 [ 601.189130] ffff88007c60fdf0 ffff880078e15780 ffff88007c610000 ffff880036032de8 [ 601.191158] ffff880036032e00 ffff88007c60ff58 ffff880078e15780 ffff88007c60fe08 [ 601.193180] ffffffff8162a570 ffff880078e15780 ffff88007c60fe68 ffffffff8162d698 [ 601.195217] Call Trace: [ 601.196226] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.197683] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.199407] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.201192] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.202711] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.204328] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.205874] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.207376] a.out D ffff88007c24fdf0 0 3421 3417 0x00000084 [ 601.209286] ffff88007c24fdf0 ffff880078e13200 ffff88007c250000 ffff880036032de8 [ 601.211316] ffff880036032e00 ffff88007c24ff58 ffff880078e13200 ffff88007c24fe08 [ 601.213335] ffffffff8162a570 ffff880078e13200 ffff88007c24fe68 ffffffff8162d698 [ 601.215356] Call Trace: [ 601.216377] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.217831] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.219529] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.221296] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.222802] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.224403] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.225958] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.227453] a.out D ffff88007823bdf0 0 3422 3417 0x00000084 [ 601.229348] ffff88007823bdf0 ffff880078e10000 ffff88007823c000 ffff880036032de8 [ 601.231395] ffff880036032e00 ffff88007823bf58 ffff880078e10000 ffff88007823be08 [ 601.233427] ffffffff8162a570 ffff880078e10000 ffff88007823be68 ffffffff8162d698 [ 601.235472] Call Trace: [ 601.236504] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.237989] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.239720] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.241583] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.243144] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.244777] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.246307] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.247823] a.out D ffff88007c483df0 0 3423 3417 0x00000084 [ 601.249719] ffff88007c483df0 ffff880078e13e80 ffff88007c484000 ffff880036032de8 [ 601.251765] ffff880036032e00 ffff88007c483f58 ffff880078e13e80 ffff88007c483e08 [ 601.253808] ffffffff8162a570 ffff880078e13e80 ffff88007c483e68 ffffffff8162d698 [ 601.255831] Call Trace: [ 601.256850] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.258286] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.260005] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.261803] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.263329] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.264936] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.266504] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.268019] a.out D ffff880035893e08 0 3424 3417 0x00000084 [ 601.269940] ffff880035893e08 ffff880078e17080 ffff880035894000 ffff880036032e00 [ 601.271945] ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff880035893e20 [ 601.273954] ffffffff8162a570 ffff880078e17080 ffff880035893ea8 ffffffff8162d8eb [ 601.276000] Call Trace: [ 601.277007] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.278497] [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350 [ 601.280240] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30 [ 601.282058] [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20 [ 601.283872] [<ffffffff8162d05f>] ? down_write+0x1f/0x30 [ 601.285403] [<ffffffff81147abe>] vm_munmap+0x2e/0x60 [ 601.286924] [<ffffffff811489fd>] SyS_munmap+0x1d/0x30 [ 601.288435] [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 601.290184] a.out D ffff8800353b7df0 0 3425 3417 0x00000084 [ 601.292108] ffff8800353b7df0 ffff880078e10c80 ffff8800353b8000 ffff880036032de8 [ 601.294165] ffff880036032e00 ffff8800353b7f58 ffff880078e10c80 ffff8800353b7e08 [ 601.296206] ffffffff8162a570 ffff880078e10c80 ffff8800353b7e68 ffffffff8162d698 [ 601.298267] Call Trace: [ 601.299300] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.300755] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.302437] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.304221] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.305764] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.307389] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.308968] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.310488] a.out D ffff88007cf87df0 0 3426 3417 0x00000084 [ 601.312380] ffff88007cf87df0 ffff880078e16400 ffff88007cf88000 ffff880036032de8 [ 601.314414] ffff880036032e00 ffff88007cf87f58 ffff880078e16400 ffff88007cf87e08 [ 601.316443] ffffffff8162a570 ffff880078e16400 ffff88007cf87e68 ffffffff8162d698 [ 601.318490] Call Trace: [ 601.319536] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.321036] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.322763] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.324504] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.326071] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.327715] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.329287] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.330761] a.out D ffff8800792dfdf0 0 3427 3417 0x00000084 [ 601.332705] ffff8800792dfdf0 ffff880078e12580 ffff8800792e0000 ffff880036032de8 [ 601.334699] ffff880036032e00 ffff8800792dff58 ffff880078e12580 ffff8800792dfe08 [ 601.336750] ffffffff8162a570 ffff880078e12580 ffff8800792dfe68 ffffffff8162d698 [ 601.338794] Call Trace: [ 601.339781] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.341280] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.343009] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.344813] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.346361] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.347990] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.349521] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.351044] a.out D ffff88007743faa8 0 3428 3417 0x00000084 [ 601.352942] ffff88007743faa8 ffff88007bda6400 ffff880077440000 ffff88007743fae0 [ 601.354990] ffff88007fccdfc0 00000001000484e5 0000000000000000 ffff88007743fac0 [ 601.357024] ffffffff8162a570 ffff88007fccdfc0 ffff88007743fb40 ffffffff8162dbed [ 601.359075] Call Trace: [ 601.360096] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.361540] [<ffffffff8162dbed>] schedule_timeout+0x11d/0x1c0 [ 601.363190] [<ffffffff810c7e00>] ? cascade+0x90/0x90 [ 601.364697] [<ffffffff8162dce9>] schedule_timeout_uninterruptible+0x19/0x20 [ 601.366574] [<ffffffff8111fc9d>] __alloc_pages_nodemask+0x84d/0xc30 [ 601.368332] [<ffffffff811609a7>] alloc_pages_current+0x87/0x110 [ 601.370002] [<ffffffff811166cf>] __page_cache_alloc+0xaf/0xc0 [ 601.371606] [<ffffffff81119225>] filemap_fault+0x1e5/0x420 [ 601.373203] [<ffffffff81244f39>] xfs_filemap_fault+0x39/0x60 [ 601.374798] [<ffffffff8113d5e7>] __do_fault+0x47/0xd0 [ 601.376315] [<ffffffff81142ec5>] handle_mm_fault+0xe15/0x1370 [ 601.377938] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30 [ 601.379707] [<ffffffff81058217>] __do_page_fault+0x177/0x400 [ 601.381320] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.382831] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.384337] a.out R running task 0 3419 3417 0x00000080 [ 601.386257] 00000000f80745e8 ffff880034ab4400 ffff8800776d3f18 ffff8800776d3f18 [ 601.388287] 0000000000000080 0000000000000000 ffff8800776d3ec8 ffffffff81187e72 [ 601.390341] ffff880034ab4400 ffff880034ab4410 0000000000020000 0000000000000000 [ 601.392366] Call Trace: [ 601.393388] [<ffffffff81187e72>] ? __vfs_read+0x32/0xf0 [ 601.394952] [<ffffffff81290aa9>] ? security_file_permission+0xa9/0xc0 [ 601.396745] [<ffffffff8118858d>] ? rw_verify_area+0x4d/0xd0 [ 601.398359] [<ffffffff8118868a>] ? vfs_read+0x7a/0x120 [ 601.399897] [<ffffffff81189560>] ? SyS_pread64+0x90/0xb0 [ 601.401429] [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71 ---------- I think that I noticed three problems from this reproducer. (1) While the likeliness of hitting mmap_sem livelock would depend on how frequently down_read(&mm->mmap_sem) tasks and down_write(&mm->mmap_sem) tasks contend on the OOM victim's mm, we can hit mmap_sem livelock with even only one down_read(&mm->mmap_sem) task. On systems where processes are monitored using /proc/pid/ interface, we can by chance hit this mmap_sem livelock. (2) The OOM killer tries to kill child process of the memory hog. But the child process is not always consuming a lot of memory. The memory unzapping kernel thread might not be able to reclaim enough memory unless we choose subsequent OOM victims when the first OOM victim task got mmap_sem livelock. (3) I don't know the reason but I can observe that (when there are many tasks which got SIGKILL by the OOM killer) many of dying tasks participate in a memory allocation competition via page_fault() which cannot make forward progress because dying tasks without TIF_MEMDIE are not allowed to access the memory reserves. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? @ 2015-10-12 6:43 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-12 6:43 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > So, zapping the first OOM victim's mm might fail by chance. I retested with a slightly different version. ---------- Reproducer start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> #include <sys/mman.h> static int writer(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); while (1) { void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0); munmap(ptr, 4096); } return 0; } int main(int argc, char *argv[]) { char buffer[128] = { }; const pid_t pid = fork(); if (pid == 0) { /* down_write(&mm->mmap_sem) requester which is chosen as an OOM victim. */ int i; for (i = 0; i < 9; i++) clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); writer(NULL); } snprintf(buffer, sizeof(buffer) - 1, "/proc/%u/stat", pid); if (fork() == 0) { /* down_read(&mm->mmap_sem) requester. */ const int fd = open(buffer, O_RDONLY); while (pread(fd, buffer, sizeof(buffer), 0) > 0); _exit(0); } else { /* A dummy process for invoking the OOM killer. */ char *buf = NULL; unsigned long size = 0; const int fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } read(fd, buf, size); /* Will cause OOM due to overcommit */ return 0; } } ---------- Reproducer end ---------- Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151012.txt.xz . Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages, no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f at uptime = 289. I don't know the reason of this silent hang up, but the memory unzapping kernel thread will not help because there is no OOM victim. ---------- [ 101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 289.343187] sysrq: SysRq : Manual OOM execution (...snipped...) [ 292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0 (...snipped...) [ 302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task. ---------- Uptime between 379 and 605 is a mmap_sem livelock after the OOM killer was invoked. ---------- [ 380.039897] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 380.042500] [ 467] 0 467 14047 1815 28 3 0 0 systemd-journal [ 380.045055] [ 482] 0 482 10413 259 23 3 0 -1000 systemd-udevd [ 380.047637] [ 504] 0 504 12795 119 25 3 0 -1000 auditd [ 380.050127] [ 1244] 0 1244 82428 4257 81 3 0 0 firewalld [ 380.052536] [ 1247] 70 1247 6988 61 21 3 0 0 avahi-daemon [ 380.055028] [ 1250] 0 1250 54104 1372 42 4 0 0 rsyslogd [ 380.057505] [ 1251] 0 1251 137547 2620 91 3 0 0 tuned [ 380.059996] [ 1255] 0 1255 4823 77 15 3 0 0 irqbalance [ 380.062552] [ 1256] 0 1256 1095 37 8 3 0 0 rngd [ 380.065020] [ 1259] 0 1259 53626 441 60 3 0 0 abrtd [ 380.067383] [ 1260] 0 1260 53001 341 58 5 0 0 abrt-watch-log [ 380.069965] [ 1265] 0 1265 8673 83 21 3 0 0 systemd-logind [ 380.072554] [ 1266] 81 1266 6663 117 18 3 0 -900 dbus-daemon [ 380.075122] [ 1272] 0 1272 31577 154 21 3 0 0 crond [ 380.077544] [ 1314] 70 1314 6988 57 19 3 0 0 avahi-daemon [ 380.080013] [ 1427] 0 1427 46741 225 44 3 0 0 vmtoolsd [ 380.082478] [ 1969] 0 1969 25942 3100 48 3 0 0 dhclient [ 380.084969] [ 1990] 999 1990 128626 1929 50 4 0 0 polkitd [ 380.087516] [ 2073] 0 2073 20629 214 45 3 0 -1000 sshd [ 380.090065] [ 2201] 0 2201 7320 68 21 3 0 0 xinetd [ 380.092465] [ 3215] 0 3215 22773 257 44 3 0 0 master [ 380.094879] [ 3217] 89 3217 22816 249 45 3 0 0 qmgr [ 380.097304] [ 3249] 0 3249 75245 315 97 3 0 0 nmbd [ 380.099666] [ 3259] 0 3259 92963 486 131 5 0 0 smbd [ 380.101956] [ 3282] 0 3282 27503 30 12 3 0 0 agetty [ 380.104277] [ 3283] 0 3283 21788 154 49 3 0 0 login [ 380.106574] [ 3286] 0 3286 92963 486 126 5 0 0 smbd [ 380.108835] [ 3296] 1000 3296 28864 117 13 3 0 0 bash [ 380.111073] [ 3374] 89 3374 22799 249 46 3 0 0 pickup [ 380.113298] [ 3378] 89 3378 22836 252 45 3 0 0 cleanup [ 380.115555] [ 3385] 89 3385 22800 248 44 3 0 0 trivial-rewrite [ 380.117811] [ 3392] 0 3392 22825 265 48 3 0 0 local [ 380.119995] [ 3393] 0 3393 30828 59 17 3 0 0 anacron [ 380.122183] [ 3417] 1000 3417 541715 397587 787 6 0 0 a.out [ 380.124315] [ 3418] 1000 3418 1081 24 8 3 0 0 a.out [ 380.126410] [ 3419] 1000 3419 1042 21 7 3 0 0 a.out [ 380.128535] Out of memory: Kill process 3417 (a.out) score 890 or sacrifice child [ 380.130392] Killed process 3418 (a.out) total-vm:4324kB, anon-rss:96kB, file-rss:0kB [ 392.704028] MemAlloc-Info: 7 stalling task, 10 dying task, 1 victim task. (...snipped...) [ 601.129977] a.out R running task 0 3417 3296 0x00000080 [ 601.131899] ffff8800774dba10 ffffffff8112b174 0000000000000100 0000000000000000 [ 601.134026] 0000000000000000 0000000000000000 00000000a23cb49d 0000000000000000 [ 601.136076] ffff880077603200 00000000024280ca 0000000000000000 ffff880077603200 [ 601.138090] Call Trace: [ 601.139145] [<ffffffff8112b174>] ? try_to_free_pages+0x94/0xc0 [ 601.140831] [<ffffffff8111a8c4>] ? out_of_memory+0x2f4/0x460 [ 601.142489] [<ffffffff8111fa63>] ? __alloc_pages_nodemask+0x613/0xc30 [ 601.144328] [<ffffffff81161c40>] ? alloc_pages_vma+0xb0/0x200 [ 601.145994] [<ffffffff81143056>] ? handle_mm_fault+0xfa6/0x1370 [ 601.147677] [<ffffffff8162f557>] ? native_iret+0x7/0x7 [ 601.149258] [<ffffffff81058217>] ? __do_page_fault+0x177/0x400 [ 601.150966] [<ffffffff810584d0>] ? do_page_fault+0x30/0x80 [ 601.152625] [<ffffffff81630518>] ? page_fault+0x28/0x30 [ 601.154159] [<ffffffff813230c0>] ? __clear_user+0x20/0x50 [ 601.155723] [<ffffffff81327a68>] ? iov_iter_zero+0x68/0x250 [ 601.157329] [<ffffffff813fc6c8>] ? read_iter_zero+0x38/0xa0 [ 601.158923] [<ffffffff81187f04>] ? __vfs_read+0xc4/0xf0 [ 601.160453] [<ffffffff8118868a>] ? vfs_read+0x7a/0x120 [ 601.161961] [<ffffffff811893a0>] ? SyS_read+0x50/0xc0 [ 601.163513] [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71 [ 601.165254] a.out D ffff8800777b7e08 0 3418 3417 0x00100084 [ 601.167118] ffff8800777b7e08 ffff880077606400 ffff8800777b8000 ffff880036032e00 [ 601.169137] ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff8800777b7e20 [ 601.171159] ffffffff8162a570 ffff880077606400 ffff8800777b7ea8 ffffffff8162d8eb [ 601.173183] Call Trace: [ 601.174193] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.175661] [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350 [ 601.177388] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30 [ 601.179194] [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20 [ 601.180971] [<ffffffff8162d05f>] ? down_write+0x1f/0x30 [ 601.182509] [<ffffffff81147abe>] vm_munmap+0x2e/0x60 [ 601.183992] [<ffffffff811489fd>] SyS_munmap+0x1d/0x30 [ 601.185485] [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 601.187224] a.out D ffff88007c60fdf0 0 3420 3417 0x00000084 [ 601.189130] ffff88007c60fdf0 ffff880078e15780 ffff88007c610000 ffff880036032de8 [ 601.191158] ffff880036032e00 ffff88007c60ff58 ffff880078e15780 ffff88007c60fe08 [ 601.193180] ffffffff8162a570 ffff880078e15780 ffff88007c60fe68 ffffffff8162d698 [ 601.195217] Call Trace: [ 601.196226] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.197683] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.199407] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.201192] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.202711] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.204328] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.205874] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.207376] a.out D ffff88007c24fdf0 0 3421 3417 0x00000084 [ 601.209286] ffff88007c24fdf0 ffff880078e13200 ffff88007c250000 ffff880036032de8 [ 601.211316] ffff880036032e00 ffff88007c24ff58 ffff880078e13200 ffff88007c24fe08 [ 601.213335] ffffffff8162a570 ffff880078e13200 ffff88007c24fe68 ffffffff8162d698 [ 601.215356] Call Trace: [ 601.216377] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.217831] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.219529] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.221296] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.222802] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.224403] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.225958] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.227453] a.out D ffff88007823bdf0 0 3422 3417 0x00000084 [ 601.229348] ffff88007823bdf0 ffff880078e10000 ffff88007823c000 ffff880036032de8 [ 601.231395] ffff880036032e00 ffff88007823bf58 ffff880078e10000 ffff88007823be08 [ 601.233427] ffffffff8162a570 ffff880078e10000 ffff88007823be68 ffffffff8162d698 [ 601.235472] Call Trace: [ 601.236504] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.237989] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.239720] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.241583] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.243144] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.244777] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.246307] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.247823] a.out D ffff88007c483df0 0 3423 3417 0x00000084 [ 601.249719] ffff88007c483df0 ffff880078e13e80 ffff88007c484000 ffff880036032de8 [ 601.251765] ffff880036032e00 ffff88007c483f58 ffff880078e13e80 ffff88007c483e08 [ 601.253808] ffffffff8162a570 ffff880078e13e80 ffff88007c483e68 ffffffff8162d698 [ 601.255831] Call Trace: [ 601.256850] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.258286] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.260005] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.261803] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.263329] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.264936] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.266504] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.268019] a.out D ffff880035893e08 0 3424 3417 0x00000084 [ 601.269940] ffff880035893e08 ffff880078e17080 ffff880035894000 ffff880036032e00 [ 601.271945] ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff880035893e20 [ 601.273954] ffffffff8162a570 ffff880078e17080 ffff880035893ea8 ffffffff8162d8eb [ 601.276000] Call Trace: [ 601.277007] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.278497] [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350 [ 601.280240] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30 [ 601.282058] [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20 [ 601.283872] [<ffffffff8162d05f>] ? down_write+0x1f/0x30 [ 601.285403] [<ffffffff81147abe>] vm_munmap+0x2e/0x60 [ 601.286924] [<ffffffff811489fd>] SyS_munmap+0x1d/0x30 [ 601.288435] [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 601.290184] a.out D ffff8800353b7df0 0 3425 3417 0x00000084 [ 601.292108] ffff8800353b7df0 ffff880078e10c80 ffff8800353b8000 ffff880036032de8 [ 601.294165] ffff880036032e00 ffff8800353b7f58 ffff880078e10c80 ffff8800353b7e08 [ 601.296206] ffffffff8162a570 ffff880078e10c80 ffff8800353b7e68 ffffffff8162d698 [ 601.298267] Call Trace: [ 601.299300] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.300755] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.302437] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.304221] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.305764] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.307389] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.308968] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.310488] a.out D ffff88007cf87df0 0 3426 3417 0x00000084 [ 601.312380] ffff88007cf87df0 ffff880078e16400 ffff88007cf88000 ffff880036032de8 [ 601.314414] ffff880036032e00 ffff88007cf87f58 ffff880078e16400 ffff88007cf87e08 [ 601.316443] ffffffff8162a570 ffff880078e16400 ffff88007cf87e68 ffffffff8162d698 [ 601.318490] Call Trace: [ 601.319536] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.321036] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.322763] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.324504] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.326071] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.327715] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.329287] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.330761] a.out D ffff8800792dfdf0 0 3427 3417 0x00000084 [ 601.332705] ffff8800792dfdf0 ffff880078e12580 ffff8800792e0000 ffff880036032de8 [ 601.334699] ffff880036032e00 ffff8800792dff58 ffff880078e12580 ffff8800792dfe08 [ 601.336750] ffffffff8162a570 ffff880078e12580 ffff8800792dfe68 ffffffff8162d698 [ 601.338794] Call Trace: [ 601.339781] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.341280] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.343009] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.344813] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.346361] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.347990] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.349521] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.351044] a.out D ffff88007743faa8 0 3428 3417 0x00000084 [ 601.352942] ffff88007743faa8 ffff88007bda6400 ffff880077440000 ffff88007743fae0 [ 601.354990] ffff88007fccdfc0 00000001000484e5 0000000000000000 ffff88007743fac0 [ 601.357024] ffffffff8162a570 ffff88007fccdfc0 ffff88007743fb40 ffffffff8162dbed [ 601.359075] Call Trace: [ 601.360096] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.361540] [<ffffffff8162dbed>] schedule_timeout+0x11d/0x1c0 [ 601.363190] [<ffffffff810c7e00>] ? cascade+0x90/0x90 [ 601.364697] [<ffffffff8162dce9>] schedule_timeout_uninterruptible+0x19/0x20 [ 601.366574] [<ffffffff8111fc9d>] __alloc_pages_nodemask+0x84d/0xc30 [ 601.368332] [<ffffffff811609a7>] alloc_pages_current+0x87/0x110 [ 601.370002] [<ffffffff811166cf>] __page_cache_alloc+0xaf/0xc0 [ 601.371606] [<ffffffff81119225>] filemap_fault+0x1e5/0x420 [ 601.373203] [<ffffffff81244f39>] xfs_filemap_fault+0x39/0x60 [ 601.374798] [<ffffffff8113d5e7>] __do_fault+0x47/0xd0 [ 601.376315] [<ffffffff81142ec5>] handle_mm_fault+0xe15/0x1370 [ 601.377938] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30 [ 601.379707] [<ffffffff81058217>] __do_page_fault+0x177/0x400 [ 601.381320] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.382831] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.384337] a.out R running task 0 3419 3417 0x00000080 [ 601.386257] 00000000f80745e8 ffff880034ab4400 ffff8800776d3f18 ffff8800776d3f18 [ 601.388287] 0000000000000080 0000000000000000 ffff8800776d3ec8 ffffffff81187e72 [ 601.390341] ffff880034ab4400 ffff880034ab4410 0000000000020000 0000000000000000 [ 601.392366] Call Trace: [ 601.393388] [<ffffffff81187e72>] ? __vfs_read+0x32/0xf0 [ 601.394952] [<ffffffff81290aa9>] ? security_file_permission+0xa9/0xc0 [ 601.396745] [<ffffffff8118858d>] ? rw_verify_area+0x4d/0xd0 [ 601.398359] [<ffffffff8118868a>] ? vfs_read+0x7a/0x120 [ 601.399897] [<ffffffff81189560>] ? SyS_pread64+0x90/0xb0 [ 601.401429] [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71 ---------- I think that I noticed three problems from this reproducer. (1) While the likeliness of hitting mmap_sem livelock would depend on how frequently down_read(&mm->mmap_sem) tasks and down_write(&mm->mmap_sem) tasks contend on the OOM victim's mm, we can hit mmap_sem livelock with even only one down_read(&mm->mmap_sem) task. On systems where processes are monitored using /proc/pid/ interface, we can by chance hit this mmap_sem livelock. (2) The OOM killer tries to kill child process of the memory hog. But the child process is not always consuming a lot of memory. The memory unzapping kernel thread might not be able to reclaim enough memory unless we choose subsequent OOM victims when the first OOM victim task got mmap_sem livelock. (3) I don't know the reason but I can observe that (when there are many tasks which got SIGKILL by the OOM killer) many of dying tasks participate in a memory allocation competition via page_fault() which cannot make forward progress because dying tasks without TIF_MEMDIE are not allowed to access the memory reserves. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Silent hang up caused by pages being not scanned? 2015-10-12 6:43 ` Tetsuo Handa @ 2015-10-12 15:25 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-12 15:25 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages, > no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f > at uptime = 289. I don't know the reason of this silent hang up, but the > memory unzapping kernel thread will not help because there is no OOM victim. > > ---------- > [ 101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 289.343187] sysrq: SysRq : Manual OOM execution > (...snipped...) > [ 292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0 > (...snipped...) > [ 302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task. > ---------- I examined this hang up using additional debug printk() patch. And it was observed that when this silent hang up occurs, zone_reclaimable() called from shrink_zones() called from a __GFP_FS memory allocation request is returning true forever. Since the __GFP_FS memory allocation request can never call out_of_memory() due to did_some_progree > 0, the system will silently hang up with 100% CPU usage. ---------- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0473eec..fda0bb5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2821,6 +2821,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, } #endif /* CONFIG_COMPACTION */ +pid_t dump_target_pid; + /* Perform direct synchronous page reclaim */ static int __perform_reclaim(gfp_t gfp_mask, unsigned int order, @@ -2847,6 +2849,9 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, cond_resched(); + if (dump_target_pid == current->pid) + printk(KERN_INFO "__perform_reclaim returned %u at line %u\n", + progress, __LINE__); return progress; } @@ -3007,6 +3012,7 @@ static int malloc_watchdog(void *unused) unsigned int memdie_pending; unsigned int stalling_tasks; u8 index; + pid_t pid; not_stalling: /* Healty case. */ /* @@ -3025,12 +3031,16 @@ static int malloc_watchdog(void *unused) * and stop_memalloc_timer() within timeout duration. */ if (likely(!memalloc_counter[index])) + { + dump_target_pid = 0; goto not_stalling; + } maybe_stalling: /* Maybe something is wrong. Let's check. */ /* First, report whether there are SIGKILL tasks and/or OOM victims. */ sigkill_pending = 0; memdie_pending = 0; stalling_tasks = 0; + pid = 0; preempt_disable(); rcu_read_lock(); for_each_process_thread(g, p) { @@ -3062,8 +3072,11 @@ static int malloc_watchdog(void *unused) (fatal_signal_pending(p) ? "-dying" : ""), p->comm, p->pid, m->gfp, m->order, spent); show_stack(p, NULL); + if (!pid && (m->gfp & __GFP_FS)) + pid = p->pid; } spin_unlock(&memalloc_list_lock); + dump_target_pid = -pid; /* Wait until next timeout duration. */ schedule_timeout_interruptible(timeout); if (memalloc_counter[index]) @@ -3155,6 +3168,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto nopage; retry: + if (dump_target_pid == -current->pid) + dump_target_pid = -dump_target_pid; + if (gfp_mask & __GFP_KSWAPD_RECLAIM) wake_all_kswapds(order, ac); @@ -3280,6 +3296,11 @@ retry: goto noretry; /* Keep reclaiming pages as long as there is reasonable progress */ + if (dump_target_pid == current->pid) { + printk(KERN_INFO "did_some_progress=%lu at line %u\n", + did_some_progress, __LINE__); + dump_target_pid = 0; + } pages_reclaimed += did_some_progress; if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 27d580b..cb0c22e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2527,6 +2527,8 @@ static inline bool compaction_ready(struct zone *zone, int order) return watermark_ok; } +extern pid_t dump_target_pid; + /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation @@ -2619,16 +2621,41 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) sc->nr_reclaimed += nr_soft_reclaimed; sc->nr_scanned += nr_soft_scanned; if (nr_soft_reclaimed) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "nr_soft_reclaimed=%lu at line %u\n", + nr_soft_reclaimed, __LINE__); reclaimable = true; + } /* need some check for avoid more shrink_zone() */ } if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx)) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "shrink_zone returned 1 at line %u\n", + __LINE__); reclaimable = true; + } if (global_reclaim(sc) && !reclaimable && zone_reclaimable(zone)) + { + if (dump_target_pid == current->pid) { + printk(KERN_INFO "zone_reclaimable returned 1 at line %u\n", + __LINE__); + printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu", + zone_page_state(zone, NR_ACTIVE_FILE), + zone_page_state(zone, NR_INACTIVE_FILE)); + if (get_nr_swap_pages() > 0) + printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu", + zone_page_state(zone, NR_ACTIVE_ANON), + zone_page_state(zone, NR_INACTIVE_ANON)); + printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n", + zone_page_state(zone, NR_PAGES_SCANNED)); + } reclaimable = true; + } } /* @@ -2674,6 +2701,9 @@ retry: sc->priority); sc->nr_scanned = 0; zones_reclaimable = shrink_zones(zonelist, sc); + if (dump_target_pid == current->pid) + printk(KERN_INFO "shrink_zones returned %u at line %u\n", + zones_reclaimable, __LINE__); total_scanned += sc->nr_scanned; if (sc->nr_reclaimed >= sc->nr_to_reclaim) @@ -2707,11 +2737,21 @@ retry: delayacct_freepages_end(); if (sc->nr_reclaimed) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "sc->nr_reclaimed=%lu at line %u\n", + sc->nr_reclaimed, __LINE__); return sc->nr_reclaimed; + } /* Aborted reclaim to try compaction? don't OOM, then */ if (sc->compaction_ready) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "sc->compaction_ready=%u at line %u\n", + sc->compaction_ready, __LINE__); return 1; + } /* Untapped cgroup reserves? Don't OOM, retry. */ if (!sc->may_thrash) { @@ -2720,6 +2760,9 @@ retry: goto retry; } + if (dump_target_pid == current->pid) + printk(KERN_INFO "zones_reclaimable=%u at line %u\n", + zones_reclaimable, __LINE__); /* Any of the zones still reclaimable? Don't OOM. */ if (zones_reclaimable) return 1; @@ -2875,7 +2918,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, * point. */ if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask)) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "throttle_direct_reclaim returned 1 at line %u\n", + __LINE__); return 1; + } trace_mm_vmscan_direct_reclaim_begin(order, sc.may_writepage, @@ -2885,6 +2933,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); + if (dump_target_pid == current->pid) + printk(KERN_INFO "do_try_to_free_pages returned %lu at line %u\n", + nr_reclaimed, __LINE__); return nr_reclaimed; } ---------- What is strange, the values printed by this debug printk() patch did not change as time went by. Thus, I think that this is not a problem of lack of CPU time for scanning pages. I suspect that there is a bug that nobody is scanning pages. ---------- [ 66.821450] zone_reclaimable returned 1 at line 2646 [ 66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 [ 66.824935] shrink_zones returned 1 at line 2706 [ 66.826392] zones_reclaimable=1 at line 2765 [ 66.827865] do_try_to_free_pages returned 1 at line 2938 [ 67.102322] __perform_reclaim returned 1 at line 2854 [ 67.103968] did_some_progress=1 at line 3301 (...snipped...) [ 281.439977] zone_reclaimable returned 1 at line 2646 [ 281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 [ 281.439978] shrink_zones returned 1 at line 2706 [ 281.439978] zones_reclaimable=1 at line 2765 [ 281.439979] do_try_to_free_pages returned 1 at line 2938 [ 281.439979] __perform_reclaim returned 1 at line 2854 [ 281.439980] did_some_progress=1 at line 3301 ---------- Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151013.txt.xz ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Silent hang up caused by pages being not scanned? @ 2015-10-12 15:25 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-12 15:25 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages, > no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f > at uptime = 289. I don't know the reason of this silent hang up, but the > memory unzapping kernel thread will not help because there is no OOM victim. > > ---------- > [ 101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 289.343187] sysrq: SysRq : Manual OOM execution > (...snipped...) > [ 292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0 > (...snipped...) > [ 302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task. > ---------- I examined this hang up using additional debug printk() patch. And it was observed that when this silent hang up occurs, zone_reclaimable() called from shrink_zones() called from a __GFP_FS memory allocation request is returning true forever. Since the __GFP_FS memory allocation request can never call out_of_memory() due to did_some_progree > 0, the system will silently hang up with 100% CPU usage. ---------- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0473eec..fda0bb5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2821,6 +2821,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, } #endif /* CONFIG_COMPACTION */ +pid_t dump_target_pid; + /* Perform direct synchronous page reclaim */ static int __perform_reclaim(gfp_t gfp_mask, unsigned int order, @@ -2847,6 +2849,9 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, cond_resched(); + if (dump_target_pid == current->pid) + printk(KERN_INFO "__perform_reclaim returned %u at line %u\n", + progress, __LINE__); return progress; } @@ -3007,6 +3012,7 @@ static int malloc_watchdog(void *unused) unsigned int memdie_pending; unsigned int stalling_tasks; u8 index; + pid_t pid; not_stalling: /* Healty case. */ /* @@ -3025,12 +3031,16 @@ static int malloc_watchdog(void *unused) * and stop_memalloc_timer() within timeout duration. */ if (likely(!memalloc_counter[index])) + { + dump_target_pid = 0; goto not_stalling; + } maybe_stalling: /* Maybe something is wrong. Let's check. */ /* First, report whether there are SIGKILL tasks and/or OOM victims. */ sigkill_pending = 0; memdie_pending = 0; stalling_tasks = 0; + pid = 0; preempt_disable(); rcu_read_lock(); for_each_process_thread(g, p) { @@ -3062,8 +3072,11 @@ static int malloc_watchdog(void *unused) (fatal_signal_pending(p) ? "-dying" : ""), p->comm, p->pid, m->gfp, m->order, spent); show_stack(p, NULL); + if (!pid && (m->gfp & __GFP_FS)) + pid = p->pid; } spin_unlock(&memalloc_list_lock); + dump_target_pid = -pid; /* Wait until next timeout duration. */ schedule_timeout_interruptible(timeout); if (memalloc_counter[index]) @@ -3155,6 +3168,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto nopage; retry: + if (dump_target_pid == -current->pid) + dump_target_pid = -dump_target_pid; + if (gfp_mask & __GFP_KSWAPD_RECLAIM) wake_all_kswapds(order, ac); @@ -3280,6 +3296,11 @@ retry: goto noretry; /* Keep reclaiming pages as long as there is reasonable progress */ + if (dump_target_pid == current->pid) { + printk(KERN_INFO "did_some_progress=%lu at line %u\n", + did_some_progress, __LINE__); + dump_target_pid = 0; + } pages_reclaimed += did_some_progress; if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 27d580b..cb0c22e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2527,6 +2527,8 @@ static inline bool compaction_ready(struct zone *zone, int order) return watermark_ok; } +extern pid_t dump_target_pid; + /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation @@ -2619,16 +2621,41 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) sc->nr_reclaimed += nr_soft_reclaimed; sc->nr_scanned += nr_soft_scanned; if (nr_soft_reclaimed) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "nr_soft_reclaimed=%lu at line %u\n", + nr_soft_reclaimed, __LINE__); reclaimable = true; + } /* need some check for avoid more shrink_zone() */ } if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx)) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "shrink_zone returned 1 at line %u\n", + __LINE__); reclaimable = true; + } if (global_reclaim(sc) && !reclaimable && zone_reclaimable(zone)) + { + if (dump_target_pid == current->pid) { + printk(KERN_INFO "zone_reclaimable returned 1 at line %u\n", + __LINE__); + printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu", + zone_page_state(zone, NR_ACTIVE_FILE), + zone_page_state(zone, NR_INACTIVE_FILE)); + if (get_nr_swap_pages() > 0) + printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu", + zone_page_state(zone, NR_ACTIVE_ANON), + zone_page_state(zone, NR_INACTIVE_ANON)); + printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n", + zone_page_state(zone, NR_PAGES_SCANNED)); + } reclaimable = true; + } } /* @@ -2674,6 +2701,9 @@ retry: sc->priority); sc->nr_scanned = 0; zones_reclaimable = shrink_zones(zonelist, sc); + if (dump_target_pid == current->pid) + printk(KERN_INFO "shrink_zones returned %u at line %u\n", + zones_reclaimable, __LINE__); total_scanned += sc->nr_scanned; if (sc->nr_reclaimed >= sc->nr_to_reclaim) @@ -2707,11 +2737,21 @@ retry: delayacct_freepages_end(); if (sc->nr_reclaimed) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "sc->nr_reclaimed=%lu at line %u\n", + sc->nr_reclaimed, __LINE__); return sc->nr_reclaimed; + } /* Aborted reclaim to try compaction? don't OOM, then */ if (sc->compaction_ready) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "sc->compaction_ready=%u at line %u\n", + sc->compaction_ready, __LINE__); return 1; + } /* Untapped cgroup reserves? Don't OOM, retry. */ if (!sc->may_thrash) { @@ -2720,6 +2760,9 @@ retry: goto retry; } + if (dump_target_pid == current->pid) + printk(KERN_INFO "zones_reclaimable=%u at line %u\n", + zones_reclaimable, __LINE__); /* Any of the zones still reclaimable? Don't OOM. */ if (zones_reclaimable) return 1; @@ -2875,7 +2918,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, * point. */ if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask)) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "throttle_direct_reclaim returned 1 at line %u\n", + __LINE__); return 1; + } trace_mm_vmscan_direct_reclaim_begin(order, sc.may_writepage, @@ -2885,6 +2933,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); + if (dump_target_pid == current->pid) + printk(KERN_INFO "do_try_to_free_pages returned %lu at line %u\n", + nr_reclaimed, __LINE__); return nr_reclaimed; } ---------- What is strange, the values printed by this debug printk() patch did not change as time went by. Thus, I think that this is not a problem of lack of CPU time for scanning pages. I suspect that there is a bug that nobody is scanning pages. ---------- [ 66.821450] zone_reclaimable returned 1 at line 2646 [ 66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 [ 66.824935] shrink_zones returned 1 at line 2706 [ 66.826392] zones_reclaimable=1 at line 2765 [ 66.827865] do_try_to_free_pages returned 1 at line 2938 [ 67.102322] __perform_reclaim returned 1 at line 2854 [ 67.103968] did_some_progress=1 at line 3301 (...snipped...) [ 281.439977] zone_reclaimable returned 1 at line 2646 [ 281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 [ 281.439978] shrink_zones returned 1 at line 2706 [ 281.439978] zones_reclaimable=1 at line 2765 [ 281.439979] do_try_to_free_pages returned 1 at line 2938 [ 281.439979] __perform_reclaim returned 1 at line 2854 [ 281.439980] did_some_progress=1 at line 3301 ---------- Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151013.txt.xz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-12 15:25 ` Tetsuo Handa @ 2015-10-12 21:23 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-12 21:23 UTC (permalink / raw) To: Tetsuo Handa Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > I examined this hang up using additional debug printk() patch. And it was > observed that when this silent hang up occurs, zone_reclaimable() called from > shrink_zones() called from a __GFP_FS memory allocation request is returning > true forever. Since the __GFP_FS memory allocation request can never call > out_of_memory() due to did_some_progree > 0, the system will silently hang up > with 100% CPU usage. I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad. So the do_try_to_free_pages() logic that does that /* Any of the zones still reclaimable? Don't OOM. */ if (zones_reclaimable) return 1; is rather dubious. The history of that odd line is pretty dubious too: it used to be that we would return success if "shrink_zones()" succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()" logic got rewritten, and I don't think the current situation is all that sane. And returning 1 there is actively misleading to callers, since it makes them think that it made progress. So I think you should look at what happens if you just remove that illogical and misleading return value. HOWEVER. I think that it's very true that we have then tuned all our *other* heuristics for taking this thing into account, so I suspect that we'll find that we'll need to tweak other places. But this crazy "let's say that we made progress even when we didn't" thing looks just wrong. In particular, I think that you'll find that you will have to change the heuristics in __alloc_pages_slowpath() where we currently do if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || .. when the "did_some_progress" logic changes that radically. Because while the current return value looks insane, all the other testing and tweaking has been done with that very odd return value in place. Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-12 21:23 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-12 21:23 UTC (permalink / raw) To: Tetsuo Handa Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > I examined this hang up using additional debug printk() patch. And it was > observed that when this silent hang up occurs, zone_reclaimable() called from > shrink_zones() called from a __GFP_FS memory allocation request is returning > true forever. Since the __GFP_FS memory allocation request can never call > out_of_memory() due to did_some_progree > 0, the system will silently hang up > with 100% CPU usage. I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad. So the do_try_to_free_pages() logic that does that /* Any of the zones still reclaimable? Don't OOM. */ if (zones_reclaimable) return 1; is rather dubious. The history of that odd line is pretty dubious too: it used to be that we would return success if "shrink_zones()" succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()" logic got rewritten, and I don't think the current situation is all that sane. And returning 1 there is actively misleading to callers, since it makes them think that it made progress. So I think you should look at what happens if you just remove that illogical and misleading return value. HOWEVER. I think that it's very true that we have then tuned all our *other* heuristics for taking this thing into account, so I suspect that we'll find that we'll need to tweak other places. But this crazy "let's say that we made progress even when we didn't" thing looks just wrong. In particular, I think that you'll find that you will have to change the heuristics in __alloc_pages_slowpath() where we currently do if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || .. when the "did_some_progress" logic changes that radically. Because while the current return value looks insane, all the other testing and tweaking has been done with that very odd return value in place. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-12 21:23 ` Linus Torvalds @ 2015-10-13 12:21 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-13 12:21 UTC (permalink / raw) To: torvalds Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Linus Torvalds wrote: > On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > I examined this hang up using additional debug printk() patch. And it was > > observed that when this silent hang up occurs, zone_reclaimable() called from > > shrink_zones() called from a __GFP_FS memory allocation request is returning > > true forever. Since the __GFP_FS memory allocation request can never call > > out_of_memory() due to did_some_progree > 0, the system will silently hang up > > with 100% CPU usage. > > I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad. > I compared "hang up after the OOM killer is invoked" and "hang up before the OOM killer is invoked" by always printing the values. } reclaimable = true; } + else if (dump_target_pid == current->pid) { + printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu", + zone_page_state(zone, NR_ACTIVE_FILE), + zone_page_state(zone, NR_INACTIVE_FILE)); + if (get_nr_swap_pages() > 0) + printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu", + zone_page_state(zone, NR_ACTIVE_ANON), + zone_page_state(zone, NR_INACTIVE_ANON)); + printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n", + zone_page_state(zone, NR_PAGES_SCANNED)); + } } /* For the former case, most of trials showed that (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 . Sometimes PAGES_SCANNED > 0 (as grep'ed below), but ACTIVE_FILE and INACTIVE_FILE seems to be always 0. ---------- [ 195.905057] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 195.927430] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 206.317088] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 206.338007] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 216.723776] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 216.744618] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 227.129653] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 227.151238] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 237.650232] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 237.671343] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 277.980310] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 278.001481] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 288.339220] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 288.361908] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 298.682988] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 298.704055] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 350.368952] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 350.389770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 360.724821] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 360.746100] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 845.231887] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27 [ 845.233770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 845.253196] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27 [ 845.254910] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 1397.628073] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 1397.649165] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 1408.207041] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 1408.228762] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 ---------- For the latter case, most of output showed that ACTIVE_FILE + INACTIVE_FILE > 0. ---------- [ 142.647201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 142.648883] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 142.842868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 142.955817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.086363] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.231120] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.359238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.473342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.618103] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.746210] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.908162] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.035415] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.161926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.306435] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.434265] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.436099] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 144.643374] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.773239] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.902309] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.046154] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.185410] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.317218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.460304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.654212] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.817362] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.945136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 146.086303] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 146.242127] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 153.489868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 153.491593] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 153.674246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 153.839478] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 154.003234] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 154.155085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.322187] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.447355] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.653150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.782216] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.939439] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.105921] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.278386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.440832] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.623970] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.625766] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.831074] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.996903] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.139137] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.318492] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.484300] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.667411] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.817246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.012323] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.159483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.323193] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.488399] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.654198] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.339172] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.340896] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.583026] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.797386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.965110] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.124935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.431304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.700317] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.862071] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.029257] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.198312] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.356224] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.559302] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.684486] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.898551] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.900496] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.175960] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.324390] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.526150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.693365] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.878407] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.061503] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.225306] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.416398] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.617395] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.783201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.989053] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 169.196126] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.361136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.362865] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.626817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.797361] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.006389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.211479] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.433890] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.630951] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.855509] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.049814] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.258218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.455404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.665085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.874173] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.057217] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.059056] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.350935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.559404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.782483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.982803] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.203930] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.428321] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.611349] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.851164] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.034220] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.279197] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.455284] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.811445] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.368405] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.370115] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.614733] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.845695] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.024274] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.211389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.427147] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.552333] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.734117] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.935811] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.138296] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.354041] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.559245] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.641776] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.716434] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.718199] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.015952] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.218976] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.440131] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.659238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.882360] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.087342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.314442] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.408926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.631240] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.850326] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 191.067488] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 191.283243] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 ---------- So, something is preventing ACTIVE_FILE and INACTIVE_FILE to become 0 ? I also tried below change, but the result was same. Therefore, this problem seems to be independent with "!__GFP_FS allocations do not fail". (Complete log with below change (uptime > 101) is at http://I-love.SAKURA.ne.jp/tmp/serial-20151013-2.txt.xz . ) ---------- --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2736,7 +2736,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, * and the OOM killer can't be invoked, but * keep looping as per tradition. */ - *did_some_progress = 1; goto out; } if (pm_suspended_storage()) ---------- ---------- [ 102.719555] (ACTIVE_FILE=3+INACTIVE_FILE=3) * 6 > PAGES_SCANNED=19 [ 102.721234] (ACTIVE_FILE=1+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 102.722908] shrink_zones returned 1 at line 2717 ---------- > So the do_try_to_free_pages() logic that does that > > /* Any of the zones still reclaimable? Don't OOM. */ > if (zones_reclaimable) > return 1; > > is rather dubious. The history of that odd line is pretty dubious too: > it used to be that we would return success if "shrink_zones()" > succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()" > logic got rewritten, and I don't think the current situation is all > that sane. > > And returning 1 there is actively misleading to callers, since it > makes them think that it made progress. > > So I think you should look at what happens if you just remove that > illogical and misleading return value. > If I remove /* Any of the zones still reclaimable? Don't OOM. */ if (zones_reclaimable) return 1; the OOM killer is invoked even when there are so much memory which can be reclaimed after written to disk. This is definitely premature invocation of the OOM killer. $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out ---------- When there is a lot of data to write ---------- [ 489.952827] Mem-Info: [ 489.953840] active_anon:328227 inactive_anon:3033 isolated_anon:26 [ 489.953840] active_file:2309 inactive_file:80915 isolated_file:0 [ 489.953840] unevictable:0 dirty:53 writeback:80874 unstable:0 [ 489.953840] slab_reclaimable:4975 slab_unreclaimable:4256 [ 489.953840] mapped:2973 shmem:4192 pagetables:1939 bounce:0 [ 489.953840] free:12963 free_pcp:60 free_cma:0 [ 489.963395] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5728kB inactive_anon:88kB active_file:140kB inactive_file:1276kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:1300kB mapped:140kB shmem:160kB slab_reclaimable:256kB slab_unreclaimable:180kB kernel_stack:64kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9768 all_unreclaimable? yes [ 489.974035] lowmem_reserve[]: 0 1729 1729 1729 [ 489.975813] Node 0 DMA32 free:44552kB min:44652kB low:55812kB high:66976kB active_anon:1307180kB inactive_anon:12044kB active_file:9096kB inactive_file:322384kB unevictable:0kB isolated(anon):104kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:216kB writeback:322196kB mapped:11752kB shmem:16608kB slab_reclaimable:19644kB slab_unreclaimable:16844kB kernel_stack:3584kB pagetables:7576kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:2419896 all_unreclaimable? yes [ 489.988452] lowmem_reserve[]: 0 0 0 0 [ 489.990043] Node 0 DMA: 2*4kB (UE) 1*8kB (M) 4*16kB (UME) 1*32kB (E) 2*64kB (UE) 3*128kB (UME) 2*256kB (UM) 2*512kB (ME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7280kB [ 489.995142] Node 0 DMA32: 578*4kB (UME) 726*8kB (UE) 447*16kB (UE) 253*32kB (UME) 155*64kB (UME) 42*128kB (UME) 3*256kB (UME) 2*512kB (UM) 4*1024kB (U) 0*2048kB 0*4096kB = 44552kB [ 490.000511] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 490.002914] 87434 total pagecache pages [ 490.004612] 0 pages in swap cache [ 490.006138] Swap cache stats: add 0, delete 0, find 0/0 [ 490.007976] Free swap = 0kB [ 490.009329] Total swap = 0kB [ 490.011033] 524157 pages RAM [ 490.012352] 0 pages HighMem/MovableOnly [ 490.013903] 76615 pages reserved [ 490.015260] 0 pages hwpoisoned ---------- When there is a lot of data to write ---------- $ ./a.out ---------- When there is no data to write ---------- [ 792.359024] Mem-Info: [ 792.360001] active_anon:413751 inactive_anon:6226 isolated_anon:0 [ 792.360001] active_file:0 inactive_file:0 isolated_file:0 [ 792.360001] unevictable:0 dirty:0 writeback:0 unstable:0 [ 792.360001] slab_reclaimable:1243 slab_unreclaimable:3638 [ 792.360001] mapped:104 shmem:6236 pagetables:1033 bounce:0 [ 792.360001] free:12965 free_pcp:126 free_cma:0 [ 792.368559] Node 0 DMA free:7292kB min:400kB low:500kB high:600kB active_anon:7040kB inactive_anon:160kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:160kB slab_reclaimable:24kB slab_unreclaimable:172kB kernel_stack:64kB pagetables:460kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes [ 792.378240] lowmem_reserve[]: 0 1729 1729 1729 [ 792.379834] Node 0 DMA32 free:44568kB min:44652kB low:55812kB high:66976kB active_anon:1647964kB inactive_anon:24744kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:0kB writeback:0kB mapped:416kB shmem:24784kB slab_reclaimable:4948kB slab_unreclaimable:14380kB kernel_stack:3104kB pagetables:3672kB unstable:0kB bounce:0kB free_pcp:504kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes [ 792.390085] lowmem_reserve[]: 0 0 0 0 [ 792.391643] Node 0 DMA: 3*4kB (UE) 0*8kB 3*16kB (UE) 24*32kB (ME) 11*64kB (UME) 5*128kB (UM) 2*256kB (ME) 3*512kB (ME) 1*1024kB (E) 1*2048kB (E) 0*4096kB = 7292kB [ 792.396201] Node 0 DMA32: 242*4kB (UME) 386*8kB (UME) 397*16kB (UME) 199*32kB (UE) 105*64kB (UME) 37*128kB (UME) 24*256kB (UME) 20*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 44616kB [ 792.401136] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 792.403356] 6250 total pagecache pages [ 792.404803] 0 pages in swap cache [ 792.406208] Swap cache stats: add 0, delete 0, find 0/0 [ 792.407896] Free swap = 0kB [ 792.409172] Total swap = 0kB [ 792.410460] 524157 pages RAM [ 792.411752] 0 pages HighMem/MovableOnly [ 792.413106] 76615 pages reserved [ 792.414493] 0 pages hwpoisoned ---------- When there is no data to write ---------- > HOWEVER. > > I think that it's very true that we have then tuned all our *other* > heuristics for taking this thing into account, so I suspect that we'll > find that we'll need to tweak other places. But this crazy "let's say > that we made progress even when we didn't" thing looks just wrong. > > In particular, I think that you'll find that you will have to change > the heuristics in __alloc_pages_slowpath() where we currently do > > if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || .. > > when the "did_some_progress" logic changes that radically. > Yes. But we can't simply do if (order <= PAGE_ALLOC_COSTLY_ORDER || .. because we won't be able to call out_of_memory(), can we? > Because while the current return value looks insane, all the other > testing and tweaking has been done with that very odd return value in > place. > > Linus > Well, did I encounter a difficult to fix problem? ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-13 12:21 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-13 12:21 UTC (permalink / raw) To: torvalds Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Linus Torvalds wrote: > On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > I examined this hang up using additional debug printk() patch. And it was > > observed that when this silent hang up occurs, zone_reclaimable() called from > > shrink_zones() called from a __GFP_FS memory allocation request is returning > > true forever. Since the __GFP_FS memory allocation request can never call > > out_of_memory() due to did_some_progree > 0, the system will silently hang up > > with 100% CPU usage. > > I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad. > I compared "hang up after the OOM killer is invoked" and "hang up before the OOM killer is invoked" by always printing the values. } reclaimable = true; } + else if (dump_target_pid == current->pid) { + printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu", + zone_page_state(zone, NR_ACTIVE_FILE), + zone_page_state(zone, NR_INACTIVE_FILE)); + if (get_nr_swap_pages() > 0) + printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu", + zone_page_state(zone, NR_ACTIVE_ANON), + zone_page_state(zone, NR_INACTIVE_ANON)); + printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n", + zone_page_state(zone, NR_PAGES_SCANNED)); + } } /* For the former case, most of trials showed that (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 . Sometimes PAGES_SCANNED > 0 (as grep'ed below), but ACTIVE_FILE and INACTIVE_FILE seems to be always 0. ---------- [ 195.905057] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 195.927430] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 206.317088] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 206.338007] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 216.723776] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 216.744618] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 227.129653] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 227.151238] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 237.650232] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 237.671343] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 277.980310] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 278.001481] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 288.339220] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 288.361908] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 298.682988] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 298.704055] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 350.368952] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 350.389770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 360.724821] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 360.746100] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 845.231887] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27 [ 845.233770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 845.253196] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27 [ 845.254910] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 1397.628073] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 1397.649165] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 1408.207041] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 1408.228762] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 ---------- For the latter case, most of output showed that ACTIVE_FILE + INACTIVE_FILE > 0. ---------- [ 142.647201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 142.648883] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 142.842868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 142.955817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.086363] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.231120] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.359238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.473342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.618103] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.746210] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.908162] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.035415] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.161926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.306435] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.434265] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.436099] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 144.643374] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.773239] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.902309] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.046154] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.185410] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.317218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.460304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.654212] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.817362] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.945136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 146.086303] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 146.242127] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 153.489868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 153.491593] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 153.674246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 153.839478] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 154.003234] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 154.155085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.322187] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.447355] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.653150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.782216] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.939439] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.105921] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.278386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.440832] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.623970] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.625766] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.831074] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.996903] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.139137] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.318492] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.484300] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.667411] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.817246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.012323] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.159483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.323193] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.488399] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.654198] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.339172] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.340896] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.583026] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.797386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.965110] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.124935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.431304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.700317] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.862071] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.029257] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.198312] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.356224] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.559302] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.684486] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.898551] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.900496] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.175960] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.324390] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.526150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.693365] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.878407] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.061503] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.225306] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.416398] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.617395] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.783201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.989053] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 169.196126] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.361136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.362865] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.626817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.797361] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.006389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.211479] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.433890] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.630951] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.855509] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.049814] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.258218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.455404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.665085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.874173] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.057217] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.059056] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.350935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.559404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.782483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.982803] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.203930] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.428321] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.611349] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.851164] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.034220] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.279197] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.455284] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.811445] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.368405] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.370115] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.614733] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.845695] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.024274] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.211389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.427147] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.552333] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.734117] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.935811] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.138296] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.354041] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.559245] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.641776] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.716434] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.718199] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.015952] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.218976] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.440131] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.659238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.882360] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.087342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.314442] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.408926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.631240] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.850326] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 191.067488] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 191.283243] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 ---------- So, something is preventing ACTIVE_FILE and INACTIVE_FILE to become 0 ? I also tried below change, but the result was same. Therefore, this problem seems to be independent with "!__GFP_FS allocations do not fail". (Complete log with below change (uptime > 101) is at http://I-love.SAKURA.ne.jp/tmp/serial-20151013-2.txt.xz . ) ---------- --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2736,7 +2736,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, * and the OOM killer can't be invoked, but * keep looping as per tradition. */ - *did_some_progress = 1; goto out; } if (pm_suspended_storage()) ---------- ---------- [ 102.719555] (ACTIVE_FILE=3+INACTIVE_FILE=3) * 6 > PAGES_SCANNED=19 [ 102.721234] (ACTIVE_FILE=1+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 102.722908] shrink_zones returned 1 at line 2717 ---------- > So the do_try_to_free_pages() logic that does that > > /* Any of the zones still reclaimable? Don't OOM. */ > if (zones_reclaimable) > return 1; > > is rather dubious. The history of that odd line is pretty dubious too: > it used to be that we would return success if "shrink_zones()" > succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()" > logic got rewritten, and I don't think the current situation is all > that sane. > > And returning 1 there is actively misleading to callers, since it > makes them think that it made progress. > > So I think you should look at what happens if you just remove that > illogical and misleading return value. > If I remove /* Any of the zones still reclaimable? Don't OOM. */ if (zones_reclaimable) return 1; the OOM killer is invoked even when there are so much memory which can be reclaimed after written to disk. This is definitely premature invocation of the OOM killer. $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out ---------- When there is a lot of data to write ---------- [ 489.952827] Mem-Info: [ 489.953840] active_anon:328227 inactive_anon:3033 isolated_anon:26 [ 489.953840] active_file:2309 inactive_file:80915 isolated_file:0 [ 489.953840] unevictable:0 dirty:53 writeback:80874 unstable:0 [ 489.953840] slab_reclaimable:4975 slab_unreclaimable:4256 [ 489.953840] mapped:2973 shmem:4192 pagetables:1939 bounce:0 [ 489.953840] free:12963 free_pcp:60 free_cma:0 [ 489.963395] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5728kB inactive_anon:88kB active_file:140kB inactive_file:1276kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:1300kB mapped:140kB shmem:160kB slab_reclaimable:256kB slab_unreclaimable:180kB kernel_stack:64kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9768 all_unreclaimable? yes [ 489.974035] lowmem_reserve[]: 0 1729 1729 1729 [ 489.975813] Node 0 DMA32 free:44552kB min:44652kB low:55812kB high:66976kB active_anon:1307180kB inactive_anon:12044kB active_file:9096kB inactive_file:322384kB unevictable:0kB isolated(anon):104kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:216kB writeback:322196kB mapped:11752kB shmem:16608kB slab_reclaimable:19644kB slab_unreclaimable:16844kB kernel_stack:3584kB pagetables:7576kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:2419896 all_unreclaimable? yes [ 489.988452] lowmem_reserve[]: 0 0 0 0 [ 489.990043] Node 0 DMA: 2*4kB (UE) 1*8kB (M) 4*16kB (UME) 1*32kB (E) 2*64kB (UE) 3*128kB (UME) 2*256kB (UM) 2*512kB (ME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7280kB [ 489.995142] Node 0 DMA32: 578*4kB (UME) 726*8kB (UE) 447*16kB (UE) 253*32kB (UME) 155*64kB (UME) 42*128kB (UME) 3*256kB (UME) 2*512kB (UM) 4*1024kB (U) 0*2048kB 0*4096kB = 44552kB [ 490.000511] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 490.002914] 87434 total pagecache pages [ 490.004612] 0 pages in swap cache [ 490.006138] Swap cache stats: add 0, delete 0, find 0/0 [ 490.007976] Free swap = 0kB [ 490.009329] Total swap = 0kB [ 490.011033] 524157 pages RAM [ 490.012352] 0 pages HighMem/MovableOnly [ 490.013903] 76615 pages reserved [ 490.015260] 0 pages hwpoisoned ---------- When there is a lot of data to write ---------- $ ./a.out ---------- When there is no data to write ---------- [ 792.359024] Mem-Info: [ 792.360001] active_anon:413751 inactive_anon:6226 isolated_anon:0 [ 792.360001] active_file:0 inactive_file:0 isolated_file:0 [ 792.360001] unevictable:0 dirty:0 writeback:0 unstable:0 [ 792.360001] slab_reclaimable:1243 slab_unreclaimable:3638 [ 792.360001] mapped:104 shmem:6236 pagetables:1033 bounce:0 [ 792.360001] free:12965 free_pcp:126 free_cma:0 [ 792.368559] Node 0 DMA free:7292kB min:400kB low:500kB high:600kB active_anon:7040kB inactive_anon:160kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:160kB slab_reclaimable:24kB slab_unreclaimable:172kB kernel_stack:64kB pagetables:460kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes [ 792.378240] lowmem_reserve[]: 0 1729 1729 1729 [ 792.379834] Node 0 DMA32 free:44568kB min:44652kB low:55812kB high:66976kB active_anon:1647964kB inactive_anon:24744kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:0kB writeback:0kB mapped:416kB shmem:24784kB slab_reclaimable:4948kB slab_unreclaimable:14380kB kernel_stack:3104kB pagetables:3672kB unstable:0kB bounce:0kB free_pcp:504kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes [ 792.390085] lowmem_reserve[]: 0 0 0 0 [ 792.391643] Node 0 DMA: 3*4kB (UE) 0*8kB 3*16kB (UE) 24*32kB (ME) 11*64kB (UME) 5*128kB (UM) 2*256kB (ME) 3*512kB (ME) 1*1024kB (E) 1*2048kB (E) 0*4096kB = 7292kB [ 792.396201] Node 0 DMA32: 242*4kB (UME) 386*8kB (UME) 397*16kB (UME) 199*32kB (UE) 105*64kB (UME) 37*128kB (UME) 24*256kB (UME) 20*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 44616kB [ 792.401136] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 792.403356] 6250 total pagecache pages [ 792.404803] 0 pages in swap cache [ 792.406208] Swap cache stats: add 0, delete 0, find 0/0 [ 792.407896] Free swap = 0kB [ 792.409172] Total swap = 0kB [ 792.410460] 524157 pages RAM [ 792.411752] 0 pages HighMem/MovableOnly [ 792.413106] 76615 pages reserved [ 792.414493] 0 pages hwpoisoned ---------- When there is no data to write ---------- > HOWEVER. > > I think that it's very true that we have then tuned all our *other* > heuristics for taking this thing into account, so I suspect that we'll > find that we'll need to tweak other places. But this crazy "let's say > that we made progress even when we didn't" thing looks just wrong. > > In particular, I think that you'll find that you will have to change > the heuristics in __alloc_pages_slowpath() where we currently do > > if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || .. > > when the "did_some_progress" logic changes that radically. > Yes. But we can't simply do if (order <= PAGE_ALLOC_COSTLY_ORDER || .. because we won't be able to call out_of_memory(), can we? > Because while the current return value looks insane, all the other > testing and tweaking has been done with that very odd return value in > place. > > Linus > Well, did I encounter a difficult to fix problem? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 12:21 ` Tetsuo Handa @ 2015-10-13 16:37 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-13 16:37 UTC (permalink / raw) To: Tetsuo Handa Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > If I remove > > /* Any of the zones still reclaimable? Don't OOM. */ > if (zones_reclaimable) > return 1; > > the OOM killer is invoked even when there are so much memory which can be > reclaimed after written to disk. This is definitely premature invocation of > the OOM killer. Right. The rest of the code knows that the return value right now means "there is no memory at all" rather than "I made progress". > Yes. But we can't simply do > > if (order <= PAGE_ALLOC_COSTLY_ORDER || .. > > because we won't be able to call out_of_memory(), can we? So I think that whole thing is kind of senseless. Not just that particular conditional, but what it *does* too. What can easily happen is that we are a blocking allocation, but because we're __GFP_FS or something, the code doesn't actually start writing anything out. Nor is anything congested. So the thing just loops. And looping is stupid, because we may be not able to actually free anything exactly because of limitations like __GFP_FS. So (a) the looping condition is senseless (b) what we do when looping is senseless and we actually do try to wake up kswapd in the loop, but we never *wait* for it, so that's largely pointless too. So *of*course* the direct reclaim code has to set "I made progress", because if it doesn't lie and say so, then the code will randomly not loop, and will oom, and things go to hell. But I hate the "let's tweak the zone_reclaimable" idea, because it doesn't actually fix anything. It just perpetuates this "the code doesn't make sense, so let's add *more* senseless heusristics to this whole loop". So instead of that senseless thing, how about trying something *sensible*. Make the code do something that we can actually explain as making sense. I'd suggest something like: - add a "retry count" - if direct reclaim made no progress, or made less progress than the target: if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry; - regardless of whether we made progress or not: if (retry count < X) goto retry; if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then goto retry where 'X" is something sane that limits our CPU use, but also guarantees that we don't end up waiting *too* long (if a single allocation takes more than a big fraction of a second, we should probably stop trying). The whole time-based thing might even be explicit. There's nothing wrong with doing something like unsigned long timeout = jiffies + HZ/4; at the top of the function, and making the whole retry logic actually say something like if (time_after(timeout, jiffies)) goto noretry; (or make *that* trigger the oom logic, or whatever). Now, I realize the above suggestions are big changes, and they'll likely break things and we'll still need to tweak things, but dammit, wouldn't that be better than just randomly tweaking the insane zone_reclaimable logic? Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-13 16:37 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-13 16:37 UTC (permalink / raw) To: Tetsuo Handa Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > If I remove > > /* Any of the zones still reclaimable? Don't OOM. */ > if (zones_reclaimable) > return 1; > > the OOM killer is invoked even when there are so much memory which can be > reclaimed after written to disk. This is definitely premature invocation of > the OOM killer. Right. The rest of the code knows that the return value right now means "there is no memory at all" rather than "I made progress". > Yes. But we can't simply do > > if (order <= PAGE_ALLOC_COSTLY_ORDER || .. > > because we won't be able to call out_of_memory(), can we? So I think that whole thing is kind of senseless. Not just that particular conditional, but what it *does* too. What can easily happen is that we are a blocking allocation, but because we're __GFP_FS or something, the code doesn't actually start writing anything out. Nor is anything congested. So the thing just loops. And looping is stupid, because we may be not able to actually free anything exactly because of limitations like __GFP_FS. So (a) the looping condition is senseless (b) what we do when looping is senseless and we actually do try to wake up kswapd in the loop, but we never *wait* for it, so that's largely pointless too. So *of*course* the direct reclaim code has to set "I made progress", because if it doesn't lie and say so, then the code will randomly not loop, and will oom, and things go to hell. But I hate the "let's tweak the zone_reclaimable" idea, because it doesn't actually fix anything. It just perpetuates this "the code doesn't make sense, so let's add *more* senseless heusristics to this whole loop". So instead of that senseless thing, how about trying something *sensible*. Make the code do something that we can actually explain as making sense. I'd suggest something like: - add a "retry count" - if direct reclaim made no progress, or made less progress than the target: if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry; - regardless of whether we made progress or not: if (retry count < X) goto retry; if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then goto retry where 'X" is something sane that limits our CPU use, but also guarantees that we don't end up waiting *too* long (if a single allocation takes more than a big fraction of a second, we should probably stop trying). The whole time-based thing might even be explicit. There's nothing wrong with doing something like unsigned long timeout = jiffies + HZ/4; at the top of the function, and making the whole retry logic actually say something like if (time_after(timeout, jiffies)) goto noretry; (or make *that* trigger the oom logic, or whatever). Now, I realize the above suggestions are big changes, and they'll likely break things and we'll still need to tweak things, but dammit, wouldn't that be better than just randomly tweaking the insane zone_reclaimable logic? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 16:37 ` Linus Torvalds @ 2015-10-14 12:21 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-14 12:21 UTC (permalink / raw) To: torvalds Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Linus Torvalds wrote: > On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > If I remove > > > > /* Any of the zones still reclaimable? Don't OOM. */ > > if (zones_reclaimable) > > return 1; > > > > the OOM killer is invoked even when there are so much memory which can be > > reclaimed after written to disk. This is definitely premature invocation of > > the OOM killer. > > Right. The rest of the code knows that the return value right now > means "there is no memory at all" rather than "I made progress". > > > Yes. But we can't simply do > > > > if (order <= PAGE_ALLOC_COSTLY_ORDER || .. > > > > because we won't be able to call out_of_memory(), can we? > > So I think that whole thing is kind of senseless. Not just that > particular conditional, but what it *does* too. > > What can easily happen is that we are a blocking allocation, but > because we're __GFP_FS or something, the code doesn't actually start > writing anything out. Nor is anything congested. So the thing just > loops. congestion_wait() sounds like a source of silent hang up. http://lkml.kernel.org/r/201406052145.CIB35534.OQLVMSJFOHtFOF@I-love.SAKURA.ne.jp > > And looping is stupid, because we may be not able to actually free > anything exactly because of limitations like __GFP_FS. > > So > > (a) the looping condition is senseless > > (b) what we do when looping is senseless > > and we actually do try to wake up kswapd in the loop, but we never > *wait* for it, so that's largely pointless too. Aren't we waiting for kswapd forever? In other words, we never check whether kswapd can make some progress. http://lkml.kernel.org/r/20150812091104.GA14940@dhcp22.suse.cz > > So *of*course* the direct reclaim code has to set "I made progress", > because if it doesn't lie and say so, then the code will randomly not > loop, and will oom, and things go to hell. > > But I hate the "let's tweak the zone_reclaimable" idea, because it > doesn't actually fix anything. It just perpetuates this "the code > doesn't make sense, so let's add *more* senseless heusristics to this > whole loop". I also don't think that tweaking current reclaim logic solves bugs which bothered me via unexplained hangups / reboots. To me, current memory allocator is too puzzling that it is as if if (there_is_much_free_memory() == TRUE) goto OK; if (do_some_heuristic1() == SUCCESS) goto OK; if (do_some_heuristic2() == SUCCESS) goto OK; if (do_some_heuristic3() == SUCCESS) goto OK; (...snipped...) if (do_some_heuristicN() == SUCCESS) goto OK; while (1); and we don't know how many heuristics we need to add in order to avoid reaching the "while (1);". (We are reaching the "while (1);" before if (out_of_memory() == SUCCESS) goto OK; is called.) > > So instead of that senseless thing, how about trying something > *sensible*. Make the code do something that we can actually explain as > making sense. > > I'd suggest something like: > > - add a "retry count" > > - if direct reclaim made no progress, or made less progress than the target: > > if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry; Yes. > > - regardless of whether we made progress or not: > > if (retry count < X) goto retry; > > if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then > goto retry I tried sleeping for reducing CPU usage and reporting via SysRq-w. http://lkml.kernel.org/r/201411231353.BDE90173.FQOMJtHOLVFOFS@I-love.SAKURA.ne.jp I complained at http://lkml.kernel.org/r/201502162023.GGE26089.tJOOFQMFFHLOVS@I-love.SAKURA.ne.jp | Oh, why every thread trying to allocate memory has to repeat | the loop that might defer somebody who can make progress if CPU time was | given? I wish only somebody like kswapd repeats the loop on behalf of all | threads waiting at memory allocation slowpath... Direct reclaim can defer termination upon SIGKILL if blocked at unkillable lock. If performance were not a problem, is direct reclaim mandatory? Of course, performance is the problem. Thus we would try direct reclaim for at least once. But I wish memory allocation logic were as simple as (1) If there are enough free memory, allocate it. (2) If there are not enough free memory, join on the waitqueue list wait_event_timeout(waiter, memory_reclaimed, timeout) and wait for reclaiming kernel threads (e.g. kswapd) to wake the waiters up. If the caller is willing to give up upon SIGKILL (e.g. __GFP_KILLABLE) then wait_event_killable_timeout(waiter, memory_reclaimed, timeout) and return NULL upon SIGKILL. (3) Whenever reclaiming kernel threads reclaimed memory and there are waiters, wake the waiters up. (4) If reclaiming kernel threads cannot reclaim memory, the caller will wake up due to timeout, and invoke the OOM killer unless the caller does not want (e.g. __GFP_NO_OOMKILL). > > where 'X" is something sane that limits our CPU use, but also > guarantees that we don't end up waiting *too* long (if a single > allocation takes more than a big fraction of a second, we should > probably stop trying). Isn't a second too short for waiting for swapping / writeback? > > The whole time-based thing might even be explicit. There's nothing > wrong with doing something like > > unsigned long timeout = jiffies + HZ/4; > > at the top of the function, and making the whole retry logic actually > say something like > > if (time_after(timeout, jiffies)) goto noretry; > > (or make *that* trigger the oom logic, or whatever). I prefer time-based thing, for my customer's usage (where watchdog timeout is configured to 10 seconds) will require kernel messages (maybe OOM killer messages) printed within a few seconds. > > Now, I realize the above suggestions are big changes, and they'll > likely break things and we'll still need to tweak things, but dammit, > wouldn't that be better than just randomly tweaking the insane > zone_reclaimable logic? > > Linus Yes, this will be big changes. But this change will be better than living with "no means for understanding what was happening are available" v.s. "really interesting things are observed if means are available". ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-14 12:21 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-14 12:21 UTC (permalink / raw) To: torvalds Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Linus Torvalds wrote: > On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > If I remove > > > > /* Any of the zones still reclaimable? Don't OOM. */ > > if (zones_reclaimable) > > return 1; > > > > the OOM killer is invoked even when there are so much memory which can be > > reclaimed after written to disk. This is definitely premature invocation of > > the OOM killer. > > Right. The rest of the code knows that the return value right now > means "there is no memory at all" rather than "I made progress". > > > Yes. But we can't simply do > > > > if (order <= PAGE_ALLOC_COSTLY_ORDER || .. > > > > because we won't be able to call out_of_memory(), can we? > > So I think that whole thing is kind of senseless. Not just that > particular conditional, but what it *does* too. > > What can easily happen is that we are a blocking allocation, but > because we're __GFP_FS or something, the code doesn't actually start > writing anything out. Nor is anything congested. So the thing just > loops. congestion_wait() sounds like a source of silent hang up. http://lkml.kernel.org/r/201406052145.CIB35534.OQLVMSJFOHtFOF@I-love.SAKURA.ne.jp > > And looping is stupid, because we may be not able to actually free > anything exactly because of limitations like __GFP_FS. > > So > > (a) the looping condition is senseless > > (b) what we do when looping is senseless > > and we actually do try to wake up kswapd in the loop, but we never > *wait* for it, so that's largely pointless too. Aren't we waiting for kswapd forever? In other words, we never check whether kswapd can make some progress. http://lkml.kernel.org/r/20150812091104.GA14940@dhcp22.suse.cz > > So *of*course* the direct reclaim code has to set "I made progress", > because if it doesn't lie and say so, then the code will randomly not > loop, and will oom, and things go to hell. > > But I hate the "let's tweak the zone_reclaimable" idea, because it > doesn't actually fix anything. It just perpetuates this "the code > doesn't make sense, so let's add *more* senseless heusristics to this > whole loop". I also don't think that tweaking current reclaim logic solves bugs which bothered me via unexplained hangups / reboots. To me, current memory allocator is too puzzling that it is as if if (there_is_much_free_memory() == TRUE) goto OK; if (do_some_heuristic1() == SUCCESS) goto OK; if (do_some_heuristic2() == SUCCESS) goto OK; if (do_some_heuristic3() == SUCCESS) goto OK; (...snipped...) if (do_some_heuristicN() == SUCCESS) goto OK; while (1); and we don't know how many heuristics we need to add in order to avoid reaching the "while (1);". (We are reaching the "while (1);" before if (out_of_memory() == SUCCESS) goto OK; is called.) > > So instead of that senseless thing, how about trying something > *sensible*. Make the code do something that we can actually explain as > making sense. > > I'd suggest something like: > > - add a "retry count" > > - if direct reclaim made no progress, or made less progress than the target: > > if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry; Yes. > > - regardless of whether we made progress or not: > > if (retry count < X) goto retry; > > if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then > goto retry I tried sleeping for reducing CPU usage and reporting via SysRq-w. http://lkml.kernel.org/r/201411231353.BDE90173.FQOMJtHOLVFOFS@I-love.SAKURA.ne.jp I complained at http://lkml.kernel.org/r/201502162023.GGE26089.tJOOFQMFFHLOVS@I-love.SAKURA.ne.jp | Oh, why every thread trying to allocate memory has to repeat | the loop that might defer somebody who can make progress if CPU time was | given? I wish only somebody like kswapd repeats the loop on behalf of all | threads waiting at memory allocation slowpath... Direct reclaim can defer termination upon SIGKILL if blocked at unkillable lock. If performance were not a problem, is direct reclaim mandatory? Of course, performance is the problem. Thus we would try direct reclaim for at least once. But I wish memory allocation logic were as simple as (1) If there are enough free memory, allocate it. (2) If there are not enough free memory, join on the waitqueue list wait_event_timeout(waiter, memory_reclaimed, timeout) and wait for reclaiming kernel threads (e.g. kswapd) to wake the waiters up. If the caller is willing to give up upon SIGKILL (e.g. __GFP_KILLABLE) then wait_event_killable_timeout(waiter, memory_reclaimed, timeout) and return NULL upon SIGKILL. (3) Whenever reclaiming kernel threads reclaimed memory and there are waiters, wake the waiters up. (4) If reclaiming kernel threads cannot reclaim memory, the caller will wake up due to timeout, and invoke the OOM killer unless the caller does not want (e.g. __GFP_NO_OOMKILL). > > where 'X" is something sane that limits our CPU use, but also > guarantees that we don't end up waiting *too* long (if a single > allocation takes more than a big fraction of a second, we should > probably stop trying). Isn't a second too short for waiting for swapping / writeback? > > The whole time-based thing might even be explicit. There's nothing > wrong with doing something like > > unsigned long timeout = jiffies + HZ/4; > > at the top of the function, and making the whole retry logic actually > say something like > > if (time_after(timeout, jiffies)) goto noretry; > > (or make *that* trigger the oom logic, or whatever). I prefer time-based thing, for my customer's usage (where watchdog timeout is configured to 10 seconds) will require kernel messages (maybe OOM killer messages) printed within a few seconds. > > Now, I realize the above suggestions are big changes, and they'll > likely break things and we'll still need to tweak things, but dammit, > wouldn't that be better than just randomly tweaking the insane > zone_reclaimable logic? > > Linus Yes, this will be big changes. But this change will be better than living with "no means for understanding what was happening are available" v.s. "really interesting things are observed if means are available". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 16:37 ` Linus Torvalds @ 2015-10-15 13:14 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-15 13:14 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel [CC Mel and Rik as well - this has diverged from the original thread considerably but the current topic started here: http://lkml.kernel.org/r/201510130025.EJF21331.FFOQJtVOMLFHSO%40I-love.SAKURA.ne.jp ] On Tue 13-10-15 09:37:06, Linus Torvalds wrote: > So instead of that senseless thing, how about trying something > *sensible*. Make the code do something that we can actually explain as > making sense. I do agree that zone_reclaimable is subtle and hackish way to wait for the writeback/kswapd to clean up pages which cannot be reclaimed from the direct reclaim. > I'd suggest something like: > > - add a "retry count" > > - if direct reclaim made no progress, or made less progress than the target: > > if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry; > > - regardless of whether we made progress or not: > > if (retry count < X) goto retry; > > if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then > goto retry This will certainly cap the reclaim retries but there are risks with this approach afaics. First of all other allocators might piggy back on the current reclaimer and push it to the OOM killer even when we are not really OOM. Maybe this is possible currently as well but it is less likely because NR_PAGES_SCANNED is reset on a freed page which allows the reclaimer another round. I am also not sure it would help with pathological cases like the one discussed here. If you have only a small amount of reclaimable memory on the LRU lists then you scan them quite quickly which will consume retries. Maybe a sufficient timeout can help but I am afraid we can still hit the OOM prematurely because a large part of the memory is still under writeback (which might be a slow device - e.g. an USB stick). We used have this kind of problems in memcg reclaim. We do not have (resp. didn't have until recently with CONFIG_CGROUP_WRITEBACK) dirty memory throttling for memory cgroups so the LRU can become full of dirty data really quickly and that led to memcg OOM killer. We are not doing zone_reclaimable and other heuristics so we had to explicitly wait_on_page_writeback in the reclaim to prevent from premature OOM killer. Ugly hack but the only thing that worked reliably. Time based solutions were tried and failed with different workloads and quite randomly depending on the load/storage. > where 'X" is something sane that limits our CPU use, but also > guarantees that we don't end up waiting *too* long (if a single > allocation takes more than a big fraction of a second, we should > probably stop trying). > > The whole time-based thing might even be explicit. There's nothing > wrong with doing something like > > unsigned long timeout = jiffies + HZ/4; > > at the top of the function, and making the whole retry logic actually > say something like > > if (time_after(timeout, jiffies)) goto noretry; > > (or make *that* trigger the oom logic, or whatever). > > Now, I realize the above suggestions are big changes, and they'll > likely break things and we'll still need to tweak things, but dammit, > wouldn't that be better than just randomly tweaking the insane > zone_reclaimable logic? Yes zone_reclaimable is subtle and imho it is used even at the wrong level. We should decide whether we are really OOM at __alloc_pages_slowpath. We definitely need a big picture logic to tell us when it makes sense to drop the ball and trigger OOM killer or fail the allocation request. E.g. free + reclaimable + writeback < min_wmark on all usable zones for more than X rounds of direct reclaim without any progress is a sufficient signal to go OOM. Costly/noretry allocations can fail earlier of course. This is obviously a half baked idea which needs much more consideration all I am trying to say is that we need a high level metric to tell OOM condition. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-15 13:14 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-15 13:14 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel [CC Mel and Rik as well - this has diverged from the original thread considerably but the current topic started here: http://lkml.kernel.org/r/201510130025.EJF21331.FFOQJtVOMLFHSO%40I-love.SAKURA.ne.jp ] On Tue 13-10-15 09:37:06, Linus Torvalds wrote: > So instead of that senseless thing, how about trying something > *sensible*. Make the code do something that we can actually explain as > making sense. I do agree that zone_reclaimable is subtle and hackish way to wait for the writeback/kswapd to clean up pages which cannot be reclaimed from the direct reclaim. > I'd suggest something like: > > - add a "retry count" > > - if direct reclaim made no progress, or made less progress than the target: > > if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry; > > - regardless of whether we made progress or not: > > if (retry count < X) goto retry; > > if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then > goto retry This will certainly cap the reclaim retries but there are risks with this approach afaics. First of all other allocators might piggy back on the current reclaimer and push it to the OOM killer even when we are not really OOM. Maybe this is possible currently as well but it is less likely because NR_PAGES_SCANNED is reset on a freed page which allows the reclaimer another round. I am also not sure it would help with pathological cases like the one discussed here. If you have only a small amount of reclaimable memory on the LRU lists then you scan them quite quickly which will consume retries. Maybe a sufficient timeout can help but I am afraid we can still hit the OOM prematurely because a large part of the memory is still under writeback (which might be a slow device - e.g. an USB stick). We used have this kind of problems in memcg reclaim. We do not have (resp. didn't have until recently with CONFIG_CGROUP_WRITEBACK) dirty memory throttling for memory cgroups so the LRU can become full of dirty data really quickly and that led to memcg OOM killer. We are not doing zone_reclaimable and other heuristics so we had to explicitly wait_on_page_writeback in the reclaim to prevent from premature OOM killer. Ugly hack but the only thing that worked reliably. Time based solutions were tried and failed with different workloads and quite randomly depending on the load/storage. > where 'X" is something sane that limits our CPU use, but also > guarantees that we don't end up waiting *too* long (if a single > allocation takes more than a big fraction of a second, we should > probably stop trying). > > The whole time-based thing might even be explicit. There's nothing > wrong with doing something like > > unsigned long timeout = jiffies + HZ/4; > > at the top of the function, and making the whole retry logic actually > say something like > > if (time_after(timeout, jiffies)) goto noretry; > > (or make *that* trigger the oom logic, or whatever). > > Now, I realize the above suggestions are big changes, and they'll > likely break things and we'll still need to tweak things, but dammit, > wouldn't that be better than just randomly tweaking the insane > zone_reclaimable logic? Yes zone_reclaimable is subtle and imho it is used even at the wrong level. We should decide whether we are really OOM at __alloc_pages_slowpath. We definitely need a big picture logic to tell us when it makes sense to drop the ball and trigger OOM killer or fail the allocation request. E.g. free + reclaimable + writeback < min_wmark on all usable zones for more than X rounds of direct reclaim without any progress is a sufficient signal to go OOM. Costly/noretry allocations can fail earlier of course. This is obviously a half baked idea which needs much more consideration all I am trying to say is that we need a high level metric to tell OOM condition. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-15 13:14 ` Michal Hocko @ 2015-10-16 15:57 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-16 15:57 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel On Thu 15-10-15 15:14:09, Michal Hocko wrote: > On Tue 13-10-15 09:37:06, Linus Torvalds wrote: [...] > > Now, I realize the above suggestions are big changes, and they'll > > likely break things and we'll still need to tweak things, but dammit, > > wouldn't that be better than just randomly tweaking the insane > > zone_reclaimable logic? > > Yes zone_reclaimable is subtle and imho it is used even at the > wrong level. We should decide whether we are really OOM at > __alloc_pages_slowpath. We definitely need a big picture logic to tell > us when it makes sense to drop the ball and trigger OOM killer or fail > the allocation request. > > E.g. free + reclaimable + writeback < min_wmark on all usable zones for > more than X rounds of direct reclaim without any progress is > a sufficient signal to go OOM. Costly/noretry allocations can fail earlier > of course. This is obviously a half baked idea which needs much more > consideration all I am trying to say is that we need a high level metric > to tell OOM condition. OK so here is what I am playing with currently. It is not complete yet. Anyway I have tested it with 2 scenarios on a swapless system with 2G of RAM both do $ cat writer.sh #!/bin/sh size=$((1<<30)) block=$((4<<10)) writer() { ( while true do dd if=/dev/zero of=/mnt/data/file.$1 bs=$block count=$(($size/$block)) rm /mnt/data/file.$1 sync done ) & } writer 1 writer 2 sleep 10s # allow to accumulate enough dirty pages 1) massive OOM start 100 memeaters each 80M run in parallel (anon private MAP_POPULATE mapping). This will trigger many OOM killers and the overall count is what I was interested in. The test is considered finished when we get a steady state - writers can make progress and there is no more OOM killing for some time. $ grep "invoked oom-killer" base-run-oom.log | wc -l 78 $ grep "invoked oom-killer" test-run-oom.log | wc -l 63 So it looks like we have triggered less OOM killing with the patch applied. I haven't checked those too closely but it seems like at least two instances might not have triggered with the current implementation because DMA32 zone is considered reclaimable. But this check is inherently racy so we cannot be sure. $ grep "DMA32.*all_unreclaimable? no" test2-run-oom.log | wc -l 2 2) almost OOM situation invoke 10 memeaters in parallel and try to fill up all the memory without triggering the OOM killer. This is quite hard and it required a lot of tunning. I've ended up with: #!/bin/sh pkill mem_eater sync echo 3 > /proc/sys/vm/drop_caches sync size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo) sh writer.sh & sleep 10s for i in $(seq 10) do memcg_test/tools/mem_eater $size & done wait and this one doesn't hit the OOM killer with the original implementation while it hits it with the patch applied: [ 32.727001] DMA32 free:5428kB min:5532kB low:6912kB high:8296kB active_anon:1802520kB inactive_anon:204kB active_file:6692kB inactive_file:137184k B unevictable:0kB isolated(anon):136kB isolated(file):32kB present:2080640kB managed:1997880kB mlocked:0kB dirty:0kB writeback:137168kB mapped:6408kB shmem:204kB slab_reclaimable:20472kB slab_unreclaimable:13276kB kernel_stack:1456kB pagetables:4756kB unstable:0kB bounce:0kB free_pcp:120kB local_p cp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:948764 all_unreclaimable? yes There is a lot of memory in the writeback but all_unreclaimable is yes so who knows maybe it is just a coincidence we haven't triggered OOM in the original kernel. Anyway the two implementation will be hard to compare because workloads are very different but I think something like below should be more readable and deterministic than what we have right now. It will need some more tuning for sure and I will be playing with it some more. I would just like to hear opinions whether this approach makes sense. If yes I will post it separately in a new thread for a wider discussion. This email thread seems to be full of detours already. --- >From e8620185cc1139cd47cee64a7e6b96e9a7c92d25 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 16 Oct 2015 14:34:50 +0200 Subject: [PATCH] mm, oom: refactor oom detection __alloc_pages_slowpath has traditionally relied on the direct reclaim and did_some_progress as an indicator that it makes sense to retry allocation rather than declaring OOM. shrink_zones had to rely on zone_reclaimable if shrink_zone didn't make any progress to prevent from pre mature OOM killer invocation - the LRU might be full of dirty or writeback pages and direct reclaim cannot clean those up. zone_reclaimable will allow to rescan the reclaimable lists several times and restart if a page is freed. This is really subtle behavior and it might lead to a livelock when a single freed page keeps allocator looping but the current task will not be able to allocate that single page. OOM killer would be more appropriate than looping without any progress for unbounded amount of time. This patch changes OOM detection logic and pulls it out from shrink_zone which is too low to be appropriate for any high level decisions such as OOM which is per zonelist property. It is __alloc_pages_slowpath which knows how many attempts have been done and what was the progress so far therefore it is more appropriate to implement this logic. The new heuristic tries to be more deterministic and easier to follow. Retrying makes sense only if the currently reclaimable memory (pages on reclaimable LRUs + writeback pages) + free pages would allow the current allocation request to succeed (as per __zone_watermark_ok) at least for one zone. This alone wouldn't be sufficient because the writeback might get stuck and reclaimable pages might be pinned for a really long time or even depend on the current allocation context. Therefore there is a feedback mechanism implemented which reduces the reclaim target after each reclaim round without any progress. This means that we should eventually converge to only NR_FREE_PAGES and fail on the wmark check and proceed to OOM. The nice aspect of this approach is that it works independently on the allocation order. The backoff is simple and linear with 1/16 of the reclaimable pages for each round without any progress. We are optimistic and reset counter for successful reclaim rounds. TODO: what are we going to do with __GFP_REPEAT? We should try harder but how much harder? Do we even need it? Opportunistic high order allocations can use __GFP_NORETRY... Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/swap.h | 1 + mm/page_alloc.c | 66 +++++++++++++++++++++++++++++++++++++++++++++------- mm/vmscan.c | 10 +------- 3 files changed, 59 insertions(+), 18 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 9c7c4b418498..8298e1dc20f9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ +extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c73913648357..ae927c762917 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2972,6 +2972,13 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask) return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE; } +/* + * Number of backoff steps for potentially reclaimable pages if the direct reclaim + * cannot make any progress. Each step will reduce 1/MAX_STALL_BACKOFF of the + * reclaimable memory. + */ +#define MAX_STALL_BACKOFF 16 + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) @@ -2979,11 +2986,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM; struct page *page = NULL; int alloc_flags; - unsigned long pages_reclaimed = 0; unsigned long did_some_progress; enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + struct zone *zone; + struct zoneref *z; + int stall_backoff = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3135,23 +3144,62 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (gfp_mask & __GFP_NORETRY) goto noretry; - /* Keep reclaiming pages as long as there is reasonable progress */ - pages_reclaimed += did_some_progress; - if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || - ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { - /* Wait for some write requests to complete then retry */ - wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); - goto retry; + /* + * Be optimistic and consider all pages on reclaimable LRUs + those + * currently on writeback as usable but make sure we converge to + * OOM if we cannot make any progress after multiple consecutive + * attempts. + */ + if (did_some_progress) + stall_backoff = 0; + else + stall_backoff = min(stall_backoff+1, MAX_STALL_BACKOFF); + + /* + * Keep reclaiming pages while there is a chance this will lead somewhere. + * If none of the target zones can satisfy our allocation request even + * if all reclaimable pages are considered then we are screwed and have + * to go OOM. + */ + for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) { + unsigned long free = zone_page_state(zone, NR_FREE_PAGES); + unsigned long writeback = zone_page_state(zone, NR_WRITEBACK); + unsigned long reclaimable; + unsigned long target; + + reclaimable = zone_reclaimable_pages(zone) + + zone_page_state(zone, NR_ISOLATED_FILE) + + zone_page_state(zone, NR_ISOLATED_ANON); + target = reclaimable + writeback; + target -= stall_backoff * (1 + target/MAX_STALL_BACKOFF); + target += free; + + /* + * Would the allocation succeed if we reclaimed the whole target? + * We might over-account here because some pages under writeback + * might be on the LRU as well but that shouldn't confuse us too + * much. + */ + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), + ac->high_zoneidx, alloc_flags, target)) { + /* Wait for some write requests to complete then retry */ + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); + goto retry; + } } + /* TODO what about GFP_REPEAT */ + /* Reclaim has failed us, start killing things */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) goto got_pg; /* Retry as long as the OOM killer is making progress */ - if (did_some_progress) + if (did_some_progress) { + stall_backoff = 0; goto retry; + } noretry: /* diff --git a/mm/vmscan.c b/mm/vmscan.c index c88d74ad9304..bc14217acd47 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -193,7 +193,7 @@ static bool sane_reclaim(struct scan_control *sc) } #endif -static unsigned long zone_reclaimable_pages(struct zone *zone) +unsigned long zone_reclaimable_pages(struct zone *zone) { unsigned long nr; @@ -2639,10 +2639,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx)) reclaimable = true; - - if (global_reclaim(sc) && - !reclaimable && zone_reclaimable(zone)) - reclaimable = true; } /* @@ -2734,10 +2730,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, goto retry; } - /* Any of the zones still reclaimable? Don't OOM. */ - if (zones_reclaimable) - return 1; - return 0; } -- 2.6.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-16 15:57 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-16 15:57 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel On Thu 15-10-15 15:14:09, Michal Hocko wrote: > On Tue 13-10-15 09:37:06, Linus Torvalds wrote: [...] > > Now, I realize the above suggestions are big changes, and they'll > > likely break things and we'll still need to tweak things, but dammit, > > wouldn't that be better than just randomly tweaking the insane > > zone_reclaimable logic? > > Yes zone_reclaimable is subtle and imho it is used even at the > wrong level. We should decide whether we are really OOM at > __alloc_pages_slowpath. We definitely need a big picture logic to tell > us when it makes sense to drop the ball and trigger OOM killer or fail > the allocation request. > > E.g. free + reclaimable + writeback < min_wmark on all usable zones for > more than X rounds of direct reclaim without any progress is > a sufficient signal to go OOM. Costly/noretry allocations can fail earlier > of course. This is obviously a half baked idea which needs much more > consideration all I am trying to say is that we need a high level metric > to tell OOM condition. OK so here is what I am playing with currently. It is not complete yet. Anyway I have tested it with 2 scenarios on a swapless system with 2G of RAM both do $ cat writer.sh #!/bin/sh size=$((1<<30)) block=$((4<<10)) writer() { ( while true do dd if=/dev/zero of=/mnt/data/file.$1 bs=$block count=$(($size/$block)) rm /mnt/data/file.$1 sync done ) & } writer 1 writer 2 sleep 10s # allow to accumulate enough dirty pages 1) massive OOM start 100 memeaters each 80M run in parallel (anon private MAP_POPULATE mapping). This will trigger many OOM killers and the overall count is what I was interested in. The test is considered finished when we get a steady state - writers can make progress and there is no more OOM killing for some time. $ grep "invoked oom-killer" base-run-oom.log | wc -l 78 $ grep "invoked oom-killer" test-run-oom.log | wc -l 63 So it looks like we have triggered less OOM killing with the patch applied. I haven't checked those too closely but it seems like at least two instances might not have triggered with the current implementation because DMA32 zone is considered reclaimable. But this check is inherently racy so we cannot be sure. $ grep "DMA32.*all_unreclaimable? no" test2-run-oom.log | wc -l 2 2) almost OOM situation invoke 10 memeaters in parallel and try to fill up all the memory without triggering the OOM killer. This is quite hard and it required a lot of tunning. I've ended up with: #!/bin/sh pkill mem_eater sync echo 3 > /proc/sys/vm/drop_caches sync size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo) sh writer.sh & sleep 10s for i in $(seq 10) do memcg_test/tools/mem_eater $size & done wait and this one doesn't hit the OOM killer with the original implementation while it hits it with the patch applied: [ 32.727001] DMA32 free:5428kB min:5532kB low:6912kB high:8296kB active_anon:1802520kB inactive_anon:204kB active_file:6692kB inactive_file:137184k B unevictable:0kB isolated(anon):136kB isolated(file):32kB present:2080640kB managed:1997880kB mlocked:0kB dirty:0kB writeback:137168kB mapped:6408kB shmem:204kB slab_reclaimable:20472kB slab_unreclaimable:13276kB kernel_stack:1456kB pagetables:4756kB unstable:0kB bounce:0kB free_pcp:120kB local_p cp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:948764 all_unreclaimable? yes There is a lot of memory in the writeback but all_unreclaimable is yes so who knows maybe it is just a coincidence we haven't triggered OOM in the original kernel. Anyway the two implementation will be hard to compare because workloads are very different but I think something like below should be more readable and deterministic than what we have right now. It will need some more tuning for sure and I will be playing with it some more. I would just like to hear opinions whether this approach makes sense. If yes I will post it separately in a new thread for a wider discussion. This email thread seems to be full of detours already. --- ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-16 15:57 ` Michal Hocko @ 2015-10-16 18:34 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-16 18:34 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote: > > OK so here is what I am playing with currently. It is not complete > yet. So this looks like it's going in a reasonable direction. However: > + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > + ac->high_zoneidx, alloc_flags, target)) { > + /* Wait for some write requests to complete then retry */ > + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); > + goto retry; > + } I still think we should at least spend some time re-thinking that "wait_iff_congested()" thing. We may not actually be congested, but might be unable to write anything out because of our allocation flags (ie not allowed to recurse into the filesystems), so we might be in the situation that we have a lot of dirty pages that we can't directly do anything about. Now, we will have woken kswapd, so something *will* hopefully be done about them eventually, but at no time do we actually really wait for it. We'll just busy-loop. So at a minimum, I think we should yield to kswapd. We do do that "cond_resched()" in wait_iff_congested(), but I'm not entirely convinced that is at all enough to wait for kswapd to *do* something. So before we really decide to see if we should oom, I think we should have at least one forced io_schedule_timeout(), whether we're congested or not. And yes, as Tetsuo Handa said, any kind of short wait might be too short for IO to really complete, but *something* will have completed. Unless we're so far up the creek that we really should just oom. But I suspect we'll have to just try things out and tweak it. This patch looks like a reasonable starting point to me. Tetsuo, mind trying it out and maybe tweaking it a bit for the load you have? Does it seem to improve on your situation? Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-16 18:34 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-10-16 18:34 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote: > > OK so here is what I am playing with currently. It is not complete > yet. So this looks like it's going in a reasonable direction. However: > + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > + ac->high_zoneidx, alloc_flags, target)) { > + /* Wait for some write requests to complete then retry */ > + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); > + goto retry; > + } I still think we should at least spend some time re-thinking that "wait_iff_congested()" thing. We may not actually be congested, but might be unable to write anything out because of our allocation flags (ie not allowed to recurse into the filesystems), so we might be in the situation that we have a lot of dirty pages that we can't directly do anything about. Now, we will have woken kswapd, so something *will* hopefully be done about them eventually, but at no time do we actually really wait for it. We'll just busy-loop. So at a minimum, I think we should yield to kswapd. We do do that "cond_resched()" in wait_iff_congested(), but I'm not entirely convinced that is at all enough to wait for kswapd to *do* something. So before we really decide to see if we should oom, I think we should have at least one forced io_schedule_timeout(), whether we're congested or not. And yes, as Tetsuo Handa said, any kind of short wait might be too short for IO to really complete, but *something* will have completed. Unless we're so far up the creek that we really should just oom. But I suspect we'll have to just try things out and tweak it. This patch looks like a reasonable starting point to me. Tetsuo, mind trying it out and maybe tweaking it a bit for the load you have? Does it seem to improve on your situation? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-16 18:34 ` Linus Torvalds @ 2015-10-16 18:49 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-16 18:49 UTC (permalink / raw) To: torvalds, mhocko Cc: rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina, mgorman, riel Linus Torvalds wrote: > Tetsuo, mind trying it out and maybe tweaking it a bit for the load > you have? Does it seem to improve on your situation? Yes, I already tried it and just replied to Michal. I tested for one hour using various memory stressing programs. As far as I tested, I did not hit silent hang up ( MemAlloc-Info: X stalling task, 0 dying task, 0 victim task. where X > 0). ---------------------------------------- [ 134.510993] Mem-Info: [ 134.511940] active_anon:408777 inactive_anon:2088 isolated_anon:24 [ 134.511940] active_file:15 inactive_file:24 isolated_file:0 [ 134.511940] unevictable:0 dirty:4 writeback:1 unstable:0 [ 134.511940] slab_reclaimable:3109 slab_unreclaimable:5594 [ 134.511940] mapped:679 shmem:2156 pagetables:2077 bounce:0 [ 134.511940] free:12911 free_pcp:31 free_cma:0 [ 134.521256] Node 0 DMA free:7256kB min:400kB low:500kB high:600kB active_anon:6560kB inactive_anon:180kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:80kB shmem:184kB slab_reclaimable:236kB slab_unreclaimable:296kB kernel_stack:48kB pagetables:556kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 134.532779] lowmem_reserve[]: 0 1714 1714 1714 [ 134.534455] Node 0 DMA32 free:44388kB min:44652kB low:55812kB high:66976kB active_anon:1628548kB inactive_anon:8172kB active_file:60kB inactive_file:96kB unevictable:0kB isolated(anon):96kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:16kB writeback:4kB mapped:2636kB shmem:8440kB slab_reclaimable:12200kB slab_unreclaimable:22080kB kernel_stack:3584kB pagetables:7752kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1016 all_unreclaimable? yes [ 134.545830] lowmem_reserve[]: 0 0 0 0 [ 134.547404] Node 0 DMA: 16*4kB (UME) 16*8kB (UME) 10*16kB (UME) 6*32kB (UME) 1*64kB (M) 2*128kB (UE) 1*256kB (M) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7264kB [ 134.552766] Node 0 DMA32: 1158*4kB (UME) 638*8kB (UE) 244*16kB (UME) 163*32kB (UE) 73*64kB (UE) 34*128kB (UME) 17*256kB (UME) 10*512kB (UME) 7*1024kB (UM) 0*2048kB 0*4096kB = 44520kB [ 134.558111] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 134.560358] 2195 total pagecache pages [ 134.562043] 0 pages in swap cache [ 134.563604] Swap cache stats: add 0, delete 0, find 0/0 [ 134.565441] Free swap = 0kB [ 134.567015] Total swap = 0kB [ 134.568628] 524157 pages RAM [ 134.570034] 0 pages HighMem/MovableOnly [ 134.571681] 80368 pages reserved [ 134.573467] 0 pages hwpoisoned ---------------------------------------- Only problem I felt is that the ratio of inactive_file/writeback (shown below) was high (compared to shown above) when I did $ cat < /dev/zero > /tmp/file1 & cat < /dev/zero > /tmp/file2 & cat < /dev/zero > /tmp/file3 & sleep 10; ./a.out; killall cat but I think this patch is better than current code. ---------------------------------------- [ 1135.909600] Mem-Info: [ 1135.910686] active_anon:321011 inactive_anon:4664 isolated_anon:0 [ 1135.910686] active_file:3170 inactive_file:78035 isolated_file:512 [ 1135.910686] unevictable:0 dirty:0 writeback:78618 unstable:0 [ 1135.910686] slab_reclaimable:5739 slab_unreclaimable:6170 [ 1135.910686] mapped:4666 shmem:8300 pagetables:1966 bounce:0 [ 1135.910686] free:12938 free_pcp:0 free_cma:0 [ 1135.925255] Node 0 DMA free:7232kB min:400kB low:500kB high:600kB active_anon:5852kB inactive_anon:196kB active_file:120kB inactive_file:980kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:968kB mapped:248kB shmem:388kB slab_reclaimable:316kB slab_unreclaimable:272kB kernel_stack:64kB pagetables:100kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7444 all_unreclaimable? yes [ 1135.936728] lowmem_reserve[]: 0 1714 1714 1714 [ 1135.938486] Node 0 DMA32 free:44520kB min:44652kB low:55812kB high:66976kB active_anon:1278192kB inactive_anon:18460kB active_file:12560kB inactive_file:313176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:0kB writeback:313504kB mapped:18416kB shmem:32812kB slab_reclaimable:22640kB slab_unreclaimable:24408kB kernel_stack:4240kB pagetables:7764kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2957668 all_unreclaimable? yes [ 1135.950355] lowmem_reserve[]: 0 0 0 0 [ 1135.952011] Node 0 DMA: 7*4kB (U) 14*8kB (UM) 13*16kB (UM) 6*32kB (UME) 1*64kB (M) 4*128kB (UME) 2*256kB (UM) 3*512kB (UME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7260kB [ 1135.957169] Node 0 DMA32: 241*4kB (UE) 929*8kB (UE) 496*16kB (UME) 277*32kB (UE) 135*64kB (UME) 17*128kB (UME) 3*256kB (E) 16*512kB (ME) 0*1024kB 0*2048kB 0*4096kB = 44972kB [ 1135.963047] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 1135.965472] 90009 total pagecache pages [ 1135.967078] 0 pages in swap cache [ 1135.968581] Swap cache stats: add 0, delete 0, find 0/0 [ 1135.970424] Free swap = 0kB [ 1135.971828] Total swap = 0kB [ 1135.973248] 524157 pages RAM [ 1135.974655] 0 pages HighMem/MovableOnly [ 1135.976230] 80368 pages reserved [ 1135.977745] 0 pages hwpoisoned ---------------------------------------- I can still hit OOM livelock ( MemAlloc-Info: X stalling task, Y dying task, Z victim task. where X > 0 && Y > 0). ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-16 18:49 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-16 18:49 UTC (permalink / raw) To: torvalds, mhocko Cc: rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina, mgorman, riel Linus Torvalds wrote: > Tetsuo, mind trying it out and maybe tweaking it a bit for the load > you have? Does it seem to improve on your situation? Yes, I already tried it and just replied to Michal. I tested for one hour using various memory stressing programs. As far as I tested, I did not hit silent hang up ( MemAlloc-Info: X stalling task, 0 dying task, 0 victim task. where X > 0). ---------------------------------------- [ 134.510993] Mem-Info: [ 134.511940] active_anon:408777 inactive_anon:2088 isolated_anon:24 [ 134.511940] active_file:15 inactive_file:24 isolated_file:0 [ 134.511940] unevictable:0 dirty:4 writeback:1 unstable:0 [ 134.511940] slab_reclaimable:3109 slab_unreclaimable:5594 [ 134.511940] mapped:679 shmem:2156 pagetables:2077 bounce:0 [ 134.511940] free:12911 free_pcp:31 free_cma:0 [ 134.521256] Node 0 DMA free:7256kB min:400kB low:500kB high:600kB active_anon:6560kB inactive_anon:180kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:80kB shmem:184kB slab_reclaimable:236kB slab_unreclaimable:296kB kernel_stack:48kB pagetables:556kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 134.532779] lowmem_reserve[]: 0 1714 1714 1714 [ 134.534455] Node 0 DMA32 free:44388kB min:44652kB low:55812kB high:66976kB active_anon:1628548kB inactive_anon:8172kB active_file:60kB inactive_file:96kB unevictable:0kB isolated(anon):96kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:16kB writeback:4kB mapped:2636kB shmem:8440kB slab_reclaimable:12200kB slab_unreclaimable:22080kB kernel_stack:3584kB pagetables:7752kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1016 all_unreclaimable? yes [ 134.545830] lowmem_reserve[]: 0 0 0 0 [ 134.547404] Node 0 DMA: 16*4kB (UME) 16*8kB (UME) 10*16kB (UME) 6*32kB (UME) 1*64kB (M) 2*128kB (UE) 1*256kB (M) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7264kB [ 134.552766] Node 0 DMA32: 1158*4kB (UME) 638*8kB (UE) 244*16kB (UME) 163*32kB (UE) 73*64kB (UE) 34*128kB (UME) 17*256kB (UME) 10*512kB (UME) 7*1024kB (UM) 0*2048kB 0*4096kB = 44520kB [ 134.558111] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 134.560358] 2195 total pagecache pages [ 134.562043] 0 pages in swap cache [ 134.563604] Swap cache stats: add 0, delete 0, find 0/0 [ 134.565441] Free swap = 0kB [ 134.567015] Total swap = 0kB [ 134.568628] 524157 pages RAM [ 134.570034] 0 pages HighMem/MovableOnly [ 134.571681] 80368 pages reserved [ 134.573467] 0 pages hwpoisoned ---------------------------------------- Only problem I felt is that the ratio of inactive_file/writeback (shown below) was high (compared to shown above) when I did $ cat < /dev/zero > /tmp/file1 & cat < /dev/zero > /tmp/file2 & cat < /dev/zero > /tmp/file3 & sleep 10; ./a.out; killall cat but I think this patch is better than current code. ---------------------------------------- [ 1135.909600] Mem-Info: [ 1135.910686] active_anon:321011 inactive_anon:4664 isolated_anon:0 [ 1135.910686] active_file:3170 inactive_file:78035 isolated_file:512 [ 1135.910686] unevictable:0 dirty:0 writeback:78618 unstable:0 [ 1135.910686] slab_reclaimable:5739 slab_unreclaimable:6170 [ 1135.910686] mapped:4666 shmem:8300 pagetables:1966 bounce:0 [ 1135.910686] free:12938 free_pcp:0 free_cma:0 [ 1135.925255] Node 0 DMA free:7232kB min:400kB low:500kB high:600kB active_anon:5852kB inactive_anon:196kB active_file:120kB inactive_file:980kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:968kB mapped:248kB shmem:388kB slab_reclaimable:316kB slab_unreclaimable:272kB kernel_stack:64kB pagetables:100kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7444 all_unreclaimable? yes [ 1135.936728] lowmem_reserve[]: 0 1714 1714 1714 [ 1135.938486] Node 0 DMA32 free:44520kB min:44652kB low:55812kB high:66976kB active_anon:1278192kB inactive_anon:18460kB active_file:12560kB inactive_file:313176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:0kB writeback:313504kB mapped:18416kB shmem:32812kB slab_reclaimable:22640kB slab_unreclaimable:24408kB kernel_stack:4240kB pagetables:7764kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2957668 all_unreclaimable? yes [ 1135.950355] lowmem_reserve[]: 0 0 0 0 [ 1135.952011] Node 0 DMA: 7*4kB (U) 14*8kB (UM) 13*16kB (UM) 6*32kB (UME) 1*64kB (M) 4*128kB (UME) 2*256kB (UM) 3*512kB (UME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7260kB [ 1135.957169] Node 0 DMA32: 241*4kB (UE) 929*8kB (UE) 496*16kB (UME) 277*32kB (UE) 135*64kB (UME) 17*128kB (UME) 3*256kB (E) 16*512kB (ME) 0*1024kB 0*2048kB 0*4096kB = 44972kB [ 1135.963047] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 1135.965472] 90009 total pagecache pages [ 1135.967078] 0 pages in swap cache [ 1135.968581] Swap cache stats: add 0, delete 0, find 0/0 [ 1135.970424] Free swap = 0kB [ 1135.971828] Total swap = 0kB [ 1135.973248] 524157 pages RAM [ 1135.974655] 0 pages HighMem/MovableOnly [ 1135.976230] 80368 pages reserved [ 1135.977745] 0 pages hwpoisoned ---------------------------------------- I can still hit OOM livelock ( MemAlloc-Info: X stalling task, Y dying task, Z victim task. where X > 0 && Y > 0). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-16 18:49 ` Tetsuo Handa @ 2015-10-19 12:57 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-19 12:57 UTC (permalink / raw) To: Tetsuo Handa Cc: torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina, mgorman, riel On Sat 17-10-15 03:49:39, Tetsuo Handa wrote: > Linus Torvalds wrote: > > Tetsuo, mind trying it out and maybe tweaking it a bit for the load > > you have? Does it seem to improve on your situation? > > Yes, I already tried it and just replied to Michal. > > I tested for one hour using various memory stressing programs. > As far as I tested, I did not hit silent hang up ( Thank you for your testing! [...] > Only problem I felt is that the ratio of inactive_file/writeback > (shown below) was high (compared to shown above) when I did Yes this is the lack of congestion on the bdi as Linus expected. Another patch I've just posted should help in that regards. At least it seems to help in my testing. [...] > I can still hit OOM livelock ( > > MemAlloc-Info: X stalling task, Y dying task, Z victim task. > > where X > 0 && Y > 0). This seems a separate issue, though. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-19 12:57 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-19 12:57 UTC (permalink / raw) To: Tetsuo Handa Cc: torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina, mgorman, riel On Sat 17-10-15 03:49:39, Tetsuo Handa wrote: > Linus Torvalds wrote: > > Tetsuo, mind trying it out and maybe tweaking it a bit for the load > > you have? Does it seem to improve on your situation? > > Yes, I already tried it and just replied to Michal. > > I tested for one hour using various memory stressing programs. > As far as I tested, I did not hit silent hang up ( Thank you for your testing! [...] > Only problem I felt is that the ratio of inactive_file/writeback > (shown below) was high (compared to shown above) when I did Yes this is the lack of congestion on the bdi as Linus expected. Another patch I've just posted should help in that regards. At least it seems to help in my testing. [...] > I can still hit OOM livelock ( > > MemAlloc-Info: X stalling task, Y dying task, Z victim task. > > where X > 0 && Y > 0). This seems a separate issue, though. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-16 18:34 ` Linus Torvalds @ 2015-10-19 12:53 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-19 12:53 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel On Fri 16-10-15 11:34:48, Linus Torvalds wrote: > On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > OK so here is what I am playing with currently. It is not complete > > yet. > > So this looks like it's going in a reasonable direction. However: > > > + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > > + ac->high_zoneidx, alloc_flags, target)) { > > + /* Wait for some write requests to complete then retry */ > > + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); > > + goto retry; > > + } > > I still think we should at least spend some time re-thinking that > "wait_iff_congested()" thing. You are right. I thought we would be congested most of the time because of the heavy IO but a quick test has shown that the zone is marked congested but the nr_wb_congested is zero all the time. That is most probably because the IO is throttled severly by the lack of memory as well. > We may not actually be congested, but > might be unable to write anything out because of our allocation flags > (ie not allowed to recurse into the filesystems), so we might be in > the situation that we have a lot of dirty pages that we can't directly > do anything about. > > Now, we will have woken kswapd, so something *will* hopefully be done > about them eventually, but at no time do we actually really wait for > it. We'll just busy-loop. > > So at a minimum, I think we should yield to kswapd. We do do that > "cond_resched()" in wait_iff_congested(), but I'm not entirely > convinced that is at all enough to wait for kswapd to *do* something. I went with congestion_wait which is what we used to do in the past before wait_iff_congested has been introduced. The primary reason for the change was that congestion_wait used to cause unhealthy stalls in the direct reclaim where the bdi wasn't really congested and so we were sleeping for the full timeout. Now I think we can do better even with congestion_wait. We do not have to wait when we did_some_progress so we won't affect a regular direct reclaim path and we can reduce sleeping to: dirty+writeback > reclaimable/2 This is a good signal that the reason for no progress is the stale IO most likely and we need to wait even if the bdi itself is not congested. We can also increase the timeout to HZ/10 because this is an extreme slow path - we are not doing any progress and stalling is better than OOM. This is a diff on top of the previous patch. I even think that this part would deserve a separate patch for a better bisect-ability. My testing shows that close-to-oom behaves better (I can use more memory for memeaters without hitting OOM) What do you think? --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e28028681c59..fed1bb7ea43a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3188,8 +3187,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, */ if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), ac->high_zoneidx, alloc_flags, target)) { - /* Wait for some write requests to complete then retry */ - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); + unsigned long writeback = zone_page_state(zone, NR_WRITEBACK), + dirty = zone_page_state(zone, NR_FILE_DIRTY); + if (did_some_progress) + goto retry; + + /* + * If we didn't make any progress and have a lot of + * dirty + writeback pages then we should wait for + * an IO to complete to slow down the reclaim and + * prevent from pre mature OOM + */ + if (2*(writeback + dirty) > reclaimable) + congestion_wait(BLK_RW_ASYNC, HZ/10); + else + cond_resched(); goto retry; } } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-19 12:53 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-19 12:53 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel On Fri 16-10-15 11:34:48, Linus Torvalds wrote: > On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > OK so here is what I am playing with currently. It is not complete > > yet. > > So this looks like it's going in a reasonable direction. However: > > > + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > > + ac->high_zoneidx, alloc_flags, target)) { > > + /* Wait for some write requests to complete then retry */ > > + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); > > + goto retry; > > + } > > I still think we should at least spend some time re-thinking that > "wait_iff_congested()" thing. You are right. I thought we would be congested most of the time because of the heavy IO but a quick test has shown that the zone is marked congested but the nr_wb_congested is zero all the time. That is most probably because the IO is throttled severly by the lack of memory as well. > We may not actually be congested, but > might be unable to write anything out because of our allocation flags > (ie not allowed to recurse into the filesystems), so we might be in > the situation that we have a lot of dirty pages that we can't directly > do anything about. > > Now, we will have woken kswapd, so something *will* hopefully be done > about them eventually, but at no time do we actually really wait for > it. We'll just busy-loop. > > So at a minimum, I think we should yield to kswapd. We do do that > "cond_resched()" in wait_iff_congested(), but I'm not entirely > convinced that is at all enough to wait for kswapd to *do* something. I went with congestion_wait which is what we used to do in the past before wait_iff_congested has been introduced. The primary reason for the change was that congestion_wait used to cause unhealthy stalls in the direct reclaim where the bdi wasn't really congested and so we were sleeping for the full timeout. Now I think we can do better even with congestion_wait. We do not have to wait when we did_some_progress so we won't affect a regular direct reclaim path and we can reduce sleeping to: dirty+writeback > reclaimable/2 This is a good signal that the reason for no progress is the stale IO most likely and we need to wait even if the bdi itself is not congested. We can also increase the timeout to HZ/10 because this is an extreme slow path - we are not doing any progress and stalling is better than OOM. This is a diff on top of the previous patch. I even think that this part would deserve a separate patch for a better bisect-ability. My testing shows that close-to-oom behaves better (I can use more memory for memeaters without hitting OOM) What do you think? --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e28028681c59..fed1bb7ea43a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3188,8 +3187,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, */ if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), ac->high_zoneidx, alloc_flags, target)) { - /* Wait for some write requests to complete then retry */ - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); + unsigned long writeback = zone_page_state(zone, NR_WRITEBACK), + dirty = zone_page_state(zone, NR_FILE_DIRTY); + if (did_some_progress) + goto retry; + + /* + * If we didn't make any progress and have a lot of + * dirty + writeback pages then we should wait for + * an IO to complete to slow down the reclaim and + * prevent from pre mature OOM + */ + if (2*(writeback + dirty) > reclaimable) + congestion_wait(BLK_RW_ASYNC, HZ/10); + else + cond_resched(); goto retry; } } -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-12 15:25 ` Tetsuo Handa @ 2015-10-13 13:32 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-13 13:32 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue 13-10-15 00:25:53, Tetsuo Handa wrote: [...] > What is strange, the values printed by this debug printk() patch did not > change as time went by. Thus, I think that this is not a problem of lack of > CPU time for scanning pages. I suspect that there is a bug that nobody is > scanning pages. > > ---------- > [ 66.821450] zone_reclaimable returned 1 at line 2646 > [ 66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 > [ 66.824935] shrink_zones returned 1 at line 2706 > [ 66.826392] zones_reclaimable=1 at line 2765 > [ 66.827865] do_try_to_free_pages returned 1 at line 2938 > [ 67.102322] __perform_reclaim returned 1 at line 2854 > [ 67.103968] did_some_progress=1 at line 3301 > (...snipped...) > [ 281.439977] zone_reclaimable returned 1 at line 2646 > [ 281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 > [ 281.439978] shrink_zones returned 1 at line 2706 > [ 281.439978] zones_reclaimable=1 at line 2765 > [ 281.439979] do_try_to_free_pages returned 1 at line 2938 > [ 281.439979] __perform_reclaim returned 1 at line 2854 > [ 281.439980] did_some_progress=1 at line 3301 This is really interesting because even with reclaimable LRUs this low we should eventually scan them enough times to convince zone_reclaimable to fail. PAGES_SCANNED in your logs seems to be constant, though, which suggests somebody manages to free a page every time before we get down to priority 0 and manage to scan something finally. This is pretty much pathological behavior and I have hard time to imagine how would that be possible but it clearly shows that zone_reclaimable heuristic is not working properly. I can see two options here. Either we teach zone_reclaimable to be less fragile or remove zone_reclaimable from shrink_zones altogether. Both of them are risky because we have a long history of changes in this areas which made other subtle behavior changes but I guess that the first option should be less fragile. What about the following patch? I am not happy about it because the condition is rather rough and a deeper inspection is really needed to check all the call sites but it should be good for testing. --- >From afe1c5ef4726b78f51e850ed93564b52f3c73905 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Tue, 13 Oct 2015 15:12:13 +0200 Subject: [PATCH] mm, vmscan: Make zone_reclaimable less fragile zone_reclaimable considers a zone unreclaimable if we have scanned all the reclaimable pages sufficient times since the last page has been freed and that still hasn't led to an allocation success. This can, however, lead to a livelock/trashing when a single freed page resets PAGES_SCANNED while memory consumers are looping over small LRUs without making any progress (e.g. remaining pages on the LRU are dirty and all the flushers are blocked) and failing to invoke the OOM killer beause zone_reclaimable would consider the zone reclaimable. Tetsuo Handa has reported the following: : [ 66.821450] zone_reclaimable returned 1 at line 2646 : [ 66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 : [ 66.824935] shrink_zones returned 1 at line 2706 : [ 66.826392] zones_reclaimable=1 at line 2765 : [ 66.827865] do_try_to_free_pages returned 1 at line 2938 : [ 67.102322] __perform_reclaim returned 1 at line 2854 : [ 67.103968] did_some_progress=1 at line 3301 : (...snipped...) : [ 281.439977] zone_reclaimable returned 1 at line 2646 : [ 281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 : [ 281.439978] shrink_zones returned 1 at line 2706 : [ 281.439978] zones_reclaimable=1 at line 2765 : [ 281.439979] do_try_to_free_pages returned 1 at line 2938 : [ 281.439979] __perform_reclaim returned 1 at line 2854 : [ 281.439980] did_some_progress=1 at line 3301 In his case anon LRUs are not reclaimable because there is no swap enabled. It is not clear who frees a page that regularly but it is clear that no progress can be made but zone_reclaimable still consider the zone reclaimable. This patch makes zone_reclaimable less fragile by checking the number of reclaimable pages against low watermark. It doesn't make much sense to rely on a PAGES_SCANNED heuristic if there are not enough reclaimable pages to get us over min watermark. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/vmscan.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c88d74ad9304..f16266e0af70 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -209,8 +209,14 @@ static unsigned long zone_reclaimable_pages(struct zone *zone) bool zone_reclaimable(struct zone *zone) { + unsigned long reclaimable = zone_reclaimable_pages(zone); + unsigned long free = zone_page_state(zone, NR_FREE_PAGES); + + if (reclaimable + free < min_wmark_pages(zone)) + return false; + return zone_page_state(zone, NR_PAGES_SCANNED) < - zone_reclaimable_pages(zone) * 6; + reclaimable * 6; } static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru) -- 2.5.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-13 13:32 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-13 13:32 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue 13-10-15 00:25:53, Tetsuo Handa wrote: [...] > What is strange, the values printed by this debug printk() patch did not > change as time went by. Thus, I think that this is not a problem of lack of > CPU time for scanning pages. I suspect that there is a bug that nobody is > scanning pages. > > ---------- > [ 66.821450] zone_reclaimable returned 1 at line 2646 > [ 66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 > [ 66.824935] shrink_zones returned 1 at line 2706 > [ 66.826392] zones_reclaimable=1 at line 2765 > [ 66.827865] do_try_to_free_pages returned 1 at line 2938 > [ 67.102322] __perform_reclaim returned 1 at line 2854 > [ 67.103968] did_some_progress=1 at line 3301 > (...snipped...) > [ 281.439977] zone_reclaimable returned 1 at line 2646 > [ 281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 > [ 281.439978] shrink_zones returned 1 at line 2706 > [ 281.439978] zones_reclaimable=1 at line 2765 > [ 281.439979] do_try_to_free_pages returned 1 at line 2938 > [ 281.439979] __perform_reclaim returned 1 at line 2854 > [ 281.439980] did_some_progress=1 at line 3301 This is really interesting because even with reclaimable LRUs this low we should eventually scan them enough times to convince zone_reclaimable to fail. PAGES_SCANNED in your logs seems to be constant, though, which suggests somebody manages to free a page every time before we get down to priority 0 and manage to scan something finally. This is pretty much pathological behavior and I have hard time to imagine how would that be possible but it clearly shows that zone_reclaimable heuristic is not working properly. I can see two options here. Either we teach zone_reclaimable to be less fragile or remove zone_reclaimable from shrink_zones altogether. Both of them are risky because we have a long history of changes in this areas which made other subtle behavior changes but I guess that the first option should be less fragile. What about the following patch? I am not happy about it because the condition is rather rough and a deeper inspection is really needed to check all the call sites but it should be good for testing. --- ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 13:32 ` Michal Hocko @ 2015-10-13 16:19 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-13 16:19 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > I can see two options here. Either we teach zone_reclaimable to be less > fragile or remove zone_reclaimable from shrink_zones altogether. Both of > them are risky because we have a long history of changes in this areas > which made other subtle behavior changes but I guess that the first > option should be less fragile. What about the following patch? I am not > happy about it because the condition is rather rough and a deeper > inspection is really needed to check all the call sites but it should be > good for testing. While zone_reclaimable() for Node 0 DMA32 became false by your patch, zone_reclaimable() for Node 0 DMA kept returning true, and as a result overall result (i.e. zones_reclaimable) remained true. $ ./a.out ---------- When there is no data to write ---------- [ 162.942371] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=16 [ 162.944541] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 162.946560] zone_reclaimable returned 1 at line 2665 [ 162.948722] shrink_zones returned 1 at line 2716 (...snipped...) [ 164.897587] zones_reclaimable=1 at line 2775 [ 164.899172] do_try_to_free_pages returned 1 at line 2948 [ 167.087119] __perform_reclaim returned 1 at line 2854 [ 167.088868] did_some_progress=1 at line 3301 (...snipped...) [ 261.577944] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 261.580093] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 261.582333] zone_reclaimable returned 1 at line 2665 [ 261.583841] shrink_zones returned 1 at line 2716 (...snipped...) [ 264.728434] zones_reclaimable=1 at line 2775 [ 264.730002] do_try_to_free_pages returned 1 at line 2948 [ 268.191368] __perform_reclaim returned 1 at line 2854 [ 268.193113] did_some_progress=1 at line 3301 ---------- When there is no data to write ---------- Complete log (with your patch inside) is at http://I-love.SAKURA.ne.jp/tmp/serial-20151014.txt.xz . By the way, the OOM killer seems to be invoked prematurely for different load if your patch is applied. $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out ---------- When there is a lot of data to write ---------- [ 69.019271] Mem-Info: [ 69.019755] active_anon:335006 inactive_anon:2084 isolated_anon:23 [ 69.019755] active_file:12197 inactive_file:65310 isolated_file:31 [ 69.019755] unevictable:0 dirty:533 writeback:51020 unstable:0 [ 69.019755] slab_reclaimable:4753 slab_unreclaimable:4134 [ 69.019755] mapped:9639 shmem:2144 pagetables:2030 bounce:0 [ 69.019755] free:12972 free_pcp:45 free_cma:0 [ 69.026260] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5232kB inactive_anon:96kB active_file:424kB inactive_file:1068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:164kB writeback:972kB mapped:416kB shmem:104kB slab_reclaimable:304kB slab_unreclaimable:244kB kernel_stack:96kB pagetables:256kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no [ 69.037189] lowmem_reserve[]: 0 1729 1729 1729 [ 69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 69.052017] lowmem_reserve[]: 0 0 0 0 [ 69.053818] Node 0 DMA: 17*4kB (UME) 8*8kB (UME) 6*16kB (UME) 2*32kB (UM) 2*64kB (UE) 4*128kB (UME) 1*256kB (U) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7332kB [ 69.059597] Node 0 DMA32: 632*4kB (UME) 454*8kB (UME) 507*16kB (UME) 310*32kB (UME) 177*64kB (UE) 61*128kB (UME) 15*256kB (ME) 19*512kB (M) 10*1024kB (M) 0*2048kB 0*4096kB = 67136kB [ 69.065810] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 69.068305] 72477 total pagecache pages [ 69.069932] 0 pages in swap cache [ 69.071435] Swap cache stats: add 0, delete 0, find 0/0 [ 69.073354] Free swap = 0kB [ 69.074822] Total swap = 0kB [ 69.076660] 524157 pages RAM [ 69.078113] 0 pages HighMem/MovableOnly [ 69.079930] 76615 pages reserved [ 69.081406] 0 pages hwpoisoned ---------- When there is a lot of data to write ---------- ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-13 16:19 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-13 16:19 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > I can see two options here. Either we teach zone_reclaimable to be less > fragile or remove zone_reclaimable from shrink_zones altogether. Both of > them are risky because we have a long history of changes in this areas > which made other subtle behavior changes but I guess that the first > option should be less fragile. What about the following patch? I am not > happy about it because the condition is rather rough and a deeper > inspection is really needed to check all the call sites but it should be > good for testing. While zone_reclaimable() for Node 0 DMA32 became false by your patch, zone_reclaimable() for Node 0 DMA kept returning true, and as a result overall result (i.e. zones_reclaimable) remained true. $ ./a.out ---------- When there is no data to write ---------- [ 162.942371] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=16 [ 162.944541] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 162.946560] zone_reclaimable returned 1 at line 2665 [ 162.948722] shrink_zones returned 1 at line 2716 (...snipped...) [ 164.897587] zones_reclaimable=1 at line 2775 [ 164.899172] do_try_to_free_pages returned 1 at line 2948 [ 167.087119] __perform_reclaim returned 1 at line 2854 [ 167.088868] did_some_progress=1 at line 3301 (...snipped...) [ 261.577944] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 261.580093] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 261.582333] zone_reclaimable returned 1 at line 2665 [ 261.583841] shrink_zones returned 1 at line 2716 (...snipped...) [ 264.728434] zones_reclaimable=1 at line 2775 [ 264.730002] do_try_to_free_pages returned 1 at line 2948 [ 268.191368] __perform_reclaim returned 1 at line 2854 [ 268.193113] did_some_progress=1 at line 3301 ---------- When there is no data to write ---------- Complete log (with your patch inside) is at http://I-love.SAKURA.ne.jp/tmp/serial-20151014.txt.xz . By the way, the OOM killer seems to be invoked prematurely for different load if your patch is applied. $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out ---------- When there is a lot of data to write ---------- [ 69.019271] Mem-Info: [ 69.019755] active_anon:335006 inactive_anon:2084 isolated_anon:23 [ 69.019755] active_file:12197 inactive_file:65310 isolated_file:31 [ 69.019755] unevictable:0 dirty:533 writeback:51020 unstable:0 [ 69.019755] slab_reclaimable:4753 slab_unreclaimable:4134 [ 69.019755] mapped:9639 shmem:2144 pagetables:2030 bounce:0 [ 69.019755] free:12972 free_pcp:45 free_cma:0 [ 69.026260] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5232kB inactive_anon:96kB active_file:424kB inactive_file:1068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:164kB writeback:972kB mapped:416kB shmem:104kB slab_reclaimable:304kB slab_unreclaimable:244kB kernel_stack:96kB pagetables:256kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no [ 69.037189] lowmem_reserve[]: 0 1729 1729 1729 [ 69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 69.052017] lowmem_reserve[]: 0 0 0 0 [ 69.053818] Node 0 DMA: 17*4kB (UME) 8*8kB (UME) 6*16kB (UME) 2*32kB (UM) 2*64kB (UE) 4*128kB (UME) 1*256kB (U) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7332kB [ 69.059597] Node 0 DMA32: 632*4kB (UME) 454*8kB (UME) 507*16kB (UME) 310*32kB (UME) 177*64kB (UE) 61*128kB (UME) 15*256kB (ME) 19*512kB (M) 10*1024kB (M) 0*2048kB 0*4096kB = 67136kB [ 69.065810] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 69.068305] 72477 total pagecache pages [ 69.069932] 0 pages in swap cache [ 69.071435] Swap cache stats: add 0, delete 0, find 0/0 [ 69.073354] Free swap = 0kB [ 69.074822] Total swap = 0kB [ 69.076660] 524157 pages RAM [ 69.078113] 0 pages HighMem/MovableOnly [ 69.079930] 76615 pages reserved [ 69.081406] 0 pages hwpoisoned ---------- When there is a lot of data to write ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 16:19 ` Tetsuo Handa @ 2015-10-14 13:22 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-14 13:22 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed 14-10-15 01:19:09, Tetsuo Handa wrote: > Michal Hocko wrote: > > I can see two options here. Either we teach zone_reclaimable to be less > > fragile or remove zone_reclaimable from shrink_zones altogether. Both of > > them are risky because we have a long history of changes in this areas > > which made other subtle behavior changes but I guess that the first > > option should be less fragile. What about the following patch? I am not > > happy about it because the condition is rather rough and a deeper > > inspection is really needed to check all the call sites but it should be > > good for testing. > > While zone_reclaimable() for Node 0 DMA32 became false by your patch, > zone_reclaimable() for Node 0 DMA kept returning true, and as a result > overall result (i.e. zones_reclaimable) remained true. Ahh, right you are. ZONE_DMA might have 0 or close to 0 pages on LRUs while it is still protected from allocations which are not targeted for this zone. My patch clearly haven't considered that. The fix for that would be quite straightforward. We have to consider lowmem_reserve of the zone wrt. the allocation/reclaim gfp target zone. But this is getting more and more ugly (see the patch below just for testing/demonstration purposes). The OOM report is really interesting: > [ 69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no so your whole file LRUs are either dirty or under writeback and reclaimable pages are below min wmark. This alone is quite suspicious. Why hasn't balance_dirty_pages throttled writers and allowed them to make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} configuration on that system. Also why throttle_vm_writeout haven't slown the reclaim down? Anyway this is exactly the case where zone_reclaimable helps us to prevent OOM because we are looping over the remaining LRU pages without making progress... This just shows how subtle all this is :/ I have to think about this much more.. --- >From c54a894490650dd65a98a2a0efa5324ecf3de61d Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Tue, 13 Oct 2015 15:12:13 +0200 Subject: [PATCH] mm, vmscan: Make zone_reclaimable less fragile zone_reclaimable considers a zone unreclaimable if we have scanned all the reclaimable pages sufficient times since the last page has been freed and that still hasn't led to an allocation success. This can, however, lead to a livelock/trashing when a single freed page resets PAGES_SCANNED while memory consumers are looping over small LRUs without making any progress (e.g. remaining pages on the LRU are dirty and all the flushers are blocked) and failing to invoke the OOM killer beause zone_reclaimable would consider the zone reclaimable. Tetsuo Handa has reported the following: : [ 66.821450] zone_reclaimable returned 1 at line 2646 : [ 66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 : [ 66.824935] shrink_zones returned 1 at line 2706 : [ 66.826392] zones_reclaimable=1 at line 2765 : [ 66.827865] do_try_to_free_pages returned 1 at line 2938 : [ 67.102322] __perform_reclaim returned 1 at line 2854 : [ 67.103968] did_some_progress=1 at line 3301 : (...snipped...) : [ 281.439977] zone_reclaimable returned 1 at line 2646 : [ 281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 : [ 281.439978] shrink_zones returned 1 at line 2706 : [ 281.439978] zones_reclaimable=1 at line 2765 : [ 281.439979] do_try_to_free_pages returned 1 at line 2938 : [ 281.439979] __perform_reclaim returned 1 at line 2854 : [ 281.439980] did_some_progress=1 at line 3301 In his case anon LRUs are not reclaimable because there is no swap enabled. It is not clear who frees a page that regularly but it is clear that no progress can be made but zone_reclaimable still considers the zone reclaimable. This patch makes sure that we do not follow zone_reclaimable without prior consideration in the direct reclaim path. Reclaimable LRU lists have to contain sufficient pages to move us over min watermark otherwise we wouldn't be able to make progress anyway. Please note that we have to consider lowmem reserves for each zone because ZONE_DMA is protected from most allocations and so its LRU list might be too small to scan enough pages to consider the zone unreclaimable. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/vmscan.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c88d74ad9304..35a384c5bdab 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2640,9 +2640,22 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx)) reclaimable = true; - if (global_reclaim(sc) && - !reclaimable && zone_reclaimable(zone)) - reclaimable = true; + /* + * Consider the current zone reclaimable even if we haven't + * reclaimed anything if there are enough pages on reclaimable + * LRU lists (they might be dirty or under writeback). + */ + if (global_reclaim(sc) && !reclaimable) { + unsigned long reclaimable = zone_reclaimable_pages(zone); + unsigned long free = zone_page_state(zone, NR_FREE_PAGES); + unsigned long reserve = zone->lowmem_reserve[gfp_zone(sc->gfp_mask)]; + + if (reclaimable + free < min_wmark_pages(zone) + reserve) + continue; + + if (zone_reclaimable(zone)) + reclaimable = true; + } } /* -- 2.5.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-14 13:22 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-14 13:22 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed 14-10-15 01:19:09, Tetsuo Handa wrote: > Michal Hocko wrote: > > I can see two options here. Either we teach zone_reclaimable to be less > > fragile or remove zone_reclaimable from shrink_zones altogether. Both of > > them are risky because we have a long history of changes in this areas > > which made other subtle behavior changes but I guess that the first > > option should be less fragile. What about the following patch? I am not > > happy about it because the condition is rather rough and a deeper > > inspection is really needed to check all the call sites but it should be > > good for testing. > > While zone_reclaimable() for Node 0 DMA32 became false by your patch, > zone_reclaimable() for Node 0 DMA kept returning true, and as a result > overall result (i.e. zones_reclaimable) remained true. Ahh, right you are. ZONE_DMA might have 0 or close to 0 pages on LRUs while it is still protected from allocations which are not targeted for this zone. My patch clearly haven't considered that. The fix for that would be quite straightforward. We have to consider lowmem_reserve of the zone wrt. the allocation/reclaim gfp target zone. But this is getting more and more ugly (see the patch below just for testing/demonstration purposes). The OOM report is really interesting: > [ 69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no so your whole file LRUs are either dirty or under writeback and reclaimable pages are below min wmark. This alone is quite suspicious. Why hasn't balance_dirty_pages throttled writers and allowed them to make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} configuration on that system. Also why throttle_vm_writeout haven't slown the reclaim down? Anyway this is exactly the case where zone_reclaimable helps us to prevent OOM because we are looping over the remaining LRU pages without making progress... This just shows how subtle all this is :/ I have to think about this much more.. --- ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-14 13:22 ` Michal Hocko @ 2015-10-14 14:38 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-14 14:38 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > The OOM report is really interesting: > > > [ 69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > so your whole file LRUs are either dirty or under writeback and > reclaimable pages are below min wmark. This alone is quite suspicious. I did $ cat < /dev/zero > /tmp/log for 10 seconds before starting $ ./a.out Thus, so much memory was waiting for writeback on XFS filesystem. > Why hasn't balance_dirty_pages throttled writers and allowed them to > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} > configuration on that system. All values are defaults of plain CentOS 7 installation. # sysctl -a | grep ^vm. vm.admin_reserve_kbytes = 8192 vm.block_dump = 0 vm.compact_unevictable_allowed = 1 vm.dirty_background_bytes = 0 vm.dirty_background_ratio = 10 vm.dirty_bytes = 0 vm.dirty_expire_centisecs = 3000 vm.dirty_ratio = 30 vm.dirty_writeback_centisecs = 500 vm.dirtytime_expire_seconds = 43200 vm.drop_caches = 0 vm.extfrag_threshold = 500 vm.hugepages_treat_as_movable = 0 vm.hugetlb_shm_group = 0 vm.laptop_mode = 0 vm.legacy_va_layout = 0 vm.lowmem_reserve_ratio = 256 256 32 vm.max_map_count = 65530 vm.memory_failure_early_kill = 0 vm.memory_failure_recovery = 1 vm.min_free_kbytes = 45056 vm.min_slab_ratio = 5 vm.min_unmapped_ratio = 1 vm.mmap_min_addr = 4096 vm.nr_hugepages = 0 vm.nr_hugepages_mempolicy = 0 vm.nr_overcommit_hugepages = 0 vm.nr_pdflush_threads = 0 vm.numa_zonelist_order = default vm.oom_dump_tasks = 1 vm.oom_kill_allocating_task = 0 vm.overcommit_kbytes = 0 vm.overcommit_memory = 0 vm.overcommit_ratio = 50 vm.page-cluster = 3 vm.panic_on_oom = 0 vm.percpu_pagelist_fraction = 0 vm.stat_interval = 1 vm.swappiness = 30 vm.user_reserve_kbytes = 54808 vm.vfs_cache_pressure = 100 vm.zone_reclaim_mode = 0 > > Also why throttle_vm_writeout haven't slown the reclaim down? Too difficult question for me. > > Anyway this is exactly the case where zone_reclaimable helps us to > prevent OOM because we are looping over the remaining LRU pages without > making progress... This just shows how subtle all this is :/ > > I have to think about this much more.. I'm suspicious about tweaking current reclaim logic. Could you please respond to Linus's comments? There are more moles than kernel developers can find. I think that what we can do for short term is to prepare for moles that kernel developers could not find, and for long term is to reform page allocator for preventing moles from living. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-14 14:38 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-14 14:38 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > The OOM report is really interesting: > > > [ 69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > so your whole file LRUs are either dirty or under writeback and > reclaimable pages are below min wmark. This alone is quite suspicious. I did $ cat < /dev/zero > /tmp/log for 10 seconds before starting $ ./a.out Thus, so much memory was waiting for writeback on XFS filesystem. > Why hasn't balance_dirty_pages throttled writers and allowed them to > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} > configuration on that system. All values are defaults of plain CentOS 7 installation. # sysctl -a | grep ^vm. vm.admin_reserve_kbytes = 8192 vm.block_dump = 0 vm.compact_unevictable_allowed = 1 vm.dirty_background_bytes = 0 vm.dirty_background_ratio = 10 vm.dirty_bytes = 0 vm.dirty_expire_centisecs = 3000 vm.dirty_ratio = 30 vm.dirty_writeback_centisecs = 500 vm.dirtytime_expire_seconds = 43200 vm.drop_caches = 0 vm.extfrag_threshold = 500 vm.hugepages_treat_as_movable = 0 vm.hugetlb_shm_group = 0 vm.laptop_mode = 0 vm.legacy_va_layout = 0 vm.lowmem_reserve_ratio = 256 256 32 vm.max_map_count = 65530 vm.memory_failure_early_kill = 0 vm.memory_failure_recovery = 1 vm.min_free_kbytes = 45056 vm.min_slab_ratio = 5 vm.min_unmapped_ratio = 1 vm.mmap_min_addr = 4096 vm.nr_hugepages = 0 vm.nr_hugepages_mempolicy = 0 vm.nr_overcommit_hugepages = 0 vm.nr_pdflush_threads = 0 vm.numa_zonelist_order = default vm.oom_dump_tasks = 1 vm.oom_kill_allocating_task = 0 vm.overcommit_kbytes = 0 vm.overcommit_memory = 0 vm.overcommit_ratio = 50 vm.page-cluster = 3 vm.panic_on_oom = 0 vm.percpu_pagelist_fraction = 0 vm.stat_interval = 1 vm.swappiness = 30 vm.user_reserve_kbytes = 54808 vm.vfs_cache_pressure = 100 vm.zone_reclaim_mode = 0 > > Also why throttle_vm_writeout haven't slown the reclaim down? Too difficult question for me. > > Anyway this is exactly the case where zone_reclaimable helps us to > prevent OOM because we are looping over the remaining LRU pages without > making progress... This just shows how subtle all this is :/ > > I have to think about this much more.. I'm suspicious about tweaking current reclaim logic. Could you please respond to Linus's comments? There are more moles than kernel developers can find. I think that what we can do for short term is to prepare for moles that kernel developers could not find, and for long term is to reform page allocator for preventing moles from living. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-14 14:38 ` Tetsuo Handa @ 2015-10-14 14:59 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-14 14:59 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed 14-10-15 23:38:00, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > Why hasn't balance_dirty_pages throttled writers and allowed them to > > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} > > configuration on that system. > > All values are defaults of plain CentOS 7 installation. So this is 3.10 kernel, right? > # sysctl -a | grep ^vm. > vm.dirty_background_ratio = 10 > vm.dirty_bytes = 0 > vm.dirty_expire_centisecs = 3000 > vm.dirty_ratio = 30 [...] OK, this is nothing unusual. And I _suspect_ that the throttling simply didn't cope with the writer speed and a large anon memory consumer. Dirtyable memory was quite high until your anon hammer bumped in and reduced dirtyable memory down so the file LRU is full of dirty pages when we get under serious memory pressure. Anonymous pages are not reclaimable so the whole memory pressure goes to file LRUs and bang. > > Also why throttle_vm_writeout haven't slown the reclaim down? > > Too difficult question for me. > > > > > Anyway this is exactly the case where zone_reclaimable helps us to > > prevent OOM because we are looping over the remaining LRU pages without > > making progress... This just shows how subtle all this is :/ > > > > I have to think about this much more.. > > I'm suspicious about tweaking current reclaim logic. > Could you please respond to Linus's comments? Yes I plan to I just didn't get to finish my email yet. > There are more moles than kernel developers can find. I think that > what we can do for short term is to prepare for moles that kernel > developers could not find, and for long term is to reform page > allocator for preventing moles from living. This is much easier said than done :/ The current code is full of heuristics grown over time based on very different requirements from different kernel subsystems. There is no simple solution for this problem I am afraid. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-14 14:59 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-14 14:59 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed 14-10-15 23:38:00, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > Why hasn't balance_dirty_pages throttled writers and allowed them to > > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} > > configuration on that system. > > All values are defaults of plain CentOS 7 installation. So this is 3.10 kernel, right? > # sysctl -a | grep ^vm. > vm.dirty_background_ratio = 10 > vm.dirty_bytes = 0 > vm.dirty_expire_centisecs = 3000 > vm.dirty_ratio = 30 [...] OK, this is nothing unusual. And I _suspect_ that the throttling simply didn't cope with the writer speed and a large anon memory consumer. Dirtyable memory was quite high until your anon hammer bumped in and reduced dirtyable memory down so the file LRU is full of dirty pages when we get under serious memory pressure. Anonymous pages are not reclaimable so the whole memory pressure goes to file LRUs and bang. > > Also why throttle_vm_writeout haven't slown the reclaim down? > > Too difficult question for me. > > > > > Anyway this is exactly the case where zone_reclaimable helps us to > > prevent OOM because we are looping over the remaining LRU pages without > > making progress... This just shows how subtle all this is :/ > > > > I have to think about this much more.. > > I'm suspicious about tweaking current reclaim logic. > Could you please respond to Linus's comments? Yes I plan to I just didn't get to finish my email yet. > There are more moles than kernel developers can find. I think that > what we can do for short term is to prepare for moles that kernel > developers could not find, and for long term is to reform page > allocator for preventing moles from living. This is much easier said than done :/ The current code is full of heuristics grown over time based on very different requirements from different kernel subsystems. There is no simple solution for this problem I am afraid. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-14 14:59 ` Michal Hocko @ 2015-10-14 15:06 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-14 15:06 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Wed 14-10-15 23:38:00, Tetsuo Handa wrote: > > Michal Hocko wrote: > [...] > > > Why hasn't balance_dirty_pages throttled writers and allowed them to > > > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} > > > configuration on that system. > > > > All values are defaults of plain CentOS 7 installation. > > So this is 3.10 kernel, right? The userland is CentOS 7 but the kernel is linux-next-20151009. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Silent hang up caused by pages being not scanned? @ 2015-10-14 15:06 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-14 15:06 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Wed 14-10-15 23:38:00, Tetsuo Handa wrote: > > Michal Hocko wrote: > [...] > > > Why hasn't balance_dirty_pages throttled writers and allowed them to > > > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} > > > configuration on that system. > > > > All values are defaults of plain CentOS 7 installation. > > So this is 3.10 kernel, right? The userland is CentOS 7 but the kernel is linux-next-20151009. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Newbie's question: memory allocation when reclaiming memory 2015-10-12 6:43 ` Tetsuo Handa @ 2015-10-26 11:44 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-26 11:44 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina May I ask a newbie question? Say, there is some amount of memory pages which can be reclaimed if they are flushed to storage. And lower layer might issue memory allocation request in a way which won't cause reclaim deadlock (e.g. using GFP_NOFS or GFP_NOIO) when flushing to storage, isn't it? What I'm worrying is a dependency that __GFP_FS allocation requests think that there are reclaimable pages and therefore there is no need to call out_of_memory(); and GFP_NOFS allocation requests which the __GFP_FS allocation requests depend on (in order to flush to storage) is waiting for GFP_NOIO allocation requests; and the GFP_NOIO allocation requests which the GFP_NOFS allocation requests depend on (in order to flush to storage) are waiting for memory pages to be reclaimed without calling out_of_memory(); because gfp_to_alloc_flags() does not favor GFP_NOIO over GFP_NOFS nor GFP_NOFS over __GFP_FS which will throttle all allocations at the same watermark level. How do we guarantee that GFP_NOFS/GFP_NOIO allocations make forward progress? What mechanism guarantees that memory pages which __GFP_FS allocation requests are waiting for are reclaimed? I assume that there is some mechanism; otherwise we can hit silent livelock, can't we? ^ permalink raw reply [flat|nested] 213+ messages in thread
* Newbie's question: memory allocation when reclaiming memory @ 2015-10-26 11:44 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-26 11:44 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina May I ask a newbie question? Say, there is some amount of memory pages which can be reclaimed if they are flushed to storage. And lower layer might issue memory allocation request in a way which won't cause reclaim deadlock (e.g. using GFP_NOFS or GFP_NOIO) when flushing to storage, isn't it? What I'm worrying is a dependency that __GFP_FS allocation requests think that there are reclaimable pages and therefore there is no need to call out_of_memory(); and GFP_NOFS allocation requests which the __GFP_FS allocation requests depend on (in order to flush to storage) is waiting for GFP_NOIO allocation requests; and the GFP_NOIO allocation requests which the GFP_NOFS allocation requests depend on (in order to flush to storage) are waiting for memory pages to be reclaimed without calling out_of_memory(); because gfp_to_alloc_flags() does not favor GFP_NOIO over GFP_NOFS nor GFP_NOFS over __GFP_FS which will throttle all allocations at the same watermark level. How do we guarantee that GFP_NOFS/GFP_NOIO allocations make forward progress? What mechanism guarantees that memory pages which __GFP_FS allocation requests are waiting for are reclaimed? I assume that there is some mechanism; otherwise we can hit silent livelock, can't we? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Newbie's question: memory allocation when reclaiming memory 2015-10-26 11:44 ` Tetsuo Handa @ 2015-11-05 8:46 ` Vlastimil Babka -1 siblings, 0 replies; 213+ messages in thread From: Vlastimil Babka @ 2015-11-05 8:46 UTC (permalink / raw) To: Tetsuo Handa, mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On 10/26/2015 12:44 PM, Tetsuo Handa wrote: > May I ask a newbie question? Say, there is some amount of memory pages > which can be reclaimed if they are flushed to storage. And lower layer > might issue memory allocation request in a way which won't cause reclaim > deadlock (e.g. using GFP_NOFS or GFP_NOIO) when flushing to storage, > isn't it? > > What I'm worrying is a dependency that __GFP_FS allocation requests think > that there are reclaimable pages and therefore there is no need to call > out_of_memory(); and GFP_NOFS allocation requests which the __GFP_FS > allocation requests depend on (in order to flush to storage) is waiting > for GFP_NOIO allocation requests; and the GFP_NOIO allocation requests > which the GFP_NOFS allocation requests depend on (in order to flush to > storage) are waiting for memory pages to be reclaimed without calling > out_of_memory(); because gfp_to_alloc_flags() does not favor GFP_NOIO over > GFP_NOFS nor GFP_NOFS over __GFP_FS which will throttle all allocations > at the same watermark level. > > How do we guarantee that GFP_NOFS/GFP_NOIO allocations make forward > progress? What mechanism guarantees that memory pages which __GFP_FS > allocation requests are waiting for are reclaimed? I assume that there > is some mechanism; otherwise we can hit silent livelock, can't we? I've never studied the code myself, but IIRC in all the debates LSF/MM I've heard it said that GFP_NOIO allocations have mempools that guarantee forward progress, so when they allocate from this mempool, there should be nothing else to block the request other than waiting for the actual hardware to finish the I/O request, and then the memory is returned to mempool and another request can use it. So there shouldn't be waiting for reclaim at that level, breaking the livelock you described? > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Newbie's question: memory allocation when reclaiming memory @ 2015-11-05 8:46 ` Vlastimil Babka 0 siblings, 0 replies; 213+ messages in thread From: Vlastimil Babka @ 2015-11-05 8:46 UTC (permalink / raw) To: Tetsuo Handa, mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On 10/26/2015 12:44 PM, Tetsuo Handa wrote: > May I ask a newbie question? Say, there is some amount of memory pages > which can be reclaimed if they are flushed to storage. And lower layer > might issue memory allocation request in a way which won't cause reclaim > deadlock (e.g. using GFP_NOFS or GFP_NOIO) when flushing to storage, > isn't it? > > What I'm worrying is a dependency that __GFP_FS allocation requests think > that there are reclaimable pages and therefore there is no need to call > out_of_memory(); and GFP_NOFS allocation requests which the __GFP_FS > allocation requests depend on (in order to flush to storage) is waiting > for GFP_NOIO allocation requests; and the GFP_NOIO allocation requests > which the GFP_NOFS allocation requests depend on (in order to flush to > storage) are waiting for memory pages to be reclaimed without calling > out_of_memory(); because gfp_to_alloc_flags() does not favor GFP_NOIO over > GFP_NOFS nor GFP_NOFS over __GFP_FS which will throttle all allocations > at the same watermark level. > > How do we guarantee that GFP_NOFS/GFP_NOIO allocations make forward > progress? What mechanism guarantees that memory pages which __GFP_FS > allocation requests are waiting for are reclaimed? I assume that there > is some mechanism; otherwise we can hit silent livelock, can't we? I've never studied the code myself, but IIRC in all the debates LSF/MM I've heard it said that GFP_NOIO allocations have mempools that guarantee forward progress, so when they allocate from this mempool, there should be nothing else to block the request other than waiting for the actual hardware to finish the I/O request, and then the memory is returned to mempool and another request can use it. So there shouldn't be waiting for reclaim at that level, breaking the livelock you described? > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? 2015-10-03 6:02 ` Tetsuo Handa (?) (?) @ 2015-10-06 15:25 ` Linus Torvalds 2015-10-08 15:33 ` Tetsuo Handa -1 siblings, 1 reply; 213+ messages in thread From: Linus Torvalds @ 2015-10-06 15:25 UTC (permalink / raw) To: Tetsuo Handa Cc: Christoph Lameter, Linux Kernel Mailing List, Michal Hocko, Kyle Walker, Oleg Nesterov, Vladimir Davydov, Stanislav Kozina, linux-mm, David Rientjes, Johannes Weiner, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 1696 bytes --] On Oct 3, 2015 7:02 AM, "Tetsuo Handa" <penguin-kernel@i-love.sakura.ne.jp> wrote: > > Kernel developers are not interested in testing OOM cases. I proposed a > SystemTap-based mandatory memory allocation failure injection for testing > OOM cases, but there was no response. I don't know if it's so much "not interested" as just "it's fairly hard to be realistic and on the same page". We used to have some simple oom testing that just did tons of allocations in user space, but then all the actual allocations that go on tend to be just the normal anonymous pages. Or then it's the same thing with shared memory (which is harder) or some other case. It's seldom a complex and varied load with lots of different allocations. I think it might be interesting to have some VM image case with fairly limited memory (so you can easily run it on different machines, whether you have a workstation with 16GB or some big iron with 1TB of ram). And a reasonable load that does at least a few different cases (ie do not just some server load, but maybe Xorg and chrome or something). Because another thing that tends to affect this is that oom without swap is very different from oom with lots of swap, so different people will see very different issues. If you have some particular case you want to check, and could make a VM image for it, maybe that would get more mm people looking at it and agreeing about the issues. Would something like that perhaps work? I dunno, but it *might* get more people on the same page (although maybe then people just start complaining about the choice of load instead..) Linus (on mobile at LinuxCon, so the mailing list will bounce this) Torvalds [-- Attachment #2: Type: text/html, Size: 1987 bytes --] ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? 2015-10-06 15:25 ` Can't we use timeout based OOM warning/killing? Linus Torvalds @ 2015-10-08 15:33 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-08 15:33 UTC (permalink / raw) To: torvalds Cc: cl, linux-kernel, mhocko, kwalker, oleg, vdavydov, skozina, linux-mm, rientjes, hannes, akpm Linus Torvalds wrote: > Because another thing that tends to affect this is that oom without swap is > very different from oom with lots of swap, so different people will see > very different issues. If you have some particular case you want to check, > and could make a VM image for it, maybe that would get more mm people > looking at it and agreeing about the issues. I was working at support center for troubleshooting RHEL systems. I saw many trouble cases where customer's servers hung up / rebooted unexpectedly. In most cases, their servers hung up without OOM killer messages. (I saw few cases where OOM killer messages are discovered by analyzing vmcore.) No messages are recorded to log files such as /var/log/messages and /var/log/sa/ when their servers hung up. According to /var/log/sa/ , there was little free memory just before their servers hung up. I suspected that something memory related problem happened and suggested customers to install serial console or netconsole in case the kernel was printing some messages, but I don't know whether they were able to install serial console or netconsole into their production systems. The origin of this OOM livelock discussion was a local OOM-DoS vulnerability which exists since Linux 2.0. When I tested this vulnerability on RHEL 7, I saw strange stalls on XFS. The discussion went to public by developing a reproducer which does not make use of the vulnerability. We recognized the "too small to fail" memory-allocation rule. I tested various corner cases using variants of the reproducer. I realized that we have race window where the memory allocation can fall into infinite loop without OOM killer messages. I made a hypothesis that customer's servers hit a race where __GFP_FS allocations are blocked at too_many_isolated() or unkillable locks in direct reclaim paths whereas !__GFP_FS allocations are retrying forever without calling out_of_memory(). But even if they install serial console or netconsole, we are currently emitting no warning messages. The timeout based OOM warning corresponds to check_memalloc_delay() in http://marc.info/?l=linux-kernel&m=143239201905479 . The timeout based OOM warning is not only for stalls after an OOM victim was chosen but also for stalls before an OOM victim is chosen. Whether we should call out_of_memory() upon timeout might depend on hardware / ram / swap / workload etc. But I think that whether we can have a mechanism for warning about possible OOM livelock is independent. Thus, I think that making a VM image is not helpful. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? @ 2015-10-08 15:33 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-08 15:33 UTC (permalink / raw) To: torvalds Cc: cl, linux-kernel, mhocko, kwalker, oleg, vdavydov, skozina, linux-mm, rientjes, hannes, akpm Linus Torvalds wrote: > Because another thing that tends to affect this is that oom without swap is > very different from oom with lots of swap, so different people will see > very different issues. If you have some particular case you want to check, > and could make a VM image for it, maybe that would get more mm people > looking at it and agreeing about the issues. I was working at support center for troubleshooting RHEL systems. I saw many trouble cases where customer's servers hung up / rebooted unexpectedly. In most cases, their servers hung up without OOM killer messages. (I saw few cases where OOM killer messages are discovered by analyzing vmcore.) No messages are recorded to log files such as /var/log/messages and /var/log/sa/ when their servers hung up. According to /var/log/sa/ , there was little free memory just before their servers hung up. I suspected that something memory related problem happened and suggested customers to install serial console or netconsole in case the kernel was printing some messages, but I don't know whether they were able to install serial console or netconsole into their production systems. The origin of this OOM livelock discussion was a local OOM-DoS vulnerability which exists since Linux 2.0. When I tested this vulnerability on RHEL 7, I saw strange stalls on XFS. The discussion went to public by developing a reproducer which does not make use of the vulnerability. We recognized the "too small to fail" memory-allocation rule. I tested various corner cases using variants of the reproducer. I realized that we have race window where the memory allocation can fall into infinite loop without OOM killer messages. I made a hypothesis that customer's servers hit a race where __GFP_FS allocations are blocked at too_many_isolated() or unkillable locks in direct reclaim paths whereas !__GFP_FS allocations are retrying forever without calling out_of_memory(). But even if they install serial console or netconsole, we are currently emitting no warning messages. The timeout based OOM warning corresponds to check_memalloc_delay() in http://marc.info/?l=linux-kernel&m=143239201905479 . The timeout based OOM warning is not only for stalls after an OOM victim was chosen but also for stalls before an OOM victim is chosen. Whether we should call out_of_memory() upon timeout might depend on hardware / ram / swap / workload etc. But I think that whether we can have a mechanism for warning about possible OOM livelock is independent. Thus, I think that making a VM image is not helpful. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? 2015-10-03 6:02 ` Tetsuo Handa @ 2015-10-10 12:50 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-10 12:50 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > Without means to find out what was happening, we will "overlook real bugs" > before "paper over real bugs". The means are expected to work without > knowledge to use trace points functionality, are expected to run without > memory allocation, are expected to dump output without administrator's > operation, are expected to work before power reset by watchdog timers. I want to use something like this patch (CONFIG_DEBUG_something is fine). Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151010.txt.xz ---------------------------------------- >From 0f749ddbc2bd9ce57ba56787e77595c3f13e9cc3 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Sat, 10 Oct 2015 20:48:09 +0900 Subject: [PATCH] Memory allocation watchdog kernel thread. This patch adds a kernel thread which periodically reports number of memory allocating tasks, dying tasks and OOM victim tasks. This kernel thread helps reporting whether we are failing to solve OOM conditions after OOM killer is invoked, in addition to reporting stalls before OOM killer is invoked (e.g. all __GFP_FS allocating tasks are blocked by locks or throttling whereas all !__GFP_FS allocating tasks are unable to invoke the OOM killer). $ grep MemAlloc serial.txt | grep -A 5 MemAlloc-Info: [ 101.937548] MemAlloc-Info: 4 stalling task, 32 dying task, 1 victim task. [ 101.939460] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=17338 [ 101.975433] MemAlloc: sync4(10602) gfp=0x24280ca order=0 delay=17115 [ 102.015519] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=17097 [ 102.053884] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=15970 [ 112.094349] MemAlloc-Info: 176 stalling task, 32 dying task, 1 victim task. [ 112.098411] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=27494 [ 112.138381] MemAlloc: sync4(10602) gfp=0x24280ca order=0 delay=27271 [ 112.178710] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=27253 [ 112.218674] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=26126 [ 112.257749] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=14083 -- [ 128.952137] MemAlloc-Info: 176 stalling task, 32 dying task, 1 victim task. [ 128.954056] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=44352 [ 128.992231] MemAlloc: sync4(10602) gfp=0x24280ca order=0 delay=44129 [ 129.034180] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=44111 [ 129.071755] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=42984 [ 129.109851] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=30941 -- [ 145.683171] MemAlloc-Info: 175 stalling task, 32 dying task, 1 victim task. [ 145.685344] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=61084 [ 145.736475] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=60843 [ 145.778084] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=59716 [ 145.815363] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=47673 [ 145.853610] MemAlloc: sync4(10601) gfp=0x24280ca order=0 delay=47673 -- [ 158.030038] MemAlloc-Info: 178 stalling task, 32 dying task, 1 victim task. [ 158.031945] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=73430 [ 158.071066] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=73189 [ 158.108835] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=72062 [ 158.146500] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=60019 [ 158.184146] MemAlloc: sync4(10601) gfp=0x24280ca order=0 delay=60019 -- [ 174.851184] MemAlloc-Info: 178 stalling task, 32 dying task, 1 victim task. [ 174.853106] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=90252 [ 174.896592] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=90011 [ 174.935838] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=88884 [ 174.978799] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=76841 [ 175.022003] MemAlloc: sync4(10601) gfp=0x24280ca order=0 delay=76841 -- Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- mm/page_alloc.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0d6f540..0473eec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2972,6 +2972,147 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask) return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE; } +#if 1 + +static u8 memalloc_counter_active_index; /* Either 0 or 1. */ +static int memalloc_counter[2]; /* Number of tasks doing memory allocation. */ + +struct memalloc { + struct list_head list; /* Connected to memalloc_list. */ + struct task_struct *task; /* Iniatilized to current. */ + unsigned long start; /* Initialized to jiffies. */ + unsigned int order; + gfp_t gfp; + u8 index; /* Initialized to memalloc_counter_active_index. */ +}; + +static LIST_HEAD(memalloc_list); /* List of "struct memalloc".*/ +static DEFINE_SPINLOCK(memalloc_list_lock); /* Lock for memalloc_list. */ + +/* + * malloc_watchdog - A kernel thread for monitoring memory allocation stalls. + * + * @unused: Not used. + * + * This kernel thread does not terminate. + */ +static int malloc_watchdog(void *unused) +{ + static const unsigned long timeout = 10 * HZ; + struct memalloc *m; + struct task_struct *g, *p; + unsigned long now; + unsigned long spent; + unsigned int sigkill_pending; + unsigned int memdie_pending; + unsigned int stalling_tasks; + u8 index; + + not_stalling: /* Healty case. */ + /* + * Switch active counter and wait for timeout duration. + * This is a kind of open coded implementation of synchronize_srcu() + * because synchronize_srcu_timeout() is missing. + */ + spin_lock(&memalloc_list_lock); + index = memalloc_counter_active_index; + memalloc_counter_active_index ^= 1; + spin_unlock(&memalloc_list_lock); + schedule_timeout_interruptible(timeout); + /* + * If memory allocations are working, the counter should remain 0 + * because tasks will be able to call both start_memalloc_timer() + * and stop_memalloc_timer() within timeout duration. + */ + if (likely(!memalloc_counter[index])) + goto not_stalling; + maybe_stalling: /* Maybe something is wrong. Let's check. */ + /* First, report whether there are SIGKILL tasks and/or OOM victims. */ + sigkill_pending = 0; + memdie_pending = 0; + stalling_tasks = 0; + preempt_disable(); + rcu_read_lock(); + for_each_process_thread(g, p) { + if (test_tsk_thread_flag(p, TIF_MEMDIE)) + memdie_pending++; + if (fatal_signal_pending(p)) + sigkill_pending++; + } + rcu_read_unlock(); + preempt_enable(); + spin_lock(&memalloc_list_lock); + now = jiffies; + list_for_each_entry(m, &memalloc_list, list) { + spent = now - m->start; + if (time_before(spent, timeout)) + continue; + stalling_tasks++; + } + pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n", + stalling_tasks, sigkill_pending, memdie_pending); + /* Next, report tasks stalled at memory allocation. */ + list_for_each_entry(m, &memalloc_list, list) { + spent = now - m->start; + if (time_before(spent, timeout)) + continue; + p = m->task; + pr_warn("MemAlloc%s: %s(%u) gfp=0x%x order=%u delay=%lu\n", + test_tsk_thread_flag(p, TIF_MEMDIE) ? "-victim" : + (fatal_signal_pending(p) ? "-dying" : ""), + p->comm, p->pid, m->gfp, m->order, spent); + show_stack(p, NULL); + } + spin_unlock(&memalloc_list_lock); + /* Wait until next timeout duration. */ + schedule_timeout_interruptible(timeout); + if (memalloc_counter[index]) + goto maybe_stalling; + goto not_stalling; + return 0; +} + +static int __init start_malloc_watchdog(void) +{ + struct task_struct *task = kthread_run(malloc_watchdog, NULL, + "MallocWatchdog"); + BUG_ON(IS_ERR(task)); + return 0; +} +late_initcall(start_malloc_watchdog); + +#define DEFINE_MEMALLOC_TIMER(m) struct memalloc m = { .task = NULL } + +static void start_memalloc_timer(struct memalloc *m, gfp_t gfp_mask, int order) +{ + if (m->task) + return; + m->task = current; + m->start = jiffies; + m->gfp = gfp_mask; + order = order; + spin_lock(&memalloc_list_lock); + m->index = memalloc_counter_active_index; + memalloc_counter[m->index]++; + list_add_tail(&m->list, &memalloc_list); + spin_unlock(&memalloc_list_lock); +} + +static void stop_memalloc_timer(struct memalloc *m) +{ + if (!m->task) + return; + spin_lock(&memalloc_list_lock); + memalloc_counter[m->index]--; + list_del(&m->list); + spin_unlock(&memalloc_list_lock); +} +#else +#define DEFINE_MEMALLOC_TIMER(m) +#define start_memalloc_timer(m, gfp_mask, order) +#define stop_memalloc_timer(m) +#endif + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) @@ -2984,6 +3125,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + DEFINE_MEMALLOC_TIMER(m); /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3075,6 +3217,8 @@ retry: if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; + start_memalloc_timer(&m, gfp_mask, order); + /* * Try direct compaction. The first pass is asynchronous. Subsequent * attempts after direct reclaim are synchronous @@ -3168,6 +3312,7 @@ noretry: nopage: warn_alloc_failed(gfp_mask, order, NULL); got_pg: + stop_memalloc_timer(&m); return page; } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? @ 2015-10-10 12:50 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-10 12:50 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > Without means to find out what was happening, we will "overlook real bugs" > before "paper over real bugs". The means are expected to work without > knowledge to use trace points functionality, are expected to run without > memory allocation, are expected to dump output without administrator's > operation, are expected to work before power reset by watchdog timers. I want to use something like this patch (CONFIG_DEBUG_something is fine). Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151010.txt.xz ---------------------------------------- >From 0f749ddbc2bd9ce57ba56787e77595c3f13e9cc3 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Sat, 10 Oct 2015 20:48:09 +0900 Subject: [PATCH] Memory allocation watchdog kernel thread. This patch adds a kernel thread which periodically reports number of memory allocating tasks, dying tasks and OOM victim tasks. This kernel thread helps reporting whether we are failing to solve OOM conditions after OOM killer is invoked, in addition to reporting stalls before OOM killer is invoked (e.g. all __GFP_FS allocating tasks are blocked by locks or throttling whereas all !__GFP_FS allocating tasks are unable to invoke the OOM killer). $ grep MemAlloc serial.txt | grep -A 5 MemAlloc-Info: [ 101.937548] MemAlloc-Info: 4 stalling task, 32 dying task, 1 victim task. [ 101.939460] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=17338 [ 101.975433] MemAlloc: sync4(10602) gfp=0x24280ca order=0 delay=17115 [ 102.015519] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=17097 [ 102.053884] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=15970 [ 112.094349] MemAlloc-Info: 176 stalling task, 32 dying task, 1 victim task. [ 112.098411] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=27494 [ 112.138381] MemAlloc: sync4(10602) gfp=0x24280ca order=0 delay=27271 [ 112.178710] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=27253 [ 112.218674] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=26126 [ 112.257749] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=14083 -- [ 128.952137] MemAlloc-Info: 176 stalling task, 32 dying task, 1 victim task. [ 128.954056] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=44352 [ 128.992231] MemAlloc: sync4(10602) gfp=0x24280ca order=0 delay=44129 [ 129.034180] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=44111 [ 129.071755] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=42984 [ 129.109851] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=30941 -- [ 145.683171] MemAlloc-Info: 175 stalling task, 32 dying task, 1 victim task. [ 145.685344] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=61084 [ 145.736475] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=60843 [ 145.778084] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=59716 [ 145.815363] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=47673 [ 145.853610] MemAlloc: sync4(10601) gfp=0x24280ca order=0 delay=47673 -- [ 158.030038] MemAlloc-Info: 178 stalling task, 32 dying task, 1 victim task. [ 158.031945] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=73430 [ 158.071066] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=73189 [ 158.108835] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=72062 [ 158.146500] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=60019 [ 158.184146] MemAlloc: sync4(10601) gfp=0x24280ca order=0 delay=60019 -- [ 174.851184] MemAlloc-Info: 178 stalling task, 32 dying task, 1 victim task. [ 174.853106] MemAlloc: sync4(10598) gfp=0x24280ca order=0 delay=90252 [ 174.896592] MemAlloc: sync4(10599) gfp=0x24280ca order=0 delay=90011 [ 174.935838] MemAlloc: sync4(10607) gfp=0x24280ca order=0 delay=88884 [ 174.978799] MemAlloc: sync4(10608) gfp=0x24280ca order=0 delay=76841 [ 175.022003] MemAlloc: sync4(10601) gfp=0x24280ca order=0 delay=76841 -- Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- mm/page_alloc.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0d6f540..0473eec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2972,6 +2972,147 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask) return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE; } +#if 1 + +static u8 memalloc_counter_active_index; /* Either 0 or 1. */ +static int memalloc_counter[2]; /* Number of tasks doing memory allocation. */ + +struct memalloc { + struct list_head list; /* Connected to memalloc_list. */ + struct task_struct *task; /* Iniatilized to current. */ + unsigned long start; /* Initialized to jiffies. */ + unsigned int order; + gfp_t gfp; + u8 index; /* Initialized to memalloc_counter_active_index. */ +}; + +static LIST_HEAD(memalloc_list); /* List of "struct memalloc".*/ +static DEFINE_SPINLOCK(memalloc_list_lock); /* Lock for memalloc_list. */ + +/* + * malloc_watchdog - A kernel thread for monitoring memory allocation stalls. + * + * @unused: Not used. + * + * This kernel thread does not terminate. + */ +static int malloc_watchdog(void *unused) +{ + static const unsigned long timeout = 10 * HZ; + struct memalloc *m; + struct task_struct *g, *p; + unsigned long now; + unsigned long spent; + unsigned int sigkill_pending; + unsigned int memdie_pending; + unsigned int stalling_tasks; + u8 index; + + not_stalling: /* Healty case. */ + /* + * Switch active counter and wait for timeout duration. + * This is a kind of open coded implementation of synchronize_srcu() + * because synchronize_srcu_timeout() is missing. + */ + spin_lock(&memalloc_list_lock); + index = memalloc_counter_active_index; + memalloc_counter_active_index ^= 1; + spin_unlock(&memalloc_list_lock); + schedule_timeout_interruptible(timeout); + /* + * If memory allocations are working, the counter should remain 0 + * because tasks will be able to call both start_memalloc_timer() + * and stop_memalloc_timer() within timeout duration. + */ + if (likely(!memalloc_counter[index])) + goto not_stalling; + maybe_stalling: /* Maybe something is wrong. Let's check. */ + /* First, report whether there are SIGKILL tasks and/or OOM victims. */ + sigkill_pending = 0; + memdie_pending = 0; + stalling_tasks = 0; + preempt_disable(); + rcu_read_lock(); + for_each_process_thread(g, p) { + if (test_tsk_thread_flag(p, TIF_MEMDIE)) + memdie_pending++; + if (fatal_signal_pending(p)) + sigkill_pending++; + } + rcu_read_unlock(); + preempt_enable(); + spin_lock(&memalloc_list_lock); + now = jiffies; + list_for_each_entry(m, &memalloc_list, list) { + spent = now - m->start; + if (time_before(spent, timeout)) + continue; + stalling_tasks++; + } + pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n", + stalling_tasks, sigkill_pending, memdie_pending); + /* Next, report tasks stalled at memory allocation. */ + list_for_each_entry(m, &memalloc_list, list) { + spent = now - m->start; + if (time_before(spent, timeout)) + continue; + p = m->task; + pr_warn("MemAlloc%s: %s(%u) gfp=0x%x order=%u delay=%lu\n", + test_tsk_thread_flag(p, TIF_MEMDIE) ? "-victim" : + (fatal_signal_pending(p) ? "-dying" : ""), + p->comm, p->pid, m->gfp, m->order, spent); + show_stack(p, NULL); + } + spin_unlock(&memalloc_list_lock); + /* Wait until next timeout duration. */ + schedule_timeout_interruptible(timeout); + if (memalloc_counter[index]) + goto maybe_stalling; + goto not_stalling; + return 0; +} + +static int __init start_malloc_watchdog(void) +{ + struct task_struct *task = kthread_run(malloc_watchdog, NULL, + "MallocWatchdog"); + BUG_ON(IS_ERR(task)); + return 0; +} +late_initcall(start_malloc_watchdog); + +#define DEFINE_MEMALLOC_TIMER(m) struct memalloc m = { .task = NULL } + +static void start_memalloc_timer(struct memalloc *m, gfp_t gfp_mask, int order) +{ + if (m->task) + return; + m->task = current; + m->start = jiffies; + m->gfp = gfp_mask; + order = order; + spin_lock(&memalloc_list_lock); + m->index = memalloc_counter_active_index; + memalloc_counter[m->index]++; + list_add_tail(&m->list, &memalloc_list); + spin_unlock(&memalloc_list_lock); +} + +static void stop_memalloc_timer(struct memalloc *m) +{ + if (!m->task) + return; + spin_lock(&memalloc_list_lock); + memalloc_counter[m->index]--; + list_del(&m->list); + spin_unlock(&memalloc_list_lock); +} +#else +#define DEFINE_MEMALLOC_TIMER(m) +#define start_memalloc_timer(m, gfp_mask, order) +#define stop_memalloc_timer(m) +#endif + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) @@ -2984,6 +3125,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + DEFINE_MEMALLOC_TIMER(m); /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3075,6 +3217,8 @@ retry: if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; + start_memalloc_timer(&m, gfp_mask, order); + /* * Try direct compaction. The first pass is asynchronous. Subsequent * attempts after direct reclaim are synchronous @@ -3168,6 +3312,7 @@ noretry: nopage: warn_alloc_failed(gfp_mask, order, NULL); got_pg: + stop_memalloc_timer(&m); return page; } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-25 9:35 ` Michal Hocko @ 2015-09-28 22:24 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-28 22:24 UTC (permalink / raw) To: Michal Hocko Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Fri, 25 Sep 2015, Michal Hocko wrote: > > > I am still not sure how you want to implement that kernel thread but I > > > am quite skeptical it would be very much useful because all the current > > > allocations which end up in the OOM killer path cannot simply back off > > > and drop the locks with the current allocator semantic. So they will > > > be sitting on top of unknown pile of locks whether you do an additional > > > reclaim (unmap the anon memory) in the direct OOM context or looping > > > in the allocator and waiting for kthread/workqueue to do its work. The > > > only argument that I can see is the stack usage but I haven't seen stack > > > overflows in the OOM path AFAIR. > > > > > > > Which locks are you specifically interested in? > > Any locks they were holding before they entered the page allocator (e.g. > i_mutex is the easiest one to trigger from the userspace but mmap_sem > might be involved as well because we are doing kmalloc(GFP_KERNEL) with > mmap_sem held for write). Those would be locked until the page allocator > returns, which with the current semantic might be _never_. > I agree that i_mutex seems to be one of the most common offenders. However, I'm not sure I understand why holding it while trying to allocate infinitely for an order-0 allocation is problematic wrt the proposed kthread. The kthread itself need only take mmap_sem for read. If all threads sharing the mm with a victim have been SIGKILL'd, they should get TIF_MEMDIE set when reclaim fails and be able to allocate so that they can drop mmap_sem. We must ensure that any holder of mmap_sem cannot quickly deplete memory reserves without properly checking for fatal_signal_pending(). > > We have already discussed > > the usefulness of killing all threads on the system sharing the same ->mm, > > meaning all threads that are either holding or want to hold mm->mmap_sem > > will be able to allocate into memory reserves. Any allocator holding > > down_write(&mm->mmap_sem) should be able to allocate and drop its lock. > > (Are you concerned about MAP_POPULATE?) > > I am not sure I understand. We would have to fail the request in order > the context which requested the memory could drop the lock. Are we > talking about the same thing here? > Not fail the request, they should be able to allocate from memory reserves when TIF_MEMDIE gets set. This would require that threads is all gfp contexts are able to get TIF_MEMDIE set without an explicit call to out_of_memory() for !__GFP_FS. > > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, > > it's the reason the code exists. Any optimizations to that is certainly > > welcome, but we definitely need to send SIGKILL to all threads sharing the > > mm to make forward progress, otherwise we are going back to pre-2008 > > livelocks. > > Yes but mm is not shared between processes most of the time. CLONE_VM > without CLONE_THREAD is more a corner case yet we have to crawl all the > task_structs for _each_ OOM killer invocation. Yes this is an extreme > slow path but still might take quite some unnecessarily time. > It must solve the issue you describe, killing other processes that share the ->mm, otherwise we have mm->mmap_sem livelock. We are not concerned about iterating over all task_structs in the oom killer as a painpoint, such users should already be using oom_kill_allocating_task which is why it was introduced. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-28 22:24 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-28 22:24 UTC (permalink / raw) To: Michal Hocko Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Fri, 25 Sep 2015, Michal Hocko wrote: > > > I am still not sure how you want to implement that kernel thread but I > > > am quite skeptical it would be very much useful because all the current > > > allocations which end up in the OOM killer path cannot simply back off > > > and drop the locks with the current allocator semantic. So they will > > > be sitting on top of unknown pile of locks whether you do an additional > > > reclaim (unmap the anon memory) in the direct OOM context or looping > > > in the allocator and waiting for kthread/workqueue to do its work. The > > > only argument that I can see is the stack usage but I haven't seen stack > > > overflows in the OOM path AFAIR. > > > > > > > Which locks are you specifically interested in? > > Any locks they were holding before they entered the page allocator (e.g. > i_mutex is the easiest one to trigger from the userspace but mmap_sem > might be involved as well because we are doing kmalloc(GFP_KERNEL) with > mmap_sem held for write). Those would be locked until the page allocator > returns, which with the current semantic might be _never_. > I agree that i_mutex seems to be one of the most common offenders. However, I'm not sure I understand why holding it while trying to allocate infinitely for an order-0 allocation is problematic wrt the proposed kthread. The kthread itself need only take mmap_sem for read. If all threads sharing the mm with a victim have been SIGKILL'd, they should get TIF_MEMDIE set when reclaim fails and be able to allocate so that they can drop mmap_sem. We must ensure that any holder of mmap_sem cannot quickly deplete memory reserves without properly checking for fatal_signal_pending(). > > We have already discussed > > the usefulness of killing all threads on the system sharing the same ->mm, > > meaning all threads that are either holding or want to hold mm->mmap_sem > > will be able to allocate into memory reserves. Any allocator holding > > down_write(&mm->mmap_sem) should be able to allocate and drop its lock. > > (Are you concerned about MAP_POPULATE?) > > I am not sure I understand. We would have to fail the request in order > the context which requested the memory could drop the lock. Are we > talking about the same thing here? > Not fail the request, they should be able to allocate from memory reserves when TIF_MEMDIE gets set. This would require that threads is all gfp contexts are able to get TIF_MEMDIE set without an explicit call to out_of_memory() for !__GFP_FS. > > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, > > it's the reason the code exists. Any optimizations to that is certainly > > welcome, but we definitely need to send SIGKILL to all threads sharing the > > mm to make forward progress, otherwise we are going back to pre-2008 > > livelocks. > > Yes but mm is not shared between processes most of the time. CLONE_VM > without CLONE_THREAD is more a corner case yet we have to crawl all the > task_structs for _each_ OOM killer invocation. Yes this is an extreme > slow path but still might take quite some unnecessarily time. > It must solve the issue you describe, killing other processes that share the ->mm, otherwise we have mm->mmap_sem livelock. We are not concerned about iterating over all task_structs in the oom killer as a painpoint, such users should already be using oom_kill_allocating_task which is why it was introduced. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-28 22:24 ` David Rientjes @ 2015-09-29 7:57 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-29 7:57 UTC (permalink / raw) To: rientjes, mhocko Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina David Rientjes wrote: > On Fri, 25 Sep 2015, Michal Hocko wrote: > > > > I am still not sure how you want to implement that kernel thread but I > > > > am quite skeptical it would be very much useful because all the current > > > > allocations which end up in the OOM killer path cannot simply back off > > > > and drop the locks with the current allocator semantic. So they will > > > > be sitting on top of unknown pile of locks whether you do an additional > > > > reclaim (unmap the anon memory) in the direct OOM context or looping > > > > in the allocator and waiting for kthread/workqueue to do its work. The > > > > only argument that I can see is the stack usage but I haven't seen stack > > > > overflows in the OOM path AFAIR. > > > > > > > > > > Which locks are you specifically interested in? > > > > Any locks they were holding before they entered the page allocator (e.g. > > i_mutex is the easiest one to trigger from the userspace but mmap_sem > > might be involved as well because we are doing kmalloc(GFP_KERNEL) with > > mmap_sem held for write). Those would be locked until the page allocator > > returns, which with the current semantic might be _never_. > > > > I agree that i_mutex seems to be one of the most common offenders. > However, I'm not sure I understand why holding it while trying to allocate > infinitely for an order-0 allocation is problematic wrt the proposed > kthread. The kthread itself need only take mmap_sem for read. If all > threads sharing the mm with a victim have been SIGKILL'd, they should get > TIF_MEMDIE set when reclaim fails and be able to allocate so that they can > drop mmap_sem. We must ensure that any holder of mmap_sem cannot quickly > deplete memory reserves without properly checking for > fatal_signal_pending(). Is the story such simple? I think there are factors which disturb memory allocation with mmap_sem held for writing. down_write(&mm->mmap_sem); kmalloc(GFP_KERNEL); up_write(&mm->mmap_sem); can involve locks inside __alloc_pages_slowpath(). Say, there are three userspace tasks named P1, P2T1, P2T2 and one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm. KT1 is a kernel thread for fs writeback (maybe kswapd?). I think sequence shown below is possible. (1) P1 enters into kernel mode via write() syscall. (2) P1 allocates memory for buffered write. (3) P2T1 enters into kernel mode and calls kmalloc(). (4) P2T1 arrives at __alloc_pages_may_oom() because there was no reclaimable memory. (Memory allocated by P1 is not reclaimable as of this moment.) (5) P1 dirties memory allocated for buffered write. (6) P2T2 enters into kernel mode and calls kmalloc() with mmap_sem held for writing. (7) KT1 finds dirtied memory. (8) KT1 holds fs's unkillable lock for fs writeback. (9) P2T2 is blocked at unkillable lock for fs writeback held by KT1. (10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets TIF_MEMDIE on both P2T1 and P2T2. (11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback held by KT1. (12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and P2T2 cannot release memory because memory unmapping code cannot hold mmap_sem for reading, KT1 waits forever.... OOM livelock completed! I think sequence shown below is also possible. Say, there are three userspace tasks named P1, P2, P3 and one kernel thread named KT1. (1) P1 enters into kernel mode via write() syscall. (2) P1 allocates memory for buffered write. (3) P2 enters into kernel mode and holds mmap_sem for writing. (4) P3 enters into kernel mode and calls kmalloc(). (5) P3 arrives at __alloc_pages_may_oom() because there was no reclaimable memory. (Memory allocated by P1 is not reclaimable as of this moment.) (6) P1 dirties memory allocated for buffered write. (7) KT1 finds dirtied memory. (8) KT1 holds fs's unkillable lock for fs writeback. (9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback held by KT1. (10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets TIF_MEMDIE on P2. (11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback held by KT1. (12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot release memory because memory unmapping code cannot hold mmap_sem for reading, KT1 waits forever.... OOM livelock completed! So, allowing all OOM victim threads to use memory reserves does not guarantee that a thread which held mmap_sem for writing to make forward progress. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-29 7:57 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-29 7:57 UTC (permalink / raw) To: rientjes, mhocko Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina David Rientjes wrote: > On Fri, 25 Sep 2015, Michal Hocko wrote: > > > > I am still not sure how you want to implement that kernel thread but I > > > > am quite skeptical it would be very much useful because all the current > > > > allocations which end up in the OOM killer path cannot simply back off > > > > and drop the locks with the current allocator semantic. So they will > > > > be sitting on top of unknown pile of locks whether you do an additional > > > > reclaim (unmap the anon memory) in the direct OOM context or looping > > > > in the allocator and waiting for kthread/workqueue to do its work. The > > > > only argument that I can see is the stack usage but I haven't seen stack > > > > overflows in the OOM path AFAIR. > > > > > > > > > > Which locks are you specifically interested in? > > > > Any locks they were holding before they entered the page allocator (e.g. > > i_mutex is the easiest one to trigger from the userspace but mmap_sem > > might be involved as well because we are doing kmalloc(GFP_KERNEL) with > > mmap_sem held for write). Those would be locked until the page allocator > > returns, which with the current semantic might be _never_. > > > > I agree that i_mutex seems to be one of the most common offenders. > However, I'm not sure I understand why holding it while trying to allocate > infinitely for an order-0 allocation is problematic wrt the proposed > kthread. The kthread itself need only take mmap_sem for read. If all > threads sharing the mm with a victim have been SIGKILL'd, they should get > TIF_MEMDIE set when reclaim fails and be able to allocate so that they can > drop mmap_sem. We must ensure that any holder of mmap_sem cannot quickly > deplete memory reserves without properly checking for > fatal_signal_pending(). Is the story such simple? I think there are factors which disturb memory allocation with mmap_sem held for writing. down_write(&mm->mmap_sem); kmalloc(GFP_KERNEL); up_write(&mm->mmap_sem); can involve locks inside __alloc_pages_slowpath(). Say, there are three userspace tasks named P1, P2T1, P2T2 and one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm. KT1 is a kernel thread for fs writeback (maybe kswapd?). I think sequence shown below is possible. (1) P1 enters into kernel mode via write() syscall. (2) P1 allocates memory for buffered write. (3) P2T1 enters into kernel mode and calls kmalloc(). (4) P2T1 arrives at __alloc_pages_may_oom() because there was no reclaimable memory. (Memory allocated by P1 is not reclaimable as of this moment.) (5) P1 dirties memory allocated for buffered write. (6) P2T2 enters into kernel mode and calls kmalloc() with mmap_sem held for writing. (7) KT1 finds dirtied memory. (8) KT1 holds fs's unkillable lock for fs writeback. (9) P2T2 is blocked at unkillable lock for fs writeback held by KT1. (10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets TIF_MEMDIE on both P2T1 and P2T2. (11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback held by KT1. (12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and P2T2 cannot release memory because memory unmapping code cannot hold mmap_sem for reading, KT1 waits forever.... OOM livelock completed! I think sequence shown below is also possible. Say, there are three userspace tasks named P1, P2, P3 and one kernel thread named KT1. (1) P1 enters into kernel mode via write() syscall. (2) P1 allocates memory for buffered write. (3) P2 enters into kernel mode and holds mmap_sem for writing. (4) P3 enters into kernel mode and calls kmalloc(). (5) P3 arrives at __alloc_pages_may_oom() because there was no reclaimable memory. (Memory allocated by P1 is not reclaimable as of this moment.) (6) P1 dirties memory allocated for buffered write. (7) KT1 finds dirtied memory. (8) KT1 holds fs's unkillable lock for fs writeback. (9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback held by KT1. (10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets TIF_MEMDIE on P2. (11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback held by KT1. (12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot release memory because memory unmapping code cannot hold mmap_sem for reading, KT1 waits forever.... OOM livelock completed! So, allowing all OOM victim threads to use memory reserves does not guarantee that a thread which held mmap_sem for writing to make forward progress. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-29 7:57 ` Tetsuo Handa @ 2015-09-29 22:56 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-29 22:56 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue, 29 Sep 2015, Tetsuo Handa wrote: > Is the story such simple? I think there are factors which disturb memory > allocation with mmap_sem held for writing. > > down_write(&mm->mmap_sem); > kmalloc(GFP_KERNEL); > up_write(&mm->mmap_sem); > > can involve locks inside __alloc_pages_slowpath(). > > Say, there are three userspace tasks named P1, P2T1, P2T2 and > one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm. > KT1 is a kernel thread for fs writeback (maybe kswapd?). > I think sequence shown below is possible. > > (1) P1 enters into kernel mode via write() syscall. > > (2) P1 allocates memory for buffered write. > > (3) P2T1 enters into kernel mode and calls kmalloc(). > > (4) P2T1 arrives at __alloc_pages_may_oom() because there was no > reclaimable memory. (Memory allocated by P1 is not reclaimable > as of this moment.) > > (5) P1 dirties memory allocated for buffered write. > > (6) P2T2 enters into kernel mode and calls kmalloc() with > mmap_sem held for writing. > > (7) KT1 finds dirtied memory. > > (8) KT1 holds fs's unkillable lock for fs writeback. > > (9) P2T2 is blocked at unkillable lock for fs writeback held by KT1. > > (10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets > TIF_MEMDIE on both P2T1 and P2T2. > > (11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback > held by KT1. > > (12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and > P2T2 cannot release memory because memory unmapping code cannot hold > mmap_sem for reading, KT1 waits forever.... OOM livelock completed! > > I think sequence shown below is also possible. Say, there are three > userspace tasks named P1, P2, P3 and one kernel thread named KT1. > > (1) P1 enters into kernel mode via write() syscall. > > (2) P1 allocates memory for buffered write. > > (3) P2 enters into kernel mode and holds mmap_sem for writing. > > (4) P3 enters into kernel mode and calls kmalloc(). > > (5) P3 arrives at __alloc_pages_may_oom() because there was no > reclaimable memory. (Memory allocated by P1 is not reclaimable > as of this moment.) > > (6) P1 dirties memory allocated for buffered write. > > (7) KT1 finds dirtied memory. > > (8) KT1 holds fs's unkillable lock for fs writeback. > > (9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback > held by KT1. > > (10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets > TIF_MEMDIE on P2. > > (11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback > held by KT1. > > (12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot > release memory because memory unmapping code cannot hold mmap_sem for > reading, KT1 waits forever.... OOM livelock completed! > > So, allowing all OOM victim threads to use memory reserves does not guarantee > that a thread which held mmap_sem for writing to make forward progress. > Thank you for writing this all out, it definitely helps to understand the concerns. This, in my understanding, is the same scenario that requires not only oom victims to be able to access memory reserves, but also any thread after an oom victim has failed to make a timely exit. I point out mm->mmap_sem as a special case because we have had fixes in the past, such as the special fatal_signal_pending() handling in __get_user_pages(), that try to ensure forward progress since we know that we need exclusive mm->mmap_sem for the victim to make an exit. I think both of your illustrations show why it is not helpful to kill additional processes after a time period has elapsed and a victim has failed to exit. In both of your scenarios, it would require that KT1 be killed to allow forward progress and we know that's not possible. Perhaps this is an argument that we need to provide access to memory reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, but I would wait to make that extension until we see it in practice. Killing all mm->mmap_sem threads certainly isn't meant to solve all oom killer livelocks, as you show. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-29 22:56 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-29 22:56 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue, 29 Sep 2015, Tetsuo Handa wrote: > Is the story such simple? I think there are factors which disturb memory > allocation with mmap_sem held for writing. > > down_write(&mm->mmap_sem); > kmalloc(GFP_KERNEL); > up_write(&mm->mmap_sem); > > can involve locks inside __alloc_pages_slowpath(). > > Say, there are three userspace tasks named P1, P2T1, P2T2 and > one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm. > KT1 is a kernel thread for fs writeback (maybe kswapd?). > I think sequence shown below is possible. > > (1) P1 enters into kernel mode via write() syscall. > > (2) P1 allocates memory for buffered write. > > (3) P2T1 enters into kernel mode and calls kmalloc(). > > (4) P2T1 arrives at __alloc_pages_may_oom() because there was no > reclaimable memory. (Memory allocated by P1 is not reclaimable > as of this moment.) > > (5) P1 dirties memory allocated for buffered write. > > (6) P2T2 enters into kernel mode and calls kmalloc() with > mmap_sem held for writing. > > (7) KT1 finds dirtied memory. > > (8) KT1 holds fs's unkillable lock for fs writeback. > > (9) P2T2 is blocked at unkillable lock for fs writeback held by KT1. > > (10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets > TIF_MEMDIE on both P2T1 and P2T2. > > (11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback > held by KT1. > > (12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and > P2T2 cannot release memory because memory unmapping code cannot hold > mmap_sem for reading, KT1 waits forever.... OOM livelock completed! > > I think sequence shown below is also possible. Say, there are three > userspace tasks named P1, P2, P3 and one kernel thread named KT1. > > (1) P1 enters into kernel mode via write() syscall. > > (2) P1 allocates memory for buffered write. > > (3) P2 enters into kernel mode and holds mmap_sem for writing. > > (4) P3 enters into kernel mode and calls kmalloc(). > > (5) P3 arrives at __alloc_pages_may_oom() because there was no > reclaimable memory. (Memory allocated by P1 is not reclaimable > as of this moment.) > > (6) P1 dirties memory allocated for buffered write. > > (7) KT1 finds dirtied memory. > > (8) KT1 holds fs's unkillable lock for fs writeback. > > (9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback > held by KT1. > > (10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets > TIF_MEMDIE on P2. > > (11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback > held by KT1. > > (12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot > release memory because memory unmapping code cannot hold mmap_sem for > reading, KT1 waits forever.... OOM livelock completed! > > So, allowing all OOM victim threads to use memory reserves does not guarantee > that a thread which held mmap_sem for writing to make forward progress. > Thank you for writing this all out, it definitely helps to understand the concerns. This, in my understanding, is the same scenario that requires not only oom victims to be able to access memory reserves, but also any thread after an oom victim has failed to make a timely exit. I point out mm->mmap_sem as a special case because we have had fixes in the past, such as the special fatal_signal_pending() handling in __get_user_pages(), that try to ensure forward progress since we know that we need exclusive mm->mmap_sem for the victim to make an exit. I think both of your illustrations show why it is not helpful to kill additional processes after a time period has elapsed and a victim has failed to exit. In both of your scenarios, it would require that KT1 be killed to allow forward progress and we know that's not possible. Perhaps this is an argument that we need to provide access to memory reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, but I would wait to make that extension until we see it in practice. Killing all mm->mmap_sem threads certainly isn't meant to solve all oom killer livelocks, as you show. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-29 22:56 ` David Rientjes @ 2015-09-30 4:25 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-30 4:25 UTC (permalink / raw) To: rientjes Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina David Rientjes wrote: > I think both of your illustrations show why it is not helpful to kill > additional processes after a time period has elapsed and a victim has > failed to exit. In both of your scenarios, it would require that KT1 be > killed to allow forward progress and we know that's not possible. My illustrations show why it is helpful to kill additional processes after a time period has elapsed and a victim has failed to exit. We don't need to kill KT1 if we combine memory unmapping approach and timeout based OOM killing approach. Simply choosing more OOM victims (processes which do not share other OOM victim's mm) based on timeout itself does not guarantee that other OOM victims can exit. But if timeout based OOM killing is used together with memory unmapping approach, the possibility that OOM victims can exit significantly increases because the only case where memory unmapping approach stucks will be when mm->mmap_sem was held for writing (which should unlikely occur). If we choose only 1 OOM victim, the possibility of hitting this memory unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the possibility becomes (almost) 0%. And if we still hit this livelock even after choosing many OOM victims, it is time to call panic(). (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not enter direct reclaim paths in order to avoid being blocked by unkillable fs locks?) > > Perhaps this is an argument that we need to provide access to memory > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, > but I would wait to make that extension until we see it in practice. I think that GFP_ATOMIC allocations already access memory reserves via ALLOC_HIGH priority. > > Killing all mm->mmap_sem threads certainly isn't meant to solve all oom > killer livelocks, as you show. > Good. I'm not denying memory unmapping approach. I'm just pointing out that use of memory unmapping approach alone still leaves room for hang up. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-30 4:25 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-30 4:25 UTC (permalink / raw) To: rientjes Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina David Rientjes wrote: > I think both of your illustrations show why it is not helpful to kill > additional processes after a time period has elapsed and a victim has > failed to exit. In both of your scenarios, it would require that KT1 be > killed to allow forward progress and we know that's not possible. My illustrations show why it is helpful to kill additional processes after a time period has elapsed and a victim has failed to exit. We don't need to kill KT1 if we combine memory unmapping approach and timeout based OOM killing approach. Simply choosing more OOM victims (processes which do not share other OOM victim's mm) based on timeout itself does not guarantee that other OOM victims can exit. But if timeout based OOM killing is used together with memory unmapping approach, the possibility that OOM victims can exit significantly increases because the only case where memory unmapping approach stucks will be when mm->mmap_sem was held for writing (which should unlikely occur). If we choose only 1 OOM victim, the possibility of hitting this memory unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the possibility becomes (almost) 0%. And if we still hit this livelock even after choosing many OOM victims, it is time to call panic(). (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not enter direct reclaim paths in order to avoid being blocked by unkillable fs locks?) > > Perhaps this is an argument that we need to provide access to memory > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, > but I would wait to make that extension until we see it in practice. I think that GFP_ATOMIC allocations already access memory reserves via ALLOC_HIGH priority. > > Killing all mm->mmap_sem threads certainly isn't meant to solve all oom > killer livelocks, as you show. > Good. I'm not denying memory unmapping approach. I'm just pointing out that use of memory unmapping approach alone still leaves room for hang up. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-30 4:25 ` Tetsuo Handa @ 2015-09-30 10:21 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-30 10:21 UTC (permalink / raw) To: rientjes Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not > enter direct reclaim paths in order to avoid being blocked by unkillable fs > locks?) I'm not familiar with how fs writeback manages memory. I feel I'm missing something. Can somebody please re-check whether my illustrations are really possible? If they are really possible, I think we have yet another silent hang up sequence. Say, there are one userspace task named P1 and one kernel thread named KT1. (1) P1 enters into kernel mode via write() syscall. (2) P1 allocates memory for buffered write. (3) P1 dirties memory allocated for buffered write. (4) P1 leaves kernel mode. (5) KT1 finds dirtied memory. (6) KT1 holds fs's unkillable lock for fs writeback. (7) KT1 tries to allocate memory for fs writeback, but fails to allocate because watermark is low. KT1 cannot call out_of_memory() because of !__GFP_FS allocation. (8) P1 enters into kernel mode. (9) P1 calls kmalloc(GFP_KERNEL) and is blocked at unkillable lock for fs writeback held by KT1. How do we allow KT1 to make forward progress? Are we giving access to memory reserves (e.g. ALLOC_NO_WATERMARKS priority) to KT1? ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-30 10:21 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-30 10:21 UTC (permalink / raw) To: rientjes Cc: mhocko, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not > enter direct reclaim paths in order to avoid being blocked by unkillable fs > locks?) I'm not familiar with how fs writeback manages memory. I feel I'm missing something. Can somebody please re-check whether my illustrations are really possible? If they are really possible, I think we have yet another silent hang up sequence. Say, there are one userspace task named P1 and one kernel thread named KT1. (1) P1 enters into kernel mode via write() syscall. (2) P1 allocates memory for buffered write. (3) P1 dirties memory allocated for buffered write. (4) P1 leaves kernel mode. (5) KT1 finds dirtied memory. (6) KT1 holds fs's unkillable lock for fs writeback. (7) KT1 tries to allocate memory for fs writeback, but fails to allocate because watermark is low. KT1 cannot call out_of_memory() because of !__GFP_FS allocation. (8) P1 enters into kernel mode. (9) P1 calls kmalloc(GFP_KERNEL) and is blocked at unkillable lock for fs writeback held by KT1. How do we allow KT1 to make forward progress? Are we giving access to memory reserves (e.g. ALLOC_NO_WATERMARKS priority) to KT1? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-30 4:25 ` Tetsuo Handa @ 2015-09-30 21:11 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-30 21:11 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, oleg, kwalker, cl, Andrew Morton, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed, 30 Sep 2015, Tetsuo Handa wrote: > If we choose only 1 OOM victim, the possibility of hitting this memory > unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the > possibility becomes (almost) 0%. And if we still hit this livelock even > after choosing many OOM victims, it is time to call panic(). > Again, this is a fundamental disagreement between your approach of randomly killing processes hoping that we target one that can make a quick exit vs. my approach where we give threads access to memory reserves after reclaim has failed in an oom livelock so they at least make forward progress. We're going around in circles. > (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not > enter direct reclaim paths in order to avoid being blocked by unkillable fs > locks?) > OOM victims shouldn't need to enter reclaim, and there have been patches before to abort reclaim if current has a pending SIGKILL, if they have access to memory reserves. Nothing prevents the victim from already being in reclaim, however, when it is killed. > > Perhaps this is an argument that we need to provide access to memory > > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, > > but I would wait to make that extension until we see it in practice. > > I think that GFP_ATOMIC allocations already access memory reserves via > ALLOC_HIGH priority. > Yes, that's true. It doesn't help for GFP_NOFS, however. It may be possible that GFP_ATOMIC reserves have been depleted or there is a GFP_NOFS allocation that gets stuck looping forever that doesn't get the ability to allocate without watermarks. I'd wait to see it in practice before making this extension since it relies on scanning the tasklist. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-30 21:11 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-30 21:11 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, oleg, kwalker, cl, Andrew Morton, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed, 30 Sep 2015, Tetsuo Handa wrote: > If we choose only 1 OOM victim, the possibility of hitting this memory > unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the > possibility becomes (almost) 0%. And if we still hit this livelock even > after choosing many OOM victims, it is time to call panic(). > Again, this is a fundamental disagreement between your approach of randomly killing processes hoping that we target one that can make a quick exit vs. my approach where we give threads access to memory reserves after reclaim has failed in an oom livelock so they at least make forward progress. We're going around in circles. > (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not > enter direct reclaim paths in order to avoid being blocked by unkillable fs > locks?) > OOM victims shouldn't need to enter reclaim, and there have been patches before to abort reclaim if current has a pending SIGKILL, if they have access to memory reserves. Nothing prevents the victim from already being in reclaim, however, when it is killed. > > Perhaps this is an argument that we need to provide access to memory > > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, > > but I would wait to make that extension until we see it in practice. > > I think that GFP_ATOMIC allocations already access memory reserves via > ALLOC_HIGH priority. > Yes, that's true. It doesn't help for GFP_NOFS, however. It may be possible that GFP_ATOMIC reserves have been depleted or there is a GFP_NOFS allocation that gets stuck looping forever that doesn't get the ability to allocate without watermarks. I'd wait to see it in practice before making this extension since it relies on scanning the tasklist. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-30 21:11 ` David Rientjes @ 2015-10-01 12:13 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-01 12:13 UTC (permalink / raw) To: rientjes Cc: mhocko, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina David Rientjes wrote: > On Wed, 30 Sep 2015, Tetsuo Handa wrote: > > > If we choose only 1 OOM victim, the possibility of hitting this memory > > unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the > > possibility becomes (almost) 0%. And if we still hit this livelock even > > after choosing many OOM victims, it is time to call panic(). > > > > Again, this is a fundamental disagreement between your approach of > randomly killing processes hoping that we target one that can make a quick > exit vs. my approach where we give threads access to memory reserves after > reclaim has failed in an oom livelock so they at least make forward > progress. We're going around in circles. I don't like that memory management subsystem shows an expectant attitude when memory allocation is failing. There are many possible silent hang up paths. And my customer's servers might be hitting such paths. But I can't go in front of their servers and capture SysRq. Thus, I want to let memory management subsystem try to recover automatically; at least emit some diagnostic kernel messages automatically. > > > (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not > > enter direct reclaim paths in order to avoid being blocked by unkillable fs > > locks?) > > > > OOM victims shouldn't need to enter reclaim, and there have been patches > before to abort reclaim if current has a pending SIGKILL, Yes. shrink_inactive_list() and throttle_direct_reclaim() recognize fatal_signal_pending() tasks. > if they have > access to memory reserves. What does this mean? shrink_inactive_list() and throttle_direct_reclaim() do not check whether OOM victims have access to memory reserves, do they? We don't allow access to memory reserves by OOM victims without TIF_MEMDIE. I think that we should favor kthread and dying threads over normal threads at __alloc_pages_slowpath() but there is no response on http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp . > Nothing prevents the victim from already being > in reclaim, however, when it is killed. I think this is problematic because there are unkillable locks in reclaim paths. The memory management subsystem reports nothing. > > > > Perhaps this is an argument that we need to provide access to memory > > > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, > > > but I would wait to make that extension until we see it in practice. > > > > I think that GFP_ATOMIC allocations already access memory reserves via > > ALLOC_HIGH priority. > > > > Yes, that's true. It doesn't help for GFP_NOFS, however. It may be > possible that GFP_ATOMIC reserves have been depleted or there is a > GFP_NOFS allocation that gets stuck looping forever that doesn't get the > ability to allocate without watermarks. Why can't we emit some diagnostic kernel messages automatically? Memory allocation requests which did not complete within e.g. 30 seconds deserve possible memory allocation deadlock warning messages. > I'd wait to see it in practice > before making this extension since it relies on scanning the tasklist. > Is this extension something like check_hung_uninterruptible_tasks()? ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-01 12:13 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-01 12:13 UTC (permalink / raw) To: rientjes Cc: mhocko, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina David Rientjes wrote: > On Wed, 30 Sep 2015, Tetsuo Handa wrote: > > > If we choose only 1 OOM victim, the possibility of hitting this memory > > unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the > > possibility becomes (almost) 0%. And if we still hit this livelock even > > after choosing many OOM victims, it is time to call panic(). > > > > Again, this is a fundamental disagreement between your approach of > randomly killing processes hoping that we target one that can make a quick > exit vs. my approach where we give threads access to memory reserves after > reclaim has failed in an oom livelock so they at least make forward > progress. We're going around in circles. I don't like that memory management subsystem shows an expectant attitude when memory allocation is failing. There are many possible silent hang up paths. And my customer's servers might be hitting such paths. But I can't go in front of their servers and capture SysRq. Thus, I want to let memory management subsystem try to recover automatically; at least emit some diagnostic kernel messages automatically. > > > (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not > > enter direct reclaim paths in order to avoid being blocked by unkillable fs > > locks?) > > > > OOM victims shouldn't need to enter reclaim, and there have been patches > before to abort reclaim if current has a pending SIGKILL, Yes. shrink_inactive_list() and throttle_direct_reclaim() recognize fatal_signal_pending() tasks. > if they have > access to memory reserves. What does this mean? shrink_inactive_list() and throttle_direct_reclaim() do not check whether OOM victims have access to memory reserves, do they? We don't allow access to memory reserves by OOM victims without TIF_MEMDIE. I think that we should favor kthread and dying threads over normal threads at __alloc_pages_slowpath() but there is no response on http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp . > Nothing prevents the victim from already being > in reclaim, however, when it is killed. I think this is problematic because there are unkillable locks in reclaim paths. The memory management subsystem reports nothing. > > > > Perhaps this is an argument that we need to provide access to memory > > > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, > > > but I would wait to make that extension until we see it in practice. > > > > I think that GFP_ATOMIC allocations already access memory reserves via > > ALLOC_HIGH priority. > > > > Yes, that's true. It doesn't help for GFP_NOFS, however. It may be > possible that GFP_ATOMIC reserves have been depleted or there is a > GFP_NOFS allocation that gets stuck looping forever that doesn't get the > ability to allocate without watermarks. Why can't we emit some diagnostic kernel messages automatically? Memory allocation requests which did not complete within e.g. 30 seconds deserve possible memory allocation deadlock warning messages. > I'd wait to see it in practice > before making this extension since it relies on scanning the tasklist. > Is this extension something like check_hung_uninterruptible_tasks()? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-28 22:24 ` David Rientjes @ 2015-10-01 14:48 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-01 14:48 UTC (permalink / raw) To: David Rientjes Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon 28-09-15 15:24:06, David Rientjes wrote: > On Fri, 25 Sep 2015, Michal Hocko wrote: > > > > > I am still not sure how you want to implement that kernel thread but I > > > > am quite skeptical it would be very much useful because all the current > > > > allocations which end up in the OOM killer path cannot simply back off > > > > and drop the locks with the current allocator semantic. So they will > > > > be sitting on top of unknown pile of locks whether you do an additional > > > > reclaim (unmap the anon memory) in the direct OOM context or looping > > > > in the allocator and waiting for kthread/workqueue to do its work. The > > > > only argument that I can see is the stack usage but I haven't seen stack > > > > overflows in the OOM path AFAIR. > > > > > > > > > > Which locks are you specifically interested in? > > > > Any locks they were holding before they entered the page allocator (e.g. > > i_mutex is the easiest one to trigger from the userspace but mmap_sem > > might be involved as well because we are doing kmalloc(GFP_KERNEL) with > > mmap_sem held for write). Those would be locked until the page allocator > > returns, which with the current semantic might be _never_. > > > > I agree that i_mutex seems to be one of the most common offenders. > However, I'm not sure I understand why holding it while trying to allocate > infinitely for an order-0 allocation is problematic wrt the proposed > kthread. I didn't say it would be problematic. We are talking past each other here. All I wanted to say was that a separate kernel oom thread wouldn't _help_ with the lock dependencies. > The kthread itself need only take mmap_sem for read. If all > threads sharing the mm with a victim have been SIGKILL'd, they should get > TIF_MEMDIE set when reclaim fails and be able to allocate so that they can > drop mmap_sem. which is the case if the direct oom context used trylock... So just to make it clear. I am not objecting a specialized oom kernel thread. It would work as well. I am just not convinced that it is really needed because the direct oom context can use trylock and do the same work directly. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-01 14:48 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-01 14:48 UTC (permalink / raw) To: David Rientjes Cc: Oleg Nesterov, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon 28-09-15 15:24:06, David Rientjes wrote: > On Fri, 25 Sep 2015, Michal Hocko wrote: > > > > > I am still not sure how you want to implement that kernel thread but I > > > > am quite skeptical it would be very much useful because all the current > > > > allocations which end up in the OOM killer path cannot simply back off > > > > and drop the locks with the current allocator semantic. So they will > > > > be sitting on top of unknown pile of locks whether you do an additional > > > > reclaim (unmap the anon memory) in the direct OOM context or looping > > > > in the allocator and waiting for kthread/workqueue to do its work. The > > > > only argument that I can see is the stack usage but I haven't seen stack > > > > overflows in the OOM path AFAIR. > > > > > > > > > > Which locks are you specifically interested in? > > > > Any locks they were holding before they entered the page allocator (e.g. > > i_mutex is the easiest one to trigger from the userspace but mmap_sem > > might be involved as well because we are doing kmalloc(GFP_KERNEL) with > > mmap_sem held for write). Those would be locked until the page allocator > > returns, which with the current semantic might be _never_. > > > > I agree that i_mutex seems to be one of the most common offenders. > However, I'm not sure I understand why holding it while trying to allocate > infinitely for an order-0 allocation is problematic wrt the proposed > kthread. I didn't say it would be problematic. We are talking past each other here. All I wanted to say was that a separate kernel oom thread wouldn't _help_ with the lock dependencies. > The kthread itself need only take mmap_sem for read. If all > threads sharing the mm with a victim have been SIGKILL'd, they should get > TIF_MEMDIE set when reclaim fails and be able to allocate so that they can > drop mmap_sem. which is the case if the direct oom context used trylock... So just to make it clear. I am not objecting a specialized oom kernel thread. It would work as well. I am just not convinced that it is really needed because the direct oom context can use trylock and do the same work directly. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-01 14:48 ` Michal Hocko @ 2015-10-02 13:06 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-02 13:06 UTC (permalink / raw) To: mhocko, rientjes Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Mon 28-09-15 15:24:06, David Rientjes wrote: > > I agree that i_mutex seems to be one of the most common offenders. > > However, I'm not sure I understand why holding it while trying to allocate > > infinitely for an order-0 allocation is problematic wrt the proposed > > kthread. > > I didn't say it would be problematic. We are talking past each other > here. All I wanted to say was that a separate kernel oom thread wouldn't > _help_ with the lock dependencies. > Oops. I misunderstood that you are skeptical about memory unmapping approach due to lock dependency. But rather, you are skeptical about use of a dedicated kernel thread for memory unmapping approach. > > The kthread itself need only take mmap_sem for read. If all > > threads sharing the mm with a victim have been SIGKILL'd, they should get > > TIF_MEMDIE set when reclaim fails and be able to allocate so that they can > > drop mmap_sem. > > which is the case if the direct oom context used trylock... > So just to make it clear. I am not objecting a specialized oom kernel > thread. It would work as well. I am just not convinced that it is really > needed because the direct oom context can use trylock and do the same > work directly. Well, I think it depends on from where we call memory unmapping code. The first candidate is oom_kill_process() because it is a location where the mm struct to unmap is determined. But since select_bad_process() aborts upon encountering a TIF_MEMDIE task, we will fail to call memory unmapping code again if the first down_trylock(&mm->mmap_sem) attempt in oom_kill_process() failed. (Here I assumed that we allow all OOM victims to access memory reserves so that subsequent down_trylock(&mm->mmap_sem) attempts could succeed.) The second candidate is select_bad_process() because it is a location where we can call memory unmapping code again upon encountering a TIF_MEMDIE task. The third candidate is caller of out_of_memory() because it is a location where we can call memory unmapping code again even when the OOM victims are blocked. (Our discussion seems to assume that TIF_MEMDIE tasks can make forward progress and die. But since TIF_MEMDIE tasks might encounter unkillable locks after returning from allocation (e.g. http://lkml.kernel.org/r/201509290118.BCJ43256.tSFFFMOLHVOJOQ@I-love.SAKURA.ne.jp ), it will be safer not to assume that out_of_memory() can be always called. So, I thought that a dedicated kernel thread makes it easy to call memory unmapping code periodically again and again. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-02 13:06 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-02 13:06 UTC (permalink / raw) To: mhocko, rientjes Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Mon 28-09-15 15:24:06, David Rientjes wrote: > > I agree that i_mutex seems to be one of the most common offenders. > > However, I'm not sure I understand why holding it while trying to allocate > > infinitely for an order-0 allocation is problematic wrt the proposed > > kthread. > > I didn't say it would be problematic. We are talking past each other > here. All I wanted to say was that a separate kernel oom thread wouldn't > _help_ with the lock dependencies. > Oops. I misunderstood that you are skeptical about memory unmapping approach due to lock dependency. But rather, you are skeptical about use of a dedicated kernel thread for memory unmapping approach. > > The kthread itself need only take mmap_sem for read. If all > > threads sharing the mm with a victim have been SIGKILL'd, they should get > > TIF_MEMDIE set when reclaim fails and be able to allocate so that they can > > drop mmap_sem. > > which is the case if the direct oom context used trylock... > So just to make it clear. I am not objecting a specialized oom kernel > thread. It would work as well. I am just not convinced that it is really > needed because the direct oom context can use trylock and do the same > work directly. Well, I think it depends on from where we call memory unmapping code. The first candidate is oom_kill_process() because it is a location where the mm struct to unmap is determined. But since select_bad_process() aborts upon encountering a TIF_MEMDIE task, we will fail to call memory unmapping code again if the first down_trylock(&mm->mmap_sem) attempt in oom_kill_process() failed. (Here I assumed that we allow all OOM victims to access memory reserves so that subsequent down_trylock(&mm->mmap_sem) attempts could succeed.) The second candidate is select_bad_process() because it is a location where we can call memory unmapping code again upon encountering a TIF_MEMDIE task. The third candidate is caller of out_of_memory() because it is a location where we can call memory unmapping code again even when the OOM victims are blocked. (Our discussion seems to assume that TIF_MEMDIE tasks can make forward progress and die. But since TIF_MEMDIE tasks might encounter unkillable locks after returning from allocation (e.g. http://lkml.kernel.org/r/201509290118.BCJ43256.tSFFFMOLHVOJOQ@I-love.SAKURA.ne.jp ), it will be safer not to assume that out_of_memory() can be always called. So, I thought that a dedicated kernel thread makes it easy to call memory unmapping code periodically again and again. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-23 20:59 ` Michal Hocko @ 2015-10-06 18:45 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-10-06 18:45 UTC (permalink / raw) To: Michal Hocko Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa Damn. I can't believe this, but I still can't make the initial change. And no, it is not that I hit some technical problems, just I can't decide what exactly the first step should do to be a) really simple and b) useful. I am starting to think I'll just update my draft patch which uses queue_work() and send it tomorrow (yes, tomorrow again ;). But let me at least answer this email, On 09/23, Michal Hocko wrote: > > On Tue 22-09-15 18:06:08, Oleg Nesterov wrote: > > > > OK, let it be a kthread from the very beginning, I won't argue. This > > is really minor compared to other problems. > > I am still not sure how you want to implement that kernel thread but I > am quite skeptical it would be very much useful because all the current > allocations which end up in the OOM killer path cannot simply back off > and drop the locks with the current allocator semantic. So they will > be sitting on top of unknown pile of locks whether you do an additional > reclaim (unmap the anon memory) in the direct OOM context or looping > in the allocator and waiting for kthread/workqueue to do its work. The > only argument that I can see is the stack usage but I haven't seen stack > overflows in the OOM path AFAIR. Please see below, > > And note that the caller can held other locks we do not even know about. > > Most probably we should not deadlock, at least if we only unmap the anon > > pages, but still this doesn't look safe. > > The unmapper cannot fall back to reclaim and/or trigger the OOM so > we should be indeed very careful and mark the allocation context > appropriately. I can remember mmu_gather but it is only doing > opportunistic allocation AFAIR. And I was going to make V1 which avoids queue_work/kthread and zaps the memory in oom_kill_process() context. But this can't work because we need to increment ->mm_users to avoid the race with exit_mmap/etc. And this means that we need mmput() after that, and as we recently discussed it can deadlock if mm_users goes to zero, we can't do exit_mmap/etc in oom_kill_process(). > > Hmm. If we already have mmap_sem and started zap_page_range() then > > I do not think it makes sense to stop until we free everything we can. > > Zapping a huge address space can take quite some time Yes, and this is another reason we should do this asynchronously. > and we really do > not have to free it all on behalf of the killer when enough memory is > freed to allow for further progress and the rest can be done by the > victim. If one batch doesn't seem sufficient then another retry can > continue. > > I do not think that a limited scan would make the implementation more > complicated But we can't even know much memory unmap_single_vma() actually frees. Even if we could, how can we know we freed enough? Anyway. Perhaps it makes sense to abort the for_each_vma() loop if freed_enough_mem() == T. But it is absolutely not clear to me how we should define this freed_enough_mem(), so I think we should do this later. > > But. Can't we just remove another ->oom_score_adj check when we try > > to kill all mm users (the last for_each_process loop). If yes, this > > all can be simplified. > > > > I guess we can't and its a pity. Because it looks simply pointless > > to not kill all mm users. This just means the select_bad_process() > > picked the wrong task. > > Yes I am not really sure why oom_score_adj is not per-mm and we are > doing that per signal struct to be honest. Heh ;) Yes, but I guess it is too late to move it back. > Maybe we can revisit this... I hope, but I am not going to try to remove this OOM_SCORE_ADJ_MIN check now. Just we should not zap this mm if we find the OOM-unkillable user. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-06 18:45 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-10-06 18:45 UTC (permalink / raw) To: Michal Hocko Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa Damn. I can't believe this, but I still can't make the initial change. And no, it is not that I hit some technical problems, just I can't decide what exactly the first step should do to be a) really simple and b) useful. I am starting to think I'll just update my draft patch which uses queue_work() and send it tomorrow (yes, tomorrow again ;). But let me at least answer this email, On 09/23, Michal Hocko wrote: > > On Tue 22-09-15 18:06:08, Oleg Nesterov wrote: > > > > OK, let it be a kthread from the very beginning, I won't argue. This > > is really minor compared to other problems. > > I am still not sure how you want to implement that kernel thread but I > am quite skeptical it would be very much useful because all the current > allocations which end up in the OOM killer path cannot simply back off > and drop the locks with the current allocator semantic. So they will > be sitting on top of unknown pile of locks whether you do an additional > reclaim (unmap the anon memory) in the direct OOM context or looping > in the allocator and waiting for kthread/workqueue to do its work. The > only argument that I can see is the stack usage but I haven't seen stack > overflows in the OOM path AFAIR. Please see below, > > And note that the caller can held other locks we do not even know about. > > Most probably we should not deadlock, at least if we only unmap the anon > > pages, but still this doesn't look safe. > > The unmapper cannot fall back to reclaim and/or trigger the OOM so > we should be indeed very careful and mark the allocation context > appropriately. I can remember mmu_gather but it is only doing > opportunistic allocation AFAIR. And I was going to make V1 which avoids queue_work/kthread and zaps the memory in oom_kill_process() context. But this can't work because we need to increment ->mm_users to avoid the race with exit_mmap/etc. And this means that we need mmput() after that, and as we recently discussed it can deadlock if mm_users goes to zero, we can't do exit_mmap/etc in oom_kill_process(). > > Hmm. If we already have mmap_sem and started zap_page_range() then > > I do not think it makes sense to stop until we free everything we can. > > Zapping a huge address space can take quite some time Yes, and this is another reason we should do this asynchronously. > and we really do > not have to free it all on behalf of the killer when enough memory is > freed to allow for further progress and the rest can be done by the > victim. If one batch doesn't seem sufficient then another retry can > continue. > > I do not think that a limited scan would make the implementation more > complicated But we can't even know much memory unmap_single_vma() actually frees. Even if we could, how can we know we freed enough? Anyway. Perhaps it makes sense to abort the for_each_vma() loop if freed_enough_mem() == T. But it is absolutely not clear to me how we should define this freed_enough_mem(), so I think we should do this later. > > But. Can't we just remove another ->oom_score_adj check when we try > > to kill all mm users (the last for_each_process loop). If yes, this > > all can be simplified. > > > > I guess we can't and its a pity. Because it looks simply pointless > > to not kill all mm users. This just means the select_bad_process() > > picked the wrong task. > > Yes I am not really sure why oom_score_adj is not per-mm and we are > doing that per signal struct to be honest. Heh ;) Yes, but I guess it is too late to move it back. > Maybe we can revisit this... I hope, but I am not going to try to remove this OOM_SCORE_ADJ_MIN check now. Just we should not zap this mm if we find the OOM-unkillable user. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-06 18:45 ` Oleg Nesterov @ 2015-10-07 11:03 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-07 11:03 UTC (permalink / raw) To: oleg, mhocko Cc: torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > > > Hmm. If we already have mmap_sem and started zap_page_range() then > > > I do not think it makes sense to stop until we free everything we can. > > > > Zapping a huge address space can take quite some time > > Yes, and this is another reason we should do this asynchronously. > > > and we really do > > not have to free it all on behalf of the killer when enough memory is > > freed to allow for further progress and the rest can be done by the > > victim. If one batch doesn't seem sufficient then another retry can > > continue. > > > > I do not think that a limited scan would make the implementation more > > complicated > > But we can't even know much memory unmap_single_vma() actually frees. > Even if we could, how can we know we freed enough? > > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if > freed_enough_mem() == T. But it is absolutely not clear to me how we > should define this freed_enough_mem(), so I think we should do this > later. Maybe bool freed_enough_mem(void) { !atomic_read(&oom_victims); } if we change to call mark_oom_victim() on all threads which should be killed as OOM victims. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-07 11:03 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-10-07 11:03 UTC (permalink / raw) To: oleg, mhocko Cc: torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > > > Hmm. If we already have mmap_sem and started zap_page_range() then > > > I do not think it makes sense to stop until we free everything we can. > > > > Zapping a huge address space can take quite some time > > Yes, and this is another reason we should do this asynchronously. > > > and we really do > > not have to free it all on behalf of the killer when enough memory is > > freed to allow for further progress and the rest can be done by the > > victim. If one batch doesn't seem sufficient then another retry can > > continue. > > > > I do not think that a limited scan would make the implementation more > > complicated > > But we can't even know much memory unmap_single_vma() actually frees. > Even if we could, how can we know we freed enough? > > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if > freed_enough_mem() == T. But it is absolutely not clear to me how we > should define this freed_enough_mem(), so I think we should do this > later. Maybe bool freed_enough_mem(void) { !atomic_read(&oom_victims); } if we change to call mark_oom_victim() on all threads which should be killed as OOM victims. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-07 11:03 ` Tetsuo Handa @ 2015-10-07 12:00 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-10-07 12:00 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 10/07, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if > > freed_enough_mem() == T. But it is absolutely not clear to me how we > > should define this freed_enough_mem(), so I think we should do this > > later. > > Maybe > > bool freed_enough_mem(void) { !atomic_read(&oom_victims); } > > if we change to call mark_oom_victim() on all threads which should be > killed as OOM victims. Well, in this case if (atomic_read(&mm->mm_users) == 1) break; makes much more sense. Plus we do not need to change mark_oom_victim(). Lets discuss this later? Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-07 12:00 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-10-07 12:00 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 10/07, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if > > freed_enough_mem() == T. But it is absolutely not clear to me how we > > should define this freed_enough_mem(), so I think we should do this > > later. > > Maybe > > bool freed_enough_mem(void) { !atomic_read(&oom_victims); } > > if we change to call mark_oom_victim() on all threads which should be > killed as OOM victims. Well, in this case if (atomic_read(&mm->mm_users) == 1) break; makes much more sense. Plus we do not need to change mark_oom_victim(). Lets discuss this later? Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-07 12:00 ` Oleg Nesterov @ 2015-10-08 14:04 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-08 14:04 UTC (permalink / raw) To: Oleg Nesterov Cc: Tetsuo Handa, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed 07-10-15 14:00:16, Oleg Nesterov wrote: > On 10/07, Tetsuo Handa wrote: > > > > Oleg Nesterov wrote: > > > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if > > > freed_enough_mem() == T. But it is absolutely not clear to me how we > > > should define this freed_enough_mem(), so I think we should do this > > > later. > > > > Maybe > > > > bool freed_enough_mem(void) { !atomic_read(&oom_victims); } > > > > if we change to call mark_oom_victim() on all threads which should be > > killed as OOM victims. > > Well, in this case > > if (atomic_read(&mm->mm_users) == 1) > break; > > makes much more sense. Plus we do not need to change mark_oom_victim(). > > Lets discuss this later? Yes I do not think this is that important if a kernel thread is going to reclaim the address space. It will effectively free memory on behalf of the victim so a longer scan shouldn't be such a big problem. At least not for the first implementation. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-08 14:04 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-08 14:04 UTC (permalink / raw) To: Oleg Nesterov Cc: Tetsuo Handa, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed 07-10-15 14:00:16, Oleg Nesterov wrote: > On 10/07, Tetsuo Handa wrote: > > > > Oleg Nesterov wrote: > > > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if > > > freed_enough_mem() == T. But it is absolutely not clear to me how we > > > should define this freed_enough_mem(), so I think we should do this > > > later. > > > > Maybe > > > > bool freed_enough_mem(void) { !atomic_read(&oom_victims); } > > > > if we change to call mark_oom_victim() on all threads which should be > > killed as OOM victims. > > Well, in this case > > if (atomic_read(&mm->mm_users) == 1) > break; > > makes much more sense. Plus we do not need to change mark_oom_victim(). > > Lets discuss this later? Yes I do not think this is that important if a kernel thread is going to reclaim the address space. It will effectively free memory on behalf of the victim so a longer scan shouldn't be such a big problem. At least not for the first implementation. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-10-06 18:45 ` Oleg Nesterov @ 2015-10-08 14:01 ` Michal Hocko -1 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-08 14:01 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Tue 06-10-15 20:45:02, Oleg Nesterov wrote: [...] > And I was going to make V1 which avoids queue_work/kthread and zaps the > memory in oom_kill_process() context. > > But this can't work because we need to increment ->mm_users to avoid > the race with exit_mmap/etc. And this means that we need mmput() after > that, and as we recently discussed it can deadlock if mm_users goes > to zero, we can't do exit_mmap/etc in oom_kill_process(). Right. I hoped we could rely on mm_count just to pin mm but that is not sufficient because exit_mmap doesn't rely on mmap_sem so we do not have any synchronization there. Unfortunate. This means that we indeed have to do it asynchronously. Maybe we can come up with some trickery but let's do it later. I do agree that going with a kernel thread for now would be easier. Sorry about misleading you, I should have realized that mmput from the oom killing path is dangerous. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-10-08 14:01 ` Michal Hocko 0 siblings, 0 replies; 213+ messages in thread From: Michal Hocko @ 2015-10-08 14:01 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Tue 06-10-15 20:45:02, Oleg Nesterov wrote: [...] > And I was going to make V1 which avoids queue_work/kthread and zaps the > memory in oom_kill_process() context. > > But this can't work because we need to increment ->mm_users to avoid > the race with exit_mmap/etc. And this means that we need mmput() after > that, and as we recently discussed it can deadlock if mm_users goes > to zero, we can't do exit_mmap/etc in oom_kill_process(). Right. I hoped we could rely on mm_count just to pin mm but that is not sufficient because exit_mmap doesn't rely on mmap_sem so we do not have any synchronization there. Unfortunate. This means that we indeed have to do it asynchronously. Maybe we can come up with some trickery but let's do it later. I do agree that going with a kernel thread for now would be easier. Sorry about misleading you, I should have realized that mmput from the oom killing path is dangerous. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-21 15:32 ` Oleg Nesterov @ 2015-09-21 16:51 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-21 16:51 UTC (permalink / raw) To: oleg, mhocko Cc: torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > Yes, yes, and I already tried to comment this part. We probably need a > dedicated kernel thread, but I still think (although I am not sure) that > initial change can use workueue. In the likely case system_unbound_wq pool > should have an idle thread, if not - OK, this change won't help in this > case. This is minor. > I imagined a dedicated kernel thread doing something like shown below. (I don't know about mm->mmap management.) mm->mmap_zapped corresponds to MMF_MEMDIE. I think this kernel thread can be used for normal kill(pid, SIGKILL) cases. ---------- bool has_sigkill_task; wait_queue_head_t kick_mm_zapper; static void mm_zapper(void *unused) { struct task_struct *g, *p; struct mm_struct *mm; sleep: wait_event(kick_remover, has_sigkill_task); has_sigkill_task = false; restart: rcu_read_lock(); for_each_process_thread(g, p) { if (likely(!fatal_signal_pending(p))) continue; task_lock(p); mm = p->mm; if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) { atomic_inc(&mm->mm_users); task_unlock(p); rcu_read_unlock(); if (mm->mmap && !mm->mmap_zapped) zap_page_range(mm->mmap, 0, TASK_SIZE, NULL); mm->mmap_zapped = 1; up_read(&mm->mmap_sem); mmput(mm); cond_resched(); goto restart; } task_unlock(p); } rcu_read_unlock(); goto sleep; } kthread_run(mm_zapper, NULL, "mm_zapper"); ---------- ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-21 16:51 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-21 16:51 UTC (permalink / raw) To: oleg, mhocko Cc: torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > Yes, yes, and I already tried to comment this part. We probably need a > dedicated kernel thread, but I still think (although I am not sure) that > initial change can use workueue. In the likely case system_unbound_wq pool > should have an idle thread, if not - OK, this change won't help in this > case. This is minor. > I imagined a dedicated kernel thread doing something like shown below. (I don't know about mm->mmap management.) mm->mmap_zapped corresponds to MMF_MEMDIE. I think this kernel thread can be used for normal kill(pid, SIGKILL) cases. ---------- bool has_sigkill_task; wait_queue_head_t kick_mm_zapper; static void mm_zapper(void *unused) { struct task_struct *g, *p; struct mm_struct *mm; sleep: wait_event(kick_remover, has_sigkill_task); has_sigkill_task = false; restart: rcu_read_lock(); for_each_process_thread(g, p) { if (likely(!fatal_signal_pending(p))) continue; task_lock(p); mm = p->mm; if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) { atomic_inc(&mm->mm_users); task_unlock(p); rcu_read_unlock(); if (mm->mmap && !mm->mmap_zapped) zap_page_range(mm->mmap, 0, TASK_SIZE, NULL); mm->mmap_zapped = 1; up_read(&mm->mmap_sem); mmput(mm); cond_resched(); goto restart; } task_unlock(p); } rcu_read_unlock(); goto sleep; } kthread_run(mm_zapper, NULL, "mm_zapper"); ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-21 16:51 ` Tetsuo Handa @ 2015-09-22 12:43 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-22 12:43 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 09/22, Tetsuo Handa wrote: > > I imagined a dedicated kernel thread doing something like shown below. > (I don't know about mm->mmap management.) > mm->mmap_zapped corresponds to MMF_MEMDIE. No, it doesn't, please see below. > bool has_sigkill_task; > wait_queue_head_t kick_mm_zapper; OK, if this kthread is kicked by oom this makes more sense, but still doesn't look right at least initially. Let me repeat, I do think we need MMF_MEMDIE or something like it before we do something more clever. And in fact I think this flag makes sense regardless. > static void mm_zapper(void *unused) > { > struct task_struct *g, *p; > struct mm_struct *mm; > > sleep: > wait_event(kick_remover, has_sigkill_task); > has_sigkill_task = false; > restart: > rcu_read_lock(); > for_each_process_thread(g, p) { > if (likely(!fatal_signal_pending(p))) > continue; > task_lock(p); > mm = p->mm; > if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) { ^^^^^^^^^^^^^^^ We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap set by oom_kill_process() and cleared after zap_page_range(). Because otherwise we can not handle CLONE_VM correctly. Suppose that an innocent process P does vfork() and the child is killed but not exited yet. mm_zapper() can find the child, do zap_page_range(), and surprise its alive parent P which uses the same ->mm. And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then for_each_process_thread() doesn't really make sense. And if we have a single MMF_MEMDIE process (likely case) then the unconditional _trylock is suboptimal. Tetsuo, can't we do something simple which "obviously can't hurt at least" and then discuss the potential improvements? And yes, yes, the "Kill all user processes sharing victim->mm" logic in oom_kill_process() doesn't 100% look right, at least wrt the change we discuss. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-22 12:43 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-22 12:43 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 09/22, Tetsuo Handa wrote: > > I imagined a dedicated kernel thread doing something like shown below. > (I don't know about mm->mmap management.) > mm->mmap_zapped corresponds to MMF_MEMDIE. No, it doesn't, please see below. > bool has_sigkill_task; > wait_queue_head_t kick_mm_zapper; OK, if this kthread is kicked by oom this makes more sense, but still doesn't look right at least initially. Let me repeat, I do think we need MMF_MEMDIE or something like it before we do something more clever. And in fact I think this flag makes sense regardless. > static void mm_zapper(void *unused) > { > struct task_struct *g, *p; > struct mm_struct *mm; > > sleep: > wait_event(kick_remover, has_sigkill_task); > has_sigkill_task = false; > restart: > rcu_read_lock(); > for_each_process_thread(g, p) { > if (likely(!fatal_signal_pending(p))) > continue; > task_lock(p); > mm = p->mm; > if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) { ^^^^^^^^^^^^^^^ We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap set by oom_kill_process() and cleared after zap_page_range(). Because otherwise we can not handle CLONE_VM correctly. Suppose that an innocent process P does vfork() and the child is killed but not exited yet. mm_zapper() can find the child, do zap_page_range(), and surprise its alive parent P which uses the same ->mm. And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then for_each_process_thread() doesn't really make sense. And if we have a single MMF_MEMDIE process (likely case) then the unconditional _trylock is suboptimal. Tetsuo, can't we do something simple which "obviously can't hurt at least" and then discuss the potential improvements? And yes, yes, the "Kill all user processes sharing victim->mm" logic in oom_kill_process() doesn't 100% look right, at least wrt the change we discuss. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-22 12:43 ` Oleg Nesterov @ 2015-09-22 14:30 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-22 14:30 UTC (permalink / raw) To: oleg Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > On 09/22, Tetsuo Handa wrote: > > > > I imagined a dedicated kernel thread doing something like shown below. > > (I don't know about mm->mmap management.) > > mm->mmap_zapped corresponds to MMF_MEMDIE. > > No, it doesn't, please see below. > > > bool has_sigkill_task; > > wait_queue_head_t kick_mm_zapper; > > OK, if this kthread is kicked by oom this makes more sense, but still > doesn't look right at least initially. Yes, I meant this kthread is kicked upon sending SIGKILL. But I forgot that > > Let me repeat, I do think we need MMF_MEMDIE or something like it before > we do something more clever. And in fact I think this flag makes sense > regardless. > > > static void mm_zapper(void *unused) > > { > > struct task_struct *g, *p; > > struct mm_struct *mm; > > > > sleep: > > wait_event(kick_remover, has_sigkill_task); > > has_sigkill_task = false; > > restart: > > rcu_read_lock(); > > for_each_process_thread(g, p) { > > if (likely(!fatal_signal_pending(p))) > > continue; > > task_lock(p); > > mm = p->mm; > > if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) { > ^^^^^^^^^^^^^^^ > > We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap > set by oom_kill_process() and cleared after zap_page_range(). > > Because otherwise we can not handle CLONE_VM correctly. Suppose that > an innocent process P does vfork() and the child is killed but not > exited yet. mm_zapper() can find the child, do zap_page_range(), and > surprise its alive parent P which uses the same ->mm. kill(P's-child, SIGKILL) does not kill P sharing the same ->mm. Thus, mm_zapper() can be used for only OOM-kill case and test_tsk_thread_flag(p, TIF_MEMDIE) should be used than fatal_signal_pending(p). > > And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then > for_each_process_thread() doesn't really make sense. And if we have > a single MMF_MEMDIE process (likely case) then the unconditional > _trylock is suboptimal. I guess the more likely case is that the OOM victim successfully exits before mm_zapper() finds it. I thought that a dedicated kernel thread which scans the task list can do deferred zapping by automatically retrying (in a few seconds interval ?) when down_read_trylock() failed. > > Tetsuo, can't we do something simple which "obviously can't hurt at > least" and then discuss the potential improvements? No problem. I can wait for your version. > > And yes, yes, the "Kill all user processes sharing victim->mm" logic > in oom_kill_process() doesn't 100% look right, at least wrt the change > we discuss. If we use test_tsk_thread_flag(p, TIF_MEMDIE), we will need to set TIF_MEMDIE to the victim after sending SIGKILL to all processes sharing the victim's mm. Well, the likely case that the OOM victim exits before mm_zapper() finds it becomes not-so-likely case? Then, MMF_MEMDIE is better than test_tsk_thread_flag(p, TIF_MEMDIE)... > > Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-22 14:30 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-22 14:30 UTC (permalink / raw) To: oleg Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > On 09/22, Tetsuo Handa wrote: > > > > I imagined a dedicated kernel thread doing something like shown below. > > (I don't know about mm->mmap management.) > > mm->mmap_zapped corresponds to MMF_MEMDIE. > > No, it doesn't, please see below. > > > bool has_sigkill_task; > > wait_queue_head_t kick_mm_zapper; > > OK, if this kthread is kicked by oom this makes more sense, but still > doesn't look right at least initially. Yes, I meant this kthread is kicked upon sending SIGKILL. But I forgot that > > Let me repeat, I do think we need MMF_MEMDIE or something like it before > we do something more clever. And in fact I think this flag makes sense > regardless. > > > static void mm_zapper(void *unused) > > { > > struct task_struct *g, *p; > > struct mm_struct *mm; > > > > sleep: > > wait_event(kick_remover, has_sigkill_task); > > has_sigkill_task = false; > > restart: > > rcu_read_lock(); > > for_each_process_thread(g, p) { > > if (likely(!fatal_signal_pending(p))) > > continue; > > task_lock(p); > > mm = p->mm; > > if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) { > ^^^^^^^^^^^^^^^ > > We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap > set by oom_kill_process() and cleared after zap_page_range(). > > Because otherwise we can not handle CLONE_VM correctly. Suppose that > an innocent process P does vfork() and the child is killed but not > exited yet. mm_zapper() can find the child, do zap_page_range(), and > surprise its alive parent P which uses the same ->mm. kill(P's-child, SIGKILL) does not kill P sharing the same ->mm. Thus, mm_zapper() can be used for only OOM-kill case and test_tsk_thread_flag(p, TIF_MEMDIE) should be used than fatal_signal_pending(p). > > And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then > for_each_process_thread() doesn't really make sense. And if we have > a single MMF_MEMDIE process (likely case) then the unconditional > _trylock is suboptimal. I guess the more likely case is that the OOM victim successfully exits before mm_zapper() finds it. I thought that a dedicated kernel thread which scans the task list can do deferred zapping by automatically retrying (in a few seconds interval ?) when down_read_trylock() failed. > > Tetsuo, can't we do something simple which "obviously can't hurt at > least" and then discuss the potential improvements? No problem. I can wait for your version. > > And yes, yes, the "Kill all user processes sharing victim->mm" logic > in oom_kill_process() doesn't 100% look right, at least wrt the change > we discuss. If we use test_tsk_thread_flag(p, TIF_MEMDIE), we will need to set TIF_MEMDIE to the victim after sending SIGKILL to all processes sharing the victim's mm. Well, the likely case that the OOM victim exits before mm_zapper() finds it becomes not-so-likely case? Then, MMF_MEMDIE is better than test_tsk_thread_flag(p, TIF_MEMDIE)... > > Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-22 14:30 ` Tetsuo Handa @ 2015-09-22 14:45 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-22 14:45 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 09/22, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > On 09/22, Tetsuo Handa wrote: > > > rcu_read_lock(); > > > for_each_process_thread(g, p) { > > > if (likely(!fatal_signal_pending(p))) > > > continue; > > > task_lock(p); > > > mm = p->mm; > > > if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) { > > ^^^^^^^^^^^^^^^ > > > > We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap > > set by oom_kill_process() and cleared after zap_page_range(). > > > > Because otherwise we can not handle CLONE_VM correctly. Suppose that > > an innocent process P does vfork() and the child is killed but not > > exited yet. mm_zapper() can find the child, do zap_page_range(), and > > surprise its alive parent P which uses the same ->mm. > > kill(P's-child, SIGKILL) does not kill P sharing the same ->mm. > Thus, mm_zapper() can be used for only OOM-kill case Yes, and only if we know for sure that all tasks which can use this ->mm were killed. > and > test_tsk_thread_flag(p, TIF_MEMDIE) should be used than > fatal_signal_pending(p). No. For example, just look at mark_oom_victim() at the start of out_of_memory(). > > Tetsuo, can't we do something simple which "obviously can't hurt at > > least" and then discuss the potential improvements? > > No problem. I can wait for your version. All I wanted to say is that this all is a bit more complicated than it looks at first glance. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-22 14:45 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-22 14:45 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, torvalds, kwalker, cl, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 09/22, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > On 09/22, Tetsuo Handa wrote: > > > rcu_read_lock(); > > > for_each_process_thread(g, p) { > > > if (likely(!fatal_signal_pending(p))) > > > continue; > > > task_lock(p); > > > mm = p->mm; > > > if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) { > > ^^^^^^^^^^^^^^^ > > > > We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap > > set by oom_kill_process() and cleared after zap_page_range(). > > > > Because otherwise we can not handle CLONE_VM correctly. Suppose that > > an innocent process P does vfork() and the child is killed but not > > exited yet. mm_zapper() can find the child, do zap_page_range(), and > > surprise its alive parent P which uses the same ->mm. > > kill(P's-child, SIGKILL) does not kill P sharing the same ->mm. > Thus, mm_zapper() can be used for only OOM-kill case Yes, and only if we know for sure that all tasks which can use this ->mm were killed. > and > test_tsk_thread_flag(p, TIF_MEMDIE) should be used than > fatal_signal_pending(p). No. For example, just look at mark_oom_victim() at the start of out_of_memory(). > > Tetsuo, can't we do something simple which "obviously can't hurt at > > least" and then discuss the potential improvements? > > No problem. I can wait for your version. All I wanted to say is that this all is a bit more complicated than it looks at first glance. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-21 15:32 ` Oleg Nesterov @ 2015-09-21 23:42 ` David Rientjes -1 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-21 23:42 UTC (permalink / raw) To: Oleg Nesterov Cc: Michal Hocko, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon, 21 Sep 2015, Oleg Nesterov wrote: > Yes we should try to do this in the OOM killer context, and in this case > (of course) we need trylock. Let me quote my previous email: > > And we want to avoid using workqueues when the caller can do this > directly. And in this case we certainly need trylock. But this needs > some refactoring: we do not want to do this under oom_lock, otoh it > makes sense to do this from mark_oom_victim() if current && killed, > and a lot more details. > > and probably this is another reason why do we need MMF_MEMDIE. But again, > I think the initial change should be simple. > I agree with the direction and I don't think it would be too complex to have a dedicated kthread that is kicked when we queue an mm to do MADV_DONTNEED behavior, and have that happen only if a trylock in oom_kill_process() fails to do it itself for anonymous mappings. We may have different opinions of simplicity. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-21 23:42 ` David Rientjes 0 siblings, 0 replies; 213+ messages in thread From: David Rientjes @ 2015-09-21 23:42 UTC (permalink / raw) To: Oleg Nesterov Cc: Michal Hocko, Linus Torvalds, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon, 21 Sep 2015, Oleg Nesterov wrote: > Yes we should try to do this in the OOM killer context, and in this case > (of course) we need trylock. Let me quote my previous email: > > And we want to avoid using workqueues when the caller can do this > directly. And in this case we certainly need trylock. But this needs > some refactoring: we do not want to do this under oom_lock, otoh it > makes sense to do this from mark_oom_victim() if current && killed, > and a lot more details. > > and probably this is another reason why do we need MMF_MEMDIE. But again, > I think the initial change should be simple. > I agree with the direction and I don't think it would be too complex to have a dedicated kthread that is kicked when we queue an mm to do MADV_DONTNEED behavior, and have that happen only if a trylock in oom_kill_process() fails to do it itself for anonymous mappings. We may have different opinions of simplicity. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-21 13:44 ` Oleg Nesterov @ 2015-09-21 16:55 ` Linus Torvalds -1 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-09-21 16:55 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon, Sep 21, 2015 at 6:44 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > I must have missed something. I can't understand your and Michal's > concerns. Heh. I looked at that patch, and apparently entirely missed the queue_work() part of the whole patch, thinking it was a direct call. So never mind. Linus ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-21 16:55 ` Linus Torvalds 0 siblings, 0 replies; 213+ messages in thread From: Linus Torvalds @ 2015-09-21 16:55 UTC (permalink / raw) To: Oleg Nesterov Cc: Kyle Walker, Christoph Lameter, Michal Hocko, Andrew Morton, David Rientjes, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Tetsuo Handa On Mon, Sep 21, 2015 at 6:44 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > I must have missed something. I can't understand your and Michal's > concerns. Heh. I looked at that patch, and apparently entirely missed the queue_work() part of the whole patch, thinking it was a direct call. So never mind. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-19 15:03 ` Oleg Nesterov @ 2015-09-20 14:50 ` Tetsuo Handa -1 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-20 14:50 UTC (permalink / raw) To: oleg, kwalker, cl, torvalds, mhocko Cc: akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > On 09/17, Kyle Walker wrote: > > > > Currently, the oom killer will attempt to kill a process that is in > > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > > period of time, such as processes writing to a frozen filesystem during > > a lengthy backup operation, this can result in a deadlock condition as > > related processes memory access will stall within the page fault > > handler. > > And there are other potential reasons for deadlock. > > Stupid idea. Can't we help the memory hog to free its memory? This is > orthogonal to other improvements we can do. So, we are trying to release memory without waiting for arriving at exit_mm() from do_exit(), right? If it works, it will be a simple and small change that will be easy to backport. The idea is that since fatal_signal_pending() tasks no longer return to user space, we can release memory allocated for use by user space, right? Then, I think that this approach can be applied to not only OOM-kill case but also regular kill(pid, SIGKILL) case (i.e. kick from signal_wake_up(1) or somewhere?). A dedicated kernel thread (not limited to OOM-kill purpose) scans for fatal_signal_pending() tasks and release that task's memory. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-20 14:50 ` Tetsuo Handa 0 siblings, 0 replies; 213+ messages in thread From: Tetsuo Handa @ 2015-09-20 14:50 UTC (permalink / raw) To: oleg, kwalker, cl, torvalds, mhocko Cc: akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina Oleg Nesterov wrote: > On 09/17, Kyle Walker wrote: > > > > Currently, the oom killer will attempt to kill a process that is in > > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > > period of time, such as processes writing to a frozen filesystem during > > a lengthy backup operation, this can result in a deadlock condition as > > related processes memory access will stall within the page fault > > handler. > > And there are other potential reasons for deadlock. > > Stupid idea. Can't we help the memory hog to free its memory? This is > orthogonal to other improvements we can do. So, we are trying to release memory without waiting for arriving at exit_mm() from do_exit(), right? If it works, it will be a simple and small change that will be easy to backport. The idea is that since fatal_signal_pending() tasks no longer return to user space, we can release memory allocated for use by user space, right? Then, I think that this approach can be applied to not only OOM-kill case but also regular kill(pid, SIGKILL) case (i.e. kick from signal_wake_up(1) or somewhere?). A dedicated kernel thread (not limited to OOM-kill purpose) scans for fatal_signal_pending() tasks and release that task's memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-20 14:50 ` Tetsuo Handa @ 2015-09-20 14:55 ` Oleg Nesterov -1 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-20 14:55 UTC (permalink / raw) To: Tetsuo Handa Cc: kwalker, cl, torvalds, mhocko, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 09/20, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > On 09/17, Kyle Walker wrote: > > > > > > Currently, the oom killer will attempt to kill a process that is in > > > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > > > period of time, such as processes writing to a frozen filesystem during > > > a lengthy backup operation, this can result in a deadlock condition as > > > related processes memory access will stall within the page fault > > > handler. > > > > And there are other potential reasons for deadlock. > > > > Stupid idea. Can't we help the memory hog to free its memory? This is > > orthogonal to other improvements we can do. > > So, we are trying to release memory without waiting for arriving at > exit_mm() from do_exit(), right? If it works, it will be a simple and > small change that will be easy to backport. > > The idea is that since fatal_signal_pending() tasks no longer return to > user space, we can release memory allocated for use by user space, right? Yes. > Then, I think that this approach can be applied to not only OOM-kill case > but also regular kill(pid, SIGKILL) case (i.e. kick from signal_wake_up(1) > or somewhere?). I don't think so... but we might want to do this if (say) we are not going to kill someone else because fatal_signal_pending(current). > A dedicated kernel thread (not limited to OOM-kill purpose) > scans for fatal_signal_pending() tasks and release that task's memory. Perhaps a dedicated kernel thread makes sense (see other emails), but I don't think it should scan the killed threads. oom-kill should kict it. Anyway, let me repeat there are a lot of details we might want to discuss. But the initial changes should be simple as possible, imo. Oleg. ^ permalink raw reply [flat|nested] 213+ messages in thread
* Re: can't oom-kill zap the victim's memory? @ 2015-09-20 14:55 ` Oleg Nesterov 0 siblings, 0 replies; 213+ messages in thread From: Oleg Nesterov @ 2015-09-20 14:55 UTC (permalink / raw) To: Tetsuo Handa Cc: kwalker, cl, torvalds, mhocko, akpm, rientjes, hannes, vdavydov, linux-mm, linux-kernel, skozina On 09/20, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > On 09/17, Kyle Walker wrote: > > > > > > Currently, the oom killer will attempt to kill a process that is in > > > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > > > period of time, such as processes writing to a frozen filesystem during > > > a lengthy backup operation, this can result in a deadlock condition as > > > related processes memory access will stall within the page fault > > > handler. > > > > And there are other potential reasons for deadlock. > > > > Stupid idea. Can't we help the memory hog to free its memory? This is > > orthogonal to other improvements we can do. > > So, we are trying to release memory without waiting for arriving at > exit_mm() from do_exit(), right? If it works, it will be a simple and > small change that will be easy to backport. > > The idea is that since fatal_signal_pending() tasks no longer return to > user space, we can release memory allocated for use by user space, right? Yes. > Then, I think that this approach can be applied to not only OOM-kill case > but also regular kill(pid, SIGKILL) case (i.e. kick from signal_wake_up(1) > or somewhere?). I don't think so... but we might want to do this if (say) we are not going to kill someone else because fatal_signal_pending(current). > A dedicated kernel thread (not limited to OOM-kill purpose) > scans for fatal_signal_pending() tasks and release that task's memory. Perhaps a dedicated kernel thread makes sense (see other emails), but I don't think it should scan the killed threads. oom-kill should kict it. Anyway, let me repeat there are a lot of details we might want to discuss. But the initial changes should be simple as possible, imo. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 213+ messages in thread
end of thread, other threads:[~2015-11-05 8:46 UTC | newest] Thread overview: 213+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-09-17 17:59 [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks Kyle Walker 2015-09-17 17:59 ` Kyle Walker 2015-09-17 19:22 ` Oleg Nesterov 2015-09-17 19:22 ` Oleg Nesterov 2015-09-18 15:41 ` Christoph Lameter 2015-09-18 15:41 ` Christoph Lameter 2015-09-18 16:24 ` Oleg Nesterov 2015-09-18 16:24 ` Oleg Nesterov 2015-09-18 16:39 ` Tetsuo Handa 2015-09-18 16:39 ` Tetsuo Handa 2015-09-18 16:54 ` Oleg Nesterov 2015-09-18 16:54 ` Oleg Nesterov 2015-09-18 17:00 ` Christoph Lameter 2015-09-18 17:00 ` Christoph Lameter 2015-09-18 19:07 ` Oleg Nesterov 2015-09-18 19:07 ` Oleg Nesterov 2015-09-18 19:19 ` Christoph Lameter 2015-09-18 19:19 ` Christoph Lameter 2015-09-18 21:28 ` Kyle Walker 2015-09-18 22:07 ` Christoph Lameter 2015-09-18 22:07 ` Christoph Lameter 2015-09-19 8:32 ` Michal Hocko 2015-09-19 8:32 ` Michal Hocko 2015-09-19 14:33 ` Tetsuo Handa 2015-09-19 14:33 ` Tetsuo Handa 2015-09-19 15:51 ` Michal Hocko 2015-09-19 15:51 ` Michal Hocko 2015-09-21 23:33 ` David Rientjes 2015-09-21 23:33 ` David Rientjes 2015-09-22 5:33 ` Tetsuo Handa 2015-09-22 5:33 ` Tetsuo Handa 2015-09-22 23:32 ` David Rientjes 2015-09-22 23:32 ` David Rientjes 2015-09-23 12:03 ` Kyle Walker 2015-09-23 12:03 ` Kyle Walker 2015-09-24 11:50 ` Tetsuo Handa 2015-09-24 11:50 ` Tetsuo Handa 2015-09-19 14:44 ` Oleg Nesterov 2015-09-19 14:44 ` Oleg Nesterov 2015-09-21 23:27 ` David Rientjes 2015-09-21 23:27 ` David Rientjes 2015-09-19 8:25 ` Michal Hocko 2015-09-19 8:25 ` Michal Hocko 2015-09-19 8:22 ` Michal Hocko 2015-09-19 8:22 ` Michal Hocko 2015-09-21 23:08 ` David Rientjes 2015-09-21 23:08 ` David Rientjes 2015-09-19 15:03 ` can't oom-kill zap the victim's memory? Oleg Nesterov 2015-09-19 15:03 ` Oleg Nesterov 2015-09-19 15:10 ` Oleg Nesterov 2015-09-19 15:10 ` Oleg Nesterov 2015-09-19 15:58 ` Michal Hocko 2015-09-19 15:58 ` Michal Hocko 2015-09-20 13:16 ` Oleg Nesterov 2015-09-20 13:16 ` Oleg Nesterov 2015-09-19 22:24 ` Linus Torvalds 2015-09-19 22:24 ` Linus Torvalds 2015-09-19 22:54 ` Raymond Jennings 2015-09-19 23:00 ` Raymond Jennings 2015-09-19 23:00 ` Raymond Jennings 2015-09-19 23:13 ` Linus Torvalds 2015-09-19 23:13 ` Linus Torvalds 2015-09-20 9:33 ` Michal Hocko 2015-09-20 9:33 ` Michal Hocko 2015-09-20 13:06 ` Oleg Nesterov 2015-09-20 13:06 ` Oleg Nesterov 2015-09-20 12:56 ` Oleg Nesterov 2015-09-20 12:56 ` Oleg Nesterov 2015-09-20 18:05 ` Linus Torvalds 2015-09-20 18:05 ` Linus Torvalds 2015-09-20 18:21 ` Raymond Jennings 2015-09-20 18:23 ` Raymond Jennings 2015-09-20 19:07 ` Raymond Jennings 2015-09-20 19:07 ` Raymond Jennings 2015-09-21 13:57 ` Oleg Nesterov 2015-09-21 13:57 ` Oleg Nesterov 2015-09-21 13:44 ` Oleg Nesterov 2015-09-21 13:44 ` Oleg Nesterov 2015-09-21 14:24 ` Michal Hocko 2015-09-21 14:24 ` Michal Hocko 2015-09-21 15:32 ` Oleg Nesterov 2015-09-21 15:32 ` Oleg Nesterov 2015-09-21 16:12 ` Michal Hocko 2015-09-21 16:12 ` Michal Hocko 2015-09-22 16:06 ` Oleg Nesterov 2015-09-22 16:06 ` Oleg Nesterov 2015-09-22 23:04 ` David Rientjes 2015-09-22 23:04 ` David Rientjes 2015-09-23 20:59 ` Michal Hocko 2015-09-23 20:59 ` Michal Hocko 2015-09-24 21:15 ` David Rientjes 2015-09-24 21:15 ` David Rientjes 2015-09-25 9:35 ` Michal Hocko 2015-09-25 9:35 ` Michal Hocko 2015-09-25 16:14 ` Tetsuo Handa 2015-09-25 16:14 ` Tetsuo Handa 2015-09-28 16:18 ` Tetsuo Handa 2015-09-28 16:18 ` Tetsuo Handa 2015-09-28 22:28 ` David Rientjes 2015-09-28 22:28 ` David Rientjes 2015-10-02 12:36 ` Michal Hocko 2015-10-02 12:36 ` Michal Hocko 2015-10-02 19:01 ` Linus Torvalds 2015-10-02 19:01 ` Linus Torvalds 2015-10-05 14:44 ` Michal Hocko 2015-10-05 14:44 ` Michal Hocko 2015-10-07 5:16 ` Vlastimil Babka 2015-10-07 5:16 ` Vlastimil Babka 2015-10-07 10:43 ` Tetsuo Handa 2015-10-07 10:43 ` Tetsuo Handa 2015-10-08 9:40 ` Vlastimil Babka 2015-10-08 9:40 ` Vlastimil Babka 2015-10-06 7:55 ` Eric W. Biederman 2015-10-06 7:55 ` Eric W. Biederman 2015-10-06 8:49 ` Linus Torvalds 2015-10-06 8:49 ` Linus Torvalds 2015-10-06 8:55 ` Linus Torvalds 2015-10-06 8:55 ` Linus Torvalds 2015-10-06 14:52 ` Eric W. Biederman 2015-10-06 14:52 ` Eric W. Biederman 2015-10-03 6:02 ` Can't we use timeout based OOM warning/killing? Tetsuo Handa 2015-10-03 6:02 ` Tetsuo Handa 2015-10-06 14:51 ` Tetsuo Handa 2015-10-06 14:51 ` Tetsuo Handa 2015-10-12 6:43 ` Tetsuo Handa 2015-10-12 6:43 ` Tetsuo Handa 2015-10-12 15:25 ` Silent hang up caused by pages being not scanned? Tetsuo Handa 2015-10-12 15:25 ` Tetsuo Handa 2015-10-12 21:23 ` Linus Torvalds 2015-10-12 21:23 ` Linus Torvalds 2015-10-13 12:21 ` Tetsuo Handa 2015-10-13 12:21 ` Tetsuo Handa 2015-10-13 16:37 ` Linus Torvalds 2015-10-13 16:37 ` Linus Torvalds 2015-10-14 12:21 ` Tetsuo Handa 2015-10-14 12:21 ` Tetsuo Handa 2015-10-15 13:14 ` Michal Hocko 2015-10-15 13:14 ` Michal Hocko 2015-10-16 15:57 ` Michal Hocko 2015-10-16 15:57 ` Michal Hocko 2015-10-16 18:34 ` Linus Torvalds 2015-10-16 18:34 ` Linus Torvalds 2015-10-16 18:49 ` Tetsuo Handa 2015-10-16 18:49 ` Tetsuo Handa 2015-10-19 12:57 ` Michal Hocko 2015-10-19 12:57 ` Michal Hocko 2015-10-19 12:53 ` Michal Hocko 2015-10-19 12:53 ` Michal Hocko 2015-10-13 13:32 ` Michal Hocko 2015-10-13 13:32 ` Michal Hocko 2015-10-13 16:19 ` Tetsuo Handa 2015-10-13 16:19 ` Tetsuo Handa 2015-10-14 13:22 ` Michal Hocko 2015-10-14 13:22 ` Michal Hocko 2015-10-14 14:38 ` Tetsuo Handa 2015-10-14 14:38 ` Tetsuo Handa 2015-10-14 14:59 ` Michal Hocko 2015-10-14 14:59 ` Michal Hocko 2015-10-14 15:06 ` Tetsuo Handa 2015-10-14 15:06 ` Tetsuo Handa 2015-10-26 11:44 ` Newbie's question: memory allocation when reclaiming memory Tetsuo Handa 2015-10-26 11:44 ` Tetsuo Handa 2015-11-05 8:46 ` Vlastimil Babka 2015-11-05 8:46 ` Vlastimil Babka 2015-10-06 15:25 ` Can't we use timeout based OOM warning/killing? Linus Torvalds 2015-10-08 15:33 ` Tetsuo Handa 2015-10-08 15:33 ` Tetsuo Handa 2015-10-10 12:50 ` Tetsuo Handa 2015-10-10 12:50 ` Tetsuo Handa 2015-09-28 22:24 ` can't oom-kill zap the victim's memory? David Rientjes 2015-09-28 22:24 ` David Rientjes 2015-09-29 7:57 ` Tetsuo Handa 2015-09-29 7:57 ` Tetsuo Handa 2015-09-29 22:56 ` David Rientjes 2015-09-29 22:56 ` David Rientjes 2015-09-30 4:25 ` Tetsuo Handa 2015-09-30 4:25 ` Tetsuo Handa 2015-09-30 10:21 ` Tetsuo Handa 2015-09-30 10:21 ` Tetsuo Handa 2015-09-30 21:11 ` David Rientjes 2015-09-30 21:11 ` David Rientjes 2015-10-01 12:13 ` Tetsuo Handa 2015-10-01 12:13 ` Tetsuo Handa 2015-10-01 14:48 ` Michal Hocko 2015-10-01 14:48 ` Michal Hocko 2015-10-02 13:06 ` Tetsuo Handa 2015-10-02 13:06 ` Tetsuo Handa 2015-10-06 18:45 ` Oleg Nesterov 2015-10-06 18:45 ` Oleg Nesterov 2015-10-07 11:03 ` Tetsuo Handa 2015-10-07 11:03 ` Tetsuo Handa 2015-10-07 12:00 ` Oleg Nesterov 2015-10-07 12:00 ` Oleg Nesterov 2015-10-08 14:04 ` Michal Hocko 2015-10-08 14:04 ` Michal Hocko 2015-10-08 14:01 ` Michal Hocko 2015-10-08 14:01 ` Michal Hocko 2015-09-21 16:51 ` Tetsuo Handa 2015-09-21 16:51 ` Tetsuo Handa 2015-09-22 12:43 ` Oleg Nesterov 2015-09-22 12:43 ` Oleg Nesterov 2015-09-22 14:30 ` Tetsuo Handa 2015-09-22 14:30 ` Tetsuo Handa 2015-09-22 14:45 ` Oleg Nesterov 2015-09-22 14:45 ` Oleg Nesterov 2015-09-21 23:42 ` David Rientjes 2015-09-21 23:42 ` David Rientjes 2015-09-21 16:55 ` Linus Torvalds 2015-09-21 16:55 ` Linus Torvalds 2015-09-20 14:50 ` Tetsuo Handa 2015-09-20 14:50 ` Tetsuo Handa 2015-09-20 14:55 ` Oleg Nesterov 2015-09-20 14:55 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.