* [RFC PATCH] oom: Don't count on mm-less current process. @ 2014-12-12 13:54 Tetsuo Handa 2014-12-16 12:47 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-12 13:54 UTC (permalink / raw) To: linux-mm; +Cc: mhocko, rientjes, oleg >From 29d0b34a1c60e91ace8e1208a415ca371e6851fe Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Fri, 12 Dec 2014 21:29:06 +0900 Subject: [PATCH] oom: Don't count on mm-less current process. out_of_memory() doesn't trigger OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. This is done to prevent from livelocks described by commit 9ff4868e3051d912 ("mm, oom: allow exiting threads to have access to memory reserves") and commit 7b98c2e402eaa1f2 ("oom: give current access to memory reserves if it has been killed") as well as to prevent from unnecessary killing of other tasks, with heuristic that the current task would finish soon and release its resources. However, this heuristic doesn't work as expected when out_of_memory() is triggered by an allocation after the current task has already released its memory in exit_mm() (e.g. from exit_task_work()) because it might livelock waiting for a memory which gets never released while there are other tasks sitting on a lot of memory. Therefore, consider doing checks as with sysctl_oom_kill_allocating_task case before giving the current task access to memory reserves. Note that this patch cannot prevent somebody from calling oom_kill_process() with a victim task when the victim task already got PF_EXITING flag and released its memory. This means that the OOM killer is kept disabled for unpredictable duration when the victim task is unkillable due to dependency which is invisible to the OOM killer (e.g. waiting for lock held by somebody) after somebody set TIF_MEMDIE flag on the victim task by calling oom_kill_process(). What is unfortunate, a local unprivileged user can make the victim task unkillable on purpose. There are two approaches for mitigating this problem. Workaround is to use sysctl-tunable panic on TIF_MEMDIE timeout (Detect DoS attacks and react. Easy to backport. Works for memory depletion bugs caused by kernel code.) and preferred fix is to develop complete kernel memory allocation tracking (Try to avoid DoS but do nothing when failed to avoid. Hard to backport. Works for memory depletion attacks caused by user programs). Anyway that's beyond what this patch can do. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- include/linux/oom.h | 3 +++ mm/memcontrol.c | 8 +++++++- mm/oom_kill.c | 12 +++++++++--- 3 files changed, 19 insertions(+), 4 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 4971874..eee5802 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -64,6 +64,9 @@ extern void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_flags); extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, int order, const nodemask_t *nodemask); +extern bool oom_unkillable_task(struct task_struct *p, + struct mem_cgroup *memcg, + const nodemask_t *nodemask); extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c6ac50e..6d9532d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1558,8 +1558,14 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. + * + * However, if current is calling out_of_memory() by doing memory + * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING + * was set by exit_signals() and mm was released by exit_mm(), it is + * wrong to expect current to exit and free its memory quickly. */ - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { + if ((fatal_signal_pending(current) || current->flags & PF_EXITING) && + current->mm && !oom_unkillable_task(current, memcg, NULL)) { set_thread_flag(TIF_MEMDIE); return; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 481d550..01719d6 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -118,8 +118,8 @@ found: } /* return true if the task is not adequate as candidate victim task. */ -static bool oom_unkillable_task(struct task_struct *p, - struct mem_cgroup *memcg, const nodemask_t *nodemask) +bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *memcg, + const nodemask_t *nodemask) { if (is_global_init(p)) return true; @@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. + * + * However, if current is calling out_of_memory() by doing memory + * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING + * was set by exit_signals() and mm was released by exit_mm(), it is + * wrong to expect current to exit and free its memory quickly. */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { + if ((fatal_signal_pending(current) || task_will_free_mem(current)) && + current->mm && !oom_unkillable_task(current, NULL, nodemask)) { set_thread_flag(TIF_MEMDIE); return; } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-12 13:54 [RFC PATCH] oom: Don't count on mm-less current process Tetsuo Handa @ 2014-12-16 12:47 ` Michal Hocko 2014-12-17 11:54 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-16 12:47 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg On Fri 12-12-14 22:54:53, Tetsuo Handa wrote: > >From 29d0b34a1c60e91ace8e1208a415ca371e6851fe Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Date: Fri, 12 Dec 2014 21:29:06 +0900 > Subject: [PATCH] oom: Don't count on mm-less current process. > > out_of_memory() doesn't trigger OOM killer if the current task is already > exiting or it has fatal signals pending, and gives the task access to > memory reserves instead. This is done to prevent from livelocks described by > commit 9ff4868e3051d912 ("mm, oom: allow exiting threads to have access to > memory reserves") and commit 7b98c2e402eaa1f2 ("oom: give current access to > memory reserves if it has been killed") as well as to prevent from unnecessary > killing of other tasks, with heuristic that the current task would finish > soon and release its resources. > > However, this heuristic doesn't work as expected when out_of_memory() is > triggered by an allocation after the current task has already released > its memory in exit_mm() (e.g. from exit_task_work()) because it might > livelock waiting for a memory which gets never released while there are > other tasks sitting on a lot of memory. > > Therefore, consider doing checks as with sysctl_oom_kill_allocating_task > case before giving the current task access to memory reserves. The most important part is to check whether the current has still its address sapce. So please be explicit about that refering to a sysctl and not mentioning what is the check is not helpful much. Besides that I do not think oom_unkillable_task which you have added is really correct. See below. > Note that this patch cannot prevent somebody from calling oom_kill_process() > with a victim task when the victim task already got PF_EXITING flag and > released its memory. This means that the OOM killer is kept disabled for > unpredictable duration when the victim task is unkillable due to dependency > which is invisible to the OOM killer (e.g. waiting for lock held by somebody) > after somebody set TIF_MEMDIE flag on the victim task by calling > oom_kill_process(). What is unfortunate, a local unprivileged user can make > the victim task unkillable on purpose. There are two approaches for mitigating > this problem. Workaround is to use sysctl-tunable panic on TIF_MEMDIE timeout > (Detect DoS attacks and react. Easy to backport. Works for memory depletion > bugs caused by kernel code.) and preferred fix is to develop complete kernel > memory allocation tracking (Try to avoid DoS but do nothing when failed to > avoid. Hard to backport. Works for memory depletion attacks caused by user > programs). Anyway that's beyond what this patch can do. And I think the whole paragraph is not really relevant for the patch. > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > include/linux/oom.h | 3 +++ > mm/memcontrol.c | 8 +++++++- > mm/oom_kill.c | 12 +++++++++--- > 3 files changed, 19 insertions(+), 4 deletions(-) > [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c6ac50e..6d9532d 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1558,8 +1558,14 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > * quickly exit and free its memory. > + * > + * However, if current is calling out_of_memory() by doing memory > + * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING > + * was set by exit_signals() and mm was released by exit_mm(), it is > + * wrong to expect current to exit and free its memory quickly. > */ > - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { > + if ((fatal_signal_pending(current) || current->flags & PF_EXITING) && > + current->mm && !oom_unkillable_task(current, memcg, NULL)) { > set_thread_flag(TIF_MEMDIE); > return; > } Why do you check oom_unkillable_task for memcg OOM killer? > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 481d550..01719d6 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c [...] > @@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > * quickly exit and free its memory. > + * > + * However, if current is calling out_of_memory() by doing memory > + * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING > + * was set by exit_signals() and mm was released by exit_mm(), it is > + * wrong to expect current to exit and free its memory quickly. > */ > - if (fatal_signal_pending(current) || task_will_free_mem(current)) { > + if ((fatal_signal_pending(current) || task_will_free_mem(current)) && > + current->mm && !oom_unkillable_task(current, NULL, nodemask)) { > set_thread_flag(TIF_MEMDIE); > return; > } Calling oom_unkillable_task doesn't make much sense to me. Even if it made sense it should be in a separate patch, no? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-16 12:47 ` Michal Hocko @ 2014-12-17 11:54 ` Tetsuo Handa 2014-12-17 13:08 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-17 11:54 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, oleg Michal Hocko wrote: > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > --- > > include/linux/oom.h | 3 +++ > > mm/memcontrol.c | 8 +++++++- > > mm/oom_kill.c | 12 +++++++++--- > > 3 files changed, 19 insertions(+), 4 deletions(-) > > > [...] > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index c6ac50e..6d9532d 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1558,8 +1558,14 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > * If current has a pending SIGKILL or is exiting, then automatically > > * select it. The goal is to allow it to allocate so that it may > > * quickly exit and free its memory. > > + * > > + * However, if current is calling out_of_memory() by doing memory > > + * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING > > + * was set by exit_signals() and mm was released by exit_mm(), it is > > + * wrong to expect current to exit and free its memory quickly. > > */ > > - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { > > + if ((fatal_signal_pending(current) || current->flags & PF_EXITING) && > > + current->mm && !oom_unkillable_task(current, memcg, NULL)) { > > set_thread_flag(TIF_MEMDIE); > > return; > > } > > Why do you check oom_unkillable_task for memcg OOM killer? > I'm not familiar with memcg. But I think the condition whether TIF_MEMDIE flag should be set or not should be same between the memcg OOM killer and the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag can prevent the global OOM killer from killing other threads when the memcg OOM killer and the global OOM killer run concurrently (the worst corner case). When a malicious user runs a memory consumer program which triggers memcg OOM killer deadlock inside some memcg, it will result in the global OOM killer deadlock when the global OOM killer is triggered by other user's tasks. > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 481d550..01719d6 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > [...] > > @@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > * If current has a pending SIGKILL or is exiting, then automatically > > * select it. The goal is to allow it to allocate so that it may > > * quickly exit and free its memory. > > + * > > + * However, if current is calling out_of_memory() by doing memory > > + * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING > > + * was set by exit_signals() and mm was released by exit_mm(), it is > > + * wrong to expect current to exit and free its memory quickly. > > */ > > - if (fatal_signal_pending(current) || task_will_free_mem(current)) { > > + if ((fatal_signal_pending(current) || task_will_free_mem(current)) && > > + current->mm && !oom_unkillable_task(current, NULL, nodemask)) { > > set_thread_flag(TIF_MEMDIE); > > return; > > } > > Calling oom_unkillable_task doesn't make much sense to me. Even if it made > sense it should be in a separate patch, no? At least for the global OOM case, current may be a kernel thread, doesn't it? Such kernel thread can do memory allocation from exit_task_work(), and trigger the global OOM killer, and disable the global OOM killer and prevent other threads from allocating memory, can't it? We can utilize memcg for reducing the possibility of triggering the global OOM killer. But if we failed to prevent the global OOM killer from triggering, the global OOM killer is responsible for solving the OOM condition than keeping the system stalled for presumably forever. Panic on TIF_MEMDIE timeout can act like /proc/sys/vm/panic_on_oom only when the OOM killer chose (by chance or by a trap) an unkillable (due to e.g. lock dependency loop) task. Of course, for those who prefer the system kept stalled over the OOM condition solved, such action should be optional and thus I'm happy to propose sysctl-tunable version. I think that if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE)) return true; check should be added to oom_unkillable_task() because mm-less thread can release little memory (except invisible memory if any). And if we add TIF_MEMDIE timeout check to oom_unkillable_task(), we can wait for mm-less TIF_MEMDIE thread for a short period before trying to kill other threads (as with with-mm TIF_MEMDIE threads which I demonstrated you off-list on Sat, 13 Dec 2014 23:28:33 +0900). The post exit_mm() issues will remain as long as OOM deadlock by pre exit_mm() issues remains. And as I demonstrated you off-list, OOM deadlock by pre exit_mm() issues is too difficult to solve because you will need to track every lock dependency like lockdep does. Thus, I think that this "oom: Don't count on mm-less current process." patch itself is a junk and I added "the whole paragraph" for guiding you to "how to handle TIF_MEMDIE deadlock caused by pre exit_mm() issues". Generally memcg should work, but memcg depends on coordination with userspace where the targets I'm troubleshooting (i.e. currently deployed enterprise servers) do not have. The cause of deadlock/slowdown may be not a malicious user's attacks but bugs in enterprise applications or kernel modules. To debug troubles in currently deployed enterprise servers, I want a solution to "handle TIF_MEMDIE deadlock caused by pre exit_mm() issues without depending on memcg". But to backport the solution to currently deployed enterprise servers, it needs to be first accepted by upstream. You say "Upstream kernels do not need TIF_MEMDIE timeout. Use memcg and you will not see the global OOM condition." but I can't force the targets to use memcg. Well, it's a chicken-and-egg situation... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-17 11:54 ` Tetsuo Handa @ 2014-12-17 13:08 ` Michal Hocko 2014-12-18 12:11 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-17 13:08 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg On Wed 17-12-14 20:54:53, Tetsuo Handa wrote: [...] > I'm not familiar with memcg. This check doesn't make any sense for this path because the task is part of the memcg, otherwise it wouldn't trigger charge for it and couldn't cause the OOM killer. Kernel threads do not have their address space they cannot trigger memcg OOM killer. As you provide NULL nodemask then this is basically a check for task being part of the memcg. The check for current->mm is not needed as well because task will not trigger a charge after exit_mm. > But I think the condition whether TIF_MEMDIE > flag should be set or not should be same between the memcg OOM killer and > the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag > can prevent the global OOM killer from killing other threads when the memcg > OOM killer and the global OOM killer run concurrently (the worst corner case). > When a malicious user runs a memory consumer program which triggers memcg OOM > killer deadlock inside some memcg, it will result in the global OOM killer > deadlock when the global OOM killer is triggered by other user's tasks. Hope that the above exaplains your concerns here. > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index 481d550..01719d6 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > [...] > > > @@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > > * If current has a pending SIGKILL or is exiting, then automatically > > > * select it. The goal is to allow it to allocate so that it may > > > * quickly exit and free its memory. > > > + * > > > + * However, if current is calling out_of_memory() by doing memory > > > + * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING > > > + * was set by exit_signals() and mm was released by exit_mm(), it is > > > + * wrong to expect current to exit and free its memory quickly. > > > */ > > > - if (fatal_signal_pending(current) || task_will_free_mem(current)) { > > > + if ((fatal_signal_pending(current) || task_will_free_mem(current)) && > > > + current->mm && !oom_unkillable_task(current, NULL, nodemask)) { > > > set_thread_flag(TIF_MEMDIE); > > > return; > > > } > > > > Calling oom_unkillable_task doesn't make much sense to me. Even if it made > > sense it should be in a separate patch, no? > > At least for the global OOM case, current may be a kernel thread, doesn't it? then mm would be NULL most of the time so current->mm check wouldn't give it TIF_MEMDIE and the task itself will be exluded later on during tasks scanning. > Such kernel thread can do memory allocation from exit_task_work(), and trigger > the global OOM killer, and disable the global OOM killer and prevent other > threads from allocating memory, can't it? > > We can utilize memcg for reducing the possibility of triggering the global > OOM killer. I do not get this. Memcg charge happens after the allocation is done so the global OOM killer would trigger before memcg one. > But if we failed to prevent the global OOM killer from triggering, > the global OOM killer is responsible for solving the OOM condition than keeping > the system stalled for presumably forever. Panic on TIF_MEMDIE timeout can act > like /proc/sys/vm/panic_on_oom only when the OOM killer chose (by chance or > by a trap) an unkillable (due to e.g. lock dependency loop) task. Of course, > for those who prefer the system kept stalled over the OOM condition solved, > such action should be optional and thus I'm happy to propose sysctl-tunable > version. You are getting offtopic again (which is pretty annoying to be honest as it is going all over again and again). Please focus on a single thing at a time. > I think that > > if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE)) > return true; > > check should be added to oom_unkillable_task() because mm-less thread can > release little memory (except invisible memory if any). Why do you think this makes more sense than handling this very special case in out_of_memory? I really do not see any reason to to make oom_unkillable_task more complicated. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-17 13:08 ` Michal Hocko @ 2014-12-18 12:11 ` Tetsuo Handa 2014-12-18 15:33 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-18 12:11 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, oleg Michal Hocko wrote: > On Wed 17-12-14 20:54:53, Tetsuo Handa wrote: > [...] > > I'm not familiar with memcg. > > This check doesn't make any sense for this path because the task is part > of the memcg, otherwise it wouldn't trigger charge for it and couldn't > cause the OOM killer. Kernel threads do not have their address space > they cannot trigger memcg OOM killer. As you provide NULL nodemask then > this is basically a check for task being part of the memcg. So !oom_unkillable_task(current, memcg, NULL) is always true for mem_cgroup_out_of_memory() case, isn't it? > The check > for current->mm is not needed as well because task will not trigger a > charge after exit_mm. So current->mm != NULL is always true for mem_cgroup_out_of_memory() case, isn't it? > > > But I think the condition whether TIF_MEMDIE > > flag should be set or not should be same between the memcg OOM killer and > > the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag > > can prevent the global OOM killer from killing other threads when the memcg > > OOM killer and the global OOM killer run concurrently (the worst corner case). > > When a malicious user runs a memory consumer program which triggers memcg OOM > > killer deadlock inside some memcg, it will result in the global OOM killer > > deadlock when the global OOM killer is triggered by other user's tasks. > > Hope that the above exaplains your concerns here. > Thread1 in memcg1 asks for memory, and thread1 gets requested amount of memory without triggering the global OOM killer, and requested amount of memory is charged to memcg1, and the memcg OOM killer is triggered. While the memcg OOM killer is searching for a victim from threads in memcg1, thread2 in memcg2 asks for the memory. Thread2 fails to get requested amount of memory without triggering the global OOM killer. Now the global OOM killer starts searching for a victim from all threads whereas the memcg OOM killer chooses thread1 in memcg1 and sets TIF_MEMDIE flag on thread1 in memcg1. Then, the global OOM killer finds that thread1 in memcg1 already has TIF_MEMDIE flag set, and waits for thread1 in memcg1 to terminate than chooses another victim from all threads. However, when thread1 in memcg1 cannot be terminated immediately for some reason, thread2 in memcg2 is blocked by thread1 in memcg1. > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > > index 481d550..01719d6 100644 > > > > --- a/mm/oom_kill.c > > > > +++ b/mm/oom_kill.c > > > [...] > > > > @@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > > > * If current has a pending SIGKILL or is exiting, then automatically > > > > * select it. The goal is to allow it to allocate so that it may > > > > * quickly exit and free its memory. > > > > + * > > > > + * However, if current is calling out_of_memory() by doing memory > > > > + * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING > > > > + * was set by exit_signals() and mm was released by exit_mm(), it is > > > > + * wrong to expect current to exit and free its memory quickly. > > > > */ > > > > - if (fatal_signal_pending(current) || task_will_free_mem(current)) { > > > > + if ((fatal_signal_pending(current) || task_will_free_mem(current)) && > > > > + current->mm && !oom_unkillable_task(current, NULL, nodemask)) { > > > > set_thread_flag(TIF_MEMDIE); > > > > return; > > > > } > > > > > > Calling oom_unkillable_task doesn't make much sense to me. Even if it made > > > sense it should be in a separate patch, no? > > > > At least for the global OOM case, current may be a kernel thread, doesn't it? > > then mm would be NULL most of the time so current->mm check wouldn't > give it TIF_MEMDIE and the task itself will be exluded later on during > tasks scanning. > > > Such kernel thread can do memory allocation from exit_task_work(), and trigger > > the global OOM killer, and disable the global OOM killer and prevent other > > threads from allocating memory, can't it? > > > > We can utilize memcg for reducing the possibility of triggering the global > > OOM killer. > > I do not get this. Memcg charge happens after the allocation is done so > the global OOM killer would trigger before memcg one. I mean, someone triggers the global OOM killer between somebody else triggered the memcg OOM killer and the memcg OOM killer finishes. > > But if we failed to prevent the global OOM killer from triggering, > > the global OOM killer is responsible for solving the OOM condition than keeping > > the system stalled for presumably forever. Panic on TIF_MEMDIE timeout can act > > like /proc/sys/vm/panic_on_oom only when the OOM killer chose (by chance or > > by a trap) an unkillable (due to e.g. lock dependency loop) task. Of course, > > for those who prefer the system kept stalled over the OOM condition solved, > > such action should be optional and thus I'm happy to propose sysctl-tunable > > version. > > You are getting offtopic again (which is pretty annoying to be honest as > it is going all over again and again). Please focus on a single thing at > a time. > I think focusing on only mm-less case makes no sense, for with-mm case ruins efforts made for mm-less case. My question is quite simple. How can we avoid memory allocation stalls when System has 2048MB of RAM and no swap. Memcg1 for task1 has quota 512MB and 400MB in use. Memcg2 for task2 has quota 512MB and 400MB in use. Memcg3 for task3 has quota 512MB and 400MB in use. Memcg4 for task4 has quota 512MB and 400MB in use. Memcg5 for task5 has quota 512MB and 1MB in use. and task5 launches below memory consumption program which would trigger the global OOM killer before triggering the memcg OOM killer? ---------- XFS + OOM killer dependency stall reproducer start ---------- #define _GNU_SOURCE #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include <fcntl.h> #include <sched.h> static int file_writer(void *unused) { static char buf[4096]; const int fd = open("file", O_CREAT | O_WRONLY, 0600); while (write(fd, buf, sizeof(buf)) == sizeof(buf)) fsync(fd); close(fd); return 0; } int main(int argc, char *argv[]) { int i; unsigned long size; const int fd = open("/dev/zero", O_RDONLY); char *buf = NULL; if (fd == -1) return 1; for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) break; buf = cp; } for (i = 0; i < 128; i++) { char *cp = malloc(4096); if (!cp || clone(file_writer, cp + 4096, CLONE_SIGHAND | CLONE_VM, NULL) == -1) break; } read(fd, buf, size); return 0; } ---------- XFS + OOM killer dependency stall reproducer end ---------- The global OOM killer will try to kill this program because this program will be using 400MB+ of RAM by the time the global OOM killer is triggered. But sometimes this program cannot be terminated by the global OOM killer due to XFS lock dependency. You can see what is happening from OOM traces after uptime > 320 seconds of http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not configured on this program. Trying to apply quota using memcg for safeguard is fine. But don't forget to prepare for the global OOM killer. And please don't reject with "use memcg and never over-commit", for my proposal is for analyzing/avoiding stalls caused by not only a malicious user's attacks but bugs in enterprise applications or kernel modules and/or stalls of servers where coordination with userspace is impossible . > > I think that > > > > if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE)) > > return true; > > > > check should be added to oom_unkillable_task() because mm-less thread can > > release little memory (except invisible memory if any). > > Why do you think this makes more sense than handling this very special > case in out_of_memory? I really do not see any reason to to make > oom_unkillable_task more complicated. Because everyone can safely skip victim threads who don't have mm. Handling setting of TIF_MEMDIE in the caller is racy. Somebody may set TIF_MEMDIE at oom_kill_process() even if we avoided setting TIF_MEMDIE at out_of_memory(). There will be more locations where TIF_MEMDIE is set; even out-of-tree modules might set TIF_MEMDIE. Nonetheless, I don't think if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE)) return true; check is perfect because we anyway need to prepare for both mm-less and with-mm cases. My concern is not "whether TIF_MEMDIE flag should be set or not". My concern is not "whether task->mm is NULL or not". My concern is "whether threads with TIF_MEMDIE flag retard other process' memory allocation or not". Above-mentioned program is an example of with-mm threads retarding other process' memory allocation. I know you don't like timeout approach, but adding if (sysctl_memdie_timeout_secs && test_tsk_thread_flag(task, TIF_MEMDIE) && time_after(jiffies, task->memdie_start + sysctl_memdie_timeout_secs * HZ)) return true; check to oom_unkillable_task() will take care of both mm-less and with-mm cases because everyone can safely skip the TIF_MEMDIE victim threads who cannot be terminated immediately for some reason. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-18 12:11 ` Tetsuo Handa @ 2014-12-18 15:33 ` Michal Hocko 2014-12-19 12:07 ` Tetsuo Handa 2014-12-19 12:22 ` How to handle TIF_MEMDIE stalls? Tetsuo Handa 0 siblings, 2 replies; 276+ messages in thread From: Michal Hocko @ 2014-12-18 15:33 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg On Thu 18-12-14 21:11:26, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Wed 17-12-14 20:54:53, Tetsuo Handa wrote: > > [...] > > > I'm not familiar with memcg. > > > > This check doesn't make any sense for this path because the task is part > > of the memcg, otherwise it wouldn't trigger charge for it and couldn't > > cause the OOM killer. Kernel threads do not have their address space > > they cannot trigger memcg OOM killer. As you provide NULL nodemask then > > this is basically a check for task being part of the memcg. > > So !oom_unkillable_task(current, memcg, NULL) is always true for > mem_cgroup_out_of_memory() case, isn't it? yes, unless the task has moved away from the memcg since the charge happened but that is not important because the charge happened for the given memcg and so the OOM should happen there. > > The check > > for current->mm is not needed as well because task will not trigger a > > charge after exit_mm. > > So current->mm != NULL is always true for mem_cgroup_out_of_memory() > case, isn't it? yes > > > But I think the condition whether TIF_MEMDIE > > > flag should be set or not should be same between the memcg OOM killer and > > > the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag > > > can prevent the global OOM killer from killing other threads when the memcg > > > OOM killer and the global OOM killer run concurrently (the worst corner case). > > > When a malicious user runs a memory consumer program which triggers memcg OOM > > > killer deadlock inside some memcg, it will result in the global OOM killer > > > deadlock when the global OOM killer is triggered by other user's tasks. > > > > Hope that the above exaplains your concerns here. > > > > Thread1 in memcg1 asks for memory, and thread1 gets requested amount of > memory without triggering the global OOM killer, and requested amount of > memory is charged to memcg1, and the memcg OOM killer is triggered. > While the memcg OOM killer is searching for a victim from threads in > memcg1, thread2 in memcg2 asks for the memory. Thread2 fails to get > requested amount of memory without triggering the global OOM killer. > Now the global OOM killer starts searching for a victim from all threads > whereas the memcg OOM killer chooses thread1 in memcg1 and sets TIF_MEMDIE > flag on thread1 in memcg1. Then, the global OOM killer finds that thread1 > in memcg1 already has TIF_MEMDIE flag set, and waits for thread1 in memcg1 > to terminate than chooses another victim from all threads. However, when > thread1 in memcg1 cannot be terminated immediately for some reason, thread2 > in memcg2 is blocked by thread1 in memcg1. Sigh... T1 triggers memcg OOM killer _only_ from the page fault path and so it will get to signal processing right away and eventually gets to exit_mm where it releases its memory. If that doesn't suffice to release enough memory then we are back to the original problem. So I do not think memcg adds anything new to the problem. [...] > I think focusing on only mm-less case makes no sense, for with-mm case > ruins efforts made for mm-less case. No. It is quite opposite. Excluding mm less current from PF_EXITING resp. fatal_signal_pending heuristics makes perfect sense from the OOM killer POV. The reasons are described in the changelog. > My question is quite simple. How can we avoid memory allocation stalls when > > System has 2048MB of RAM and no swap. > Memcg1 for task1 has quota 512MB and 400MB in use. > Memcg2 for task2 has quota 512MB and 400MB in use. > Memcg3 for task3 has quota 512MB and 400MB in use. > Memcg4 for task4 has quota 512MB and 400MB in use. > Memcg5 for task5 has quota 512MB and 1MB in use. > > and task5 launches below memory consumption program which would trigger > the global OOM killer before triggering the memcg OOM killer? > [...] > The global OOM killer will try to kill this program because this program > will be using 400MB+ of RAM by the time the global OOM killer is triggered. > But sometimes this program cannot be terminated by the global OOM killer > due to XFS lock dependency. > > You can see what is happening from OOM traces after uptime > 320 seconds of > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not > configured on this program. This is clearly a separate issue. It is a lock dependency and that alone _cannot_ be handled from OOM killer as it doesn't understand lock dependencies. This should be addressed from the xfs point of view IMHO but I am not familiar with this filesystem to tell you how or whether it is possible. [...] > > > if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE)) > > > return true; > > > > > > check should be added to oom_unkillable_task() because mm-less thread can > > > release little memory (except invisible memory if any). > > > > Why do you think this makes more sense than handling this very special > > case in out_of_memory? I really do not see any reason to to make > > oom_unkillable_task more complicated. > > Because everyone can safely skip victim threads who don't have mm. And that is handled already. Check oom_badness and its find_lock_task_mm oom_scan_process_thread and its task->mm and out_of_memory and the complete sysctl_oom_kill_allocating_task check. > Handling setting of TIF_MEMDIE in the caller is racy. Any operation on another task is racy, that's why I prefer current->mm check in out_of_memory. > Somebody may set > TIF_MEMDIE at oom_kill_process() even if we avoided setting TIF_MEMDIE at > out_of_memory(). There will be more locations where TIF_MEMDIE is set; even > out-of-tree modules might set TIF_MEMDIE. TIF_MEMDIE should be set only when we _know_ the task will free _some_ memory and when we are killing the OOM victim. The only place I can see that would break the first condition is out_of_memory for the current which passed exit_mm(). That is the point why I've suggested you this patch and it would be much more easier if we could simply finished that one without pulling other things in. Out-of-tree and even in-tree modules have no bussines in setting the flag. lowmemory killer is doing that but that is an abuse and should be fixed in other way. TIF_MEMDIE is not a flag anybody can touch. > Nonetheless, I don't think > > if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE)) > return true; > > check is perfect because we anyway need to prepare for both mm-less and > with-mm cases. > > My concern is not "whether TIF_MEMDIE flag should be set or not". My concern > is not "whether task->mm is NULL or not". My concern is "whether threads with > TIF_MEMDIE flag retard other process' memory allocation or not". > Above-mentioned program is an example of with-mm threads retarding > other process' memory allocation. There is no way you can guarantee something like that. OOM is the _last_ resort. Things are in a pretty bad state already when it hits. It is the last attempt to reclaim some memory. System might be in an arbitrary state at this time. I really hate to repeat myself but you are trying to "fix" your problem at a wrong level. > I know you don't like timeout approach, but adding > > if (sysctl_memdie_timeout_secs && test_tsk_thread_flag(task, TIF_MEMDIE) && > time_after(jiffies, task->memdie_start + sysctl_memdie_timeout_secs * HZ)) > return true; > > check to oom_unkillable_task() will take care of both mm-less and with-mm > cases because everyone can safely skip the TIF_MEMDIE victim threads who > cannot be terminated immediately for some reason. It will not take care of anything. It will start shooting to more processes after some timeout, which is hard to get right, and there wouldn't be any guaratee multiple victims will help because they might end up blocking on the very same or other lock on the way out. Jeez are you even reading feedback you are getting? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-18 15:33 ` Michal Hocko @ 2014-12-19 12:07 ` Tetsuo Handa 2014-12-19 12:49 ` Michal Hocko 2014-12-19 12:22 ` How to handle TIF_MEMDIE stalls? Tetsuo Handa 1 sibling, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-19 12:07 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, oleg Michal Hocko wrote: > On Thu 18-12-14 21:11:26, Tetsuo Handa wrote: > > > > But I think the condition whether TIF_MEMDIE > > > > flag should be set or not should be same between the memcg OOM killer and > > > > the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag > > > > can prevent the global OOM killer from killing other threads when the memcg > > > > OOM killer and the global OOM killer run concurrently (the worst corner case). > > > > When a malicious user runs a memory consumer program which triggers memcg OOM > > > > killer deadlock inside some memcg, it will result in the global OOM killer > > > > deadlock when the global OOM killer is triggered by other user's tasks. > > > > > > Hope that the above exaplains your concerns here. > > > > > > > Thread1 in memcg1 asks for memory, and thread1 gets requested amount of > > memory without triggering the global OOM killer, and requested amount of > > memory is charged to memcg1, and the memcg OOM killer is triggered. > > While the memcg OOM killer is searching for a victim from threads in > > memcg1, thread2 in memcg2 asks for the memory. Thread2 fails to get > > requested amount of memory without triggering the global OOM killer. > > Now the global OOM killer starts searching for a victim from all threads > > whereas the memcg OOM killer chooses thread1 in memcg1 and sets TIF_MEMDIE > > flag on thread1 in memcg1. Then, the global OOM killer finds that thread1 > > in memcg1 already has TIF_MEMDIE flag set, and waits for thread1 in memcg1 > > to terminate than chooses another victim from all threads. However, when > > thread1 in memcg1 cannot be terminated immediately for some reason, thread2 > > in memcg2 is blocked by thread1 in memcg1. > > Sigh... T1 triggers memcg OOM killer _only_ from the page fault path and so it > will get to signal processing right away and eventually gets to exit_mm > where it releases its memory. If that doesn't suffice to release enough > memory then we are back to the original problem. So I do not think memcg > adds anything new to the problem. > The memcg OOM killer is triggered upon page fault than memory charge, I see. But the memcg OOM killer is not relevant to my concern. It's a matter of which OOM killer sets TIF_MEMDIE flag. > > [...] > > > I think focusing on only mm-less case makes no sense, for with-mm case > > ruins efforts made for mm-less case. > > No. It is quite opposite. Excluding mm less current from PF_EXITING > resp. fatal_signal_pending heuristics makes perfect sense from the OOM > killer POV. The reasons are described in the changelog. > OK. Below is an updated patch. ---------------------------------------- >From 3c68c66a72f0dbfc66f9799a00fbaa1f0217befb Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Fri, 19 Dec 2014 20:49:06 +0900 Subject: [PATCH v2] oom: Don't count on mm-less current process. out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- mm/oom_kill.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 481d550..e87391f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -649,8 +649,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. + * + * But don't select if current has already released its mm and cleared + * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { + if ((fatal_signal_pending(current) || task_will_free_mem(current)) && + current->mm) { set_thread_flag(TIF_MEMDIE); return; } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-19 12:07 ` Tetsuo Handa @ 2014-12-19 12:49 ` Michal Hocko 2014-12-20 9:13 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-19 12:49 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg On Fri 19-12-14 21:07:53, Tetsuo Handa wrote: [...] > >From 3c68c66a72f0dbfc66f9799a00fbaa1f0217befb Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Date: Fri, 19 Dec 2014 20:49:06 +0900 > Subject: [PATCH v2] oom: Don't count on mm-less current process. > > out_of_memory() doesn't trigger the OOM killer if the current task is already > exiting or it has fatal signals pending, and gives the task access to memory > reserves instead. However, doing so is wrong if out_of_memory() is called by > an allocation (e.g. from exit_task_work()) after the current task has already > released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set > TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by > the task sitting in the final schedule() waiting for its parent to reap it. > It will trigger an OOM livelock if its parent is unable to reap it due to > doing an allocation and waiting for the OOM killer to kill it. > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Just a nit, You could start the condition with current->mm because it is the simplest check. We do not have to check for signals pending or PF_EXITING at all if it is NULL. But this is not a hot path so it doesn't matter much. It is just a good practice to start with the simplest tests first. Please also make sure to add Andrew to CC when sending the patch again so that he knows about it and picks it up. Thanks! > --- > mm/oom_kill.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 481d550..e87391f 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -649,8 +649,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > * quickly exit and free its memory. > + * > + * But don't select if current has already released its mm and cleared > + * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. > */ > - if (fatal_signal_pending(current) || task_will_free_mem(current)) { > + if ((fatal_signal_pending(current) || task_will_free_mem(current)) && > + current->mm) { > set_thread_flag(TIF_MEMDIE); > return; > } > -- > 1.8.3.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-19 12:49 ` Michal Hocko @ 2014-12-20 9:13 ` Tetsuo Handa 2014-12-20 11:42 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-20 9:13 UTC (permalink / raw) To: mhocko, akpm; +Cc: linux-mm, rientjes, oleg Michal Hocko wrote: > On Fri 19-12-14 21:07:53, Tetsuo Handa wrote: > [...] > > >From 3c68c66a72f0dbfc66f9799a00fbaa1f0217befb Mon Sep 17 00:00:00 2001 > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Date: Fri, 19 Dec 2014 20:49:06 +0900 > > Subject: [PATCH v2] oom: Don't count on mm-less current process. > > > > out_of_memory() doesn't trigger the OOM killer if the current task is already > > exiting or it has fatal signals pending, and gives the task access to memory > > reserves instead. However, doing so is wrong if out_of_memory() is called by > > an allocation (e.g. from exit_task_work()) after the current task has already > > released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set > > TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by > > the task sitting in the final schedule() waiting for its parent to reap it. > > It will trigger an OOM livelock if its parent is unable to reap it due to > > doing an allocation and waiting for the OOM killer to kill it. > > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Acked-by: Michal Hocko <mhocko@suse.cz> > > Just a nit, You could start the condition with current->mm because it > is the simplest check. We do not have to check for signals pending or > PF_EXITING at all if it is NULL. But this is not a hot path so it > doesn't matter much. It is just a good practice to start with the > simplest tests first. > > Please also make sure to add Andrew to CC when sending the patch again > so that he knows about it and picks it up. > > Thanks! > I see. Here is v3 patch. Andrew, would you please pick this up? By the way, Michal, I think there is still an unlikely race window at set_tsk_thread_flag(p, TIF_MEMDIE) in oom_kill_process(). For example, task1 calls out_of_memory() and select_bad_process() is called from out_of_memory(). oom_scan_process_thread(task2) is called from select_bad_process(). oom_scan_process_thread() returns OOM_SCAN_OK because task2->mm != NULL and task_will_free_mem(task2) == false. select_bad_process() calls get_task_struct(task2) and returns task2. Task1 goes to sleep and task2 is woken up. Task2 enters into do_exit() and gets PF_EXITING at exit_signals() and releases mm at exit_mm(). Task2 goes to sleep and task1 is woken up. Task1 calls oom_kill_process(task2). oom_kill_process() sets TIF_MEMDIE on task2 because task_will_free_mem(task2) == true due to PF_EXITING already set... Should we do like if (task_will_free_mem(p)) { if (p->mm) set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); return; } at oom_kill_process() ? Or even if we do so, how to check if task1 went to sleep between task2->mm and set_tsk_thread_flag(task2, TIF_MEMDIE) ? This race window is very very unlikely because releasing task2->mm is expected to release some memory. But if somebody else consumed memory released by exit_mm(task2), I think there is nothing to protect. ---------------------------------------- >From 3a75c92a03cf17d9505bbb7fc9c81603daac9da0 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Sat, 20 Dec 2014 17:18:37 +0900 Subject: [PATCH v3] oom: Don't count on mm-less current process. out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9c..f82dd13 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -643,8 +643,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. + * + * But don't select if current has already released its mm and cleared + * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { + if (current->mm && + (fatal_signal_pending(current) || task_will_free_mem(current))) { set_thread_flag(TIF_MEMDIE); return; } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-20 9:13 ` Tetsuo Handa @ 2014-12-20 11:42 ` Tetsuo Handa 2014-12-22 20:25 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-20 11:42 UTC (permalink / raw) To: mhocko, akpm; +Cc: linux-mm, rientjes, oleg Tetsuo Handa wrote: > By the way, Michal, I think there is still an unlikely race window at > set_tsk_thread_flag(p, TIF_MEMDIE) in oom_kill_process(). For example, > task1 calls out_of_memory() and select_bad_process() is called from > out_of_memory(). oom_scan_process_thread(task2) is called from > select_bad_process(). oom_scan_process_thread() returns OOM_SCAN_OK > because task2->mm != NULL and task_will_free_mem(task2) == false. > select_bad_process() calls get_task_struct(task2) and returns task2. > Task1 goes to sleep and task2 is woken up. Task2 enters into do_exit() > and gets PF_EXITING at exit_signals() and releases mm at exit_mm(). > Task2 goes to sleep and task1 is woken up. Task1 calls > oom_kill_process(task2). oom_kill_process() sets TIF_MEMDIE on task2 > because task_will_free_mem(task2) == true due to PF_EXITING already set... > Should we do like > > if (task_will_free_mem(p)) { > if (p->mm) > set_tsk_thread_flag(p, TIF_MEMDIE); > put_task_struct(p); > return; > } > > at oom_kill_process() ? Or even if we do so, how to check if task1 went > to sleep between task2->mm and set_tsk_thread_flag(task2, TIF_MEMDIE) ? > This race window is very very unlikely because releasing task2->mm is > expected to release some memory. But if somebody else consumed memory > released by exit_mm(task2), I think there is nothing to protect. Well, this could happen if task2 is one of threads in a multi-threaded process like Java where exit_mm(task2) decrements refcount than releases memory. Below is a patch. Michal, please check. ---------------------------------------- >From a2ebb5b873ec5af45e0bea9ea6da2a93c0f06c35 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Sat, 20 Dec 2014 20:05:14 +0900 Subject: [PATCH] oom: Close race of setting TIF_MEMDIE to mm-less process. exit_mm() and oom_kill_process() could race with regard to handling of TIF_MEMDIE flag if sequence described below occurred. P1 calls out_of_memory(). out_of_memory() calls select_bad_process(). select_bad_process() calls oom_scan_process_thread(P2). If P2->mm != NULL and task_will_free_mem(P2) == false, oom_scan_process_thread(P2) returns OOM_SCAN_OK. And if P2 is chosen as a victim task, select_bad_process() returns P2 after calling get_task_struct(P2). Then, P1 goes to sleep and P2 is woken up. P2 enters into do_exit() and gets PF_EXITING at exit_signals() and releases mm at exit_mm(). Then, P2 goes to sleep and P1 is woken up. P1 calls oom_kill_process(P2). oom_kill_process() sets TIF_MEMDIE on P2 because task_will_free_mem(P2) == true due to PF_EXITING already set. Afterward, oom_scan_process_thread(P2) will return OOM_SCAN_ABORT because test_tsk_thread_flag(P2, TIF_MEMDIE) is checked before P2->mm is checked. If TIF_MEMDIE was again set to P2, the OOM killer will be blocked by P2 sitting in the final schedule() waiting for P2's parent to reap P2. It will trigger an OOM livelock if P2's parent is unable to reap P2 due to doing an allocation and waiting for the OOM killer to kill P2. To close this race window, clear TIF_MEMDIE if P2->mm == NULL after set_tsk_thread_flag(P2, TIF_MEMDIE) is done. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- kernel/exit.c | 1 + mm/oom_kill.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/kernel/exit.c b/kernel/exit.c index 1ea4369..46d72e6 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -435,6 +435,7 @@ static void exit_mm(struct task_struct *tsk) task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); + smp_wmb(); /* Avoid race with oom_kill_process(). */ clear_thread_flag(TIF_MEMDIE); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f82dd13..c8ae445 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -440,6 +440,9 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, */ if (task_will_free_mem(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); + smp_rmb(); /* Avoid race with exit_mm(). */ + if (unlikely(!p->mm)) + clear_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); return; } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-20 11:42 ` Tetsuo Handa @ 2014-12-22 20:25 ` Michal Hocko 2014-12-23 1:00 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-22 20:25 UTC (permalink / raw) To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg On Sat 20-12-14 20:42:08, Tetsuo Handa wrote: [...] > >From a2ebb5b873ec5af45e0bea9ea6da2a93c0f06c35 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Date: Sat, 20 Dec 2014 20:05:14 +0900 > Subject: [PATCH] oom: Close race of setting TIF_MEMDIE to mm-less process. > > exit_mm() and oom_kill_process() could race with regard to handling of > TIF_MEMDIE flag if sequence described below occurred. > > P1 calls out_of_memory(). out_of_memory() calls select_bad_process(). > select_bad_process() calls oom_scan_process_thread(P2). If P2->mm != NULL > and task_will_free_mem(P2) == false, oom_scan_process_thread(P2) returns > OOM_SCAN_OK. And if P2 is chosen as a victim task, select_bad_process() > returns P2 after calling get_task_struct(P2). Then, P1 goes to sleep and > P2 is woken up. P2 enters into do_exit() and gets PF_EXITING at exit_signals() > and releases mm at exit_mm(). Then, P2 goes to sleep and P1 is woken up. > P1 calls oom_kill_process(P2). oom_kill_process() sets TIF_MEMDIE on P2 > because task_will_free_mem(P2) == true due to PF_EXITING already set. > Afterward, oom_scan_process_thread(P2) will return OOM_SCAN_ABORT because > test_tsk_thread_flag(P2, TIF_MEMDIE) is checked before P2->mm is checked. > > If TIF_MEMDIE was again set to P2, the OOM killer will be blocked by P2 > sitting in the final schedule() waiting for P2's parent to reap P2. > It will trigger an OOM livelock if P2's parent is unable to reap P2 due to > doing an allocation and waiting for the OOM killer to kill P2. > > To close this race window, clear TIF_MEMDIE if P2->mm == NULL after > set_tsk_thread_flag(P2, TIF_MEMDIE) is done. I do not think this patch is sufficient. P2 could pass exit_mm() right after task_unlock in oom_kill_process and we would set TIF_MEMDIE to this task as well. Something like the following should work and it doesn't add memory barriers trickery. --- ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-22 20:25 ` Michal Hocko @ 2014-12-23 1:00 ` Tetsuo Handa 2014-12-23 9:51 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-23 1:00 UTC (permalink / raw) To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg Michal Hocko wrote: > OOM killer tries to exlude tasks which do not have mm_struct associated s/exlude/exclude/ > Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock > which will serialize the OOM killer with exit_mm which sets task->mm to > NULL. Nice idea. By the way, find_lock_task_mm(victim) may succeed if victim->mm == NULL and one of threads in victim thread-group has non-NULL mm. That case is handled by victim != p branch below. But where was p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN checked? (In other words, don't we need to check like t->mm && t->signal->oom_score_adj != OOM_SCORE_ADJ_MIN at find_lock_task_mm() for OOM-kill case?) Also, why not to call set_tsk_thread_flag() and do_send_sig_info() together like below p = find_lock_task_mm(victim); if (!p) { put_task_struct(victim); return; } else if (victim != p) { get_task_struct(p); put_task_struct(victim); victim = p; } /* mm cannot safely be dereferenced after task_unlock(victim) */ mm = victim->mm; + set_tsk_thread_flag(victim, TIF_MEMDIE); + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), K(get_mm_counter(victim->mm, MM_FILEPAGES))); task_unlock(victim); than wait for for_each_process() loop in case current task went to sleep immediately after task_unlock(victim)? Or is there a reason we had been setting TIF_MEMDIE after the for_each_process() loop? If the reason was to minimize the duration of OOM killer being disabled due to TIF_MEMDIE, shouldn't we do like below? rcu_read_unlock(); - set_tsk_thread_flag(victim, TIF_MEMDIE); + task_lock(victim); + if (victim->mm) + set_tsk_thread_flag(victim, TIF_MEMDIE); + task_unlock(victim); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); put_task_struct(victim); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 1:00 ` Tetsuo Handa @ 2014-12-23 9:51 ` Michal Hocko 2014-12-23 11:46 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-23 9:51 UTC (permalink / raw) To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg On Tue 23-12-14 10:00:00, Tetsuo Handa wrote: > Michal Hocko wrote: > > OOM killer tries to exlude tasks which do not have mm_struct associated > s/exlude/exclude/ Fixed > > Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock > > which will serialize the OOM killer with exit_mm which sets task->mm to > > NULL. > Nice idea. > > By the way, find_lock_task_mm(victim) may succeed if victim->mm == NULL and > one of threads in victim thread-group has non-NULL mm. That case is handled > by victim != p branch below. But where was p->signal->oom_score_adj != > OOM_SCORE_ADJ_MIN checked? > > (In other words, don't we need to check like > t->mm && t->signal->oom_score_adj != OOM_SCORE_ADJ_MIN at find_lock_task_mm() > for OOM-kill case?) oom_score_adj is shared between threads. > Also, why not to call set_tsk_thread_flag() and do_send_sig_info() together > like below What would be an advantage? I am not really sure whether the two locks might nest as well. > p = find_lock_task_mm(victim); > if (!p) { > put_task_struct(victim); > return; > } else if (victim != p) { > get_task_struct(p); > put_task_struct(victim); > victim = p; > } > > /* mm cannot safely be dereferenced after task_unlock(victim) */ > mm = victim->mm; > + set_tsk_thread_flag(victim, TIF_MEMDIE); > + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); > pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", > task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), > K(get_mm_counter(victim->mm, MM_ANONPAGES)), > K(get_mm_counter(victim->mm, MM_FILEPAGES))); > task_unlock(victim); > > than wait for for_each_process() loop in case current task went to sleep > immediately after task_unlock(victim)? Or is there a reason we had been > setting TIF_MEMDIE after the for_each_process() loop? If the reason was > to minimize the duration of OOM killer being disabled due to TIF_MEMDIE, > shouldn't we do like below? No, global parallel OOM killer is disabled by oom zonelist lock at this moment for most paths so TIF_MEMDIE setting little bit earlier doesn't make any difference. > rcu_read_unlock(); > > - set_tsk_thread_flag(victim, TIF_MEMDIE); > + task_lock(victim); > + if (victim->mm) > + set_tsk_thread_flag(victim, TIF_MEMDIE); > + task_unlock(victim); > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); > put_task_struct(victim); This would work as well but I am not sure it is much more nicer. It is the find_lock_task_mm() part which tells the final victim so setting TIF_MEMDIE is logical there. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 9:51 ` Michal Hocko @ 2014-12-23 11:46 ` Tetsuo Handa 2014-12-23 11:57 ` Tetsuo Handa 2014-12-23 12:24 ` Michal Hocko 0 siblings, 2 replies; 276+ messages in thread From: Tetsuo Handa @ 2014-12-23 11:46 UTC (permalink / raw) To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg Michal Hocko wrote: > > Also, why not to call set_tsk_thread_flag() and do_send_sig_info() together > > like below > > What would be an advantage? I am not really sure whether the two locks > might nest as well. I imagined that current thread sets TIF_MEMDIE on a victim thread, then sleeps for 30 seconds immediately after task_unlock() (it's an overdone delay), and finally sets SIGKILL on that victim thread. If such a delay happened, that victim thread is free to abuse TIF_MEMDIE for that period. Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better. rcu_read_unlock(); - set_tsk_thread_flag(victim, TIF_MEMDIE); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); + task_lock(victim); + if (victim->mm) + set_tsk_thread_flag(victim, TIF_MEMDIE); + task_unlock(victim); put_task_struct(victim); If such a delay is theoretically impossible, I'm OK with your patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 11:46 ` Tetsuo Handa @ 2014-12-23 11:57 ` Tetsuo Handa 2014-12-23 12:12 ` Tetsuo Handa 2014-12-23 12:27 ` Michal Hocko 2014-12-23 12:24 ` Michal Hocko 1 sibling, 2 replies; 276+ messages in thread From: Tetsuo Handa @ 2014-12-23 11:57 UTC (permalink / raw) To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg Tetsuo Handa wrote: > If such a delay is theoretically impossible, I'm OK with your patch. > Oops, I forgot to mention that task_unlock(p) should be called before put_task_struct(p), in case p->usage == 1 at put_task_struct(p). * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (task_will_free_mem(p)) { + task_lock(p); + if (p->mm && task_will_free_mem(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + task_unlock(p); return; } + task_unlock(p); if (__ratelimit(&oom_rs)) dump_header(p, gfp_mask, order, memcg, nodemask); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 11:57 ` Tetsuo Handa @ 2014-12-23 12:12 ` Tetsuo Handa 2014-12-23 12:27 ` Michal Hocko 1 sibling, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2014-12-23 12:12 UTC (permalink / raw) To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg Tetsuo Handa wrote: > Tetsuo Handa wrote: > > If such a delay is theoretically impossible, I'm OK with your patch. > > > > Oops, I forgot to mention that task_unlock(p) should be called before > put_task_struct(p), in case p->usage == 1 at put_task_struct(p). > After all, something like below? ---------------------------------------- >From 63e9317553688944e27b6054ccc059b82064605e Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Tue, 23 Dec 2014 21:04:43 +0900 Subject: [PATCH] oom: Make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after it the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Also, reverse the order of sending SIGKILL and setting TIF_MEMDIE so that preemption will not allow the victim task to abuse TIF_MEMDIE. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/oom_kill.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9c..91079ec 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -438,11 +438,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (task_will_free_mem(p)) { - set_tsk_thread_flag(p, TIF_MEMDIE); - put_task_struct(p); - return; - } + if (task_will_free_mem(victim)) + goto set_memdie_flag; if (__ratelimit(&oom_rs)) dump_header(p, gfp_mask, order, memcg, nodemask); @@ -522,8 +519,12 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } rcu_read_unlock(); - set_tsk_thread_flag(victim, TIF_MEMDIE); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); + set_memdie_flag: + task_lock(victim); + if (victim->mm) + set_tsk_thread_flag(victim, TIF_MEMDIE); + task_unlock(victim); put_task_struct(victim); } #undef K -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 11:57 ` Tetsuo Handa 2014-12-23 12:12 ` Tetsuo Handa @ 2014-12-23 12:27 ` Michal Hocko 1 sibling, 0 replies; 276+ messages in thread From: Michal Hocko @ 2014-12-23 12:27 UTC (permalink / raw) To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg On Tue 23-12-14 20:57:23, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > If such a delay is theoretically impossible, I'm OK with your patch. > > > > Oops, I forgot to mention that task_unlock(p) should be called before > put_task_struct(p), in case p->usage == 1 at put_task_struct(p). True. It would be quite surprising to see p->mm != NULL if the OOM killer was the only one to hold a reference to the task. So it shouldn't make any difference AFAICS. It is a good practice to change that though. Fixed. [...] Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 11:46 ` Tetsuo Handa 2014-12-23 11:57 ` Tetsuo Handa @ 2014-12-23 12:24 ` Michal Hocko 2014-12-23 13:00 ` Tetsuo Handa 1 sibling, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-23 12:24 UTC (permalink / raw) To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg On Tue 23-12-14 20:46:07, Tetsuo Handa wrote: > Michal Hocko wrote: > > > Also, why not to call set_tsk_thread_flag() and do_send_sig_info() together > > > like below > > > > What would be an advantage? I am not really sure whether the two locks > > might nest as well. > > I imagined that current thread sets TIF_MEMDIE on a victim thread, then > sleeps for 30 seconds immediately after task_unlock() (it's an overdone > delay), Only if the current task was preempted for such a long time. Which doesn't sound too probable to me. > and finally sets SIGKILL on that victim thread. If such a delay > happened, that victim thread is free to abuse TIF_MEMDIE for that period. > Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better. I don't know, I can hardly find a scenario where it would make any difference in the real life. If the victim needs to allocate a memory to finish then it would trigger OOM again and have to wait/loop until this OOM killer releases the oom zonelist lock just to find out it already has TIF_MEMDIE set and can dive into memory reserves. Which way is more correct is a question but I wouldn't change it without having a really good reason. This whole code is subtle already, let's not make it even more so. > > rcu_read_unlock(); > > - set_tsk_thread_flag(victim, TIF_MEMDIE); > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); > + task_lock(victim); > + if (victim->mm) > + set_tsk_thread_flag(victim, TIF_MEMDIE); > + task_unlock(victim); > put_task_struct(victim); > > If such a delay is theoretically impossible, I'm OK with your patch. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 12:24 ` Michal Hocko @ 2014-12-23 13:00 ` Tetsuo Handa 2014-12-23 13:09 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-23 13:00 UTC (permalink / raw) To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg Michal Hocko wrote: > > and finally sets SIGKILL on that victim thread. If such a delay > > happened, that victim thread is free to abuse TIF_MEMDIE for that period. > > Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better. > > I don't know, I can hardly find a scenario where it would make any > difference in the real life. If the victim needs to allocate a memory to > finish then it would trigger OOM again and have to wait/loop until this > OOM killer releases the oom zonelist lock just to find out it already > has TIF_MEMDIE set and can dive into memory reserves. Which way is more > correct is a question but I wouldn't change it without having a really > good reason. This whole code is subtle already, let's not make it even > more so. gfp_to_alloc_flags() in mm/page_alloc.c sets ALLOC_NO_WATERMARKS if the victim task has TIF_MEMDIE flag, doesn't it? if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) { if (gfp_mask & __GFP_MEMALLOC) alloc_flags |= ALLOC_NO_WATERMARKS; else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; else if (!in_interrupt() && ((current->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))) alloc_flags |= ALLOC_NO_WATERMARKS; } Then, I think deferring SIGKILL might widen race window for abusing TIF_MEMDIE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 13:00 ` Tetsuo Handa @ 2014-12-23 13:09 ` Michal Hocko 2014-12-23 13:20 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-23 13:09 UTC (permalink / raw) To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg On Tue 23-12-14 22:00:52, Tetsuo Handa wrote: > Michal Hocko wrote: > > > and finally sets SIGKILL on that victim thread. If such a delay > > > happened, that victim thread is free to abuse TIF_MEMDIE for that period. > > > Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better. > > > > I don't know, I can hardly find a scenario where it would make any > > difference in the real life. If the victim needs to allocate a memory to > > finish then it would trigger OOM again and have to wait/loop until this > > OOM killer releases the oom zonelist lock just to find out it already > > has TIF_MEMDIE set and can dive into memory reserves. Which way is more > > correct is a question but I wouldn't change it without having a really > > good reason. This whole code is subtle already, let's not make it even > > more so. > > gfp_to_alloc_flags() in mm/page_alloc.c sets ALLOC_NO_WATERMARKS if > the victim task has TIF_MEMDIE flag, doesn't it? This is the whole point of TIF_MEMDIE. [...] > Then, I think deferring SIGKILL might widen race window for abusing TIF_MEMDIE. How would it abuse the flag? The OOM victim has to die and if it needs to allocate then we have to allow it to do so otherwise the whole exercise was pointless. fatal_signal_pending check is not so widespread in the kernel that the task would notice it immediately. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 13:09 ` Michal Hocko @ 2014-12-23 13:20 ` Tetsuo Handa 2014-12-23 13:43 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-23 13:20 UTC (permalink / raw) To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg Michal Hocko wrote: > On Tue 23-12-14 22:00:52, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > > and finally sets SIGKILL on that victim thread. If such a delay > > > > happened, that victim thread is free to abuse TIF_MEMDIE for that period. > > > > Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better. > > > > > > I don't know, I can hardly find a scenario where it would make any > > > difference in the real life. If the victim needs to allocate a memory to > > > finish then it would trigger OOM again and have to wait/loop until this > > > OOM killer releases the oom zonelist lock just to find out it already > > > has TIF_MEMDIE set and can dive into memory reserves. Which way is more > > > correct is a question but I wouldn't change it without having a really > > > good reason. This whole code is subtle already, let's not make it even > > > more so. > > > > gfp_to_alloc_flags() in mm/page_alloc.c sets ALLOC_NO_WATERMARKS if > > the victim task has TIF_MEMDIE flag, doesn't it? > > This is the whole point of TIF_MEMDIE. > > [...] > > > Then, I think deferring SIGKILL might widen race window for abusing TIF_MEMDIE. > > How would it abuse the flag? The OOM victim has to die and if it needs > to allocate then we have to allow it to do so otherwise the whole > exercise was pointless. fatal_signal_pending check is not so widespread > in the kernel that the task would notice it immediately. I'm talking about possible delay between TIF_MEMDIE was set on the victim and SIGKILL is delivered to the victim. Why the victim has to die before receiving SIGKILL? The victim can access memory reserves until SIGKILL is delivered, can't it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 13:20 ` Tetsuo Handa @ 2014-12-23 13:43 ` Michal Hocko 2014-12-23 14:11 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-23 13:43 UTC (permalink / raw) To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg On Tue 23-12-14 22:20:57, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 23-12-14 22:00:52, Tetsuo Handa wrote: [...] > > > Then, I think deferring SIGKILL might widen race window for abusing TIF_MEMDIE. > > > > How would it abuse the flag? The OOM victim has to die and if it needs > > to allocate then we have to allow it to do so otherwise the whole > > exercise was pointless. fatal_signal_pending check is not so widespread > > in the kernel that the task would notice it immediately. > > I'm talking about possible delay between TIF_MEMDIE was set on the victim > and SIGKILL is delivered to the victim. I can read what you wrote. You are just ignoring my questions it seems because I haven't got any reason _why it matters_. My point was that the victim might be looping in the kernel and doing other allocations until it notices it has fatal_signal_pending and bail out. So the delay between setting the flag and sending the signal is not that important AFAICS. > Why the victim has to die before receiving SIGKILL? It has to die to resolve the current OOM condition. I haven't written anything about dying before receiving SIGKILL. > The victim can access memory reserves until SIGKILL is delivered, > can't it? And why does that matter? It would have to do such an allocation anyway because it wouldn't proceed without it... And the only difference between having the flag and not having it is that the allocation has higher chance to succeed with the flag so it will not trigger the OOM killer again right away. See the point or am I missing something here? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 13:43 ` Michal Hocko @ 2014-12-23 14:11 ` Tetsuo Handa 2014-12-23 14:57 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-23 14:11 UTC (permalink / raw) To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg Michal Hocko wrote: > On Tue 23-12-14 22:20:57, Tetsuo Handa wrote: > > I'm talking about possible delay between TIF_MEMDIE was set on the victim > > and SIGKILL is delivered to the victim. > > I can read what you wrote. You are just ignoring my questions it seems > because I haven't got any reason _why it matters_. My point was that the > victim might be looping in the kernel and doing other allocations until > it notices it has fatal_signal_pending and bail out. So the delay > between setting the flag and sending the signal is not that important > AFAICS. My point is that the victim might not be looping in the kernel when getting TIF_MEMDIE. Situation: P1: A process who called the OOM killer P2: A process who is chosen by the OOM killer P2 is running a program shown below. ---------- int main(int argc, char *argv[]) { const int fd = open("/dev/zero", O_RDONLY); char *buf = malloc(1024 * 1048576); if (fd == -1 || !buf) return 1; memset(buf, 0, 512 * 1048576); sleep(10); read(fd, buf, 1024 * 1048576); return 0; } ---------- Sequence: (1) P2 is sleeping at sleep(10). (2) P1 triggers the OOM killer and P2 is chosen. (3) The OOM killer sets TIF_MEMDIE on P2. (4) P2 wakes up as sleep(10) expired. (5) P2 calls read(). (6) P2 triggers page fault inside read(). (7) P2 allocates from memory reserves for handling page fault. (8) The OOM killer sends SIGKILL to P2. (9) P2 receives SIGKILL after all memory reserves were allocated for handling page fault. (10) P2 starts steps for die, but memory reserves may be already empty. My worry: More the delay between (3) and (8) becomes longer (e.g. 30 seconds for an overdone case), more likely to cause memory reserves being consumed before (9). If (3) and (8) are reversed, P2 will notice fatal_signal_pending() and bail out before allocating a lot of memory from memory reserves. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [RFC PATCH] oom: Don't count on mm-less current process. 2014-12-23 14:11 ` Tetsuo Handa @ 2014-12-23 14:57 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2014-12-23 14:57 UTC (permalink / raw) To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg On Tue 23-12-14 23:11:01, Tetsuo Handa wrote: [...] > (1) P2 is sleeping at sleep(10). > (2) P1 triggers the OOM killer and P2 is chosen. > (3) The OOM killer sets TIF_MEMDIE on P2. > (4) P2 wakes up as sleep(10) expired. > (5) P2 calls read(). > (6) P2 triggers page fault inside read(). > (7) P2 allocates from memory reserves for handling page fault. > (8) The OOM killer sends SIGKILL to P2. > (9) P2 receives SIGKILL after all memory reserves were > allocated for handling page fault. > (10) P2 starts steps for die, but memory reserves may be > already empty. How is that any different from any other task which allocates with TIF_MEMDIE already set and fatal_signal_pending without checking for the later? > My worry: > > More the delay between (3) and (8) becomes longer (e.g. 30 seconds > for an overdone case), more likely to cause memory reserves being > consumed before (9). If (3) and (8) are reversed, P2 will notice > fatal_signal_pending() and bail out before allocating a lot of > memory from memory reserves. And my suspicion is that this has never been a real problem and I really do not like to fiddle with the code for non-existing problems. If you are sure that the reverse order is correct and doesn't cause any other issues then you are free to send a separate patch with a proper justification. The patch I've posted fixes a different problem and putting more stuff in it is just not right! I really hate how you conflate different issues all the time, TBH. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* How to handle TIF_MEMDIE stalls? 2014-12-18 15:33 ` Michal Hocko 2014-12-19 12:07 ` Tetsuo Handa @ 2014-12-19 12:22 ` Tetsuo Handa 2014-12-20 2:03 ` Dave Chinner 1 sibling, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-19 12:22 UTC (permalink / raw) To: mhocko, dchinner; +Cc: linux-mm, rientjes, oleg (Renamed thread's title and invited Dave Chinner. A memory stressing program at http://marc.info/?l=linux-mm&m=141890469424353&w=2 can trigger stalls on a system with 4 CPUs/2048MB of RAM/no swap. I want to hear your opinion.) Michal Hocko wrote: > > My question is quite simple. How can we avoid memory allocation stalls when > > > > System has 2048MB of RAM and no swap. > > Memcg1 for task1 has quota 512MB and 400MB in use. > > Memcg2 for task2 has quota 512MB and 400MB in use. > > Memcg3 for task3 has quota 512MB and 400MB in use. > > Memcg4 for task4 has quota 512MB and 400MB in use. > > Memcg5 for task5 has quota 512MB and 1MB in use. > > > > and task5 launches below memory consumption program which would trigger > > the global OOM killer before triggering the memcg OOM killer? > > > [...] > > The global OOM killer will try to kill this program because this program > > will be using 400MB+ of RAM by the time the global OOM killer is triggered. > > But sometimes this program cannot be terminated by the global OOM killer > > due to XFS lock dependency. > > > > You can see what is happening from OOM traces after uptime > 320 seconds of > > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not > > configured on this program. > > This is clearly a separate issue. It is a lock dependency and that alone > _cannot_ be handled from OOM killer as it doesn't understand lock > dependencies. This should be addressed from the xfs point of view IMHO > but I am not familiar with this filesystem to tell you how or whether it > is possible. > Then, let's ask Dave Chinner whether he can address it. My opinion is that everybody is doing __GFP_WAIT memory allocation without understanding the entire dependencies. Everybody is only prepared for allocation failures because everybody is expecting that the OOM killer shall somehow solve the OOM condition (except that some are expecting that memory stress that will trigger the OOM killer must not be given). I am neither familiar with XFS, but I don't think this issue can be addressed from the XFS point of view. For example, https://lkml.org/lkml/2014/7/2/249 stalls at blk_rq_map_kern() which I'm suspecting it as one of causes of the stall due to happening inside disk I/O event of XFS partition. If XFS were responsible for avoiding stall at blk_rq_map_kern() (on the assumption that XFS triggered that disk I/O event), XFS (filesystem layer) somehow needs to drop __GFP_WAIT flag from scsi_execute() (SCSI layer). We will end up with passing gfp flags to every function which might do memory allocation. Is everybody happy with such code complication/bloat? ---------- int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, int data_direction, void *buffer, unsigned bufflen, unsigned char *sense, int timeout, int retries, u64 flags, int *resid) { struct request *req; int write = (data_direction == DMA_TO_DEVICE); int ret = DRIVER_ERROR << 24; req = blk_get_request(sdev->request_queue, write, __GFP_WAIT); if (IS_ERR(req)) return ret; blk_rq_set_block_pc(req); if (bufflen && blk_rq_map_kern(sdev->request_queue, req, buffer, bufflen, __GFP_WAIT)) goto out; req->cmd_len = COMMAND_SIZE(cmd[0]); memcpy(req->cmd, cmd, req->cmd_len); req->sense = sense; req->sense_len = 0; req->retries = retries; req->timeout = timeout; req->cmd_flags |= flags | REQ_QUIET | REQ_PREEMPT; /* * head injection *required* here otherwise quiesce won't work */ blk_execute_rq(req->q, NULL, req, 1); /* * Some devices (USB mass-storage in particular) may transfer * garbage data together with a residue indicating that the data * is invalid. Prevent the garbage from being misinterpreted * and prevent security leaks by zeroing out the excess data. */ if (unlikely(req->resid_len > 0 && req->resid_len <= bufflen)) memset(buffer + (bufflen - req->resid_len), 0, req->resid_len); if (resid) *resid = req->resid_len; ret = req->errors; out: blk_put_request(req); return ret; } ---------- By the way, if __GFP_WAIT requests had higher priority (lower or ignore the watermark?) than GFP_NOIO or GFP_NOFS or GFP_KERNEL requests, could blk_rq_map_kern() avoid the stall and allow XFS to proceed (and release XFS lock and terminate the OOM victim)? > > Somebody may set > > TIF_MEMDIE at oom_kill_process() even if we avoided setting TIF_MEMDIE at > > out_of_memory(). There will be more locations where TIF_MEMDIE is set; even > > out-of-tree modules might set TIF_MEMDIE. > > TIF_MEMDIE should be set only when we _know_ the task will free _some_ > memory and when we are killing the OOM victim. The only place I can see > that would break the first condition is out_of_memory for the current > which passed exit_mm(). That is the point why I've suggested you this > patch and it would be much more easier if we could simply finished that > one without pulling other things in. I agree that TIF_MEMDIE should be set only when we know the task will free some memory, but currently setting TIF_MEMDIE on the OOM victim is causing stalls which I want to analyze/debug via patchset posted at http://marc.info/?l=linux-mm&m=141671817211121&w=2 because we forever wait until the OOM victim terminates. In serial-20141213.txt.xz, TIF_MEMDIE was set on the OOM victim which is even unkillable by SysRq-f. > > Nonetheless, I don't think > > > > if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE)) > > return true; > > > > check is perfect because we anyway need to prepare for both mm-less and > > with-mm cases. > > > > My concern is not "whether TIF_MEMDIE flag should be set or not". My concern > > is not "whether task->mm is NULL or not". My concern is "whether threads with > > TIF_MEMDIE flag retard other process' memory allocation or not". > > Above-mentioned program is an example of with-mm threads retarding > > other process' memory allocation. > > There is no way you can guarantee something like that. OOM is the _last_ > resort. Things are in a pretty bad state already when it hits. It is the > last attempt to reclaim some memory. System might be in an arbitrary > state at this time. > I really hate to repeat myself but you are trying to "fix" your problem > at a wrong level. I think that the OOM killer is responsible for killing the OOM condition or triggering kernel panic. I don't like that the OOM killer is failing to kill the OOM condition as it claims to be. > > > I know you don't like timeout approach, but adding > > > > if (sysctl_memdie_timeout_secs && test_tsk_thread_flag(task, TIF_MEMDIE) && > > time_after(jiffies, task->memdie_start + sysctl_memdie_timeout_secs * HZ)) > > return true; > > > > check to oom_unkillable_task() will take care of both mm-less and with-mm > > cases because everyone can safely skip the TIF_MEMDIE victim threads who > > cannot be terminated immediately for some reason. > > It will not take care of anything. It will start shooting to more > processes after some timeout, which is hard to get right, and there > wouldn't be any guaratee multiple victims will help because they might > end up blocking on the very same or other lock on the way out. If you don't like skip on timeout approach, I'm OK with triggering kernel panic on timeout approach. Analyzing vmcore will give us some hints about what was happening. > Jeez are > you even reading feedback you are getting? Of course, I'm reading your feedback. The "[RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls." will become unnecessary after all bugs are identified and fixed. I agree that bugs should be identified and fixed, but XFS stall is nothing but an example which I can reproduce on my desktop. My role is to analyze and respond to kernel troubles such as unexpected stalls, panics, reboots occurred on customer's servers which I don't have access. I will encounter various different troubles which I can't predict how to obtain information. Therefore, I want some unattended built-in assistance for understanding what was happening in chronological order and identifying/fixing the bugs. Existing built-in debugging hooks which requires administrator's operation might help after understanding what was happening. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-19 12:22 ` How to handle TIF_MEMDIE stalls? Tetsuo Handa @ 2014-12-20 2:03 ` Dave Chinner 2014-12-20 12:41 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Dave Chinner @ 2014-12-20 2:03 UTC (permalink / raw) To: Tetsuo Handa; +Cc: mhocko, linux-mm, rientjes, oleg, david On Fri, Dec 19, 2014 at 09:22:49PM +0900, Tetsuo Handa wrote: > (Renamed thread's title and invited Dave Chinner. A memory stressing program > at http://marc.info/?l=linux-mm&m=141890469424353&w=2 can trigger stalls on > a system with 4 CPUs/2048MB of RAM/no swap. I want to hear your opinion.) > > Michal Hocko wrote: > > > My question is quite simple. How can we avoid memory allocation stalls when > > > > > > System has 2048MB of RAM and no swap. > > > Memcg1 for task1 has quota 512MB and 400MB in use. > > > Memcg2 for task2 has quota 512MB and 400MB in use. > > > Memcg3 for task3 has quota 512MB and 400MB in use. > > > Memcg4 for task4 has quota 512MB and 400MB in use. > > > Memcg5 for task5 has quota 512MB and 1MB in use. > > > > > > and task5 launches below memory consumption program which would trigger > > > the global OOM killer before triggering the memcg OOM killer? > > > > > [...] > > > The global OOM killer will try to kill this program because this program > > > will be using 400MB+ of RAM by the time the global OOM killer is triggered. > > > But sometimes this program cannot be terminated by the global OOM killer > > > due to XFS lock dependency. > > > > > > You can see what is happening from OOM traces after uptime > 320 seconds of > > > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not > > > configured on this program. > > > > This is clearly a separate issue. It is a lock dependency and that alone > > _cannot_ be handled from OOM killer as it doesn't understand lock > > dependencies. This should be addressed from the xfs point of view IMHO > > but I am not familiar with this filesystem to tell you how or whether it > > is possible. What XFS lock dependency? I see nothing in that output file that indicates a lock dependency problem - can you point out what the issue is here? > Then, let's ask Dave Chinner whether he can address it. My opinion is that > everybody is doing __GFP_WAIT memory allocation without understanding the > entire dependencies. Everybody is only prepared for allocation failures > because everybody is expecting that the OOM killer shall somehow solve the > OOM condition (except that some are expecting that memory stress that will > trigger the OOM killer must not be given). I am neither familiar with XFS, > but I don't think this issue can be addressed from the XFS point of view. Well, I can't comment (nor am I going to waste time speculating) until someone actually explains the XFS lock dependency that is apparently causing reclaim problems. Has lockdep reported any problems? Cheers, Dave. -- Dave Chinner dchinner@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-20 2:03 ` Dave Chinner @ 2014-12-20 12:41 ` Tetsuo Handa 2014-12-20 22:35 ` Dave Chinner 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-20 12:41 UTC (permalink / raw) To: dchinner; +Cc: mhocko, linux-mm, rientjes, oleg, david Dave Chinner wrote: > On Fri, Dec 19, 2014 at 09:22:49PM +0900, Tetsuo Handa wrote: > > > > The global OOM killer will try to kill this program because this program > > > > will be using 400MB+ of RAM by the time the global OOM killer is triggered. > > > > But sometimes this program cannot be terminated by the global OOM killer > > > > due to XFS lock dependency. > > > > > > > > You can see what is happening from OOM traces after uptime > 320 seconds of > > > > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not > > > > configured on this program. > > > > > > This is clearly a separate issue. It is a lock dependency and that alone > > > _cannot_ be handled from OOM killer as it doesn't understand lock > > > dependencies. This should be addressed from the xfs point of view IMHO > > > but I am not familiar with this filesystem to tell you how or whether it > > > is possible. > > What XFS lock dependency? I see nothing in that output file that indicates a > lock dependency problem - can you point out what the issue is here? This is a problem which lockdep cannot report. The problem is that an OOM-victim task is unable to terminate because it is blocked for waiting for (I don't know which lock but) one of locks used by XFS. ---------- [ 320.788387] Kill process 10732 (a.out) sharing same memory (...snipped...) [ 398.641724] a.out D ffff880077e42638 0 10732 1 0x00000084 [ 398.643705] ffff8800770ebcb8 0000000000000082 ffff8800770ebc88 ffff880077e42210 [ 398.645819] 0000000000012500 ffff8800770ebfd8 0000000000012500 ffff880077e42210 [ 398.647917] ffff8800770ebcb8 ffff88007b4a2a48 ffff88007b4a2a4c ffff880077e42210 [ 398.650009] Call Trace: [ 398.651094] [<ffffffff8159f954>] schedule_preempt_disabled+0x24/0x70 [ 398.652913] [<ffffffff815a1705>] __mutex_lock_slowpath+0xb5/0x120 [ 398.654679] [<ffffffff815a178e>] mutex_lock+0x1e/0x32 [ 398.656262] [<ffffffffa023b58a>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs] [ 398.658350] [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs] [ 398.660191] [<ffffffff8117edd9>] new_sync_write+0x89/0xd0 [ 398.661829] [<ffffffff8117f742>] vfs_write+0xb2/0x1f0 [ 398.663397] [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70 [ 398.665190] [<ffffffff81180200>] SyS_write+0x50/0xc0 [ 398.666745] [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0 [ 398.668539] [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17 (...snipped...) [ 897.190487] Out of memory: Kill process 10732 (a.out) score 898 or sacrifice child [ 897.192236] Killed process 10732 (a.out) total-vm:2166864kB, anon-rss:1727976kB, file-rss:0kB (...snipped...) [ 904.819053] a.out D ffff880077e42638 0 10732 1 0x00100084 [ 904.820967] ffff8800770ebcb8 0000000000000082 ffff8800770ebc88 ffff880077e42210 [ 904.823011] 0000000000012500 ffff8800770ebfd8 0000000000012500 ffff880077e42210 [ 904.825054] ffff8800770ebcb8 ffff88007b4a2a48 ffff88007b4a2a4c ffff880077e42210 [ 904.827137] Call Trace: [ 904.828174] [<ffffffff8159f954>] schedule_preempt_disabled+0x24/0x70 [ 904.829924] [<ffffffff815a1705>] __mutex_lock_slowpath+0xb5/0x120 [ 904.831634] [<ffffffff815a178e>] mutex_lock+0x1e/0x32 [ 904.833148] [<ffffffffa023b58a>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs] [ 904.835178] [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs] [ 904.836980] [<ffffffff8117edd9>] new_sync_write+0x89/0xd0 [ 904.838561] [<ffffffff8117f742>] vfs_write+0xb2/0x1f0 [ 904.840094] [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70 [ 904.841846] [<ffffffff81180200>] SyS_write+0x50/0xc0 [ 904.844026] [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0 [ 904.845826] [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17 ---------- I don't know how block layer requests are issued by filesystem layer's activities, but PID=10832 is blocked for so long at blk_rq_map_kern() doing __GFP_WAIT allocation. I'm sure that this blk_rq_map_kern() is issued by XFS filesystem's activities because this system has only /dev/sda1 formatted as XFS and there is no swap memory. ---------- [ 393.696527] kworker/1:1 R running task 0 43 2 0x00000000 [ 393.698561] Workqueue: events_freezable_power_ disk_events_workfn [ 393.700339] ffff88007c5437d8 0000000000000046 ffff88007c5438a0 ffff88007c4b4cc0 [ 393.702513] 0000000000012500 ffff88007c543fd8 0000000000012500 ffff88007c4b4cc0 [ 393.704631] 0000000000000020 ffff88007c5438b0 0000000000000002 ffffffff81848408 [ 393.706748] Call Trace: [ 393.707924] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 393.709572] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 393.711206] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 393.713001] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 393.714679] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 393.716538] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 393.718262] [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380 [ 393.719959] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 393.721628] [<ffffffff8125c119>] bio_copy_kern+0x49/0x100 [ 393.723240] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 393.725043] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 393.726695] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 393.728407] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 393.730021] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 393.731776] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 393.733561] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 393.735235] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 393.737027] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 393.738918] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 393.740602] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 393.742254] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 393.743898] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 393.745495] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 393.747152] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 393.748637] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 393.750438] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 393.752004] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 (...snipped...) [ 525.157216] kworker/1:0 R running task 0 10832 2 0x00000080 [ 525.159187] Workqueue: events_freezable_power_ disk_events_workfn [ 525.160907] ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190 [ 525.162956] 0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190 [ 525.165010] 0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408 [ 525.167068] Call Trace: [ 525.168100] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 525.169679] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 525.171241] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 525.172960] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 525.174580] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 525.176302] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 525.177982] [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380 [ 525.179631] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 525.181215] [<ffffffff8125c119>] bio_copy_kern+0x49/0x100 [ 525.182785] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 525.184545] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 525.186156] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 525.187831] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 525.189418] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 525.191148] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 525.192969] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 525.194688] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 525.196455] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 525.198291] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 525.199984] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 525.201616] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 525.203264] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 525.204799] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 525.206436] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 525.207902] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 525.209655] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 525.211206] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 (...snipped...) [ 619.934144] kworker/1:0 R running task 0 10832 2 0x00000080 [ 619.936060] Workqueue: events_freezable_power_ disk_events_workfn [ 619.937833] ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190 [ 619.939912] 0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190 [ 619.942010] 0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408 [ 619.944123] Call Trace: [ 619.945168] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 619.946697] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 619.948271] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 619.949968] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 619.951576] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 619.953387] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 619.955062] [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380 [ 619.956726] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 619.958289] [<ffffffff8125c119>] bio_copy_kern+0x49/0x100 [ 619.959886] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 619.961641] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 619.963229] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 619.964904] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 619.966499] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 619.968182] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 619.969936] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 619.971583] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 619.973346] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 619.975213] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 619.976865] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 619.978497] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 619.980179] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 619.981793] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 619.983468] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 619.984939] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 619.986684] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 619.988231] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 (...snipped...) [ 715.930998] kworker/1:0 R running task 0 10832 2 0x00000080 [ 715.932930] Workqueue: events_freezable_power_ disk_events_workfn [ 715.934670] ffff880076fb9b40 0000000000000400 ffff88007c8ab8a0 0000000000000000 [ 715.936814] ffff88007c8ab7e8 ffff88007c8abfd8 0000000000012500 ffff88007c894190 [ 715.938869] 0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408 [ 715.940909] Call Trace: [ 715.942017] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 715.943638] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 715.945256] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 715.947001] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 715.948603] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 715.950298] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 715.952010] [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380 [ 715.953658] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 715.955324] [<ffffffff8125c119>] bio_copy_kern+0x49/0x100 [ 715.956929] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 715.958693] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 715.960722] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 715.962488] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 715.964142] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 715.965870] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 715.967615] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 715.969255] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 715.971061] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 715.972981] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 715.974692] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 715.976330] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 715.978090] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 715.979723] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 715.981361] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 715.982794] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 715.984554] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 715.986116] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 (...snipped...) [ 798.788405] kworker/1:0 R running task 0 10832 2 0x00000088 [ 798.790344] Workqueue: events_freezable_power_ disk_events_workfn [ 798.792191] ffff880035e3f340 0000000000000400 ffff88007c8ab8a0 0000000000000000 [ 798.794328] ffff88007c8ab7e8 ffffffff8112132a ffff88007c8ab908 ffff88007cfee800 [ 798.796395] 0000000000000020 0000000000000000 ffff88007c8ab838 ffff88007c8ab8b0 [ 798.798458] Call Trace: [ 798.799525] [<ffffffff8112132a>] ? shrink_slab_node+0x3a/0x1b0 [ 798.801229] [<ffffffff81122063>] ? shrink_slab+0x83/0x150 [ 798.802809] [<ffffffff811252bf>] ? do_try_to_free_pages+0x35f/0x4d0 [ 798.804586] [<ffffffff811254c4>] ? try_to_free_pages+0x94/0xc0 [ 798.806250] [<ffffffff8111a793>] ? __alloc_pages_nodemask+0x4e3/0xa40 [ 798.808050] [<ffffffff8115a8ce>] ? alloc_pages_current+0x8e/0x100 [ 798.809759] [<ffffffff8125bed6>] ? bio_copy_user_iov+0x1d6/0x380 [ 798.811500] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 798.813053] [<ffffffff8125c119>] ? bio_copy_kern+0x49/0x100 [ 798.814699] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 798.816494] [<ffffffff81265e6f>] ? blk_rq_map_kern+0x6f/0x130 [ 798.818421] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 798.820083] [<ffffffff813a66cf>] ? scsi_execute+0x12f/0x160 [ 798.821733] [<ffffffff813a7f14>] ? scsi_execute_req_flags+0x84/0xf0 [ 798.823454] [<ffffffffa01e29cc>] ? sr_check_events+0xbc/0x2e0 [sr_mod] [ 798.825312] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 798.826930] [<ffffffffa01d6177>] ? cdrom_check_events+0x17/0x30 [cdrom] [ 798.828733] [<ffffffffa01e2e5d>] ? sr_block_check_events+0x2d/0x30 [sr_mod] [ 798.830594] [<ffffffff812701c6>] ? disk_check_events+0x56/0x1b0 [ 798.832338] [<ffffffff81270331>] ? disk_events_workfn+0x11/0x20 [ 798.834013] [<ffffffff8107ceaf>] ? process_one_work+0x13f/0x370 [ 798.835682] [<ffffffff8107de99>] ? worker_thread+0x119/0x500 [ 798.837350] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 798.838990] [<ffffffff81082f7c>] ? kthread+0xdc/0x100 [ 798.840489] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 798.842258] [<ffffffff815a383c>] ? ret_from_fork+0x7c/0xb0 [ 798.843837] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 (...snipped...) [ 850.354473] kworker/1:0 R running task 0 10832 2 0x00000080 [ 850.356549] Workqueue: events_freezable_power_ disk_events_workfn [ 850.358273] ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190 [ 850.360359] 0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190 [ 850.362427] 0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408 [ 850.364505] Call Trace: [ 850.365504] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 850.369185] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 850.371553] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 850.373384] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 850.375503] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 850.377333] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 850.379100] [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380 [ 850.380763] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 850.382362] [<ffffffff8125c119>] bio_copy_kern+0x49/0x100 [ 850.384008] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 850.385799] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 850.387572] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 850.389995] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 850.391575] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 850.393298] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 850.395050] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 850.396696] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 850.398459] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 850.400321] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 850.401986] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 850.403621] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 850.405618] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 850.407336] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 850.411190] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 850.412677] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 850.414454] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 850.416010] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 (...snipped...) [ 907.302050] kworker/1:0 R running task 0 10832 2 0x00000080 [ 907.303961] Workqueue: events_freezable_power_ disk_events_workfn [ 907.305706] ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190 [ 907.307761] 0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190 [ 907.309894] 0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408 [ 907.311949] Call Trace: [ 907.312989] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 907.314578] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 907.316182] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 907.317889] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 907.319535] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 907.321259] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 907.322945] [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380 [ 907.324606] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 907.326196] [<ffffffff8125c119>] bio_copy_kern+0x49/0x100 [ 907.327788] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 907.329549] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 907.331184] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 907.332877] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 907.334452] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 907.336156] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 907.337893] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 907.339539] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 907.341289] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 907.343115] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 907.344771] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 907.346421] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 907.348057] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 907.349650] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 907.351295] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 907.352765] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 907.354520] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 907.356097] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 ---------- I don't know which process is holding the mutex which PID=10732 is waiting for, but I suspect that a process holding the mutex which PID=10732 is waiting for is waiting for completion of disk I/O which is processed by PID=10832. If my suspect is correct, it's a AB-BA livelock because the OOM killer is waiting for PID=10732 to terminate whereas PID=10832 cannot complete disk I/O due to waiting for the OOM killer. Unfortunately I'm not familiar with XFS, thus I can't find who is. Maybe PID=10802 than PID=10832? Then, why both PID=10802 and PID=10832 are blocked for memory allocation? ---------- [ 715.162520] a.out R running task 0 10802 1 0x00000084 [ 715.164482] ffff88007b877898 0000000000000082 ffff88007b877960 ffff8800751bc050 [ 715.166574] 0000000000012500 ffff88007b877fd8 0000000000012500 ffff8800751bc050 [ 715.169036] 0000000000000020 ffff88007b877970 0000000000000003 ffffffff81848408 [ 715.171125] Call Trace: [ 715.172185] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 715.173773] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 715.175356] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 715.177088] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 715.178721] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 715.180583] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 715.182203] [<ffffffff81111b27>] __page_cache_alloc+0xa7/0xc0 [ 715.183864] [<ffffffff8111263b>] pagecache_get_page+0x6b/0x1e0 [ 715.185533] [<ffffffffa02522ae>] ? xfs_trans_commit+0x13e/0x230 [xfs] [ 715.187314] [<ffffffff811127de>] grab_cache_page_write_begin+0x2e/0x50 [ 715.189108] [<ffffffffa02301cf>] xfs_vm_write_begin+0x2f/0xe0 [xfs] [ 715.190876] [<ffffffff8111188c>] generic_perform_write+0xcc/0x1d0 [ 715.192610] [<ffffffffa023b50f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs] [ 715.194526] [<ffffffffa023b5ef>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs] [ 715.196580] [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs] [ 715.198368] [<ffffffff8117edd9>] new_sync_write+0x89/0xd0 [ 715.200029] [<ffffffff8117f742>] vfs_write+0xb2/0x1f0 [ 715.201576] [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70 [ 715.203309] [<ffffffff81180200>] SyS_write+0x50/0xc0 [ 715.204866] [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0 [ 715.206613] [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17 (...snipped...) [ 906.533722] a.out R running task 0 10802 1 0x00000084 [ 906.535671] ffff88007b877898 0000000000000082 ffff88007b877960 ffff8800751bc050 [ 906.537699] 0000000000012500 ffff88007b877fd8 0000000000012500 ffff8800751bc050 [ 906.539838] 0000000000000020 ffff88007b877970 0000000000000003 ffffffff81848408 [ 906.541916] Call Trace: [ 906.543075] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 906.544610] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 906.546223] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 906.547941] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 906.549622] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 906.551357] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 906.553070] [<ffffffff81111b27>] __page_cache_alloc+0xa7/0xc0 [ 906.554748] [<ffffffff8111263b>] pagecache_get_page+0x6b/0x1e0 [ 906.556409] [<ffffffffa02522ae>] ? xfs_trans_commit+0x13e/0x230 [xfs] [ 906.558180] [<ffffffff811127de>] grab_cache_page_write_begin+0x2e/0x50 [ 906.560242] [<ffffffffa02301cf>] xfs_vm_write_begin+0x2f/0xe0 [xfs] [ 906.562027] [<ffffffff8111188c>] generic_perform_write+0xcc/0x1d0 [ 906.563851] [<ffffffffa023b50f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs] [ 906.565838] [<ffffffffa023b5ef>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs] [ 906.567892] [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs] [ 906.569719] [<ffffffff8117edd9>] new_sync_write+0x89/0xd0 [ 906.571300] [<ffffffff8117f742>] vfs_write+0xb2/0x1f0 [ 906.572836] [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70 [ 906.574578] [<ffffffff81180200>] SyS_write+0x50/0xc0 [ 906.576198] [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0 [ 906.577929] [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17 ---------- Anyway stalling for 10 minutes upon OOM (and can't solve with SysRq-f) is unusable for me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-20 12:41 ` Tetsuo Handa @ 2014-12-20 22:35 ` Dave Chinner 2014-12-21 8:45 ` Tetsuo Handa 2014-12-29 17:40 ` [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) Michal Hocko 0 siblings, 2 replies; 276+ messages in thread From: Dave Chinner @ 2014-12-20 22:35 UTC (permalink / raw) To: Tetsuo Handa; +Cc: dchinner, mhocko, linux-mm, rientjes, oleg On Sat, Dec 20, 2014 at 09:41:22PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > On Fri, Dec 19, 2014 at 09:22:49PM +0900, Tetsuo Handa wrote: > > > > > The global OOM killer will try to kill this program because this program > > > > > will be using 400MB+ of RAM by the time the global OOM killer is triggered. > > > > > But sometimes this program cannot be terminated by the global OOM killer > > > > > due to XFS lock dependency. > > > > > > > > > > You can see what is happening from OOM traces after uptime > 320 seconds of > > > > > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not > > > > > configured on this program. > > > > > > > > This is clearly a separate issue. It is a lock dependency and that alone > > > > _cannot_ be handled from OOM killer as it doesn't understand lock > > > > dependencies. This should be addressed from the xfs point of view IMHO > > > > but I am not familiar with this filesystem to tell you how or whether it > > > > is possible. > > > > What XFS lock dependency? I see nothing in that output file that indicates a > > lock dependency problem - can you point out what the issue is here? > > This is a problem which lockdep cannot report. > > The problem is that an OOM-victim task is unable to terminate because it is > blocked for waiting for (I don't know which lock but) one of locks used by XFS. That's not an XFS problem - XFS relies on the memory reclaim subsystem being able to make progress. If the memory reclaim subsystem cannot make progress, then there's a bug in the memory reclaim subsystem, not a problem with the OOM killer. IOWs, you're not looking at the right place to solve the problem. > ---------- > [ 320.788387] Kill process 10732 (a.out) sharing same memory > (...snipped...) > [ 398.641724] a.out D ffff880077e42638 0 10732 1 0x00000084 > [ 398.643705] ffff8800770ebcb8 0000000000000082 ffff8800770ebc88 ffff880077e42210 > [ 398.645819] 0000000000012500 ffff8800770ebfd8 0000000000012500 ffff880077e42210 > [ 398.647917] ffff8800770ebcb8 ffff88007b4a2a48 ffff88007b4a2a4c ffff880077e42210 > [ 398.650009] Call Trace: > [ 398.651094] [<ffffffff8159f954>] schedule_preempt_disabled+0x24/0x70 > [ 398.652913] [<ffffffff815a1705>] __mutex_lock_slowpath+0xb5/0x120 > [ 398.654679] [<ffffffff815a178e>] mutex_lock+0x1e/0x32 > [ 398.656262] [<ffffffffa023b58a>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs] > [ 398.658350] [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs] > [ 398.660191] [<ffffffff8117edd9>] new_sync_write+0x89/0xd0 > [ 398.661829] [<ffffffff8117f742>] vfs_write+0xb2/0x1f0 > [ 398.663397] [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70 > [ 398.665190] [<ffffffff81180200>] SyS_write+0x50/0xc0 > [ 398.666745] [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0 > [ 398.668539] [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17 These processes are blocked because some other process is holding the i_mutex - likely another write that is blocked in memory reclaim during page cache allocation. Yup: [ 398.852364] a.out R running task 0 10739 1 0x00000084 [ 398.854312] ffff8800751d3898 0000000000000082 ffff8800751d3960 ffff880035c42a80 [ 398.856369] 0000000000012500 ffff8800751d3fd8 0000000000012500 ffff880035c42a80 [ 398.858440] 0000000000000020 ffff8800751d3970 0000000000000003 ffffffff81848408 [ 398.860497] Call Trace: [ 398.861602] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 398.863195] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 398.864799] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 398.866536] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 398.868177] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 398.869920] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 398.871647] [<ffffffff81111b27>] __page_cache_alloc+0xa7/0xc0 [ 398.873785] [<ffffffff8111263b>] pagecache_get_page+0x6b/0x1e0 [ 398.875468] [<ffffffff811127de>] grab_cache_page_write_begin+0x2e/0x50 [ 398.881857] [<ffffffffa02301cf>] xfs_vm_write_begin+0x2f/0xe0 [xfs] [ 398.883553] [<ffffffff8111188c>] generic_perform_write+0xcc/0x1d0 [ 398.885210] [<ffffffffa023b50f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs] [ 398.887100] [<ffffffffa023b5ef>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs] [ 398.889135] [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs] [ 398.890907] [<ffffffff8117edd9>] new_sync_write+0x89/0xd0 [ 398.892495] [<ffffffff8117f742>] vfs_write+0xb2/0x1f0 [ 398.894017] [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70 [ 398.895768] [<ffffffff81180200>] SyS_write+0x50/0xc0 [ 398.897273] [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0 [ 398.899013] [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17 That's what's holding the i_mutex. This is normal, and *every* filesystem holds the i_mutex here for buffered writes. Stop trying to shoot the messenger... Oh, boy. struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags) { struct page *page; int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT; if (flags & AOP_FLAG_NOFS) fgp_flags |= FGP_NOFS; page = pagecache_get_page(mapping, index, fgp_flags, mapping_gfp_mask(mapping), GFP_KERNEL); if (page) wait_for_stable_page(page); return page; } There are *3* different memory allocation controls passed to pagecache_get_page. The first is via AOP_FLAG_NOFS, where the caller explicitly says this allocation is in filesystem context with locks held, and so all allocations need to be done in GFP_NOFS context. This is used to override the second and third gfp parameters. The second is mapping_gfp_mask(mapping), which is the *default allocation context* the filesystem wants the page cache to use for allocating pages to the mapping. The third is a hard coded GFP_KERNEL, which is used for radix tree node allocation. Why are there separate allocation contexts for the radix tree nodes and the page cache pages when they are done under *exactly the same caller context*? Either we are allowed to recurse into the filesystem or we aren't, and the inode mapping mask defines that context for all page cache allocations, not just the pages themselves. And to point out how many filesystems this affects, the loop device, btrfs, f2fs, gfs2, jfs, logfs, nil2fs, reiserfs and XFS all use this mapping default to clear __GFP_FS from page cache allocations. Only ext4 and gfs2 use AOP_FLAG_NOFS in their ->write_begin callouts to prevent recusrion. IOWs, grab_cache_page_write_begin/pagecache_get_page multiple allocation contexts are just wrong. It does not match the way filesystems are informing the page cache of allocation context to avoid recursion (for avoiding stack overflow and/or deadlock). AOP_FLAG_NOFS should go away, and all filesystems should modify the mapping gfp mask to set their allocation context. If should be used *everywhere* pages are allocated into the page cache, and for all allocations related to tracking those allocated pages. Now, that's not the problem directly related to this lockup, but it's indicative of how far the page cache code has become from reality over the past few years... So, going back to the lockup, doesn't hte fact that so many processes are spinning in the shrinker tell you that there's a problem in that area? i.e. this: [ 398.861602] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 398.863195] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 398.864799] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 tells me a shrinker is not making progress for some reason. I'd suggest that you run some tracing to find out what shrinker it is stuck in. there are tracepoints in shrink_slab that will tell you what shrinker is iterating for long periods of time. i.e instead of ranting and pointing fingers at everyone, you need to keep digging until you know exactly where reclaim progress is stalling. > I don't know how block layer requests are issued by filesystem layer's > activities, but PID=10832 is blocked for so long at blk_rq_map_kern() doing > __GFP_WAIT allocation. I'm sure that this blk_rq_map_kern() is issued by XFS > filesystem's activities because this system has only /dev/sda1 formatted as > XFS and there is no swap memory. Sorry, what? [ 525.184545] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 525.186156] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 525.187831] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 525.189418] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 525.191148] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 525.192969] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 525.194688] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 525.196455] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 525.198291] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 525.199984] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 525.201616] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 525.203264] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 525.204799] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 525.206436] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 525.207902] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 525.209655] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 525.211206] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 That's a CDROM event through the SCSI stack via a raw scsi device. If you read the code you'd see that scsi_execute() is the function using __GFP_WAIT semantics. This has *absolutely nothing* to do with XFS, and clearly has nothing to do with anything related to the problem you are seeing. > Anyway stalling for 10 minutes upon OOM (and can't solve with > SysRq-f) is unusable for me. OOM-killing is not a magic button that will miraculously make the system work when you oversubscribe it severely. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-20 22:35 ` Dave Chinner @ 2014-12-21 8:45 ` Tetsuo Handa 2014-12-21 20:42 ` Dave Chinner 2014-12-29 18:19 ` Michal Hocko 2014-12-29 17:40 ` [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) Michal Hocko 1 sibling, 2 replies; 276+ messages in thread From: Tetsuo Handa @ 2014-12-21 8:45 UTC (permalink / raw) To: david; +Cc: dchinner, mhocko, linux-mm, rientjes, oleg Thank you for detailed explanation. Dave Chinner wrote: > So, going back to the lockup, doesn't hte fact that so many > processes are spinning in the shrinker tell you that there's a > problem in that area? i.e. this: > > [ 398.861602] [<ffffffff8159f814>] _cond_resched+0x24/0x40 > [ 398.863195] [<ffffffff81122119>] shrink_slab+0x139/0x150 > [ 398.864799] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 > > tells me a shrinker is not making progress for some reason. I'd > suggest that you run some tracing to find out what shrinker it is > stuck in. there are tracepoints in shrink_slab that will tell you > what shrinker is iterating for long periods of time. i.e instead of > ranting and pointing fingers at everyone, you need to keep digging > until you know exactly where reclaim progress is stalling. I checked using below patch that shrink_slab() is called for many times but each call took 0 jiffies and freed 0 objects. I think shrink_slab() is merely reported since it likely works as a location for yielding CPU resource. ---------- diff --git a/include/linux/sched.h b/include/linux/sched.h index 5e344bb..ac8b46a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1661,6 +1661,14 @@ struct task_struct { unsigned int sequential_io; unsigned int sequential_io_avg; #endif + /* Jiffies spent since the start of outermost memory allocation */ + unsigned long gfp_start; + /* GFP flags passed to innermost memory allocation */ + gfp_t gfp_flags; + /* # of shrink_slab() calls since outermost memory allocation. */ + unsigned int shrink_slab_counter; + /* # of OOM-killer skipped. */ + atomic_t oom_killer_skip_counter; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 89e7283..26dcdf8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4522,6 +4522,22 @@ out_unlock: return retval; } +static void print_memalloc_info(const struct task_struct *p) +{ + const gfp_t gfp = p->gfp_flags & __GFP_WAIT; + + /* + * __alloc_pages_nodemask() doesn't use smp_wmb() between + * updating ->gfp_start and ->gfp_flags. But reading stale + * ->gfp_start value harms nothing but printing bogus duration. + * Correct duration will be printed when this function is + * called for the next time. + */ + if (unlikely(gfp)) + printk(KERN_INFO "MemAlloc: %ld jiffies on 0x%x\n", + jiffies - p->gfp_start, gfp); +} + static const char stat_nam[] = TASK_STATE_TO_CHAR_STR; void sched_show_task(struct task_struct *p) @@ -4554,6 +4570,7 @@ void sched_show_task(struct task_struct *p) task_pid_nr(p), ppid, (unsigned long)task_thread_info(p)->flags); + print_memalloc_info(p); print_worker_info(KERN_INFO, p); show_stack(p, NULL); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5340f6b..5b014d0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -319,6 +319,10 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, case OOM_SCAN_CONTINUE: continue; case OOM_SCAN_ABORT: + if (atomic_inc_return(&p->oom_killer_skip_counter) % 1000 == 0) + printk(KERN_INFO "%s(%d) the OOM killer was skipped " + "for %u times.\n", p->comm, p->pid, + atomic_read(&p->oom_killer_skip_counter)); rcu_read_unlock(); return (struct task_struct *)(-1UL); case OOM_SCAN_OK: @@ -444,6 +448,10 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { + if (atomic_inc_return(&p->oom_killer_skip_counter) % 1000 == 0) + printk(KERN_INFO "%s(%d) the OOM killer was skipped " + "for %u times.\n", p->comm, p->pid, + atomic_read(&p->oom_killer_skip_counter)); set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); return; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 616a2c9..d1c872f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2790,6 +2790,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, unsigned int cpuset_mems_cookie; int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR; int classzone_idx; + const gfp_t old_gfp_flags = current->gfp_flags; + + if (!old_gfp_flags) { + current->gfp_start = jiffies; + current->shrink_slab_counter = 0; + } + current->gfp_flags = gfp_mask; gfp_mask &= gfp_allowed_mask; @@ -2798,7 +2805,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, might_sleep_if(gfp_mask & __GFP_WAIT); if (should_fail_alloc_page(gfp_mask, order)) - return NULL; + goto nopage; /* * Check the zones suitable for the gfp_mask contain at least one @@ -2806,7 +2813,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, * of GFP_THISNODE and a memoryless node */ if (unlikely(!zonelist->_zonerefs->zone)) - return NULL; + goto nopage; if (IS_ENABLED(CONFIG_CMA) && migratetype == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; @@ -2850,6 +2857,9 @@ out: if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; +nopage: + current->gfp_flags = old_gfp_flags; + return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); diff --git a/mm/vmscan.c b/mm/vmscan.c index dcb4707..5690f2d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -365,6 +365,7 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl, { struct shrinker *shrinker; unsigned long freed = 0; + const unsigned long start = jiffies; if (nr_pages_scanned == 0) nr_pages_scanned = SWAP_CLUSTER_MAX; @@ -397,6 +398,15 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl, } up_read(&shrinker_rwsem); out: + { + struct task_struct *p = current; + if (++p->shrink_slab_counter % 100000 == 0) + printk(KERN_INFO "%s(%d) shrink_slab() was called for " + "%u times. This time freed %lu object and took " + "%lu jiffies. Spent %lu jiffies till now.\n", + p->comm, p->pid, p->shrink_slab_counter, freed, + jiffies - start, jiffies - p->gfp_start); + } cond_resched(); return freed; } ---------- Traces from uptime > 484 seconds of http://I-love.SAKURA.ne.jp/tmp/serial-20141221.txt.xz is a stalled case. PID=12718 got SIGKILL for the first time when PID=12716 got SIGKILL with TIF_MEMDIE at 484 sec. When PID=12717 got TIF_MEMDIE at 540 sec, the OOM killer was skipped for 28000 times till 547 sec, but PID=12717 was able to terminate because somebody has released enough memory for PID=12717 to call exit_mm(). When PID=12718 got TIF_MEMDIE at 548 sec, the OOM killer was skipped for 2059000 times till 983 sec, indicating that PID=12718 was not able to terminate because nobody has released enough memory for PID=12718 to call exit_mm(). Is this interpretation correct? > That's not an XFS problem - XFS relies on the memory reclaim > subsystem being able to make progress. If the memory reclaim > subsystem cannot make progress, then there's a bug in the memory > reclaim subsystem, not a problem with the OOM killer. Since trying to trigger the OOM killer means that memory reclaim subsystem has gave up, the memory reclaim subsystem had been unable to find reclaimable memory after PID=12718 got TIF_MEMDIE at 548 sec. Is this interpretation correct? And traces of PID=12718 after 548 sec remained unchanged. Does this mean that there is a bug in the memory reclaim subsystem? ---------- [ 799.490009] a.out D ffff8800764918a0 0 12718 1 0x00100084 [ 799.491903] ffff880077d7fca8 0000000000000086 ffff880076491470 ffff880077d7ffd8 [ 799.493924] 0000000000013640 0000000000013640 ffff8800358c8210 ffff880076491470 [ 799.495938] 0000000000000000 ffff88007c8a3e48 ffff88007c8a3e4c ffff880076491470 [ 799.497964] Call Trace: [ 799.498971] [<ffffffff81618669>] schedule_preempt_disabled+0x29/0x70 [ 799.500746] [<ffffffff8161a555>] __mutex_lock_slowpath+0xb5/0x120 [ 799.502402] [<ffffffff8161a5e3>] mutex_lock+0x23/0x37 [ 799.503944] [<ffffffffa025fb47>] xfs_file_buffered_aio_write.isra.9+0x77/0x270 [xfs] [ 799.505939] [<ffffffff8109e274>] ? finish_task_switch+0x54/0x150 [ 799.507638] [<ffffffffa025fdc3>] xfs_file_write_iter+0x83/0x130 [xfs] [ 799.509416] [<ffffffff811ce76e>] new_sync_write+0x8e/0xd0 [ 799.510990] [<ffffffff811cf0f7>] vfs_write+0xb7/0x1f0 [ 799.512484] [<ffffffff81022d9c>] ? do_audit_syscall_entry+0x6c/0x70 [ 799.514226] [<ffffffff811cfbe5>] SyS_write+0x55/0xd0 [ 799.515752] [<ffffffff8161c9e9>] system_call_fastpath+0x12/0x17 (...snipped...) [ 954.595576] a.out D ffff8800764918a0 0 12718 1 0x00100084 [ 954.597544] ffff880077d7fca8 0000000000000086 ffff880076491470 ffff880077d7ffd8 [ 954.599565] 0000000000013640 0000000000013640 ffff8800358c8210 ffff880076491470 [ 954.601634] 0000000000000000 ffff88007c8a3e48 ffff88007c8a3e4c ffff880076491470 [ 954.604091] Call Trace: [ 954.607766] [<ffffffff81618669>] schedule_preempt_disabled+0x29/0x70 [ 954.609792] [<ffffffff8161a555>] __mutex_lock_slowpath+0xb5/0x120 [ 954.611644] [<ffffffff8161a5e3>] mutex_lock+0x23/0x37 [ 954.613256] [<ffffffffa025fb47>] xfs_file_buffered_aio_write.isra.9+0x77/0x270 [xfs] [ 954.615261] [<ffffffff8109e274>] ? finish_task_switch+0x54/0x150 [ 954.616990] [<ffffffffa025fdc3>] xfs_file_write_iter+0x83/0x130 [xfs] [ 954.619180] [<ffffffff811ce76e>] new_sync_write+0x8e/0xd0 [ 954.620798] [<ffffffff811cf0f7>] vfs_write+0xb7/0x1f0 [ 954.622345] [<ffffffff81022d9c>] ? do_audit_syscall_entry+0x6c/0x70 [ 954.624073] [<ffffffff811cfbe5>] SyS_write+0x55/0xd0 [ 954.625549] [<ffffffff8161c9e9>] system_call_fastpath+0x12/0x17 ---------- I guess __alloc_pages_direct_reclaim() returns NULL with did_some_progress > 0 so that __alloc_pages_may_oom() will not be called easily. As long as try_to_free_pages() returns non-zero, __alloc_pages_direct_reclaim() might return NULL with did_some_progress > 0. So, do_try_to_free_pages() is called for many times and is likely to return non-zero. And when __alloc_pages_may_oom() is called, TIF_MEMDIE is set on the thread waiting for mutex_lock(&"struct inode"->i_mutex) at xfs_file_buffered_aio_write() and I see no further progress. I don't know where to examine next. Would you please teach me command line for tracepoints to examine? > That's a CDROM event through the SCSI stack via a raw scsi device. > If you read the code you'd see that scsi_execute() is the function > using __GFP_WAIT semantics. This has *absolutely nothing* to do with > XFS, and clearly has nothing to do with anything related to the > problem you are seeing. Oops, sorry. I was misunderstanding that [ 907.336156] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 907.337893] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 907.339539] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 907.341289] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] lines are garbage. But indeed there is a chain disk_check_events() => disk->fops->check_events(disk, clearing) == sr_block_check_events() => cdrom_check_events() => cdrom_update_events() => cdi->ops->check_events() == sr_check_events() => sr_get_events() => scsi_execute_req() that indicates it is blocked at CDROM event. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-21 8:45 ` Tetsuo Handa @ 2014-12-21 20:42 ` Dave Chinner 2014-12-22 16:57 ` Michal Hocko 2014-12-29 18:19 ` Michal Hocko 1 sibling, 1 reply; 276+ messages in thread From: Dave Chinner @ 2014-12-21 20:42 UTC (permalink / raw) To: Tetsuo Handa; +Cc: dchinner, mhocko, linux-mm, rientjes, oleg On Sun, Dec 21, 2014 at 05:45:32PM +0900, Tetsuo Handa wrote: > Thank you for detailed explanation. > > Dave Chinner wrote: > > So, going back to the lockup, doesn't hte fact that so many > > processes are spinning in the shrinker tell you that there's a > > problem in that area? i.e. this: > > > > [ 398.861602] [<ffffffff8159f814>] _cond_resched+0x24/0x40 > > [ 398.863195] [<ffffffff81122119>] shrink_slab+0x139/0x150 > > [ 398.864799] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 > > > > tells me a shrinker is not making progress for some reason. I'd > > suggest that you run some tracing to find out what shrinker it is > > stuck in. there are tracepoints in shrink_slab that will tell you > > what shrinker is iterating for long periods of time. i.e instead of > > ranting and pointing fingers at everyone, you need to keep digging > > until you know exactly where reclaim progress is stalling. > > I checked using below patch that shrink_slab() is called for many times but > each call took 0 jiffies and freed 0 objects. I think shrink_slab() is merely > reported since it likely works as a location for yielding CPU resource. So we've got a situation where memory reclaim is not making progress because there's nothing left to free, and everything is backed up waiting for memory allocation to complete so that locks can be released. > Since trying to trigger the OOM killer means that memory reclaim subsystem > has gave up, the memory reclaim subsystem had been unable to find > reclaimable memory after PID=12718 got TIF_MEMDIE at 548 sec. > Is this interpretation correct? "memory reclaim gave up"? So why the hell isn't it returning a failure to the caller? i.e. We have a perfectly good page cache allocation failure error path here all the way back to userspace, but we're invoking the OOM-killer to kill random processes rather than returning ENOMEM to the processes that are generating the memory demand? Further: when did the oom-killer become the primary method of handling situations when memory allocation needs to fail? __GFP_WAIT does *not* mean memory allocation can't fail - that's what __GFP_NOFAIL means. And none of the page cache allocations use __GFP_NOFAIL, so why aren't we getting an allocation failure before the oom-killer is kicked? > I guess __alloc_pages_direct_reclaim() returns NULL with did_some_progress > 0 > so that __alloc_pages_may_oom() will not be called easily. As long as > try_to_free_pages() returns non-zero, __alloc_pages_direct_reclaim() might > return NULL with did_some_progress > 0. So, do_try_to_free_pages() is called > for many times and is likely to return non-zero. And when > __alloc_pages_may_oom() is called, TIF_MEMDIE is set on the thread waiting > for mutex_lock(&"struct inode"->i_mutex) at xfs_file_buffered_aio_write() > and I see no further progress. Of course - TIF_MEMDIE doesn't do anything to the task that is blocked, and the SIGKILL signal can't be delivered until the syscall completes or the kernel code checks for pending signals and handles EINTR directly. Mutexes are uninterruptible by design so there's no EINTR processing, hence the oom killer cannot make progress when everything is blocked on mutexes waiting for memory allocation to succeed or fail. i.e. until the lock holder exists from direct memory reclaim and releases the locks it holds, the oom killer will not be able to save the system. IOWs, the problem is that memory allocation is not failing when it should.... Focussing on the OOM killer here is the wrong way to solve this problem - the problem that needs to be solved is sane handling of OOM conditions to avoid needing to invoke the OOM-killer... > I don't know where to examine next. Would you please teach me command line > for tracepoints to examine? Tracepoints for what purpose? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-21 20:42 ` Dave Chinner @ 2014-12-22 16:57 ` Michal Hocko 2014-12-22 21:30 ` Dave Chinner 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-22 16:57 UTC (permalink / raw) To: Dave Chinner; +Cc: Tetsuo Handa, dchinner, linux-mm, rientjes, oleg On Mon 22-12-14 07:42:49, Dave Chinner wrote: [...] > "memory reclaim gave up"? So why the hell isn't it returning a > failure to the caller? > > i.e. We have a perfectly good page cache allocation failure error > path here all the way back to userspace, but we're invoking the > OOM-killer to kill random processes rather than returning ENOMEM to > the processes that are generating the memory demand? > > Further: when did the oom-killer become the primary method > of handling situations when memory allocation needs to fail? > __GFP_WAIT does *not* mean memory allocation can't fail - that's what > __GFP_NOFAIL means. And none of the page cache allocations use > __GFP_NOFAIL, so why aren't we getting an allocation failure before > the oom-killer is kicked? Well, it has been an unwritten rule that GFP_KERNEL allocations for low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago decision which would be tricky to fix now without silently breaking a lot of code. Sad... Nevertheless the caller can prevent from an endless loop by using __GFP_NORETRY so this could be used as a workaround. The default should be opposite IMO and only those who really require some guarantee should use a special flag for that purpose. > > I guess __alloc_pages_direct_reclaim() returns NULL with did_some_progress > 0 > > so that __alloc_pages_may_oom() will not be called easily. As long as > > try_to_free_pages() returns non-zero, __alloc_pages_direct_reclaim() might > > return NULL with did_some_progress > 0. So, do_try_to_free_pages() is called > > for many times and is likely to return non-zero. And when > > __alloc_pages_may_oom() is called, TIF_MEMDIE is set on the thread waiting > > for mutex_lock(&"struct inode"->i_mutex) at xfs_file_buffered_aio_write() > > and I see no further progress. > > Of course - TIF_MEMDIE doesn't do anything to the task that is > blocked, and the SIGKILL signal can't be delivered until the syscall > completes or the kernel code checks for pending signals and handles > EINTR directly. Mutexes are uninterruptible by design so there's no > EINTR processing, hence the oom killer cannot make progress when > everything is blocked on mutexes waiting for memory allocation to > succeed or fail. > > i.e. until the lock holder exists from direct memory reclaim and > releases the locks it holds, the oom killer will not be able to save > the system. IOWs, the problem is that memory allocation is not > failing when it should.... > > Focussing on the OOM killer here is the wrong way to solve this > problem - the problem that needs to be solved is sane handling of > OOM conditions to avoid needing to invoke the OOM-killer... Completely agreed! [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-22 16:57 ` Michal Hocko @ 2014-12-22 21:30 ` Dave Chinner 2014-12-23 9:41 ` Johannes Weiner 0 siblings, 1 reply; 276+ messages in thread From: Dave Chinner @ 2014-12-22 21:30 UTC (permalink / raw) To: Michal Hocko; +Cc: Tetsuo Handa, dchinner, linux-mm, rientjes, oleg On Mon, Dec 22, 2014 at 05:57:36PM +0100, Michal Hocko wrote: > On Mon 22-12-14 07:42:49, Dave Chinner wrote: > [...] > > "memory reclaim gave up"? So why the hell isn't it returning a > > failure to the caller? > > > > i.e. We have a perfectly good page cache allocation failure error > > path here all the way back to userspace, but we're invoking the > > OOM-killer to kill random processes rather than returning ENOMEM to > > the processes that are generating the memory demand? > > > > Further: when did the oom-killer become the primary method > > of handling situations when memory allocation needs to fail? > > __GFP_WAIT does *not* mean memory allocation can't fail - that's what > > __GFP_NOFAIL means. And none of the page cache allocations use > > __GFP_NOFAIL, so why aren't we getting an allocation failure before > > the oom-killer is kicked? > > Well, it has been an unwritten rule that GFP_KERNEL allocations for > low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago > decision which would be tricky to fix now without silently breaking a > lot of code. Sad... Wow. We have *always* been told memory allocations are not guaranteed to succeed, ever, unless __GFP_NOFAIL is set, but that's deprecated and nobody is allowed to use it any more. Lots of code has dependencies on memory allocation making progress or failing for the system to work in low memory situations. The page cache is one of them, which means all filesystems have that dependency. We don't explicitly ask memory allocations to fail, we *expect* the memory allocation failures will occur in low memory conditions. We've been designing and writing code with this in mind for the past 15 years. How did we get so far away from the message of "the memory allocator never guarantees success" that it will never fail to allocate memory even if it means we livelock the entire system? > Nevertheless the caller can prevent from an endless loop by using > __GFP_NORETRY so this could be used as a workaround. That's just a never-ending game of whack-a-mole that we will continually lose. It's not a workable solution. > The default should be opposite IMO and only those who really > require some guarantee should use a special flag for that purpose. Yup, totally agree. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-22 21:30 ` Dave Chinner @ 2014-12-23 9:41 ` Johannes Weiner 2014-12-24 1:06 ` Dave Chinner 0 siblings, 1 reply; 276+ messages in thread From: Johannes Weiner @ 2014-12-23 9:41 UTC (permalink / raw) To: Dave Chinner Cc: Michal Hocko, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, Andrew Morton, Linus Torvalds On Tue, Dec 23, 2014 at 08:30:58AM +1100, Dave Chinner wrote: > On Mon, Dec 22, 2014 at 05:57:36PM +0100, Michal Hocko wrote: > > On Mon 22-12-14 07:42:49, Dave Chinner wrote: > > [...] > > > "memory reclaim gave up"? So why the hell isn't it returning a > > > failure to the caller? > > > > > > i.e. We have a perfectly good page cache allocation failure error > > > path here all the way back to userspace, but we're invoking the > > > OOM-killer to kill random processes rather than returning ENOMEM to > > > the processes that are generating the memory demand? > > > > > > Further: when did the oom-killer become the primary method > > > of handling situations when memory allocation needs to fail? > > > __GFP_WAIT does *not* mean memory allocation can't fail - that's what > > > __GFP_NOFAIL means. And none of the page cache allocations use > > > __GFP_NOFAIL, so why aren't we getting an allocation failure before > > > the oom-killer is kicked? > > > > Well, it has been an unwritten rule that GFP_KERNEL allocations for > > low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago > > decision which would be tricky to fix now without silently breaking a > > lot of code. Sad... > > Wow. > > We have *always* been told memory allocations are not guaranteed to > succeed, ever, unless __GFP_NOFAIL is set, but that's deprecated and > nobody is allowed to use it any more. > > Lots of code has dependencies on memory allocation making progress > or failing for the system to work in low memory situations. The page > cache is one of them, which means all filesystems have that > dependency. We don't explicitly ask memory allocations to fail, we > *expect* the memory allocation failures will occur in low memory > conditions. We've been designing and writing code with this in mind > for the past 15 years. > > How did we get so far away from the message of "the memory allocator > never guarantees success" that it will never fail to allocate memory > even if it means we livelock the entire system? I think this isn't as much an allocation guarantee as it is based on the thought that once we can't satisfy such low orders anymore the system is so entirely unusable that the only remaining thing to do is to kill processes one by one until the situation is resolved. Hard to say, though, because this has been the behavior for longer than the initial git import of the tree, without any code comment. And yes, it's flawed, because the allocating task looping might be what's holding up progress, as we can see here. > > Nevertheless the caller can prevent from an endless loop by using > > __GFP_NORETRY so this could be used as a workaround. > > That's just a never-ending game of whack-a-mole that we will > continually lose. It's not a workable solution. Agreed. > > The default should be opposite IMO and only those who really > > require some guarantee should use a special flag for that purpose. > > Yup, totally agree. So how about something like the following change? It restricts the allocator's endless OOM killing loop to __GFP_NOFAIL contexts, which are annotated in the callsite and thus easier to review for locks etc. Otherwise, the allocator tries only as long as page reclaim makes progress, the idea being that failures are handled gracefully in the callsites, and page faults restarting automatically anyway. The OOM killing in that case is deferred to the end of the exception handler. Preliminary testing confirms that the system is indeed trying just as hard before OOM killing in the page fault case. However, it doesn't look like all callsites are prepared for failing smaller allocations: [ 55.553822] Out of memory: Kill process 240 (anonstress) score 158 or sacrifice child [ 55.561787] Killed process 240 (anonstress) total-vm:1540044kB, anon-rss:1284068kB, file-rss:468kB [ 55.571083] BUG: unable to handle kernel paging request at 00000000004006bd [ 55.578156] IP: [<00000000004006bd>] 0x4006bd [ 55.582584] PGD c8f3f067 PUD c8f48067 PMD c8f15067 PTE 0 [ 55.588016] Oops: 0014 [#1] SMP [ 55.591337] CPU: 1 PID: 240 Comm: anonstress Not tainted 3.18.0-mm1-00081-gf6137925fc97-dirty #188 [ 55.600435] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS, BIOS P1.30 05/10/2012 [ 55.610030] task: ffff8802139b9a10 ti: ffff8800c8f64000 task.ti: ffff8800c8f64000 [ 55.617623] RIP: 0033:[<00000000004006bd>] [<00000000004006bd>] 0x4006bd [ 55.624512] RSP: 002b:00007fffd43b7220 EFLAGS: 00010206 [ 55.629901] RAX: 00007f87e6e01000 RBX: 0000000000000000 RCX: 00007f87f64fe25a [ 55.637104] RDX: 00007f879881a000 RSI: 000000005dc00000 RDI: 0000000000000000 [ 55.644331] RBP: 00007fffd43b7240 R08: 00000000ffffffff R09: 0000000000000000 [ 55.651569] R10: 0000000000000022 R11: 0000000000000283 R12: 0000000000400570 [ 55.658796] R13: 00007fffd43b7340 R14: 0000000000000000 R15: 0000000000000000 [ 55.666040] FS: 00007f87f69d1700(0000) GS:ffff88021f280000(0000) knlGS:0000000000000000 [ 55.674221] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 55.680055] CR2: 00007fdd676ad480 CR3: 00000000c8f3e000 CR4: 00000000000407e0 [ 55.687272] [ 55.688780] RIP [<00000000004006bd>] 0x4006bd [ 55.693304] RSP <00007fffd43b7220> [ 55.696850] CR2: 00000000004006bd [ 55.700207] ---[ end trace b9cb4f44f8e47bc3 ]--- [ 55.704903] Kernel panic - not syncing: Fatal exception [ 55.710208] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) [ 55.720517] Rebooting in 30 seconds.. Obvious bugs aside, though, the thought of failing order-0 allocations after such a long time is scary... --- ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-23 9:41 ` Johannes Weiner @ 2014-12-24 1:06 ` Dave Chinner 2014-12-24 2:40 ` Linus Torvalds 0 siblings, 1 reply; 276+ messages in thread From: Dave Chinner @ 2014-12-24 1:06 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, Andrew Morton, Linus Torvalds On Tue, Dec 23, 2014 at 04:41:32AM -0500, Johannes Weiner wrote: > On Tue, Dec 23, 2014 at 08:30:58AM +1100, Dave Chinner wrote: > > On Mon, Dec 22, 2014 at 05:57:36PM +0100, Michal Hocko wrote: > > > On Mon 22-12-14 07:42:49, Dave Chinner wrote: > > > [...] > > > > "memory reclaim gave up"? So why the hell isn't it returning a > > > > failure to the caller? > > > > > > > > i.e. We have a perfectly good page cache allocation failure error > > > > path here all the way back to userspace, but we're invoking the > > > > OOM-killer to kill random processes rather than returning ENOMEM to > > > > the processes that are generating the memory demand? > > > > > > > > Further: when did the oom-killer become the primary method > > > > of handling situations when memory allocation needs to fail? > > > > __GFP_WAIT does *not* mean memory allocation can't fail - that's what > > > > __GFP_NOFAIL means. And none of the page cache allocations use > > > > __GFP_NOFAIL, so why aren't we getting an allocation failure before > > > > the oom-killer is kicked? > > > > > > Well, it has been an unwritten rule that GFP_KERNEL allocations for > > > low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago > > > decision which would be tricky to fix now without silently breaking a > > > lot of code. Sad... > > > > Wow. > > > > We have *always* been told memory allocations are not guaranteed to > > succeed, ever, unless __GFP_NOFAIL is set, but that's deprecated and > > nobody is allowed to use it any more. > > > > Lots of code has dependencies on memory allocation making progress > > or failing for the system to work in low memory situations. The page > > cache is one of them, which means all filesystems have that > > dependency. We don't explicitly ask memory allocations to fail, we > > *expect* the memory allocation failures will occur in low memory > > conditions. We've been designing and writing code with this in mind > > for the past 15 years. > > > > How did we get so far away from the message of "the memory allocator > > never guarantees success" that it will never fail to allocate memory > > even if it means we livelock the entire system? > > I think this isn't as much an allocation guarantee as it is based on > the thought that once we can't satisfy such low orders anymore the > system is so entirely unusable that the only remaining thing to do is > to kill processes one by one until the situation is resolved. > > Hard to say, though, because this has been the behavior for longer > than the initial git import of the tree, without any code comment. > > And yes, it's flawed, because the allocating task looping might be > what's holding up progress, as we can see here. Worse, it can be the task that is consuming all the memory, as canbe seen by this failure on xfs/084 on my single CPU. 1GB RAM VM. This test has been failing like this about 30% of the time since 3.18-rc1: [ 4083.059309] Mem-Info: [ 4083.059693] Node 0 DMA per-cpu: [ 4083.060246] CPU 0: hi: 0, btch: 1 usd: 0 [ 4083.061041] Node 0 DMA32 per-cpu: [ 4083.061612] CPU 0: hi: 186, btch: 31 usd: 50 [ 4083.062407] active_anon:119604 inactive_anon:119575 isolated_anon:0 [ 4083.062407] active_file:29 inactive_file:58 isolated_file:0 [ 4083.062407] unevictable:0 dirty:0 writeback:0 unstable:0 [ 4083.062407] free:1953 slab_reclaimable:2881 slab_unreclaimable:2484 [ 4083.062407] mapped:27 shmem:2 pagetables:928 bounce:0 [ 4083.062407] free_cma:0 [ 4083.067475] Node 0 DMA free:3924kB min:60kB low:72kB high:88kB active_anon:5612kB inactive_anon:5792kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(as [ 4083.073986] lowmem_reserve[]: 0 966 966 966 [ 4083.074808] Node 0 DMA32 free:3888kB min:3944kB low:4928kB high:5916kB active_anon:472804kB inactive_anon:472508kB active_file:116kB inactive_file:232kB unevictabls [ 4083.081570] lowmem_reserve[]: 0 0 0 0 [ 4083.082268] Node 0 DMA: 7*4kB (U) 9*8kB (UM) 7*16kB (UM) 4*32kB (U) 4*64kB (U) 2*128kB (U) 2*256kB (UM) 1*512kB (M) 0*1024kB 1*2048kB (R) 0*4096kB = 3924kB [ 4083.084829] Node 0 DMA32: 16*4kB (U) 0*8kB 1*16kB (R) 1*32kB (R) 1*64kB (R) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3888kB [ 4083.087287] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 4083.088657] 47956 total pagecache pages [ 4083.089275] 47858 pages in swap cache [ 4083.089856] Swap cache stats: add 416328, delete 368470, find 818589/929518 [ 4083.090941] Free swap = 0kB [ 4083.091398] Total swap = 497976kB [ 4083.091923] 262044 pages RAM [ 4083.092405] 0 pages HighMem/MovableOnly [ 4083.093016] 10167 pages reserved [ 4083.093528] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 4083.094749] [ 1195] 0 1195 5992 24 16 152 -1000 udevd [ 4083.095981] [ 1326] 0 1326 5991 50 15 128 -1000 udevd [ 4083.097224] [ 3835] 0 3835 2529 0 6 573 -1000 dhclient [ 4083.098497] [ 3886] 0 3886 13099 0 27 153 -1000 sshd [ 4083.099716] [ 3892] 0 3892 25770 1 52 233 -1000 sshd [ 4083.100939] [ 3970] 1000 3970 25770 8 50 227 -1000 sshd [ 4083.102164] [ 3971] 1000 3971 5276 1 14 493 -1000 bash [ 4083.103386] [ 4062] 0 4062 16887 1 36 118 -1000 sudo [ 4083.104667] [ 4063] 0 4063 3044 192 10 162 -1000 check [ 4083.105952] [ 6708] 0 6708 5991 35 15 143 -1000 udevd [ 4083.107244] [18113] 0 18113 2584 1 9 288 -1000 084 [ 4083.108517] [18317] 0 18317 316605 191037 623 121971 -1000 resvtest [ 4083.109852] [18318] 0 18318 2584 0 9 288 -1000 084 [ 4083.111117] [18319] 0 18319 2584 0 9 288 -1000 084 [ 4083.112431] [18320] 0 18320 3258 0 11 36 -1000 sed [ 4083.113692] [18321] 0 18321 3258 0 11 36 -1000 sed [ 4083.114950] Kernel panic - not syncing: Out of memory and no killable processes... [ 4083.114950] [ 4083.116420] CPU: 0 PID: 18317 Comm: resvtest Not tainted 3.19.0-rc1-dgc+ #650 [ 4083.116423] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 4083.116423] ffffffff823357a0 ffff88003d98faa8 ffffffff81d87acb 0000000000008686 [ 4083.116423] ffffffff8219b348 ffff88003d98fb28 ffffffff81d813c1 000000000000000b [ 4083.116423] 0000000000000008 ffff88003d98fb38 ffff88003d98fad8 0000000000000000 [ 4083.116423] Call Trace: [ 4083.116423] [<ffffffff81d87acb>] dump_stack+0x45/0x57 [ 4083.116423] [<ffffffff81d813c1>] panic+0xc1/0x1eb [ 4083.116423] [<ffffffff81174dea>] out_of_memory+0x4fa/0x500 [ 4083.116423] [<ffffffff81179969>] __alloc_pages_nodemask+0x7a9/0x8a0 [ 4083.116423] [<ffffffff811b8c77>] alloc_pages_vma+0x97/0x160 [ 4083.116423] [<ffffffff8119b0c3>] handle_mm_fault+0x963/0xc20 [ 4083.116423] [<ffffffff814ec802>] ? xfs_file_buffered_aio_write+0x1e2/0x240 [ 4083.116423] [<ffffffff8108bf24>] __do_page_fault+0x1b4/0x570 [ 4083.116423] [<ffffffff8119f5e1>] ? vma_merge+0x211/0x330 [ 4083.116423] [<ffffffff811a0808>] ? do_brk+0x268/0x350 [ 4083.116423] [<ffffffff8108c395>] trace_do_page_fault+0x45/0x100 [ 4083.116423] [<ffffffff8108778e>] do_async_page_fault+0x1e/0xd0 [ 4083.116423] [<ffffffff81d946f8>] async_page_fault+0x28/0x30 [ 4083.116423] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) This needs to fail the allocation so that the process consuming all the memory fails the page fault and SEGVs. Otherwise the OOM-killer just runs wild killing everything else in the system until there's nothing left to kill and the system panics. > > > The default should be opposite IMO and only those who really > > > require some guarantee should use a special flag for that purpose. > > > > Yup, totally agree. > > So how about something like the following change? It restricts the > allocator's endless OOM killing loop to __GFP_NOFAIL contexts, which > are annotated in the callsite and thus easier to review for locks etc. > Otherwise, the allocator tries only as long as page reclaim makes > progress, the idea being that failures are handled gracefully in the > callsites, and page faults restarting automatically anyway. The OOM > killing in that case is deferred to the end of the exception handler. > > Preliminary testing confirms that the system is indeed trying just as > hard before OOM killing in the page fault case. However, it doesn't > look like all callsites are prepared for failing smaller allocations: Then we need to fix those bugs. > [ 55.553822] Out of memory: Kill process 240 (anonstress) score 158 or sacrifice child > [ 55.561787] Killed process 240 (anonstress) total-vm:1540044kB, anon-rss:1284068kB, file-rss:468kB > [ 55.571083] BUG: unable to handle kernel paging request at 00000000004006bd > [ 55.578156] IP: [<00000000004006bd>] 0x4006bd That's an offset of >4MB from a null pointer. Doesn't seem likely that it's caused by a failure of a order 0 allocation. The lack of a stack trace is worrying, though.... > Obvious bugs aside, though, the thought of failing order-0 allocations > after such a long time is scary... The reliance on the OOM-killer to save the system from memory starvation when users put the page cache under pressure via write(2) is even scarier, IMO. > --- > From 0b204ee379aa5502a1c4dce5df51de96448b5163 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner <hannes@cmpxchg.org> > Date: Mon, 22 Dec 2014 17:16:43 -0500 > Subject: [patch] mm: page_alloc: avoid page allocation vs. OOM killing > deadlock Remind me to test whatever you've come up with in a couple of weeks after the xmas break, though it's more likely to be late january before i'll get to it given LCA will be keeping me busy in the new year... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-24 1:06 ` Dave Chinner @ 2014-12-24 2:40 ` Linus Torvalds 0 siblings, 0 replies; 276+ messages in thread From: Linus Torvalds @ 2014-12-24 2:40 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Michal Hocko, Tetsuo Handa, Dave Chinner, linux-mm, David Rientjes, Oleg Nesterov, Andrew Morton On Tue, Dec 23, 2014 at 5:06 PM, Dave Chinner <david@fromorbit.com> wrote: > > Worse, it can be the task that is consuming all the memory, as canbe > seen by this failure on xfs/084 on my single CPU. 1GB RAM VM. This > test has been failing like this about 30% of the time since 3.18-rc1: Quite frankly, uif you can realiably handle memory allocation failures and they won't cause problems for other processes, you should use GFP_USER, not GFP_KERNEL. GFP_KERNEL does mean "try really hard". That has *always* been true. We used to have a __GFP_HIGH set in GFP_KERNEL exactly for that reason. We seem lost that distinction between GFP_USER and GFP_KERNEL long ago, and then re-grew it in a weaker form as GFP_HARDWALL. That may be part of the problem: the kernel cannot easily distinguish between "we should try really hard to satisfy this allocation" and "we can easily fail it". Maybe we could just use that GFP_HARDWALL bit for it. Possibly rename it, but for *testing* it somebody could try this trivial/minimal test-patch. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7633c503a116..7cacd45b47ce 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2307,6 +2307,10 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order, if (!did_some_progress && pm_suspended_storage()) return 0; + /* GFP_USER allocations don't re-try */ + if (gfp_mask & __GFP_HIGHWALL) + return 0; + /* * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER * means __GFP_NOFAIL, but that may not be true in other which is intentionally whitespace-damaged, because it really is meant as a "this is a starting point for experimentation by VM people" rather than as a "apply this patch and you're good to go" patch.. Hmm? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-21 8:45 ` Tetsuo Handa 2014-12-21 20:42 ` Dave Chinner @ 2014-12-29 18:19 ` Michal Hocko 2014-12-30 6:42 ` Tetsuo Handa 1 sibling, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-29 18:19 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, linux-mm, rientjes, oleg, Andrew Morton, Mel Gorman, Johannes Weiner, Linus Torvalds On Sun 21-12-14 17:45:32, Tetsuo Handa wrote: [...] > Traces from uptime > 484 seconds of > http://I-love.SAKURA.ne.jp/tmp/serial-20141221.txt.xz is a stalled case. [ 548.449780] Out of memory: Kill process 12718 (a.out) score 890 or sacrifice child [...] [ 954.595576] a.out D ffff8800764918a0 0 12718 1 0x00100084 [ 954.597544] ffff880077d7fca8 0000000000000086 ffff880076491470 ffff880077d7ffd8 [ 954.599565] 0000000000013640 0000000000013640 ffff8800358c8210 ffff880076491470 [ 954.601634] 0000000000000000 ffff88007c8a3e48 ffff88007c8a3e4c ffff880076491470 [ 954.604091] Call Trace: [ 954.607766] [<ffffffff81618669>] schedule_preempt_disabled+0x29/0x70 [ 954.609792] [<ffffffff8161a555>] __mutex_lock_slowpath+0xb5/0x120 [ 954.611644] [<ffffffff8161a5e3>] mutex_lock+0x23/0x37 [ 954.613256] [<ffffffffa025fb47>] xfs_file_buffered_aio_write.isra.9+0x77/0x270 [xfs] [...] and it seems that it is blocked by another allocator: [ 957.178207] a.out R running task 0 12804 1 0x00000084 [ 957.180304] MemAlloc: 471962 jiffies on 0x10 [ 957.181738] ffff8800355df868 0000000000000086 ffff88007be98940 ffff8800355dffd8 [ 957.183831] 0000000000013640 0000000000013640 ffff88007c4174b0 ffff88007be98940 [ 957.185916] 0000000000000000 ffff8800355df940 0000000000000000 ffffffff81a621e8 [ 957.188067] Call Trace: [ 957.189130] [<ffffffff81618509>] _cond_resched+0x29/0x40 [ 957.190790] [<ffffffff8117752a>] shrink_slab+0x17a/0x1d0 [ 957.192384] [<ffffffff8117a330>] do_try_to_free_pages+0x280/0x450 [ 957.194117] [<ffffffff8117a5da>] try_to_free_pages+0xda/0x170 [ 957.195800] [<ffffffff8116db23>] __alloc_pages_nodemask+0x633/0xa50 [ 957.197615] [<ffffffff811b1ce7>] alloc_pages_current+0x97/0x110 [ 957.199314] [<ffffffff81164797>] __page_cache_alloc+0xa7/0xc0 [ 957.201026] [<ffffffff811652b0>] pagecache_get_page+0x70/0x1e0 [ 957.202724] [<ffffffff81165453>] grab_cache_page_write_begin+0x33/0x50 [ 957.204546] [<ffffffffa0252cb4>] xfs_vm_write_begin+0x34/0xe0 [xfs] but this task managed to make some progress because we can clearly see that pid 12718 (oom victim) managed to move on and get to OOM killer many times [ 961.062042] a.out(12718) the OOM killer was skipped for 1965000 times. [...] [ 983.140589] a.out(12718) the OOM killer was skipped for 2059000 times. This shouldn't happen for the xfs pagecache allocation because they all should be !__GFS_FS and we do not trigger OOM killer in that case and fail instead. But as already pointed out by Dave grab_cache_page_write_begin uses GFP_KERNEL for the radix tree allocation and that would trigger the OOM killer. The rest is our hopeless attempt to not fail the allocation. I believe that the patch from http://marc.info/?l=linux-mm&m=141987483503279 should help in this particular case. There are still other cases where we can livelock but this seems to be a clear bug in grab_cache_page_write_begin. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-29 18:19 ` Michal Hocko @ 2014-12-30 6:42 ` Tetsuo Handa 2014-12-30 11:21 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-30 6:42 UTC (permalink / raw) To: mhocko Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes, torvalds Michal Hocko wrote: > but this task managed to make some progress because we can clearly see > that pid 12718 (oom victim) managed to move on and get to OOM killer > many times > [ 961.062042] a.out(12718) the OOM killer was skipped for 1965000 times. > [...] > [ 983.140589] a.out(12718) the OOM killer was skipped for 2059000 times. > Excuse me for the confusing message. The a.out(12718) printed here is not the caller of OOM killer but the victim keeping the OOM killer disabled. Thus, this task could not manage to make some progress and I called it "a stalled case". > There are still other cases where we can livelock but > this seems to be a clear bug in grab_cache_page_write_begin. We might want to discuss below case as a separate topic, but is a TIF_MEMDIE stall anyway. I retested using 3.19-rc2 with diff shown below. If I start a.out and b.out (where b.out is a copy of a.out) with slight delay (a few deciseconds), I can observe that the a.out is unable to die due to b.out asking for memory or holding lock. http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-1.txt.xz is a case where I think a.out keeps the OOM killer disabled and http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-2.txt.xz is a case where I think a.out cannot die within reasonable duration due to b.out . I don't know whether cgroups can help or not, but I think we need to be prepared for cases where sending SIGKILL to all threads sharing the same memory does not help. ---------- diff start ---------- mm-get-rid-of-radix-tree-gfp-mask-for-pagecache_get_page-was-re-how-to-handle-tif_memdie-stalls.patch oom-dont-count-on-mm-less-current-process.patch oom-make-sure-that-tif_memdie-is-set-under-task_lock.patch my patch for debug printk() on memory allocation stall my patch for boot failure by bd809af16e3ab1f8 "x86: Enable PAT to use cache mode translation tables" diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index a97ee08..cab1578 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -718,9 +718,6 @@ void __init zone_sizes_init(void) void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache) { - /* entry 0 MUST be WB (hardwired to speed up translations) */ - BUG_ON(!entry && cache != _PAGE_CACHE_MODE_WB); - __cachemode2pte_tbl[cache] = __cm_idx2pte(entry); __pte2cachemode_tbl[entry] = cache; } diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 7ea069c..4b3736f 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -251,7 +251,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping, #define FGP_NOWAIT 0x00000020 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, - int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask); + int fgp_flags, gfp_t cache_gfp_mask); /** * find_get_page - find and get a page reference @@ -266,13 +266,13 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, static inline struct page *find_get_page(struct address_space *mapping, pgoff_t offset) { - return pagecache_get_page(mapping, offset, 0, 0, 0); + return pagecache_get_page(mapping, offset, 0, 0); } static inline struct page *find_get_page_flags(struct address_space *mapping, pgoff_t offset, int fgp_flags) { - return pagecache_get_page(mapping, offset, fgp_flags, 0, 0); + return pagecache_get_page(mapping, offset, fgp_flags, 0); } /** @@ -292,7 +292,7 @@ static inline struct page *find_get_page_flags(struct address_space *mapping, static inline struct page *find_lock_page(struct address_space *mapping, pgoff_t offset) { - return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0); + return pagecache_get_page(mapping, offset, FGP_LOCK, 0); } /** @@ -319,7 +319,7 @@ static inline struct page *find_or_create_page(struct address_space *mapping, { return pagecache_get_page(mapping, offset, FGP_LOCK|FGP_ACCESSED|FGP_CREAT, - gfp_mask, gfp_mask & GFP_RECLAIM_MASK); + gfp_mask); } /** @@ -340,8 +340,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping, { return pagecache_get_page(mapping, index, FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT, - mapping_gfp_mask(mapping), - GFP_NOFS); + mapping_gfp_mask(mapping)); } struct page *find_get_entry(struct address_space *mapping, pgoff_t offset); diff --git a/include/linux/sched.h b/include/linux/sched.h index 8db31ef..69d367f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1701,6 +1701,14 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif + /* Jiffies spent since the start of outermost memory allocation */ + unsigned long gfp_start; + /* GFP flags passed to innermost memory allocation */ + gfp_t gfp_flags; + /* # of shrink_slab() calls since outermost memory allocation. */ + unsigned int shrink_slab_counter; + /* # of OOM-killer skipped. */ + atomic_t oom_killer_skip_counter; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b5797b7..e7fc702 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4502,6 +4502,22 @@ out_unlock: return retval; } +static void print_memalloc_info(const struct task_struct *p) +{ + const gfp_t gfp = p->gfp_flags & __GFP_WAIT; + + /* + * __alloc_pages_nodemask() doesn't use smp_wmb() between + * updating ->gfp_start and ->gfp_flags. But reading stale + * ->gfp_start value harms nothing but printing bogus duration. + * Correct duration will be printed when this function is + * called for the next time. + */ + if (unlikely(gfp)) + printk(KERN_INFO "MemAlloc: %ld jiffies on 0x%x\n", + jiffies - p->gfp_start, gfp); +} + static const char stat_nam[] = TASK_STATE_TO_CHAR_STR; void sched_show_task(struct task_struct *p) @@ -4536,6 +4552,7 @@ void sched_show_task(struct task_struct *p) task_pid_nr(p), ppid, (unsigned long)task_thread_info(p)->flags); + print_memalloc_info(p); print_worker_info(KERN_INFO, p); show_stack(p, NULL); } diff --git a/mm/filemap.c b/mm/filemap.c index bd8543c..673e458 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1046,8 +1046,7 @@ EXPORT_SYMBOL(find_lock_entry); * @mapping: the address_space to search * @offset: the page index * @fgp_flags: PCG flags - * @cache_gfp_mask: gfp mask to use for the page cache data page allocation - * @radix_gfp_mask: gfp mask to use for radix tree node allocation + * @gfp_mask: gfp mask to use for the page cache data page allocation * * Looks up the page cache slot at @mapping & @offset. * @@ -1056,11 +1055,9 @@ EXPORT_SYMBOL(find_lock_entry); * FGP_ACCESSED: the page will be marked accessed * FGP_LOCK: Page is return locked * FGP_CREAT: If page is not present then a new page is allocated using - * @cache_gfp_mask and added to the page cache and the VM's LRU - * list. If radix tree nodes are allocated during page cache - * insertion then @radix_gfp_mask is used. The page is returned - * locked and with an increased refcount. Otherwise, %NULL is - * returned. + * @gfp_mask and added to the page cache and the VM's LRU + * list. The page is returned locked and with an increased + * refcount. Otherwise, %NULL is returned. * * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even * if the GFP flags specified for FGP_CREAT are atomic. @@ -1068,7 +1065,7 @@ EXPORT_SYMBOL(find_lock_entry); * If there is a page cache page, it is returned with an increased refcount. */ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, - int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask) + int fgp_flags, gfp_t gfp_mask) { struct page *page; @@ -1105,13 +1102,11 @@ no_page: if (!page && (fgp_flags & FGP_CREAT)) { int err; if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping)) - cache_gfp_mask |= __GFP_WRITE; - if (fgp_flags & FGP_NOFS) { - cache_gfp_mask &= ~__GFP_FS; - radix_gfp_mask &= ~__GFP_FS; - } + gfp_mask |= __GFP_WRITE; + if (fgp_flags & FGP_NOFS) + gfp_mask &= ~__GFP_FS; - page = __page_cache_alloc(cache_gfp_mask); + page = __page_cache_alloc(gfp_mask); if (!page) return NULL; @@ -1122,7 +1117,8 @@ no_page: if (fgp_flags & FGP_ACCESSED) __SetPageReferenced(page); - err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask); + err = add_to_page_cache_lru(page, mapping, offset, + gfp_mask & GFP_RECLAIM_MASK); if (unlikely(err)) { page_cache_release(page); page = NULL; @@ -2443,8 +2439,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, fgp_flags |= FGP_NOFS; page = pagecache_get_page(mapping, index, fgp_flags, - mapping_gfp_mask(mapping), - GFP_KERNEL); + mapping_gfp_mask(mapping)); if (page) wait_for_stable_page(page); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9c..2f3ece1 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -304,6 +304,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, rcu_read_lock(); for_each_process_thread(g, p) { unsigned int points; + unsigned int count; switch (oom_scan_process_thread(p, totalpages, nodemask, force_kill)) { @@ -314,6 +315,14 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, case OOM_SCAN_CONTINUE: continue; case OOM_SCAN_ABORT: + count = atomic_inc_return(&p->oom_killer_skip_counter); + if (count % 1000 == 0) + printk(KERN_INFO "%s(pid=%d,flags=0x%x) " + "waited for %s(pid=%d,flags=0x%x) for " + "%u times at select_bad_process().\n", + current->comm, current->pid, + current->gfp_flags, p->comm, p->pid, + p->gfp_flags, count); rcu_read_unlock(); return (struct task_struct *)(-1UL); case OOM_SCAN_OK: @@ -438,11 +447,22 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (task_will_free_mem(p)) { + task_lock(p); + if (p->mm && task_will_free_mem(p)) { + unsigned int count = + atomic_inc_return(&p->oom_killer_skip_counter); + if (count % 1000 == 0) + printk(KERN_INFO "%s(pid=%d,flags=0x%x) waited for " + "%s(pid=%d,flags=0x%x) for %u times at " + "oom_kill_process().\n", current->comm, + current->pid, current->gfp_flags, p->comm, + p->pid, p->gfp_flags, count); set_tsk_thread_flag(p, TIF_MEMDIE); + task_unlock(p); put_task_struct(p); return; } + task_unlock(p); if (__ratelimit(&oom_rs)) dump_header(p, gfp_mask, order, memcg, nodemask); @@ -492,6 +512,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, /* mm cannot safely be dereferenced after task_unlock(victim) */ mm = victim->mm; + set_tsk_thread_flag(victim, TIF_MEMDIE); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), @@ -522,7 +543,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } rcu_read_unlock(); - set_tsk_thread_flag(victim, TIF_MEMDIE); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); put_task_struct(victim); } @@ -643,8 +663,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. + * + * But don't select if current has already released its mm and cleared + * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { + if (current->mm && + (fatal_signal_pending(current) || task_will_free_mem(current))) { set_thread_flag(TIF_MEMDIE); return; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7633c50..a3b0c5a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2877,6 +2877,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, unsigned int cpuset_mems_cookie; int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR; int classzone_idx; + const gfp_t old_gfp_flags = current->gfp_flags; + + if (!old_gfp_flags) { + current->gfp_start = jiffies; + current->shrink_slab_counter = 0; + } + current->gfp_flags = gfp_mask; gfp_mask &= gfp_allowed_mask; @@ -2885,7 +2892,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, might_sleep_if(gfp_mask & __GFP_WAIT); if (should_fail_alloc_page(gfp_mask, order)) - return NULL; + goto nopage; /* * Check the zones suitable for the gfp_mask contain at least one @@ -2893,7 +2900,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, * of GFP_THISNODE and a memoryless node */ if (unlikely(!zonelist->_zonerefs->zone)) - return NULL; + goto nopage; if (IS_ENABLED(CONFIG_CMA) && migratetype == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; @@ -2937,6 +2944,9 @@ out: if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; +nopage: + current->gfp_flags = old_gfp_flags; + return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); diff --git a/mm/vmscan.c b/mm/vmscan.c index bd9a72b..7d736d6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -368,6 +368,7 @@ unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, { struct shrinker *shrinker; unsigned long freed = 0; + const unsigned long start = jiffies; if (nr_scanned == 0) nr_scanned = SWAP_CLUSTER_MAX; @@ -397,6 +398,13 @@ unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, up_read(&shrinker_rwsem); out: + if (++current->shrink_slab_counter % 100000 == 0) + printk(KERN_INFO "%s(pid=%d,flags=0x%x) called " + "shrink_slab() for %u times. This time freed " + "%lu object and took %lu jiffies. Spent %lu " + "jiffies till now.\n", current->comm, current->pid, + current->gfp_flags, current->shrink_slab_counter, freed, + jiffies - start, jiffies - current->gfp_start); cond_resched(); return freed; } ---------- diff end ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-30 6:42 ` Tetsuo Handa @ 2014-12-30 11:21 ` Michal Hocko 2014-12-30 13:33 ` Tetsuo Handa ` (2 more replies) 0 siblings, 3 replies; 276+ messages in thread From: Michal Hocko @ 2014-12-30 11:21 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes, torvalds On Tue 30-12-14 15:42:56, Tetsuo Handa wrote: [...] > We might want to discuss below case as a separate topic, but is a TIF_MEMDIE > stall anyway. I retested using 3.19-rc2 with diff shown below. If I start > a.out and b.out (where b.out is a copy of a.out) with slight delay (a few > deciseconds), I can observe that the a.out is unable to die due to b.out > asking for memory or holding lock. > http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-1.txt.xz is a case > where I think a.out keeps the OOM killer disabled and [ 53.748454] b.out invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [...] [ 53.807397] active_anon:448903 inactive_anon:2082 isolated_anon:0 [ 53.807397] active_file:0 inactive_file:9 isolated_file:0 [ 53.807397] unevictable:0 dirty:3 writeback:0 unstable:0 [ 53.807397] free:13079 slab_reclaimable:1227 slab_unreclaimable:4520 [ 53.807397] mapped:380 shmem:2151 pagetables:2059 bounce:0 [ 53.807397] free_cma:0 [...] [ 53.856598] Free swap = 0kB [ 53.857908] Total swap = 0kB [ 53.859218] 524157 pages RAM This situation looks quite hopeless. We cannot swap yet we have over 80% of memory occupied by anon memory. There is still around ~50M free and few pages in the reclaimable slab which should be sufficient to help TIF_MEMDIE to make some progress on the other hand. [ 54.380517] Out of memory: Kill process 3596 (a.out) score 719 or sacrifice child [ 54.382091] Killed process 3596 (a.out) total-vm:2166864kB, anon-rss:1383880kB, file-rss:4kB [...] [ 348.134718] a.out D ffff880036fefcb8 0 3596 1 0x00100084 [ 348.136616] ffff880036fefcb8 ffff880036fefc88 ffff88007c204550 00000000000130c0 [ 348.138645] ffff880036feffd8 00000000000130c0 ffff88007c204550 ffff880036fefcb8 [ 348.140657] ffff88007ca45248 ffff88007ca4524c ffff88007c204550 00000000ffffffff [ 348.142672] Call Trace: [ 348.143662] [<ffffffff815bddb4>] schedule_preempt_disabled+0x24/0x70 [ 348.145379] [<ffffffff815bfb65>] __mutex_lock_slowpath+0xb5/0x120 [ 348.147153] [<ffffffff815bfbee>] mutex_lock+0x1e/0x32 [ 348.148644] [<ffffffffa02463ca>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs] [ 348.150637] [<ffffffff8100d62f>] ? __switch_to+0x15f/0x580 [ 348.152209] [<ffffffffa02465dd>] xfs_file_write_iter+0x7d/0x120 [xfs] [ 348.153961] [<ffffffff81178009>] new_sync_write+0x89/0xd0 [ 348.155506] [<ffffffff811787f2>] vfs_write+0xb2/0x1f0 [ 348.157004] [<ffffffff8101b994>] ? do_audit_syscall_entry+0x64/0x70 [ 348.158715] [<ffffffff81179440>] SyS_write+0x50/0xc0 [ 348.160188] [<ffffffff810f9ffe>] ? __audit_syscall_exit+0x22e/0x2d0 and this is the case for most a.out and b.out threads basically because all of them contend on a single file. The holder of the lock right now seems to be: [ 355.559722] b.out R running task 0 3843 3724 0x00000080 [ 355.561700] MemAlloc: 21916 jiffies on 0x10 [ 355.563056] ffff88007c3f3808 ffff88007c3f37d8 ffff88007c3e4d60 00000000000130c0 [ 355.565346] ffff88007c3f3fd8 00000000000130c0 ffff88007c3e4d60 ffff880036f02b48 [ 355.567440] ffffffff81848588 0000000000000400 0000000000000000 ffff88007c3f39c8 [ 355.569517] Call Trace: [ 355.570557] [<ffffffff815bdc72>] _cond_resched+0x22/0x40 [ 355.572167] [<ffffffff811249f2>] shrink_node_slabs+0x242/0x310 [ 355.573846] [<ffffffff81127155>] shrink_zone+0x175/0x1c0 [ 355.575410] [<ffffffff81127590>] do_try_to_free_pages+0x1d0/0x3e0 [ 355.577339] [<ffffffff81127834>] try_to_free_pages+0x94/0xc0 [ 355.579015] [<ffffffff8111d4c5>] __alloc_pages_nodemask+0x535/0xaa0 [ 355.580759] [<ffffffff8115cf9c>] alloc_pages_current+0x8c/0x100 [ 355.582446] [<ffffffff811148f7>] __page_cache_alloc+0xa7/0xc0 [ 355.584092] [<ffffffff81115364>] pagecache_get_page+0x54/0x1b0 [ 355.585773] [<ffffffffa025d11e>] ? xfs_trans_commit+0x13e/0x230 [xfs] [ 355.587553] [<ffffffff811154e8>] grab_cache_page_write_begin+0x28/0x50 [ 355.589349] [<ffffffffa023b04f>] xfs_vm_write_begin+0x2f/0xe0 [xfs] [ 355.591096] [<ffffffff8111465c>] generic_perform_write+0xbc/0x1c0 [ 355.592816] [<ffffffffa024634f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs] [ 355.594718] [<ffffffffa024642f>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs] So it is trying to reclaim at least something but it will take some time for it to realize this will not fly. The allocation will fail eventually, though, because this is !__GFP_FS allocation and the same will apply to a.out waiting for the lock as well. $ grep "waited for.*select_bad_process" serial-20141230-ab-1.txt | sed 's@.*\((pid=.*waited for.*\) for.*@\1@' | sort | uniq -c 1 (pid=2,flags=0x2000d0) waited for a.out(pid=3596,flags=0x0) 809 (pid=3724,flags=0x280da) waited for a.out(pid=3596,flags=0x0) [ 351.915586] b.out R running task 0 3724 3572 0x00000080 [ 351.917619] MemAlloc: 29906 jiffies on 0x10 [ 351.919012] ffff88007b8d7948 ffff88007fffc6c0 ffff88007c5751b0 00000000000130c0 [ 351.921096] ffff88007b8d7fd8 00000000000130c0 ffff88007c5751b0 0000000000000000 [ 351.923228] 0000000000000000 00000000000280da 0000000000000002 0000000000000000 [ 351.925374] Call Trace: [ 351.926466] [<ffffffff815bdc72>] _cond_resched+0x22/0x40 [ 351.928073] [<ffffffff8111d477>] __alloc_pages_nodemask+0x4e7/0xaa0 [ 351.929828] [<ffffffff8115f302>] alloc_pages_vma+0x92/0x160 [ 351.931502] [<ffffffff8113fa11>] handle_mm_fault+0xbe1/0xed0 [ 351.933171] [<ffffffff815c2847>] ? native_iret+0x7/0x7 [ 351.934719] [<ffffffff8105502c>] __do_page_fault+0x1dc/0x5b0 [ 351.936412] [<ffffffff8111d125>] ? __alloc_pages_nodemask+0x195/0xaa0 [ 351.938191] [<ffffffff81055431>] do_page_fault+0x31/0x70 [ 351.939769] [<ffffffff815c3638>] page_fault+0x28/0x30 [ 351.941322] [<ffffffff812b1940>] ? __clear_user+0x20/0x50 [ 351.942921] [<ffffffff81139538>] iov_iter_zero+0x68/0x2f0 [ 351.944503] [<ffffffff8138a4e7>] read_iter_zero+0x47/0xb0 [ 351.946135] [<ffffffff81177f46>] new_sync_read+0x86/0xc0 [ 351.947703] [<ffffffff811791b3>] __vfs_read+0x13/0x50 [ 351.949216] [<ffffffff81179271>] vfs_read+0x81/0x140 [ 351.950757] [<ffffffff81179380>] SyS_read+0x50/0xc0 [ 351.952277] [<ffffffff810f9ffe>] ? __audit_syscall_exit+0x22e/0x2d0 [ 351.953995] [<ffffffff815c1c29>] system_call_fastpath+0x12/0x17 So the OOM blocked task is sitting in the page fault caused by clearing the user buffer. According to your debugging patch this should be GFP_HIGHUSER_MOVABLE | __GFP_ZERO allocation which is the case where we retry without failing most of the time. I am not familiar with the VFS code much but it seems we are not sitting on any locks that would block the OOM victim later on (I am not entirely sure about FDPUT_POS_UNLOCK from fdget_pos but all tasks are past this calling it without blocking so it shouldn't matter). So even if the page fault failed with ENOMEM it wouldn't help us much here. That being said this doesn't look like a live lock or a lockup. System should recover from this state but it might take a lot of time (there are hundreds of tasks waiting on the i_mutex lock, each will try to allocate and fail and OOM victims will have to get out of the kernel and die). I am not sure we can do much about that from the allocator POV. A possible way would be refraining from the reclaim efforts when it is clear that nothing is really reclaimable. But I suspect this would be tricky to get right. > http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-2.txt.xz is a case [ 44.588785] Out of memory: Kill process 3599 (a.out) score 773 or sacrifice child [ 44.590418] Killed process 3599 (a.out) total-vm:2166864kB, anon-rss:1488688kB, file-rss:4kB [...] [ 44.640689] a.out: page allocation failure: order:0, mode:0x280da [ 44.640690] CPU: 2 PID: 3599 Comm: a.out Not tainted 3.19.0-rc2+ #20 [...] [ 44.641125] a.out: page allocation failure: order:0, mode:0x2015a [ 44.641126] CPU: 2 PID: 3599 Comm: a.out Not tainted 3.19.0-rc2+ #20 So the OOM victim is failing the allocation because we prevent endless loops in the allocator for TIF_MEMDIE tasks and then it dies (it is not among Sysrq+t output AFAICS). We still have to wait for all the tasks sharing mm with it. many of them are in: [ 402.300859] a.out x ffff88007be53ce8 0 3601 1 0x00000086 [ 402.303407] ffff88007be53ce8 ffff88007c962450 ffff880078d10e60 00000000000130c0 [ 402.305478] ffff88007be53fd8 00000000000130c0 ffff880078d10e60 ffff880078d114a8 [ 402.307519] ffff880078d114a8 ffff880078d11170 ffff88007c0a9220 ffff880078d10e60 [ 402.309547] Call Trace: [ 402.310551] [<ffffffff815bd8c4>] schedule+0x24/0x70 [ 402.312040] [<ffffffff8106a4ea>] do_exit+0x6ba/0xb10 [ 402.313531] [<ffffffff8106b7da>] do_group_exit+0x3a/0xa0 [ 402.315082] [<ffffffff81075de8>] get_signal+0x188/0x690 [ 402.316629] [<ffffffff815bd43a>] ? __schedule+0x27a/0x6e0 [ 402.318196] [<ffffffff8100e4f2>] do_signal+0x32/0x750 [ 402.319744] [<ffffffffa02611c4>] ? _xfs_log_force_lsn+0xc4/0x2f0 [xfs] [ 402.321729] [<ffffffffa0245489>] ? xfs_file_fsync+0x159/0x1b0 [xfs] [ 402.323461] [<ffffffff8100ec5c>] do_notify_resume+0x4c/0x90 [ 402.325135] [<ffffffff815c1ec7>] int_signal+0x12/0x17 so they have already dropped reference to mm_struct but some of them are still waiting in the write path to fail and exit: [ 402.271983] a.out D ffff88007c047cb8 0 3600 1 0x00000084 [ 402.273866] ffff88007c047cb8 ffff88007c047c88 ffff8800793d8ba0 00000000000130c0 [ 402.275872] ffff88007c047fd8 00000000000130c0 ffff8800793d8ba0 ffff88007c047cb8 [ 402.277878] ffff88007ae56a48 ffff88007ae56a4c ffff8800793d8ba0 00000000ffffffff [ 402.279888] Call Trace: [ 402.280874] [<ffffffff815bddb4>] schedule_preempt_disabled+0x24/0x70 [ 402.282597] [<ffffffff815bfb65>] __mutex_lock_slowpath+0xb5/0x120 [ 402.284266] [<ffffffff815bfbee>] mutex_lock+0x1e/0x32 [ 402.285756] [<ffffffffa02463ca>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs] [ 402.287741] [<ffffffff8100d62f>] ? __switch_to+0x15f/0x580 [ 402.289311] [<ffffffffa02465dd>] xfs_file_write_iter+0x7d/0x120 [xfs] [ 402.291050] [<ffffffff81178009>] new_sync_write+0x89/0xd0 [ 402.292596] [<ffffffff811787f2>] vfs_write+0xb2/0x1f0 [ 402.294075] [<ffffffff8101b994>] ? do_audit_syscall_entry+0x64/0x70 [ 402.295774] [<ffffffff81179440>] SyS_write+0x50/0xc0 [ 402.297239] [<ffffffff810f9ffe>] ? __audit_syscall_exit+0x22e/0x2d0 [ 402.298947] [<ffffffff815c1c29>] system_call_fastpath+0x12/0x17 while one of them is holding the lock: [ 402.736525] a.out R running task 0 3617 1 0x00000084 [ 402.738452] MemAlloc: 358299 jiffies on 0x10 [ 402.739812] ffff88007ba63808 ffff88007ba637d8 ffff8800792f2510 00000000000130c0 [ 402.741972] ffff88007ba63fd8 00000000000130c0 ffff8800792f2510 ffff880078d1bb48 [ 402.744029] ffffffff81848588 0000000000000400 0000000000000000 ffff88007ba639c8 [ 402.746135] Call Trace: [ 402.747153] [<ffffffff815bdc72>] _cond_resched+0x22/0x40 [ 402.748718] [<ffffffff811249f2>] shrink_node_slabs+0x242/0x310 [ 402.750432] [<ffffffff81127155>] shrink_zone+0x175/0x1c0 [ 402.751996] [<ffffffff81127590>] do_try_to_free_pages+0x1d0/0x3e0 [ 402.753686] [<ffffffff81127834>] try_to_free_pages+0x94/0xc0 [ 402.755325] [<ffffffff8111d4c5>] __alloc_pages_nodemask+0x535/0xaa0 [ 402.757057] [<ffffffff8115cf9c>] alloc_pages_current+0x8c/0x100 [ 402.758725] [<ffffffff811148f7>] __page_cache_alloc+0xa7/0xc0 [ 402.760362] [<ffffffff81115364>] pagecache_get_page+0x54/0x1b0 [ 402.762004] [<ffffffff811154e8>] grab_cache_page_write_begin+0x28/0x50 [ 402.763787] [<ffffffffa023b04f>] xfs_vm_write_begin+0x2f/0xe0 [xfs] [ 402.765516] [<ffffffff8111465c>] generic_perform_write+0xbc/0x1c0 [ 402.767203] [<ffffffffa024634f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs] [ 402.769078] [<ffffffffa024642f>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs] So this is basically the same as the previous one we just see it in a slightly better shape because many threads managed to exit already. > where I think a.out cannot die within reasonable duration due to b.out . I am not sure you can have any reasonable time expectation with such a huge contention on a single file. Even killing the task manually would take quite some time I suspect. Sure, memory pressure makes it all much worse. > I don't know whether cgroups can help or not, Memory cgroups would help you to limit the amount of anon memory but you would have to be really careful about the potential overcomit due to other allocations from outside of the restricted group. Not having any swap doesn't help here either. It just moves all the reclaim pressure to the file pages and slabs which struggle already. > but I think we need to be prepared for cases where sending SIGKILL to > all threads sharing the same memory does not help. Sure, unkillable tasks are a problem which we have to handle. Having GFP_KERNEL allocations looping without way out contributes to this which is sad but your current data just show that sometimes it might take ages to finish even without that going on. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-30 11:21 ` Michal Hocko @ 2014-12-30 13:33 ` Tetsuo Handa 2014-12-31 10:24 ` Tetsuo Handa 2015-02-09 11:44 ` Tetsuo Handa 2015-02-16 11:23 ` Tetsuo Handa 2 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2014-12-30 13:33 UTC (permalink / raw) To: mhocko Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes, torvalds Michal Hocko wrote: > So the OOM blocked task is sitting in the page fault caused by clearing > the user buffer. According to your debugging patch this should be > GFP_HIGHUSER_MOVABLE | __GFP_ZERO allocation which is the case where we > retry without failing most of the time. Oops, my debugging patch had a bug. I wanted to print p->gfp_flags but was printing (p->gfp_flags & __GFP_WAIT). Retested with a fix and result is http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-3.txt.xz . static void print_memalloc_info(const struct task_struct *p) { const gfp_t gfp = p->gfp_flags; /* * __alloc_pages_nodemask() doesn't use smp_wmb() between * updating ->gfp_start and ->gfp_flags. But reading stale * ->gfp_start value harms nothing but printing bogus duration. * Correct duration will be printed when this function is * called for the next time. */ if (unlikely(gfp & __GFP_WAIT)) printk(KERN_INFO "MemAlloc: %ld jiffies on 0x%x\n", jiffies - p->gfp_start, gfp); } > That being said this doesn't look like a live lock or a lockup. System > should recover from this state but it might take a lot of time (there > are hundreds of tasks waiting on the i_mutex lock, each will try to > allocate and fail and OOM victims will have to get out of the kernel and > die). I am not sure we can do much about that from the allocator POV. A > possible way would be refraining from the reclaim efforts when it is > clear that nothing is really reclaimable. But I suspect this would be > tricky to get right. Indeed, this is not a livelock since the task holding the mutex is doing a !__GFP_FS allocation and is making too-slow-to-wait progress, and the "waited for" lines are eventually gone. [ 121.017797] b.out R running task 0 9999 9982 0x00000088 [ 121.019750] MemAlloc: 30542 jiffies on 0x102005a [ 223.486701] b.out R running task 0 10008 9982 0x00000080 [ 223.488642] MemAlloc: 12242 jiffies on 0x102005a [ 415.695635] b.out R running task 0 10013 9982 0x00000080 [ 415.697578] MemAlloc: 108210 jiffies on 0x102005a [ 960.228134] b.out R running task 0 10013 9982 0x00000080 [ 960.230179] MemAlloc: 652090 jiffies on 0x102005a > > where I think a.out cannot die within reasonable duration due to b.out . > > I am not sure you can have any reasonable time expectation with such a > huge contention on a single file. Even killing the task manually would > take quite some time I suspect. Sure, memory pressure makes it all much > worse. Not specific to OOM-killer case, but I wish that the stall ends within 10 seconds, for my customers are using watchdog timeout of 11 seconds with watchdog keep-alive interval of 2 seconds. I wish that there is a way to record that the process who is supposed to do watchdog keep-alive operation was unexpectedly blocked for many seconds at memory allocation. My gfp_start patch works for that purpose. > > but I think we need to be prepared for cases where sending SIGKILL to > > all threads sharing the same memory does not help. > > Sure, unkillable tasks are a problem which we have to handle. Having > GFP_KERNEL allocations looping without way out contributes to this which > is sad but your current data just show that sometimes it might take ages > to finish even without that going on. Can't we replace mutex_lock() / wait_for_completion() with killable versions where it is safe (in order to reduce locations of unkillable waits)? I think replacing mutex_lock() in xfs_file_buffered_aio_write() with killable version is possible because data written by buffered write is not guaranteed to be flushed until sync() / fsync() / fdatasync() returns. And can't we detect unkillable TIF_MEMDIE tasks (like checking task's ->state after a while after TIF_MEMDIE was set)? My sysctl_memdie_timeout_jiffies patch works for that purpose. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-30 13:33 ` Tetsuo Handa @ 2014-12-31 10:24 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2014-12-31 10:24 UTC (permalink / raw) To: mhocko Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes, torvalds Tetsuo Handa wrote: > > > where I think a.out cannot die within reasonable duration due to b.out . > > > > I am not sure you can have any reasonable time expectation with such a > > huge contention on a single file. Even killing the task manually would > > take quite some time I suspect. Sure, memory pressure makes it all much > > worse. > > Not specific to OOM-killer case, but I wish that the stall ends within 10 > seconds, for my customers are using watchdog timeout of 11 seconds with > watchdog keep-alive interval of 2 seconds. > > I wish that there is a way to record that the process who is supposed to do > watchdog keep-alive operation was unexpectedly blocked for many seconds at > memory allocation. My gfp_start patch works for that purpose. > > > > but I think we need to be prepared for cases where sending SIGKILL to > > > all threads sharing the same memory does not help. > > > > Sure, unkillable tasks are a problem which we have to handle. Having > > GFP_KERNEL allocations looping without way out contributes to this which > > is sad but your current data just show that sometimes it might take ages > > to finish even without that going on. > > Can't we replace mutex_lock() / wait_for_completion() with killable versions > where it is safe (in order to reduce locations of unkillable waits)? > I think replacing mutex_lock() in xfs_file_buffered_aio_write() with killable > version is possible because data written by buffered write is not guaranteed > to be flushed until sync() / fsync() / fdatasync() returns. > > And can't we detect unkillable TIF_MEMDIE tasks (like checking task's ->state > after a while after TIF_MEMDIE was set)? My sysctl_memdie_timeout_jiffies > patch works for that purpose. > I was testing below patch on current linux.git tree. To my surprise, I can no longer reproduce "stall by a.out + b.out" because setting TIF_MEMDIE to all threads sharing the same memory (without granting access to memory reserves) made it possible to solve the stalled state immediately (console log is at http://I-love.SAKURA.ne.jp/tmp/serial-20141231-ab.txt.xz ). Given that low-order (<=PAGE_ALLOC_COSTLY_ORDER) allocations are allowed to fail immediately upon OOM, maybe we can let ongoing memory allocations fail without granting access to memory reserves? ---------------------------------------- >From 9212fb2bc96579c0dd0e1f121f5c089c683e12c0 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Wed, 31 Dec 2014 17:50:24 +0900 Subject: [RFC PATCH] oom: Introduce sysctl-tunable MEMDIE timeout. When there is a thread with TIF_MEMDIE flag set, the OOM killer is disabled. However, a victim process containing that thread could get stuck due to dependency which is invisible to the OOM killer. As a result, the system will stall for unpredictable duration because the OOM killer is kept disabled when one of threads in the victim process got stuck. This situation is easily reproduced by multi-threaded programs where thread1 tries to allocate memory whereas thread2 tries to perform file I/O operation. The OOM killer sets TIF_MEMDIE flag to only thread1, but the threads which really needs TIF_MEMDIE flag which is blocking thread2 via unkillable wait (e.g. mutex_lock() for "struct inode"->i_mutex) can be thread3 doing memory allocation. And the thread3 can be outside of the victim process containing thread1. But in order to avoid depletion of memory reserves via TIF_MEMDIE flag, we don't want to set TIF_MEMDIE flag to all threads which might be preventing thread2 to terminate. Moreover, we can't know which threads are holding the lock which thread2 depends on. While converting unkillable waits (e.g. mutex_lock()) to killable waits (e.g. mutex_lock_killable()) helps thread2 to die quickly (not only SIGKILL by the OOM killer but also SIGKILL by user's operations), we can't afford converting all unkillable waits. So, we want to be prepared for unkillable threads anyway. This patch does the following things. (1) Let ongoing memory allocation fail without accessing to memory reserves via TIF_MEMDIE flag. (2) Let the OOM killer set TIF_MEMDIE flag to all threads sharing the same memory. (3) Let the OOM killer record current time as of setting TIF_MEMDIE flag. (4) Let the OOM killer treat threads which did not die within sysctl-tunable timeout as unkillable. We can avoid depletion of memory reserves via TIF_MEMDIE flag by (1). While (1) might retard termination of thread1 when allowing access to memory reserves helps the victim process containing thread1 to die quickly, (4) will prevent thread1 from being unable to die forever by killing other threads after timeout. If the OOM killer cannot find threads to kill after timeout, something is absolutely wrong. Therefore, kernel panic followed by automatic reboot (with kdump as optional for analyzing the cause) should be OK. (4) introduces /proc/sys/vm/memdie_task_{skip|panic}_secs interfaces which control timeout for waiting for the threads with TIF_MEMDIE flag set. When timeout expired, the former enables the OOM killer again and the latter triggers kernel panic. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- include/linux/oom.h | 3 ++ include/linux/sched.h | 1 + kernel/cpuset.c | 5 ++-- kernel/exit.c | 1 + kernel/sysctl.c | 19 +++++++++++++ mm/oom_kill.c | 77 ++++++++++++++++++++++++++++++++++++++++++++------- mm/page_alloc.c | 4 +-- 7 files changed, 95 insertions(+), 15 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 853698c..642e4ae 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -68,6 +68,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); +extern bool is_killable_memdie_task(struct task_struct *p); extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); @@ -107,4 +108,6 @@ static inline bool task_will_free_mem(struct task_struct *task) extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; extern int sysctl_panic_on_oom; +extern unsigned long sysctl_memdie_task_skip_secs; +extern unsigned long sysctl_memdie_task_panic_secs; #endif /* _INCLUDE_LINUX_OOM_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 8db31ef..58ad56a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1701,6 +1701,7 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif + unsigned long memdie_start; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 64b257f..aea9d712 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -35,6 +35,7 @@ #include <linux/kmod.h> #include <linux/list.h> #include <linux/mempolicy.h> +#include <linux/oom.h> #include <linux/mm.h> #include <linux/memory.h> #include <linux/export.h> @@ -1008,7 +1009,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk, * Allow tasks that have access to memory reserves because they have * been OOM killed to get memory anywhere. */ - if (unlikely(test_thread_flag(TIF_MEMDIE))) + if (unlikely(is_killable_memdie_task(current))) return; if (current->flags & PF_EXITING) /* Let dying task have memory */ return; @@ -2515,7 +2516,7 @@ int __cpuset_node_allowed(int node, gfp_t gfp_mask) * Allow tasks that have access to memory reserves because they have * been OOM killed to get memory anywhere. */ - if (unlikely(test_thread_flag(TIF_MEMDIE))) + if (unlikely(is_killable_memdie_task(current))) return 1; if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */ return 0; diff --git a/kernel/exit.c b/kernel/exit.c index 1ea4369..de5efe5 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -436,6 +436,7 @@ static void exit_mm(struct task_struct *tsk) mm_update_next_owner(mm); mmput(mm); clear_thread_flag(TIF_MEMDIE); + current->memdie_start = 0; } static struct task_struct *find_alive_thread(struct task_struct *p) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 137c7f6..dab9b31 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -145,6 +145,9 @@ static const int cap_last_cap = CAP_LAST_CAP; static unsigned long hung_task_timeout_max = (LONG_MAX/HZ); #endif +/* Used by proc_doulongvec_minmax of sysctl_memdie_task_*_secs */ +static unsigned long memdie_task_timeout_max = (LONG_MAX/HZ); + #ifdef CONFIG_INOTIFY_USER #include <linux/inotify.h> #endif @@ -1502,6 +1505,22 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = proc_doulongvec_minmax, }, + { + .procname = "memdie_task_skip_secs", + .data = &sysctl_memdie_task_skip_secs, + .maxlen = sizeof(sysctl_memdie_task_skip_secs), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + .extra2 = &memdie_task_timeout_max, + }, + { + .procname = "memdie_task_panic_secs", + .data = &sysctl_memdie_task_panic_secs, + .maxlen = sizeof(sysctl_memdie_task_panic_secs), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + .extra2 = &memdie_task_timeout_max, + }, { } }; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9c..dbff730 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -43,6 +43,8 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; static DEFINE_SPINLOCK(zone_scan_lock); +unsigned long sysctl_memdie_task_skip_secs; +unsigned long sysctl_memdie_task_panic_secs; #ifdef CONFIG_NUMA /** @@ -117,6 +119,61 @@ found: return t; } +/** + * set_memdie_flag - set TIF_MEMDIE flag and record current time. + * @p: Pointer to "struct task_struct". + */ +static void set_memdie_flag(struct task_struct *p) +{ + /* For avoiding race condition, current time must not be 0. */ + if (!p->memdie_start) { + const unsigned long start = jiffies; + + p->memdie_start = start ? start : 1; + } + set_tsk_thread_flag(p, TIF_MEMDIE); +} + +/** + * is_killable_memdie_task - check task is not stuck with TIF_MEMDIE flag set. + * @p: Pointer to "struct task_struct". + * + * Setting TIF_MEMDIE flag to @p disables the OOM killer. However, @p could get + * stuck due to dependency which is invisible to the OOM killer. When @p got + * stuck, the system will stall for unpredictable duration (presumably forever) + * because the OOM killer is kept disabled. + * + * If @p remained stuck for /proc/sys/vm/memdie_task_skip_secs seconds, this + * function returns false as if TIF_MEMDIE flag was not set to @p. As a result, + * the OOM killer will try to find other killable processes at the risk of + * kernel panic when there is no other killable processes. + * If @p remained stuck for /proc/sys/vm/memdie_task_panic_secs seconds, this + * function triggers kernel panic (for optionally taking vmcore for analysis). + * Setting 0 to these interfaces disables this check. + */ +bool is_killable_memdie_task(struct task_struct *p) +{ + unsigned long start, timeout; + + /* If task does not have TIF_MEMDIE flag, there is nothing to do.*/ + if (!test_tsk_thread_flag(p, TIF_MEMDIE)) + return false; + /* Handle cases where TIF_MEMDIE was set outside of this file. */ + start = p->memdie_start; + if (!start) { + set_memdie_flag(p); + return true; + } + /* Trigger kernel panic after timeout. */ + timeout = sysctl_memdie_task_panic_secs; + if (timeout && time_after(jiffies, start + timeout * HZ)) + panic("Out of memory: %s (%d) did not die within %lu seconds.\n", + p->comm, p->pid, timeout); + /* Return true before timeout. */ + timeout = sysctl_memdie_task_skip_secs; + return !timeout || time_before(jiffies, start + timeout * HZ); +} + /* return true if the task is not adequate as candidate victim task. */ static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask) @@ -134,7 +191,7 @@ static bool oom_unkillable_task(struct task_struct *p, if (!has_intersects_mems_allowed(p, nodemask)) return true; - return false; + return is_killable_memdie_task(p); } /** @@ -439,7 +496,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (task_will_free_mem(p)) { - set_tsk_thread_flag(p, TIF_MEMDIE); + set_memdie_flag(p); put_task_struct(p); return; } @@ -500,12 +557,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, /* * Kill all user processes sharing victim->mm in other thread groups, if - * any. They don't get access to memory reserves, though, to avoid - * depletion of all memory. This prevents mm->mmap_sem livelock when an - * oom killed thread cannot exit because it requires the semaphore and - * its contended by another thread trying to allocate memory itself. - * That thread will now get access to memory reserves since it has a - * pending fatal signal. + * any. This mitigates mm->mmap_sem livelock when an oom killed thread + * cannot exit because it requires the semaphore and its contended by + * another thread trying to allocate memory itself. Note that this does + * not help if the contended process does not share victim->mm. In that + * case, is_killable_memdie_task() will detect it and take actions. */ rcu_read_lock(); for_each_process(p) @@ -518,11 +574,12 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, pr_err("Kill process %d (%s) sharing same memory\n", task_pid_nr(p), p->comm); task_unlock(p); + set_memdie_flag(p); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); } rcu_read_unlock(); - set_tsk_thread_flag(victim, TIF_MEMDIE); + set_memdie_flag(victim); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); put_task_struct(victim); } @@ -645,7 +702,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * quickly exit and free its memory. */ if (fatal_signal_pending(current) || task_will_free_mem(current)) { - set_thread_flag(TIF_MEMDIE); + set_memdie_flag(current); return; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7633c50..3799139 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) alloc_flags |= ALLOC_NO_WATERMARKS; else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; - else if (!in_interrupt() && - ((current->flags & PF_MEMALLOC) || - unlikely(test_thread_flag(TIF_MEMDIE)))) + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; } #ifdef CONFIG_CMA -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-30 11:21 ` Michal Hocko 2014-12-30 13:33 ` Tetsuo Handa @ 2015-02-09 11:44 ` Tetsuo Handa 2015-02-10 13:58 ` Tetsuo Handa 2015-02-17 14:37 ` Michal Hocko 2015-02-16 11:23 ` Tetsuo Handa 2 siblings, 2 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-09 11:44 UTC (permalink / raw) To: mhocko Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes, torvalds Hello. Today I tested Linux 3.19 and noticed unexpected behavior (A) (B) shown below. (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition despite we didn't remove the /* * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER * means __GFP_NOFAIL, but that may not be true in other * implementations. */ if (order <= PAGE_ALLOC_COSTLY_ORDER) return 1; check in should_alloc_retry(). Is this what you expected? (B) When coredump to pipe is configured, the system stalls under OOM condition due to memory allocation by coredump's reader side. How should we handle this "expected to terminate shortly but unable to terminate due to invisible dependency" case? What approaches other than applying timeout on coredump's writer side are possible? (Running inside memory cgroup is not an answer which I want.) Console log is at http://I-love.SAKURA.ne.jp/tmp/serial-20150209.txt.xz and kernel config is at http://I-love.SAKURA.ne.jp/tmp/config-3.19 . To reproduce these behavior, you can run reproducer program shown below on a system with 4 CPUs / 2GB RAM / no swap. (Too small stack is passed to clone() because I by error did so when trying to reproduce OOM-stall situations caused by memory allocations inside unkillable down_write("struct mm_struct"->mmap_sem) calls.) ---------- reproducer program start ---------- #define _GNU_SOURCE #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include <fcntl.h> #include <sched.h> #include <sys/mman.h> static int file_mapper(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); void *ptr[10000]; /* Will cause SIGSEGV due to stack overflow */ int i; while (1) { for (i = 0; i < 10000; i++) ptr[i] = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0); for (i = 0; i < 10000; i++) munmap(ptr[i], 4096); } return 0; } static void child(void) { const int fd = open("/proc/self/oom_score_adj", O_WRONLY); int i; write(fd, "999", 3); close(fd); for (i = 0; i < 10; i++) { char *cp = malloc(4 * 1024); if (!cp || clone(file_mapper, cp + 4 * 1024, CLONE_SIGHAND | CLONE_VM, NULL) == -1) break; } while (1) pause(); } static void memory_consumer(void) { const int fd = open("/dev/zero", O_RDONLY); unsigned long size; char *buf = NULL; for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } while (1) read(fd, buf, size); /* Will cause OOM due to overcommit */ } int main(int argc, char *argv[]) { if (fork() == 0) child(); memory_consumer(); return 0; } ---------- reproducer program end ---------- Logs for (A) [ 98.933472] kworker/1:2: page allocation failure: order:0, mode:0x10 [ 98.935374] CPU: 1 PID: 363 Comm: kworker/1:2 Not tainted 3.19.0 #329 [ 98.937271] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 98.940026] Workqueue: events_freezable_power_ disk_events_workfn [ 98.942084] 0000000000000000 00000000f967a090 0000000000000000 ffffffff81576f4e [ 98.944511] 0000000000000010 ffffffff8110d26e ffff88007fffdb00 0000000000000000 [ 98.946873] 0000000236945e30 0000000000000002 0000000000000000 00000000f967a090 [ 98.949121] Call Trace: [ 98.950318] [<ffffffff81576f4e>] ? dump_stack+0x40/0x50 [ 98.952054] [<ffffffff8110d26e>] ? warn_alloc_failed+0xee/0x150 [ 98.953935] [<ffffffff811108e2>] ? __alloc_pages_nodemask+0x6a2/0xa70 [ 98.955912] [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100 [ 98.957812] [<ffffffff812467c6>] ? bio_copy_user_iov+0x1c6/0x380 [ 98.959709] [<ffffffff81246a1a>] ? bio_copy_kern+0x4a/0xf0 [ 98.961518] [<ffffffff8125053a>] ? blk_rq_map_kern+0x6a/0x150 [ 98.963346] [<ffffffff8124a856>] ? blk_get_request+0x76/0x120 [ 98.965208] [<ffffffff8139d39c>] ? scsi_execute+0x12c/0x160 [ 98.967093] [<ffffffff8139d4ab>] ? scsi_execute_req_flags+0x8b/0x100 [ 98.969088] [<ffffffffa01fca20>] ? sr_check_events+0xc0/0x300 [sr_mod] [ 98.971076] [<ffffffff81579152>] ? __schedule+0x272/0x760 [ 98.972838] [<ffffffffa01f017f>] ? cdrom_check_events+0xf/0x30 [cdrom] [ 98.974856] [<ffffffff8125a5ba>] ? disk_check_events+0x5a/0x1e0 [ 98.976753] [<ffffffff8107b0b1>] ? process_one_work+0x131/0x360 [ 98.978650] [<ffffffff8107b863>] ? worker_thread+0x113/0x590 [ 98.980489] [<ffffffff8107b750>] ? rescuer_thread+0x470/0x470 [ 98.982330] [<ffffffff810804d1>] ? kthread+0xd1/0xf0 [ 98.984068] [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190 [ 98.986049] [<ffffffff8157d27c>] ? ret_from_fork+0x7c/0xb0 [ 98.987845] [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190 [ 101.495212] kworker/1:2: page allocation failure: order:0, mode:0x10 [ 101.497410] CPU: 1 PID: 363 Comm: kworker/1:2 Not tainted 3.19.0 #329 [ 101.499581] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 101.502603] Workqueue: events_freezable_power_ disk_events_workfn [ 101.504775] 0000000000000000 00000000f967a090 0000000000000000 ffffffff81576f4e [ 101.507283] 0000000000000010 ffffffff8110d26e ffff88007fffdb00 0000000000000000 [ 101.509800] 0000000236945e30 0000000000000002 0000000000000000 00000000f967a090 [ 101.512324] Call Trace: [ 101.513767] [<ffffffff81576f4e>] ? dump_stack+0x40/0x50 [ 101.515748] [<ffffffff8110d26e>] ? warn_alloc_failed+0xee/0x150 [ 101.517897] [<ffffffff811108e2>] ? __alloc_pages_nodemask+0x6a2/0xa70 [ 101.520140] [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100 [ 101.522352] [<ffffffff812467c6>] ? bio_copy_user_iov+0x1c6/0x380 [ 101.524534] [<ffffffff81246a1a>] ? bio_copy_kern+0x4a/0xf0 [ 101.526619] [<ffffffff8125053a>] ? blk_rq_map_kern+0x6a/0x150 [ 101.528743] [<ffffffff8124a856>] ? blk_get_request+0x76/0x120 [ 101.530870] [<ffffffff8139d39c>] ? scsi_execute+0x12c/0x160 [ 101.532971] [<ffffffff8139d4ab>] ? scsi_execute_req_flags+0x8b/0x100 [ 101.535250] [<ffffffffa01fca20>] ? sr_check_events+0xc0/0x300 [sr_mod] [ 101.537641] [<ffffffff81579152>] ? __schedule+0x272/0x760 [ 101.539713] [<ffffffffa01f017f>] ? cdrom_check_events+0xf/0x30 [cdrom] [ 101.542015] [<ffffffff8125a5ba>] ? disk_check_events+0x5a/0x1e0 [ 101.544189] [<ffffffff8107b0b1>] ? process_one_work+0x131/0x360 [ 101.546370] [<ffffffff8107b863>] ? worker_thread+0x113/0x590 [ 101.548488] [<ffffffff8107b750>] ? rescuer_thread+0x470/0x470 [ 101.550575] [<ffffffff810804d1>] ? kthread+0xd1/0xf0 [ 101.552492] [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190 [ 101.554657] [<ffffffff8157d27c>] ? ret_from_fork+0x7c/0xb0 [ 101.556628] [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190 [ 104.052500] kworker/1:2: page allocation failure: order:0, mode:0x10 [ 104.054694] CPU: 1 PID: 363 Comm: kworker/1:2 Not tainted 3.19.0 #329 [ 104.056897] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 104.059887] Workqueue: events_freezable_power_ disk_events_workfn [ 104.062061] 0000000000000000 00000000f967a090 0000000000000000 ffffffff81576f4e [ 104.064611] 0000000000000010 ffffffff8110d26e ffff88007fffdb00 0000000000000000 [ 104.067119] 0000000236945e30 0000000000000002 0000000000000000 00000000f967a090 [ 104.069657] Call Trace: [ 104.071074] [<ffffffff81576f4e>] ? dump_stack+0x40/0x50 [ 104.073080] [<ffffffff8110d26e>] ? warn_alloc_failed+0xee/0x150 [ 104.075194] [<ffffffff811108e2>] ? __alloc_pages_nodemask+0x6a2/0xa70 [ 104.077424] [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100 [ 104.079626] [<ffffffff812467c6>] ? bio_copy_user_iov+0x1c6/0x380 [ 104.081800] [<ffffffff81246a1a>] ? bio_copy_kern+0x4a/0xf0 [ 104.083868] [<ffffffff8125053a>] ? blk_rq_map_kern+0x6a/0x150 [ 104.085988] [<ffffffff8124a856>] ? blk_get_request+0x76/0x120 [ 104.088119] [<ffffffff8139d39c>] ? scsi_execute+0x12c/0x160 [ 104.090206] [<ffffffff8139d4ab>] ? scsi_execute_req_flags+0x8b/0x100 [ 104.092497] [<ffffffffa01fca20>] ? sr_check_events+0xc0/0x300 [sr_mod] [ 104.094781] [<ffffffff81579152>] ? __schedule+0x272/0x760 [ 104.096843] [<ffffffffa01f017f>] ? cdrom_check_events+0xf/0x30 [cdrom] [ 104.099147] [<ffffffff8125a5ba>] ? disk_check_events+0x5a/0x1e0 [ 104.101306] [<ffffffff8107b0b1>] ? process_one_work+0x131/0x360 [ 104.103470] [<ffffffff8107b863>] ? worker_thread+0x113/0x590 [ 104.105600] [<ffffffff8107b750>] ? rescuer_thread+0x470/0x470 [ 104.107710] [<ffffffff810804d1>] ? kthread+0xd1/0xf0 [ 104.109607] [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190 [ 104.111781] [<ffffffff8157d27c>] ? ret_from_fork+0x7c/0xb0 [ 104.113733] [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190 [ 106.608783] kworker/1:2: page allocation failure: order:0, mode:0x10 [ 106.610960] CPU: 1 PID: 363 Comm: kworker/1:2 Not tainted 3.19.0 #329 [ 106.613123] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 106.616159] Workqueue: events_freezable_power_ disk_events_workfn [ 106.618337] 0000000000000000 00000000f967a090 0000000000000000 ffffffff81576f4e [ 106.621153] 0000000000000010 ffffffff8110d26e ffff88007fffdb00 0000000000000000 [ 106.623823] 0000000236945e30 0000000000000002 0000000000000000 00000000f967a090 [ 106.626386] Call Trace: [ 106.627810] [<ffffffff81576f4e>] ? dump_stack+0x40/0x50 [ 106.629800] [<ffffffff8110d26e>] ? warn_alloc_failed+0xee/0x150 [ 106.632128] [<ffffffff811108e2>] ? __alloc_pages_nodemask+0x6a2/0xa70 [ 106.634460] [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100 [ 106.636638] [<ffffffff812467c6>] ? bio_copy_user_iov+0x1c6/0x380 [ 106.638856] [<ffffffff81246a1a>] ? bio_copy_kern+0x4a/0xf0 [ 106.640929] [<ffffffff8125053a>] ? blk_rq_map_kern+0x6a/0x150 [ 106.643053] [<ffffffff8124a856>] ? blk_get_request+0x76/0x120 [ 106.645209] [<ffffffff8139d39c>] ? scsi_execute+0x12c/0x160 [ 106.647293] [<ffffffff8139d4ab>] ? scsi_execute_req_flags+0x8b/0x100 [ 106.649573] [<ffffffffa01fca20>] ? sr_check_events+0xc0/0x300 [sr_mod] [ 106.651921] [<ffffffff81579152>] ? __schedule+0x272/0x760 [ 106.654008] [<ffffffffa01f017f>] ? cdrom_check_events+0xf/0x30 [cdrom] [ 106.656297] [<ffffffff8125a5ba>] ? disk_check_events+0x5a/0x1e0 [ 106.658466] [<ffffffff8107b0b1>] ? process_one_work+0x131/0x360 [ 106.660610] [<ffffffff8107b863>] ? worker_thread+0x113/0x590 [ 106.662744] [<ffffffff8107b750>] ? rescuer_thread+0x470/0x470 [ 106.664849] [<ffffffff810804d1>] ? kthread+0xd1/0xf0 [ 106.666759] [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190 [ 106.668930] [<ffffffff8157d27c>] ? ret_from_fork+0x7c/0xb0 [ 106.670889] [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190 Logs for (B) [ 145.078502] a.out S ffff88007fc92d00 0 2643 2641 0x00000080 [ 145.078503] ffff88003681c480 0000000000012d00 ffff88007a51bfd8 0000000000012d00 [ 145.078504] ffff88003681c480 ffff88003681c480 000200d20000000f 0000000000000001 [ 145.078504] ffff88003681c480 ffff88003681c480 00007fb700000001 ffff88007adcc508 [ 145.078505] Call Trace: [ 145.078506] [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0 [ 145.078507] [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0 [ 145.078508] [<ffffffff8117ba17>] ? pipe_wait+0x67/0xb0 [ 145.078509] [<ffffffff8109ced0>] ? wait_woken+0x90/0x90 [ 145.078510] [<ffffffff8117bb48>] ? pipe_write+0x88/0x450 [ 145.078511] [<ffffffff811732a3>] ? new_sync_write+0x83/0xd0 [ 145.078512] [<ffffffff81173417>] ? __kernel_write+0x57/0x140 [ 145.078513] [<ffffffff811c615e>] ? dump_emit+0x8e/0xd0 [ 145.078515] [<ffffffff811c002f>] ? elf_core_dump+0x146f/0x15d0 [ 145.078516] [<ffffffff811c6a09>] ? do_coredump+0x769/0xe80 [ 145.078517] [<ffffffff8101634d>] ? native_sched_clock+0x2d/0x80 [ 145.078518] [<ffffffff8106fd2b>] ? __send_signal+0x16b/0x3a0 [ 145.078520] [<ffffffff810717f2>] ? get_signal+0x192/0x770 [ 145.078521] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 145.078522] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 145.078523] [<ffffffff8157e022>] ? retint_signal+0x48/0x86 [ 145.078625] abrt-hook-ccpp D 0000000000000002 0 2650 347 0x00000080 [ 145.078626] ffff88007b364d10 0000000000012d00 ffff88007ae3ffd8 0000000000012d00 [ 145.078627] ffff88007b364d10 ffff88007fffc000 ffffffff8111a6a5 0000000000000000 [ 145.078628] 0000000000000000 000088007ae3f9e8 ffff88007b364d10 ffffffff81015df5 [ 145.078628] Call Trace: [ 145.078629] [<ffffffff8111a6a5>] ? shrink_zone+0x105/0x2a0 [ 145.078630] [<ffffffff81015df5>] ? read_tsc+0x5/0x10 [ 145.078631] [<ffffffff810c0270>] ? ktime_get+0x30/0x90 [ 145.078632] [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70 [ 145.078633] [<ffffffff8111ae45>] ? do_try_to_free_pages+0x3e5/0x480 [ 145.078634] [<ffffffff8157c013>] ? schedule_timeout+0x113/0x1b0 [ 145.078635] [<ffffffff810b9800>] ? migrate_timer_list+0x60/0x60 [ 145.078636] [<ffffffff811109ee>] ? __alloc_pages_nodemask+0x7ae/0xa70 [ 145.078638] [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100 [ 145.078640] [<ffffffff8110a240>] ? filemap_fault+0x1c0/0x400 [ 145.078641] [<ffffffff8112e7c6>] ? __do_fault+0x46/0xd0 [ 145.078642] [<ffffffff81131128>] ? do_read_fault.isra.62+0x228/0x310 [ 145.078643] [<ffffffff8113380e>] ? handle_mm_fault+0x7ae/0x10e0 [ 145.078644] [<ffffffff81138145>] ? vma_set_page_prot+0x35/0x60 [ 145.078645] [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540 [ 145.078646] [<ffffffff811399ac>] ? do_mmap_pgoff+0x33c/0x3f0 [ 145.078647] [<ffffffff8112180b>] ? vm_mmap_pgoff+0xbb/0xf0 [ 145.078648] [<ffffffff81051d40>] ? do_page_fault+0x30/0x70 [ 145.078649] [<ffffffff8157ed38>] ? page_fault+0x28/0x30 [ 232.113394] a.out S ffff88007fc92d00 0 2643 2641 0x00000080 [ 232.115926] ffff88003681c480 0000000000012d00 ffff88007a51bfd8 0000000000012d00 [ 232.118630] ffff88003681c480 ffff88003681c480 000200d20000000f 0000000000000001 [ 232.121312] ffff88003681c480 ffff88003681c480 00007fb700000001 ffff88007adcc508 [ 232.124004] Call Trace: [ 232.125242] [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0 [ 232.127506] [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0 [ 232.129972] [<ffffffff8117ba17>] ? pipe_wait+0x67/0xb0 [ 232.131960] [<ffffffff8109ced0>] ? wait_woken+0x90/0x90 [ 232.133928] [<ffffffff8117bb48>] ? pipe_write+0x88/0x450 [ 232.135901] [<ffffffff811732a3>] ? new_sync_write+0x83/0xd0 [ 232.137956] [<ffffffff81173417>] ? __kernel_write+0x57/0x140 [ 232.140033] [<ffffffff811c615e>] ? dump_emit+0x8e/0xd0 [ 232.141958] [<ffffffff811c002f>] ? elf_core_dump+0x146f/0x15d0 [ 232.144161] [<ffffffff811c6a09>] ? do_coredump+0x769/0xe80 [ 232.146178] [<ffffffff8101634d>] ? native_sched_clock+0x2d/0x80 [ 232.148343] [<ffffffff8106fd2b>] ? __send_signal+0x16b/0x3a0 [ 232.150441] [<ffffffff810717f2>] ? get_signal+0x192/0x770 [ 232.152468] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 232.154441] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 232.156552] [<ffffffff8157e022>] ? retint_signal+0x48/0x86 [ 232.340460] abrt-hook-ccpp D 0000000000000002 0 2650 347 0x00000080 [ 232.343038] ffff88007b364d10 0000000000012d00 ffff88007ae3ffd8 0000000000012d00 [ 232.345779] ffff88007b364d10 ffff88007fffc000 ffffffff8111a6a5 0000000000000000 [ 232.348626] 0000000000000000 000088007ae3f9e8 ffff88007b364d10 ffffffff81015df5 [ 232.351400] Call Trace: [ 232.352798] [<ffffffff8111a6a5>] ? shrink_zone+0x105/0x2a0 [ 232.355177] [<ffffffff81015df5>] ? read_tsc+0x5/0x10 [ 232.357260] [<ffffffff810c0270>] ? ktime_get+0x30/0x90 [ 232.359321] [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70 [ 232.361597] [<ffffffff8111ae45>] ? do_try_to_free_pages+0x3e5/0x480 [ 232.364151] [<ffffffff81089ac1>] ? try_to_wake_up+0x221/0x2b0 [ 232.366364] [<ffffffff8110af07>] ? oom_badness+0x17/0x130 [ 232.368410] [<ffffffff8109ced9>] ? autoremove_wake_function+0x9/0x30 [ 232.370694] [<ffffffff8157992f>] ? _cond_resched+0x1f/0x40 [ 232.372765] [<ffffffff811106d0>] ? __alloc_pages_nodemask+0x490/0xa70 [ 232.375082] [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100 [ 232.377416] [<ffffffff8110a240>] ? filemap_fault+0x1c0/0x400 [ 232.379542] [<ffffffff8112e7c6>] ? __do_fault+0x46/0xd0 [ 232.381624] [<ffffffff81131128>] ? do_read_fault.isra.62+0x228/0x310 [ 232.383984] [<ffffffff8113380e>] ? handle_mm_fault+0x7ae/0x10e0 [ 232.386198] [<ffffffff81138145>] ? vma_set_page_prot+0x35/0x60 [ 232.388386] [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540 [ 232.390592] [<ffffffff811399ac>] ? do_mmap_pgoff+0x33c/0x3f0 [ 232.392762] [<ffffffff8112180b>] ? vm_mmap_pgoff+0xbb/0xf0 [ 232.395259] [<ffffffff81051d40>] ? do_page_fault+0x30/0x70 [ 232.397472] [<ffffffff8157ed38>] ? page_fault+0x28/0x30 [ 328.225954] a.out S ffff88007fc92d00 0 2643 2641 0x00000080 [ 328.228262] ffff88003681c480 0000000000012d00 ffff88007a51bfd8 0000000000012d00 [ 328.230731] ffff88003681c480 ffff88003681c480 000200d20000000f 0000000000000001 [ 328.233188] ffff88003681c480 ffff88003681c480 00007fb700000001 ffff88007adcc508 [ 328.235701] Call Trace: [ 328.236851] [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0 [ 328.238826] [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0 [ 328.240792] [<ffffffff8117ba17>] ? pipe_wait+0x67/0xb0 [ 328.242598] [<ffffffff8109ced0>] ? wait_woken+0x90/0x90 [ 328.244426] [<ffffffff8117bb48>] ? pipe_write+0x88/0x450 [ 328.246284] [<ffffffff811732a3>] ? new_sync_write+0x83/0xd0 [ 328.248208] [<ffffffff81173417>] ? __kernel_write+0x57/0x140 [ 328.250159] [<ffffffff811c615e>] ? dump_emit+0x8e/0xd0 [ 328.251967] [<ffffffff811c002f>] ? elf_core_dump+0x146f/0x15d0 [ 328.253930] [<ffffffff811c6a09>] ? do_coredump+0x769/0xe80 [ 328.255811] [<ffffffff8101634d>] ? native_sched_clock+0x2d/0x80 [ 328.257806] [<ffffffff8106fd2b>] ? __send_signal+0x16b/0x3a0 [ 328.259714] [<ffffffff810717f2>] ? get_signal+0x192/0x770 [ 328.261552] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 328.263369] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 328.265292] [<ffffffff8157e022>] ? retint_signal+0x48/0x86 [ 328.444215] abrt-hook-ccpp D 0000000000000002 0 2650 347 0x00000080 [ 328.446549] ffff88007b364d10 0000000000012d00 ffff88007ae3ffd8 0000000000012d00 [ 328.449029] ffff88007b364d10 ffff88007fffc000 ffffffff8111a6a5 0000000000000000 [ 328.451689] 0000000000000000 000088007ae3f9e8 ffff88007b364d10 ffffffff81015df5 [ 328.454187] Call Trace: [ 328.455408] [<ffffffff8111a6a5>] ? shrink_zone+0x105/0x2a0 [ 328.457406] [<ffffffff81015df5>] ? read_tsc+0x5/0x10 [ 328.459289] [<ffffffff810c0270>] ? ktime_get+0x30/0x90 [ 328.461368] [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70 [ 328.464191] [<ffffffff8111ae45>] ? do_try_to_free_pages+0x3e5/0x480 [ 328.466419] [<ffffffff8157c013>] ? schedule_timeout+0x113/0x1b0 [ 328.468506] [<ffffffff810b9800>] ? migrate_timer_list+0x60/0x60 [ 328.470672] [<ffffffff811109ee>] ? __alloc_pages_nodemask+0x7ae/0xa70 [ 328.472883] [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100 [ 328.475087] [<ffffffff8110a240>] ? filemap_fault+0x1c0/0x400 [ 328.477089] [<ffffffff8112e7c6>] ? __do_fault+0x46/0xd0 [ 328.478960] [<ffffffff81131128>] ? do_read_fault.isra.62+0x228/0x310 [ 328.481116] [<ffffffff8113380e>] ? handle_mm_fault+0x7ae/0x10e0 [ 328.483454] [<ffffffff81138145>] ? vma_set_page_prot+0x35/0x60 [ 328.485613] [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540 [ 328.487634] [<ffffffff811399ac>] ? do_mmap_pgoff+0x33c/0x3f0 [ 328.489611] [<ffffffff8112180b>] ? vm_mmap_pgoff+0xbb/0xf0 [ 328.491539] [<ffffffff81051d40>] ? do_page_fault+0x30/0x70 [ 328.493441] [<ffffffff8157ed38>] ? page_fault+0x28/0x30 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-09 11:44 ` Tetsuo Handa @ 2015-02-10 13:58 ` Tetsuo Handa 2015-02-10 15:19 ` Johannes Weiner 2015-02-17 14:37 ` Michal Hocko 1 sibling, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-02-10 13:58 UTC (permalink / raw) To: hannes, mhocko Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds (Michal is offline, asking Johannes instead.) Tetsuo Handa wrote: > (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition > despite we didn't remove the > > /* > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER > * means __GFP_NOFAIL, but that may not be true in other > * implementations. > */ > if (order <= PAGE_ALLOC_COSTLY_ORDER) > return 1; > > check in should_alloc_retry(). Is this what you expected? This behavior is caused by commit 9879de7373fcfb46 "mm: page_alloc: embed OOM killing naturally into allocation slowpath". Did you apply that commit with agreement to let GFP_NOIO / GFP_NOFS allocations fail upon memory pressure and permit filesystems to take fs error actions? /* The OOM killer does not compensate for light reclaim */ if (!(gfp_mask & __GFP_FS)) goto out; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-10 13:58 ` Tetsuo Handa @ 2015-02-10 15:19 ` Johannes Weiner 2015-02-11 2:23 ` Tetsuo Handa 2015-02-17 14:50 ` Michal Hocko 0 siblings, 2 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-10 15:19 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Tue, Feb 10, 2015 at 10:58:46PM +0900, Tetsuo Handa wrote: > (Michal is offline, asking Johannes instead.) > > Tetsuo Handa wrote: > > (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition > > despite we didn't remove the > > > > /* > > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER > > * means __GFP_NOFAIL, but that may not be true in other > > * implementations. > > */ > > if (order <= PAGE_ALLOC_COSTLY_ORDER) > > return 1; > > > > check in should_alloc_retry(). Is this what you expected? > > This behavior is caused by commit 9879de7373fcfb46 "mm: page_alloc: > embed OOM killing naturally into allocation slowpath". Did you apply > that commit with agreement to let GFP_NOIO / GFP_NOFS allocations fail > upon memory pressure and permit filesystems to take fs error actions? > > /* The OOM killer does not compensate for light reclaim */ > if (!(gfp_mask & __GFP_FS)) > goto out; The model behind the refactored code is to continue retrying the allocation as long as the allocator has the ability to free memory, i.e. if page reclaim makes progress, or the OOM killer can be used. That being said, I missed that GFP_NOFS were able to loop endlessly even without page reclaim making progress or the OOM killer working, and since it didn't fit the model I dropped it by accident. Is this a real workload you are having trouble with or an artificial stresstest? Because I'd certainly be willing to revert that part of the patch and make GFP_NOFS looping explicit if it helps you. But I do think the new behavior makes more sense, so I'd prefer to keep it if it's merely a stress test you use to test allocator performance. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8e20f9c2fa5a..f77c58ebbcfa 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) + if (!(gfp_mask & __GFP_FS)) { + /* + * XXX: Page reclaim didn't yield anything, + * and the OOM killer can't be invoked, but + * keep looping as per should_alloc_retry(). + */ + *did_some_progress = 1; goto out; + } /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-10 15:19 ` Johannes Weiner @ 2015-02-11 2:23 ` Tetsuo Handa 2015-02-11 13:37 ` Tetsuo Handa 2015-02-17 12:23 ` Tetsuo Handa 2015-02-17 14:50 ` Michal Hocko 1 sibling, 2 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-11 2:23 UTC (permalink / raw) To: hannes Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds Johannes Weiner wrote: > On Tue, Feb 10, 2015 at 10:58:46PM +0900, Tetsuo Handa wrote: > > (Michal is offline, asking Johannes instead.) > > > > Tetsuo Handa wrote: > > > (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition > > > despite we didn't remove the > > > > > > /* > > > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER > > > * means __GFP_NOFAIL, but that may not be true in other > > > * implementations. > > > */ > > > if (order <= PAGE_ALLOC_COSTLY_ORDER) > > > return 1; > > > > > > check in should_alloc_retry(). Is this what you expected? > > > > This behavior is caused by commit 9879de7373fcfb46 "mm: page_alloc: > > embed OOM killing naturally into allocation slowpath". Did you apply > > that commit with agreement to let GFP_NOIO / GFP_NOFS allocations fail > > upon memory pressure and permit filesystems to take fs error actions? > > > > /* The OOM killer does not compensate for light reclaim */ > > if (!(gfp_mask & __GFP_FS)) > > goto out; > > The model behind the refactored code is to continue retrying the > allocation as long as the allocator has the ability to free memory, > i.e. if page reclaim makes progress, or the OOM killer can be used. > > That being said, I missed that GFP_NOFS were able to loop endlessly > even without page reclaim making progress or the OOM killer working, > and since it didn't fit the model I dropped it by accident. > > Is this a real workload you are having trouble with or an artificial > stresstest? Because I'd certainly be willing to revert that part of > the patch and make GFP_NOFS looping explicit if it helps you. But I > do think the new behavior makes more sense, so I'd prefer to keep it > if it's merely a stress test you use to test allocator performance. I'm working for troubleshooting RHEL systems. This is an artificial stresstest which I developed for trying to reproduce various low memory troubles occurred on customer's systems. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > if (high_zoneidx < ZONE_NORMAL) > goto out; > /* The OOM killer does not compensate for light reclaim */ > - if (!(gfp_mask & __GFP_FS)) > + if (!(gfp_mask & __GFP_FS)) { > + /* > + * XXX: Page reclaim didn't yield anything, > + * and the OOM killer can't be invoked, but > + * keep looping as per should_alloc_retry(). > + */ > + *did_some_progress = 1; > goto out; > + } Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? Thread2 doing GFP_FS / GFP_KERNEL allocation might be waiting for Thread1 doing GFP_NOIO / GFP_NOFS allocation to call out_of_memory() on behalf of Thread2, as mutexed by /* * Acquire the per-zone oom lock for each zone. If that * fails, somebody else is making progress for us. */ if (!oom_zonelist_trylock(zonelist, gfp_mask)) { *did_some_progress = 1; schedule_timeout_uninterruptible(1); return NULL; } lock. If Thread1 calls oom_zonelist_trylock() / oom_zonelist_unlock() without sleep while Thread2 calls oom_zonelist_trylock() / oom_zonelist_unlock() with sleep, Thread2 is unlikely able to call out_of_memory() because Thread2 likely fails at oom_zonelist_trylock(). > /* > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > Though, more serious behavior with this reproducer is (B) where the system stalls forever without kernel messages being saved to /var/log/messages . out_of_memory() does not select victims until the coredump to pipe can make progress whereas the coredump to pipe can't make progress until memory allocation succeeds or fails. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-11 2:23 ` Tetsuo Handa @ 2015-02-11 13:37 ` Tetsuo Handa 2015-02-11 18:50 ` Oleg Nesterov 2015-02-17 12:23 ` Tetsuo Handa 1 sibling, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-02-11 13:37 UTC (permalink / raw) To: oleg, mhocko Cc: hannes, david, dchinner, linux-mm, rientjes, akpm, mgorman, torvalds (Asking Oleg this time.) Tetsuo Handa wrote: > Though, more serious behavior with this reproducer is (B) where the system > stalls forever without kernel messages being saved to /var/log/messages . > out_of_memory() does not select victims until the coredump to pipe can make > progress whereas the coredump to pipe can't make progress until memory > allocation succeeds or fails. This behavior is related to commit d003f371b2701635 ("oom: don't assume that a coredumping thread will exit soon"). That commit tried to take SIGNAL_GROUP_COREDUMP into account, but actually it is failing to do so. I tested with debug printk() and got the result shown below. ---------- diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9c..1f684df 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -268,8 +268,12 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, if (test_tsk_thread_flag(task, TIF_MEMDIE)) { if (unlikely(frozen(task))) __thaw_task(task); - if (!force_kill) + if (!force_kill) { + printk_ratelimited(KERN_INFO "OOM: Waiting for %s(%u) " + ": TIF_MEMDIE\n", task->comm, + task->pid); return OOM_SCAN_ABORT; + } } if (!task->mm) return OOM_SCAN_CONTINUE; @@ -281,8 +285,12 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, if (oom_task_origin(task)) return OOM_SCAN_SELECT; - if (task_will_free_mem(task) && !force_kill) + if (task_will_free_mem(task) && !force_kill) { + printk_ratelimited(KERN_INFO "OOM: Waiting for %s(%u) " + ": will_free_mem\n", task->comm, + task->pid); return OOM_SCAN_ABORT; + } return OOM_SCAN_OK; } @@ -439,6 +447,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (task_will_free_mem(p)) { + printk(KERN_INFO "OOM: Waiting for %s(%u) : WILL_FREE_MEM\n", + p->comm, p->pid); set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); return; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8e20f9c..4a2b19b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer does not needlessly kill tasks for lowmem */ if (high_zoneidx < ZONE_NORMAL) goto out; - /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) - goto out; /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. ---------- ---------- [ 66.374198] a.out[9918]: segfault at 2591768 ip 000000000040091e sp 0000000002591770 error 6[ 66.374220] a.out[9919]: segfault at 2592778 ip 000000000040091e sp 0000000002592780 error 6 in a.out[400000+1000] [ 66.378705] in a.out[400000+1000] [ 67.997279] OOM: Waiting for a.out(9917) : will_free_mem (...snipped...) [ 90.952640] a.out D 0000000000000002 0 9916 7303 0x00000080 [ 90.954478] ffff88007a4ca240 0000000000012f80 ffff88007bcc7fd8 0000000000012f80 [ 90.956468] ffff88007a4ca240 ffff88007fffc000 ffffffff8111a945 0000000000000000 [ 90.958475] 0000000000000000 000088007bcc7908 ffff88007a4ca240 ffffffff81015df5 [ 90.960471] Call Trace: [ 90.961420] [<ffffffff8111a945>] ? shrink_zone+0x105/0x2a0 [ 90.962939] [<ffffffff81015df5>] ? read_tsc+0x5/0x10 [ 90.964364] [<ffffffff810c0270>] ? ktime_get+0x30/0x90 [ 90.965816] [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70 [ 90.967322] [<ffffffff8111b0e5>] ? do_try_to_free_pages+0x3e5/0x480 [ 90.969115] [<ffffffff815f23f3>] ? schedule_timeout+0x113/0x1b0 [ 90.970796] [<ffffffff810b9800>] ? migrate_timer_list+0x60/0x60 [ 90.972380] [<ffffffff81110c9e>] ? __alloc_pages_nodemask+0x7ae/0xa60 [ 90.974090] [<ffffffff81151eb2>] ? alloc_pages_vma+0x92/0x1a0 [ 90.975643] [<ffffffff81134037>] ? handle_mm_fault+0xd37/0x10e0 [ 90.977212] [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540 [ 90.978753] [<ffffffff81092fac>] ? update_curr+0xac/0x100 [ 90.980228] [<ffffffff810946cb>] ? put_prev_entity+0x5b/0x2c0 [ 90.981763] [<ffffffff8108ef1d>] ? pick_next_entity+0x9d/0x170 [ 90.983305] [<ffffffff8109157e>] ? set_next_entity+0x4e/0x60 [ 90.984824] [<ffffffff81097953>] ? pick_next_task_fair+0x453/0x520 [ 90.986446] [<ffffffff8100c6e0>] ? __switch_to+0x240/0x570 [ 90.987943] [<ffffffff81051d40>] ? do_page_fault+0x30/0x70 [ 90.989453] [<ffffffff815f5138>] ? page_fault+0x28/0x30 [ 90.990987] [<ffffffff812ed0bc>] ? __clear_user+0x1c/0x40 [ 90.992481] [<ffffffff8112cb16>] ? iov_iter_zero+0x66/0x2d0 [ 90.993991] [<ffffffff813c09d7>] ? read_iter_zero+0x37/0xa0 [ 90.995515] [<ffffffff81173470>] ? new_sync_read+0x80/0xd0 [ 90.997027] [<ffffffff81174678>] ? vfs_read+0x78/0x130 [ 90.998492] [<ffffffff8117477d>] ? SyS_read+0x4d/0xc0 [ 90.999913] [<ffffffff815f3729>] ? system_call_fastpath+0x12/0x17 [ 91.001616] a.out D ffff88007fc52f80 0 9917 9916 0x00000080 [ 91.003485] ffff880020b10000 0000000000012f80 ffff8800786d7fd8 0000000000012f80 [ 91.005443] ffff880020b10000 000000000000000a 0000000000000400 0000000100000001 [ 91.007427] 0000000100000000 0000000000000000 0000000000000000 ffff8800786d7cc8 [ 91.009348] Call Trace: [ 91.010281] [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80 [ 91.011759] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.013176] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.014661] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.016128] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.017551] [<ffffffff815ef532>] ? __schedule+0x272/0x760 [ 91.019007] [<ffffffff81087408>] ? check_preempt_curr+0x78/0xa0 [ 91.020569] [<ffffffff81089c98>] ? wake_up_new_task+0xf8/0x140 [ 91.022094] [<ffffffff81063bd8>] ? do_fork+0x138/0x340 [ 91.023526] [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0 [ 91.025171] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.026700] [<ffffffff815f39c7>] ? int_signal+0x12/0x17 [ 91.028136] a.out D ffff88007c6a66c0 0 9918 9917 0x00000084 [ 91.029945] ffff88007c6a66c0 0000000000012f80 ffff88007c6cbfd8 0000000000012f80 [ 91.031886] ffff88007c6a66c0 0000000000000003 ffff88007c6a759a 0000000000000046 [ 91.033830] 0000000000000046 ffff88007c6a6f50 ffffffff81089a55 ffff88007c6cbcc8 [ 91.035913] Call Trace: [ 91.036848] [<ffffffff81089a55>] ? try_to_wake_up+0x1b5/0x2b0 [ 91.038382] [<ffffffff8109c7ef>] ? __wake_up_common+0x4f/0x80 [ 91.039944] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.041420] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.042931] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.044416] [<ffffffff8109157e>] ? set_next_entity+0x4e/0x60 [ 91.045941] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.047376] [<ffffffff815ef532>] ? __schedule+0x272/0x760 [ 91.048836] [<ffffffff81067282>] ? do_exit+0x6d2/0xb40 [ 91.050251] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.051786] [<ffffffff815f4422>] ? retint_signal+0x48/0x86 [ 91.053256] a.out S ffff88007fcd2f80 0 9919 9917 0x00000080 [ 91.055081] ffff88007c6a6f50 0000000000012f80 ffff88007c04bfd8 0000000000012f80 [ 91.057026] ffff88007c6a6f50 ffff88007c6a6f50 000200d27fffc6c0 0000000000000001 [ 91.059006] ffff88007c6a6f50 ffff88007c6a6f50 0000014100000001 0000000000000000 [ 91.060952] Call Trace: [ 91.061893] [<ffffffff8112b1ee>] ? copy_from_iter+0x10e/0x2d0 [ 91.063456] [<ffffffff8112b1ee>] ? copy_from_iter+0x10e/0x2d0 [ 91.065025] [<ffffffff8117bcb7>] ? pipe_wait+0x67/0xb0 [ 91.066491] [<ffffffff8109ced0>] ? wait_woken+0x90/0x90 [ 91.068160] [<ffffffff8117bde8>] ? pipe_write+0x88/0x450 [ 91.069787] [<ffffffff81173543>] ? new_sync_write+0x83/0xd0 [ 91.071302] [<ffffffff811736b7>] ? __kernel_write+0x57/0x140 [ 91.072813] [<ffffffff811c63fe>] ? dump_emit+0x8e/0xd0 [ 91.074293] [<ffffffff811c02cf>] ? elf_core_dump+0x146f/0x15d0 [ 91.075848] [<ffffffff811c6ca9>] ? do_coredump+0x769/0xe80 [ 91.077308] [<ffffffff8101634d>] ? native_sched_clock+0x2d/0x80 [ 91.078861] [<ffffffff8106fd2b>] ? __send_signal+0x16b/0x3a0 [ 91.080384] [<ffffffff810717f2>] ? get_signal+0x192/0x770 [ 91.081831] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.083234] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.084747] [<ffffffff815f4422>] ? retint_signal+0x48/0x86 [ 91.086210] a.out D ffff88007c6a0000 0 9920 9917 0x00000080 [ 91.088001] ffff88007c6a0000 0000000000012f80 ffff88007b7affd8 0000000000012f80 [ 91.089996] ffff88007c6a0000 ffffea0001df9780 ffffffff81a5ba00 0000000000000200 [ 91.091953] ffff880036d8c480 0000000000000000 0000000000000000 ffff88007b7afcc8 [ 91.093899] Call Trace: [ 91.094823] [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80 [ 91.096310] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.097785] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.099291] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.100773] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.102311] [<ffffffff810faf95>] ? task_function_call+0x55/0x80 [ 91.103978] [<ffffffff81067282>] ? do_exit+0x6d2/0xb40 [ 91.105413] [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0 [ 91.107047] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.108568] [<ffffffff815f39c7>] ? int_signal+0x12/0x17 [ 91.110069] a.out D ffff88007c6a2240 0 9921 9917 0x00000080 [ 91.111869] ffff88007c6a2240 0000000000012f80 ffff88007b883fd8 0000000000012f80 [ 91.113795] ffff88007c6a2240 0000000000000001 ffffffff81a5ba00 0000000000000200 [ 91.115708] ffff880036d8d5a0 0000000000000000 0000000000000000 ffff88007b883cc8 [ 91.117627] Call Trace: [ 91.118546] [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80 [ 91.120012] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.121432] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.122928] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.124469] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.125915] [<ffffffff810faf95>] ? task_function_call+0x55/0x80 [ 91.127487] [<ffffffff81067282>] ? do_exit+0x6d2/0xb40 [ 91.128906] [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0 [ 91.130518] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.132053] [<ffffffff815f39c7>] ? int_signal+0x12/0x17 [ 91.133505] a.out D ffff88007c6a3360 0 9922 9917 0x00000080 [ 91.135450] ffff88007c6a3360 0000000000012f80 ffff88007861bfd8 0000000000012f80 [ 91.137395] ffff88007c6a3360 0000000000000001 ffffffff81a5ba00 0000000000000200 [ 91.139332] ffff88007a4cbbf0 0000000000000000 0000000000000000 ffff88007861bcc8 [ 91.141356] Call Trace: [ 91.142290] [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80 [ 91.143781] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.145212] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.146724] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.148204] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.149657] [<ffffffff810faf95>] ? task_function_call+0x55/0x80 [ 91.151242] [<ffffffff81067282>] ? do_exit+0x6d2/0xb40 [ 91.152682] [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0 [ 91.154309] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.155855] [<ffffffff815f39c7>] ? int_signal+0x12/0x17 [ 91.157334] a.out D ffff88007c6a0890 0 9923 9917 0x00000080 [ 91.159214] ffff88007c6a0890 0000000000012f80 ffff88007c62bfd8 0000000000012f80 [ 91.161219] ffff88007c6a0890 0000000000000400 ffffffff810969d2 0000000000000200 [ 91.163193] ffff88007f804a80 ffff88007fc12f80 0000000000000000 ffff88007c62bcc8 [ 91.165161] Call Trace: [ 91.166115] [<ffffffff810969d2>] ? load_balance+0x1d2/0x8a0 [ 91.167678] [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80 [ 91.169293] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.170755] [<ffffffff810163a5>] ? sched_clock+0x5/0x10 [ 91.172208] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.173798] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.175282] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.176736] [<ffffffff81067282>] ? do_exit+0x6d2/0xb40 [ 91.178167] [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0 [ 91.179789] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.181319] [<ffffffff815f39c7>] ? int_signal+0x12/0x17 [ 91.182769] a.out D ffff88007c6a19b0 0 9924 9917 0x00000080 [ 91.184597] ffff88007c6a19b0 0000000000012f80 ffff88007bf27fd8 0000000000012f80 [ 91.186552] ffff88007c6a19b0 0000000000000001 ffffffff81a5ba00 0000000000000200 [ 91.188483] ffff880020b11120 0000000000000000 0000000000000000 ffff88007bf27cc8 [ 91.190517] Call Trace: [ 91.191462] [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80 [ 91.192961] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.194409] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.195926] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.197418] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.198884] [<ffffffff810faf95>] ? task_function_call+0x55/0x80 [ 91.200504] [<ffffffff81067282>] ? do_exit+0x6d2/0xb40 [ 91.202034] [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0 [ 91.203757] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.205293] [<ffffffff815f39c7>] ? int_signal+0x12/0x17 [ 91.206774] a.out D ffff88007c6a2ad0 0 9925 9917 0x00000080 [ 91.208641] ffff88007c6a2ad0 0000000000012f80 ffff88007cb8bfd8 0000000000012f80 [ 91.210592] ffff88007c6a2ad0 0000000000000400 ffffffff810969d2 0000000000000200 [ 91.212538] ffff88007f804a80 ffff88007fc12f80 0000000000000000 ffff88007cb8bcc8 [ 91.214486] Call Trace: [ 91.215428] [<ffffffff810969d2>] ? load_balance+0x1d2/0x8a0 [ 91.216949] [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80 [ 91.218437] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.219861] [<ffffffff810163a5>] ? sched_clock+0x5/0x10 [ 91.221301] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.222833] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.224362] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.225860] [<ffffffff810faf95>] ? task_function_call+0x55/0x80 [ 91.227442] [<ffffffff81067282>] ? do_exit+0x6d2/0xb40 [ 91.228891] [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0 [ 91.230543] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.232107] [<ffffffff815f39c7>] ? int_signal+0x12/0x17 [ 91.233565] a.out D ffff88007c6a4d10 0 9926 9917 0x00000080 [ 91.235432] ffff88007c6a4d10 0000000000012f80 ffff88007860bfd8 0000000000012f80 [ 91.237477] ffff88007c6a4d10 0000000000000001 ffffffff81a5ba00 0000000000000200 [ 91.239430] ffff880020b12240 0000000000000000 0000000000000000 ffff88007860bcc8 [ 91.241388] Call Trace: [ 91.242322] [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80 [ 91.243815] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.245241] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.246753] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.248232] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.249687] [<ffffffff810faf95>] ? task_function_call+0x55/0x80 [ 91.251271] [<ffffffff81067282>] ? do_exit+0x6d2/0xb40 [ 91.252709] [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0 [ 91.254334] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.255910] [<ffffffff815f39c7>] ? int_signal+0x12/0x17 [ 91.257441] a.out D ffff88007fcd2f80 0 9927 9917 0x00000080 [ 91.259308] ffff88007c6a4480 0000000000012f80 ffff88007c67bfd8 0000000000012f80 [ 91.261306] ffff88007c6a4480 ffff88007c67bd40 ffff88007d119440 ffff88007c67bd18 [ 91.263283] 000000001fe3d887 ffff88007c67bd18 ffffffff811fa4f4 ffff88007c67bcc8 [ 91.265259] Call Trace: [ 91.266206] [<ffffffff811fa4f4>] ? xfs_bmap_search_multi_extents+0x94/0x130 [ 91.268011] [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80 [ 91.269636] [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40 [ 91.271113] [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100 [ 91.272636] [<ffffffff810717fb>] ? get_signal+0x19b/0x770 [ 91.274184] [<ffffffff8100d451>] ? do_signal+0x31/0x6d0 [ 91.275659] [<ffffffff810faf95>] ? task_function_call+0x55/0x80 [ 91.277250] [<ffffffff81067282>] ? do_exit+0x6d2/0xb40 [ 91.278699] [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0 [ 91.280342] [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90 [ 91.281901] [<ffffffff815f39c7>] ? int_signal+0x12/0x17 [ 91.283368] abrt-hook-ccpp D 0000000000000002 0 9928 345 0x00000080 [ 91.285222] ffff880020b10890 0000000000012f80 ffff88007c68bfd8 0000000000012f80 [ 91.287200] ffff880020b10890 ffff88007fffc000 ffffffff8111a945 0000000000000000 [ 91.289187] 0000000000000000 000088007c68b9e8 ffff880020b10890 ffffffff81015df5 [ 91.291215] Call Trace: [ 91.292155] [<ffffffff8111a945>] ? shrink_zone+0x105/0x2a0 [ 91.293682] [<ffffffff81015df5>] ? read_tsc+0x5/0x10 [ 91.295117] [<ffffffff810c0270>] ? ktime_get+0x30/0x90 [ 91.296574] [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70 [ 91.298096] [<ffffffff8111b0e5>] ? do_try_to_free_pages+0x3e5/0x480 [ 91.299768] [<ffffffff815f23f3>] ? schedule_timeout+0x113/0x1b0 [ 91.301384] [<ffffffff810b9800>] ? migrate_timer_list+0x60/0x60 [ 91.303092] [<ffffffff81110c9e>] ? __alloc_pages_nodemask+0x7ae/0xa60 [ 91.304858] [<ffffffff81150477>] ? alloc_pages_current+0x87/0x100 [ 91.306497] [<ffffffff8110a240>] ? filemap_fault+0x1c0/0x400 [ 91.308054] [<ffffffff8112ea66>] ? __do_fault+0x46/0xd0 [ 91.309531] [<ffffffff811313c8>] ? do_read_fault.isra.62+0x228/0x310 [ 91.311204] [<ffffffff81133aae>] ? handle_mm_fault+0x7ae/0x10e0 [ 91.312800] [<ffffffff81182762>] ? path_openat+0xa2/0x660 [ 91.314298] [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540 [ 91.315884] [<ffffffff81183c9e>] ? do_filp_open+0x3e/0xa0 [ 91.317367] [<ffffffff81051d40>] ? do_page_fault+0x30/0x70 [ 91.318879] [<ffffffff815f5138>] ? page_fault+0x28/0x30 (...snipped...) [ 93.038908] oom_scan_process_thread: 244092 callbacks suppressed [ 93.040655] OOM: Waiting for a.out(9917) : will_free_mem ---------- PID 9916 is the parent process doing read() from /dev/zero . PID 9917 is the child process waiting at pause(). PIDs from 9918 to 9927 are the child thread of PID 9917 sharing the MM. PID 9919 is the thread doing coredump to pipe and PID 9928 is the process doing read from pipe. Since will_free_mem() for PID 9917 is true, oom_scan_process_thread() does not choose a victim. PID 9917 is waiting for PID 9919 to complete the coredump. PID 9919 is waiting for PID 9928 to read from pipe. PID 9928 is waiting for PID 9917 to release memory. ---------- static void exit_mm(struct task_struct *tsk) { (...snipped...) if (core_state) { struct core_thread self; up_read(&mm->mmap_sem); self.task = tsk; self.next = xchg(&core_state->dumper.next, &self); /* * Implies mb(), the result of xchg() must be visible * to core_state->dumper. */ if (atomic_dec_and_test(&core_state->nr_threads)) complete(&core_state->startup); for (;;) { set_task_state(tsk, TASK_UNINTERRUPTIBLE); if (!self.task) /* see coredump_finish() */ break; freezable_schedule(); /* <ffffffff81066d8c> is here. */ } __set_task_state(tsk, TASK_RUNNING); down_read(&mm->mmap_sem); } (...snipped...) } ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-11 13:37 ` Tetsuo Handa @ 2015-02-11 18:50 ` Oleg Nesterov 2015-02-11 18:59 ` Oleg Nesterov 0 siblings, 1 reply; 276+ messages in thread From: Oleg Nesterov @ 2015-02-11 18:50 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, hannes, david, dchinner, linux-mm, rientjes, akpm, mgorman, torvalds On 02/11, Tetsuo Handa wrote: > > (Asking Oleg this time.) Well, sorry, I ignored the previous discussion, not sure I understand you correctly. > > Though, more serious behavior with this reproducer is (B) where the system > > stalls forever without kernel messages being saved to /var/log/messages . > > out_of_memory() does not select victims until the coredump to pipe can make > > progress whereas the coredump to pipe can't make progress until memory > > allocation succeeds or fails. > > This behavior is related to commit d003f371b2701635 ("oom: don't assume > that a coredumping thread will exit soon"). That commit tried to take > SIGNAL_GROUP_COREDUMP into account, but actually it is failing to do so. Heh. Please see the changelog. This "fix" is obviously very limited, it does not even try to solve all problems (even with coredump in particular). Note also that SIGNAL_GROUP_COREDUMP is not even set if the process (not a sub-thread) shares the memory with the coredumping task. It would be better to check mm->core_state != NULL instead, but this needs the locking. Plus that process likely sleeps in D state in exit_mm(), so this can't help. And that is why we set SIGNAL_GROUP_COREDUMP in zap_threads(), not in zap_process(). We probably want to make that "wait for coredump_finish()" sleep in exit_mm() killable, but this is not simple. Sorry for noise if the above is not relevant. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-11 18:50 ` Oleg Nesterov @ 2015-02-11 18:59 ` Oleg Nesterov 2015-03-14 13:03 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Oleg Nesterov @ 2015-02-11 18:59 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, hannes, david, dchinner, linux-mm, rientjes, akpm, mgorman, torvalds On 02/11, Oleg Nesterov wrote: > > On 02/11, Tetsuo Handa wrote: > > > > (Asking Oleg this time.) > > Well, sorry, I ignored the previous discussion, not sure I understand you > correctly. > > > > Though, more serious behavior with this reproducer is (B) where the system > > > stalls forever without kernel messages being saved to /var/log/messages . > > > out_of_memory() does not select victims until the coredump to pipe can make > > > progress whereas the coredump to pipe can't make progress until memory > > > allocation succeeds or fails. > > > > This behavior is related to commit d003f371b2701635 ("oom: don't assume > > that a coredumping thread will exit soon"). That commit tried to take > > SIGNAL_GROUP_COREDUMP into account, but actually it is failing to do so. > > Heh. Please see the changelog. This "fix" is obviously very limited, it does > not even try to solve all problems (even with coredump in particular). > > Note also that SIGNAL_GROUP_COREDUMP is not even set if the process (not a > sub-thread) shares the memory with the coredumping task. It would be better > to check mm->core_state != NULL instead, but this needs the locking. Plus > that process likely sleeps in D state in exit_mm(), so this can't help. > > And that is why we set SIGNAL_GROUP_COREDUMP in zap_threads(), not in > zap_process(). We probably want to make that "wait for coredump_finish()" > sleep in exit_mm() killable, but this is not simple. on a cecond thought, perhaps it makes sense to set SIGNAL_GROUP_COREDUMP anyway, even if a CLONE_VM process participating in coredump is not killable. I'll recheck tomorrow. > Sorry for noise if the above is not relevant. > > Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-11 18:59 ` Oleg Nesterov @ 2015-03-14 13:03 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-03-14 13:03 UTC (permalink / raw) To: oleg Cc: mhocko, hannes, david, dchinner, linux-mm, rientjes, akpm, mgorman, torvalds Oleg Nesterov wrote: > On 02/11, Oleg Nesterov wrote: > > > > On 02/11, Tetsuo Handa wrote: > > > > > > (Asking Oleg this time.) > > > > Well, sorry, I ignored the previous discussion, not sure I understand you > > correctly. > > > > > > Though, more serious behavior with this reproducer is (B) where the system > > > > stalls forever without kernel messages being saved to /var/log/messages . > > > > out_of_memory() does not select victims until the coredump to pipe can make > > > > progress whereas the coredump to pipe can't make progress until memory > > > > allocation succeeds or fails. > > > > > > This behavior is related to commit d003f371b2701635 ("oom: don't assume > > > that a coredumping thread will exit soon"). That commit tried to take > > > SIGNAL_GROUP_COREDUMP into account, but actually it is failing to do so. > > > > Heh. Please see the changelog. This "fix" is obviously very limited, it does > > not even try to solve all problems (even with coredump in particular). > > > > Note also that SIGNAL_GROUP_COREDUMP is not even set if the process (not a > > sub-thread) shares the memory with the coredumping task. It would be better > > to check mm->core_state != NULL instead, but this needs the locking. Plus > > that process likely sleeps in D state in exit_mm(), so this can't help. > > > > And that is why we set SIGNAL_GROUP_COREDUMP in zap_threads(), not in > > zap_process(). We probably want to make that "wait for coredump_finish()" > > sleep in exit_mm() killable, but this is not simple. > > on a cecond thought, perhaps it makes sense to set SIGNAL_GROUP_COREDUMP > anyway, even if a CLONE_VM process participating in coredump is not killable. > I'll recheck tomorrow. Ping? > > > Sorry for noise if the above is not relevant. > > > > Oleg. > > I tried https://lkml.org/lkml/2015/3/11/707 with retry_allocation_attempts == 1 (with http://marc.info/?l=linux-mm&m=141671829611143&w=2 for debug printk() ). Although 0x2015a (which is !__GFP_FS) allocation likely fails within a few jiffies under TIF_MEMDIE condition, TIF_MEMDIE condition itself cannot be solved until SIGNAL_GROUP_COREDUMP patch is proposed. ---------- XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) warn_alloc_failed: 212565 callbacks suppressed crond: page allocation failure: order:0, mode:0x2015a rngd: page allocation failure: order:0, mode:0x2015a CPU: 3 PID: 1667 Comm: rngd Not tainted 4.0.0-rc3+ #37 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 0000000000000000 00000000ce4cec53 0000000000000000 ffffffff815f30c4 000000000002015a ffffffff8111063e ffff88007fffdb00 0000000000000000 0000000000000040 ffff88007c223db0 0000000000000000 00000000ce4cec53 Call Trace: [<ffffffff815f30c4>] ? dump_stack+0x40/0x50 [<ffffffff8111063e>] ? warn_alloc_failed+0xee/0x150 [<ffffffff81113b03>] ? __alloc_pages_nodemask+0x623/0xa10 [<ffffffff81150c57>] ? alloc_pages_current+0x87/0x100 [<ffffffff8110d30d>] ? filemap_fault+0x1bd/0x400 [<ffffffff812e3dbc>] ? radix_tree_next_chunk+0x5c/0x240 [<ffffffff8112f85b>] ? __do_fault+0x4b/0xe0 [<ffffffff81134465>] ? handle_mm_fault+0xc85/0x1640 [<ffffffff81051c9a>] ? __do_page_fault+0x16a/0x430 [<ffffffff81051f90>] ? do_page_fault+0x30/0x70 [<ffffffff815fb03f>] ? error_exit+0x1f/0x60 [<ffffffff815fae18>] ? page_fault+0x28/0x30 ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-11 2:23 ` Tetsuo Handa 2015-02-11 13:37 ` Tetsuo Handa @ 2015-02-17 12:23 ` Tetsuo Handa 2015-02-17 12:53 ` Johannes Weiner 2015-02-17 14:59 ` Michal Hocko 1 sibling, 2 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-17 12:23 UTC (permalink / raw) To: hannes Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds Tetsuo Handa wrote: > Johannes Weiner wrote: > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > if (high_zoneidx < ZONE_NORMAL) > > goto out; > > /* The OOM killer does not compensate for light reclaim */ > > - if (!(gfp_mask & __GFP_FS)) > > + if (!(gfp_mask & __GFP_FS)) { > > + /* > > + * XXX: Page reclaim didn't yield anything, > > + * and the OOM killer can't be invoked, but > > + * keep looping as per should_alloc_retry(). > > + */ > > + *did_some_progress = 1; > > goto out; > > + } > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm: page_alloc: embed OOM killing naturally into allocation slowpath" introduced a regression and below one is the fix. --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer does not needlessly kill tasks for lowmem */ if (high_zoneidx < ZONE_NORMAL) goto out; - /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) - goto out; /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. BTW, I think commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer path raceless" opened a race window for __alloc_pages_may_oom(__GFP_NOFAIL) allocation to fail when OOM killer is disabled. I think something like --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -789,7 +789,7 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, bool ret = false; down_read(&oom_sem); - if (!oom_killer_disabled) { + if (!oom_killer_disabled || (gfp_mask & __GFP_NOFAIL)) { __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); ret = true; } is needed. But such change can race with up_write() and wait_event() in oom_killer_disable(). While the comment of oom_killer_disable() says "The function cannot be called when there are runnable user tasks because the userspace would see unexpected allocation failures as a result.", aren't there still kernel threads which might do __GFP_NOFAIL allocations? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 12:23 ` Tetsuo Handa @ 2015-02-17 12:53 ` Johannes Weiner 2015-02-17 15:38 ` Michal Hocko 2015-02-17 22:54 ` Dave Chinner 2015-02-17 14:59 ` Michal Hocko 1 sibling, 2 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-17 12:53 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > Johannes Weiner wrote: > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > + if (!(gfp_mask & __GFP_FS)) { > > > + /* > > > + * XXX: Page reclaim didn't yield anything, > > > + * and the OOM killer can't be invoked, but > > > + * keep looping as per should_alloc_retry(). > > > + */ > > > + *did_some_progress = 1; > > > goto out; > > > + } > > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm: > page_alloc: embed OOM killing naturally into allocation slowpath" introduced > a regression and below one is the fix. > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > /* The OOM killer does not needlessly kill tasks for lowmem */ > if (high_zoneidx < ZONE_NORMAL) > goto out; > - /* The OOM killer does not compensate for light reclaim */ > - if (!(gfp_mask & __GFP_FS)) > - goto out; > /* > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. Again, we don't want to OOM kill on behalf of allocations that can't initiate IO, or even actively prevent others from doing it. Not per default anyway, because most callers can deal with the failure without having to resort to killing tasks, and NOFS reclaim *can* easily fail. It's the exceptions that should be annotated instead: void * kmem_alloc(size_t size, xfs_km_flags_t flags) { int retries = 0; gfp_t lflags = kmem_flags_convert(flags); void *ptr; do { ptr = kmalloc(size, lflags); if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) return ptr; if (!(++retries % 100)) xfs_err(NULL, "possible memory allocation deadlock in %s (mode:0x%x)", __func__, lflags); congestion_wait(BLK_RW_ASYNC, HZ/50); } while (1); } This should use __GFP_NOFAIL, which is not only designed to annotate broken code like this, but also recognizes that endless looping on a GFP_NOFS allocation needs the OOM killer after all to make progress. diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c index a7a3a63bb360..17ced1805d3a 100644 --- a/fs/xfs/kmem.c +++ b/fs/xfs/kmem.c @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) void * kmem_alloc(size_t size, xfs_km_flags_t flags) { - int retries = 0; gfp_t lflags = kmem_flags_convert(flags); - void *ptr; - do { - ptr = kmalloc(size, lflags); - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) - return ptr; - if (!(++retries % 100)) - xfs_err(NULL, - "possible memory allocation deadlock in %s (mode:0x%x)", - __func__, lflags); - congestion_wait(BLK_RW_ASYNC, HZ/50); - } while (1); + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) + lflags |= __GFP_NOFAIL; + + return kmalloc(size, lflags); } void * -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 12:53 ` Johannes Weiner @ 2015-02-17 15:38 ` Michal Hocko 2015-02-17 22:54 ` Dave Chinner 1 sibling, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-17 15:38 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Tue 17-02-15 07:53:15, Johannes Weiner wrote: [...] > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > index a7a3a63bb360..17ced1805d3a 100644 > --- a/fs/xfs/kmem.c > +++ b/fs/xfs/kmem.c > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > void * > kmem_alloc(size_t size, xfs_km_flags_t flags) > { > - int retries = 0; > gfp_t lflags = kmem_flags_convert(flags); > - void *ptr; > > - do { > - ptr = kmalloc(size, lflags); > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > - return ptr; > - if (!(++retries % 100)) > - xfs_err(NULL, > - "possible memory allocation deadlock in %s (mode:0x%x)", > - __func__, lflags); > - congestion_wait(BLK_RW_ASYNC, HZ/50); > - } while (1); > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > + lflags |= __GFP_NOFAIL; > + > + return kmalloc(size, lflags); > } > > void * Yes, I think this is the right thing to do (care to send a patch with the full changelog?). We really want to have __GFP_NOFAIL explicit. If for nothing else I hope we can get lockdep checks for this flag. I am hopelessly unfamiliar with lockdep but even warning from __lockdep_trace_alloc for this flag and any lock held in the current's context might be helpful to identify those places and try to fix them. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 12:53 ` Johannes Weiner @ 2015-02-17 22:54 ` Dave Chinner 2015-02-17 22:54 ` Dave Chinner 1 sibling, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-17 22:54 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds [ cc xfs list - experienced kernel devs should not have to be reminded to do this ] On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > > Tetsuo Handa wrote: > > > Johannes Weiner wrote: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > > > > --- a/mm/page_alloc.c > > > > +++ b/mm/page_alloc.c > > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > > if (high_zoneidx < ZONE_NORMAL) > > > > goto out; > > > > /* The OOM killer does not compensate for light reclaim */ > > > > - if (!(gfp_mask & __GFP_FS)) > > > > + if (!(gfp_mask & __GFP_FS)) { > > > > + /* > > > > + * XXX: Page reclaim didn't yield anything, > > > > + * and the OOM killer can't be invoked, but > > > > + * keep looping as per should_alloc_retry(). > > > > + */ > > > > + *did_some_progress = 1; > > > > goto out; > > > > + } > > > > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? > > > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings > > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm: > > page_alloc: embed OOM killing naturally into allocation slowpath" introduced > > a regression and below one is the fix. > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > /* The OOM killer does not needlessly kill tasks for lowmem */ > > if (high_zoneidx < ZONE_NORMAL) > > goto out; > > - /* The OOM killer does not compensate for light reclaim */ > > - if (!(gfp_mask & __GFP_FS)) > > - goto out; > > /* > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > Again, we don't want to OOM kill on behalf of allocations that can't > initiate IO, or even actively prevent others from doing it. Not per > default anyway, because most callers can deal with the failure without > having to resort to killing tasks, and NOFS reclaim *can* easily fail. > It's the exceptions that should be annotated instead: > > void * > kmem_alloc(size_t size, xfs_km_flags_t flags) > { > int retries = 0; > gfp_t lflags = kmem_flags_convert(flags); > void *ptr; > > do { > ptr = kmalloc(size, lflags); > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > return ptr; > if (!(++retries % 100)) > xfs_err(NULL, > "possible memory allocation deadlock in %s (mode:0x%x)", > __func__, lflags); > congestion_wait(BLK_RW_ASYNC, HZ/50); > } while (1); > } > > This should use __GFP_NOFAIL, which is not only designed to annotate > broken code like this, but also recognizes that endless looping on a > GFP_NOFS allocation needs the OOM killer after all to make progress. > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > index a7a3a63bb360..17ced1805d3a 100644 > --- a/fs/xfs/kmem.c > +++ b/fs/xfs/kmem.c > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > void * > kmem_alloc(size_t size, xfs_km_flags_t flags) > { > - int retries = 0; > gfp_t lflags = kmem_flags_convert(flags); > - void *ptr; > > - do { > - ptr = kmalloc(size, lflags); > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > - return ptr; > - if (!(++retries % 100)) > - xfs_err(NULL, > - "possible memory allocation deadlock in %s (mode:0x%x)", > - __func__, lflags); > - congestion_wait(BLK_RW_ASYNC, HZ/50); > - } while (1); > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > + lflags |= __GFP_NOFAIL; > + > + return kmalloc(size, lflags); > } Hmmm - the only reason there is a focus on this loop is that it emits warnings about allocations failing. It's obvious that the problem being dealt with here is a fundamental design issue w.r.t. to locking and the OOM killer, but the proposed special casing hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code in XFS started emitting warnings about allocations failing more often. So the answer is to remove the warning? That's like killing the canary to stop the methane leak in the coal mine. No canary? No problems! Right now, the oom killer is a liability. Over the past 6 months I've slowly had to exclude filesystem regression tests from running on small memory machines because the OOM killer is now so unreliable that it kills the test harness regularly rather than the process generating memory pressure. That's a big red flag to me that all this hacking around the edges is not solving the underlying problem, but instead is breaking things that did once work. And, well, then there's this (gfp.h): * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. This modifier is deprecated and no new * users should be added. So, is this another policy relevation from the mm developers about the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? Or just another symptom of frantic thrashing because nobody actually understands the problem or those that do are unwilling to throw out the broken crap and redesign it? If you are changing allocator behaviour and constraints, then you better damn well think through that changes fully, then document those changes, change all the relevant code to use the new API (not just those that throw warnings in your face) and make sure *everyone* knows about it. e.g. a LWN article explaining the changes and how memory allocation is going to work into the future would be a good start. Otherwise, this just looks like another knee-jerk band aid for an architectural problem that needs more than special case hacks to solve. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-17 22:54 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-17 22:54 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs [ cc xfs list - experienced kernel devs should not have to be reminded to do this ] On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > > Tetsuo Handa wrote: > > > Johannes Weiner wrote: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > > > > --- a/mm/page_alloc.c > > > > +++ b/mm/page_alloc.c > > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > > if (high_zoneidx < ZONE_NORMAL) > > > > goto out; > > > > /* The OOM killer does not compensate for light reclaim */ > > > > - if (!(gfp_mask & __GFP_FS)) > > > > + if (!(gfp_mask & __GFP_FS)) { > > > > + /* > > > > + * XXX: Page reclaim didn't yield anything, > > > > + * and the OOM killer can't be invoked, but > > > > + * keep looping as per should_alloc_retry(). > > > > + */ > > > > + *did_some_progress = 1; > > > > goto out; > > > > + } > > > > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? > > > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings > > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm: > > page_alloc: embed OOM killing naturally into allocation slowpath" introduced > > a regression and below one is the fix. > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > /* The OOM killer does not needlessly kill tasks for lowmem */ > > if (high_zoneidx < ZONE_NORMAL) > > goto out; > > - /* The OOM killer does not compensate for light reclaim */ > > - if (!(gfp_mask & __GFP_FS)) > > - goto out; > > /* > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > Again, we don't want to OOM kill on behalf of allocations that can't > initiate IO, or even actively prevent others from doing it. Not per > default anyway, because most callers can deal with the failure without > having to resort to killing tasks, and NOFS reclaim *can* easily fail. > It's the exceptions that should be annotated instead: > > void * > kmem_alloc(size_t size, xfs_km_flags_t flags) > { > int retries = 0; > gfp_t lflags = kmem_flags_convert(flags); > void *ptr; > > do { > ptr = kmalloc(size, lflags); > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > return ptr; > if (!(++retries % 100)) > xfs_err(NULL, > "possible memory allocation deadlock in %s (mode:0x%x)", > __func__, lflags); > congestion_wait(BLK_RW_ASYNC, HZ/50); > } while (1); > } > > This should use __GFP_NOFAIL, which is not only designed to annotate > broken code like this, but also recognizes that endless looping on a > GFP_NOFS allocation needs the OOM killer after all to make progress. > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > index a7a3a63bb360..17ced1805d3a 100644 > --- a/fs/xfs/kmem.c > +++ b/fs/xfs/kmem.c > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > void * > kmem_alloc(size_t size, xfs_km_flags_t flags) > { > - int retries = 0; > gfp_t lflags = kmem_flags_convert(flags); > - void *ptr; > > - do { > - ptr = kmalloc(size, lflags); > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > - return ptr; > - if (!(++retries % 100)) > - xfs_err(NULL, > - "possible memory allocation deadlock in %s (mode:0x%x)", > - __func__, lflags); > - congestion_wait(BLK_RW_ASYNC, HZ/50); > - } while (1); > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > + lflags |= __GFP_NOFAIL; > + > + return kmalloc(size, lflags); > } Hmmm - the only reason there is a focus on this loop is that it emits warnings about allocations failing. It's obvious that the problem being dealt with here is a fundamental design issue w.r.t. to locking and the OOM killer, but the proposed special casing hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code in XFS started emitting warnings about allocations failing more often. So the answer is to remove the warning? That's like killing the canary to stop the methane leak in the coal mine. No canary? No problems! Right now, the oom killer is a liability. Over the past 6 months I've slowly had to exclude filesystem regression tests from running on small memory machines because the OOM killer is now so unreliable that it kills the test harness regularly rather than the process generating memory pressure. That's a big red flag to me that all this hacking around the edges is not solving the underlying problem, but instead is breaking things that did once work. And, well, then there's this (gfp.h): * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. This modifier is deprecated and no new * users should be added. So, is this another policy relevation from the mm developers about the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? Or just another symptom of frantic thrashing because nobody actually understands the problem or those that do are unwilling to throw out the broken crap and redesign it? If you are changing allocator behaviour and constraints, then you better damn well think through that changes fully, then document those changes, change all the relevant code to use the new API (not just those that throw warnings in your face) and make sure *everyone* knows about it. e.g. a LWN article explaining the changes and how memory allocation is going to work into the future would be a good start. Otherwise, this just looks like another knee-jerk band aid for an architectural problem that needs more than special case hacks to solve. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 22:54 ` Dave Chinner @ 2015-02-17 23:32 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-17 23:32 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, akpm, torvalds On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote: > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > /* The OOM killer does not needlessly kill tasks for lowmem */ > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > - /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > - goto out; > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Again, we don't want to OOM kill on behalf of allocations that can't > > initiate IO, or even actively prevent others from doing it. Not per > > default anyway, because most callers can deal with the failure without > > having to resort to killing tasks, and NOFS reclaim *can* easily fail. > > It's the exceptions that should be annotated instead: > > > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > void *ptr; > > > > do { > > ptr = kmalloc(size, lflags); > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > return ptr; > > if (!(++retries % 100)) > > xfs_err(NULL, > > "possible memory allocation deadlock in %s (mode:0x%x)", > > __func__, lflags); > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > } while (1); > > } > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > broken code like this, but also recognizes that endless looping on a > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index a7a3a63bb360..17ced1805d3a 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > - int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > - void *ptr; > > > > - do { > > - ptr = kmalloc(size, lflags); > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > - return ptr; > > - if (!(++retries % 100)) > > - xfs_err(NULL, > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > - __func__, lflags); > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > - } while (1); > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > + lflags |= __GFP_NOFAIL; > > + > > + return kmalloc(size, lflags); > > } > > Hmmm - the only reason there is a focus on this loop is that it > emits warnings about allocations failing. It's obvious that the > problem being dealt with here is a fundamental design issue w.r.t. > to locking and the OOM killer, but the proposed special casing > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > in XFS started emitting warnings about allocations failing more > often. > > So the answer is to remove the warning? That's like killing the > canary to stop the methane leak in the coal mine. No canary? No > problems! I'll also point out that there are two other identical allocation loops in XFS, one of which is only 30 lines below this one. That's further indication that this is a "silence the warning" patch rather than something that actually fixes a problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-17 23:32 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-17 23:32 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote: > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > /* The OOM killer does not needlessly kill tasks for lowmem */ > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > - /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > - goto out; > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Again, we don't want to OOM kill on behalf of allocations that can't > > initiate IO, or even actively prevent others from doing it. Not per > > default anyway, because most callers can deal with the failure without > > having to resort to killing tasks, and NOFS reclaim *can* easily fail. > > It's the exceptions that should be annotated instead: > > > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > void *ptr; > > > > do { > > ptr = kmalloc(size, lflags); > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > return ptr; > > if (!(++retries % 100)) > > xfs_err(NULL, > > "possible memory allocation deadlock in %s (mode:0x%x)", > > __func__, lflags); > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > } while (1); > > } > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > broken code like this, but also recognizes that endless looping on a > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index a7a3a63bb360..17ced1805d3a 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > - int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > - void *ptr; > > > > - do { > > - ptr = kmalloc(size, lflags); > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > - return ptr; > > - if (!(++retries % 100)) > > - xfs_err(NULL, > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > - __func__, lflags); > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > - } while (1); > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > + lflags |= __GFP_NOFAIL; > > + > > + return kmalloc(size, lflags); > > } > > Hmmm - the only reason there is a focus on this loop is that it > emits warnings about allocations failing. It's obvious that the > problem being dealt with here is a fundamental design issue w.r.t. > to locking and the OOM killer, but the proposed special casing > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > in XFS started emitting warnings about allocations failing more > often. > > So the answer is to remove the warning? That's like killing the > canary to stop the methane leak in the coal mine. No canary? No > problems! I'll also point out that there are two other identical allocation loops in XFS, one of which is only 30 lines below this one. That's further indication that this is a "silence the warning" patch rather than something that actually fixes a problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 22:54 ` Dave Chinner @ 2015-02-18 8:25 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-18 8:25 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [ cc xfs list - experienced kernel devs should not have to be > reminded to do this ] > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: [...] > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > void *ptr; > > > > do { > > ptr = kmalloc(size, lflags); > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > return ptr; > > if (!(++retries % 100)) > > xfs_err(NULL, > > "possible memory allocation deadlock in %s (mode:0x%x)", > > __func__, lflags); > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > } while (1); > > } > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > broken code like this, but also recognizes that endless looping on a > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index a7a3a63bb360..17ced1805d3a 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > - int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > - void *ptr; > > > > - do { > > - ptr = kmalloc(size, lflags); > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > - return ptr; > > - if (!(++retries % 100)) > > - xfs_err(NULL, > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > - __func__, lflags); > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > - } while (1); > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > + lflags |= __GFP_NOFAIL; > > + > > + return kmalloc(size, lflags); > > } > > Hmmm - the only reason there is a focus on this loop is that it > emits warnings about allocations failing. Such a warning should be part of the allocator and the whole point why I like the patch is that we should really warn at a single place. I was thinking about a simple warning (e.g. like the above) and having something more sophisticated when lockdep is enabled. > It's obvious that the > problem being dealt with here is a fundamental design issue w.r.t. > to locking and the OOM killer, but the proposed special casing > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > in XFS started emitting warnings about allocations failing more > often. > > So the answer is to remove the warning? That's like killing the > canary to stop the methane leak in the coal mine. No canary? No > problems! Not at all. I cannot speak for Johannes but I am pretty sure his motivation wasn't to simply silence the warning. The thing is that no kernel code paths except for the page allocator shouldn't emulate behavior for which we have a gfp flag. > Right now, the oom killer is a liability. Over the past 6 months > I've slowly had to exclude filesystem regression tests from running > on small memory machines because the OOM killer is now so unreliable > that it kills the test harness regularly rather than the process > generating memory pressure. It would be great to get bug reports. > That's a big red flag to me that all > this hacking around the edges is not solving the underlying problem, > but instead is breaking things that did once work. I am heavily trying to discourage people from adding random hacks to the already complicated and subtle OOM code. > And, well, then there's this (gfp.h): > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. This modifier is deprecated and no new > * users should be added. > > So, is this another policy relevation from the mm developers about > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? It is deprecated and shouldn't be used. But that doesn't mean that users should workaround this by developing their own alternative. I agree the wording could be more clear and mention that if the allocation failure is absolutely unacceptable then the flags can be used rather than creating the loop around. What do you think about the following? diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b840e3b2770d..ee6440ccb75d 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -57,8 +57,12 @@ struct vm_area_struct; * _might_ fail. This depends upon the particular VM implementation. * * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller - * cannot handle allocation failures. This modifier is deprecated and no new - * users should be added. + * cannot handle allocation failures. This modifier is deprecated for allocation + * with order > 1. Besides that this modifier is very dangerous when allocation + * happens under a lock because it creates a lock dependency invisible for the + * OOM killer so it can livelock. If the allocation failure is _absolutely_ + * unacceptable then the flags has to be used rather than looping around + * allocator. * * __GFP_NORETRY: The VM implementation must not retry indefinitely. * > Or just another symptom of frantic thrashing because nobody actually > understands the problem or those that do are unwilling to throw out > the broken crap and redesign it? > > If you are changing allocator behaviour and constraints, then you > better damn well think through that changes fully, then document > those changes, change all the relevant code to use the new API (not > just those that throw warnings in your face) and make sure > *everyone* knows about it. e.g. a LWN article explaining the changes > and how memory allocation is going to work into the future would be > a good start. Well, I think the first step is to change the users of the allocator to not lie about gfp flags. So if the code is infinitely trying then it really should use GFP_NOFAIL flag. In the meantime page allocator should develop a proper diagnostic to help identify all the potential dependencies. Next we should start thinking whether all the existing GFP_NOFAIL paths are really necessary or the code can be refactored/reimplemented to accept allocation failures. > Otherwise, this just looks like another knee-jerk band aid for an > architectural problem that needs more than special case hacks to > solve. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-18 8:25 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-18 8:25 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [ cc xfs list - experienced kernel devs should not have to be > reminded to do this ] > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: [...] > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > void *ptr; > > > > do { > > ptr = kmalloc(size, lflags); > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > return ptr; > > if (!(++retries % 100)) > > xfs_err(NULL, > > "possible memory allocation deadlock in %s (mode:0x%x)", > > __func__, lflags); > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > } while (1); > > } > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > broken code like this, but also recognizes that endless looping on a > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index a7a3a63bb360..17ced1805d3a 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > - int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > - void *ptr; > > > > - do { > > - ptr = kmalloc(size, lflags); > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > - return ptr; > > - if (!(++retries % 100)) > > - xfs_err(NULL, > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > - __func__, lflags); > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > - } while (1); > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > + lflags |= __GFP_NOFAIL; > > + > > + return kmalloc(size, lflags); > > } > > Hmmm - the only reason there is a focus on this loop is that it > emits warnings about allocations failing. Such a warning should be part of the allocator and the whole point why I like the patch is that we should really warn at a single place. I was thinking about a simple warning (e.g. like the above) and having something more sophisticated when lockdep is enabled. > It's obvious that the > problem being dealt with here is a fundamental design issue w.r.t. > to locking and the OOM killer, but the proposed special casing > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > in XFS started emitting warnings about allocations failing more > often. > > So the answer is to remove the warning? That's like killing the > canary to stop the methane leak in the coal mine. No canary? No > problems! Not at all. I cannot speak for Johannes but I am pretty sure his motivation wasn't to simply silence the warning. The thing is that no kernel code paths except for the page allocator shouldn't emulate behavior for which we have a gfp flag. > Right now, the oom killer is a liability. Over the past 6 months > I've slowly had to exclude filesystem regression tests from running > on small memory machines because the OOM killer is now so unreliable > that it kills the test harness regularly rather than the process > generating memory pressure. It would be great to get bug reports. > That's a big red flag to me that all > this hacking around the edges is not solving the underlying problem, > but instead is breaking things that did once work. I am heavily trying to discourage people from adding random hacks to the already complicated and subtle OOM code. > And, well, then there's this (gfp.h): > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. This modifier is deprecated and no new > * users should be added. > > So, is this another policy relevation from the mm developers about > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? It is deprecated and shouldn't be used. But that doesn't mean that users should workaround this by developing their own alternative. I agree the wording could be more clear and mention that if the allocation failure is absolutely unacceptable then the flags can be used rather than creating the loop around. What do you think about the following? diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b840e3b2770d..ee6440ccb75d 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -57,8 +57,12 @@ struct vm_area_struct; * _might_ fail. This depends upon the particular VM implementation. * * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller - * cannot handle allocation failures. This modifier is deprecated and no new - * users should be added. + * cannot handle allocation failures. This modifier is deprecated for allocation + * with order > 1. Besides that this modifier is very dangerous when allocation + * happens under a lock because it creates a lock dependency invisible for the + * OOM killer so it can livelock. If the allocation failure is _absolutely_ + * unacceptable then the flags has to be used rather than looping around + * allocator. * * __GFP_NORETRY: The VM implementation must not retry indefinitely. * > Or just another symptom of frantic thrashing because nobody actually > understands the problem or those that do are unwilling to throw out > the broken crap and redesign it? > > If you are changing allocator behaviour and constraints, then you > better damn well think through that changes fully, then document > those changes, change all the relevant code to use the new API (not > just those that throw warnings in your face) and make sure > *everyone* knows about it. e.g. a LWN article explaining the changes > and how memory allocation is going to work into the future would be > a good start. Well, I think the first step is to change the users of the allocator to not lie about gfp flags. So if the code is infinitely trying then it really should use GFP_NOFAIL flag. In the meantime page allocator should develop a proper diagnostic to help identify all the potential dependencies. Next we should start thinking whether all the existing GFP_NOFAIL paths are really necessary or the code can be refactored/reimplemented to accept allocation failures. > Otherwise, this just looks like another knee-jerk band aid for an > architectural problem that needs more than special case hacks to > solve. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 8:25 ` Michal Hocko @ 2015-02-18 10:48 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-18 10:48 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > > [ cc xfs list - experienced kernel devs should not have to be > > reminded to do this ] > > > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > [...] > > > void * > > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > > { > > > int retries = 0; > > > gfp_t lflags = kmem_flags_convert(flags); > > > void *ptr; > > > > > > do { > > > ptr = kmalloc(size, lflags); > > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > > return ptr; > > > if (!(++retries % 100)) > > > xfs_err(NULL, > > > "possible memory allocation deadlock in %s (mode:0x%x)", > > > __func__, lflags); > > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > > } while (1); > > > } > > > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > > broken code like this, but also recognizes that endless looping on a > > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > > index a7a3a63bb360..17ced1805d3a 100644 > > > --- a/fs/xfs/kmem.c > > > +++ b/fs/xfs/kmem.c > > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > > void * > > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > > { > > > - int retries = 0; > > > gfp_t lflags = kmem_flags_convert(flags); > > > - void *ptr; > > > > > > - do { > > > - ptr = kmalloc(size, lflags); > > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > > - return ptr; > > > - if (!(++retries % 100)) > > > - xfs_err(NULL, > > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > > - __func__, lflags); > > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > > - } while (1); > > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > > + lflags |= __GFP_NOFAIL; > > > + > > > + return kmalloc(size, lflags); > > > } > > > > Hmmm - the only reason there is a focus on this loop is that it > > emits warnings about allocations failing. > > Such a warning should be part of the allocator and the whole point why > I like the patch is that we should really warn at a single place. I > was thinking about a simple warning (e.g. like the above) and having > something more sophisticated when lockdep is enabled. > > > It's obvious that the > > problem being dealt with here is a fundamental design issue w.r.t. > > to locking and the OOM killer, but the proposed special casing > > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > > in XFS started emitting warnings about allocations failing more > > often. > > > > So the answer is to remove the warning? That's like killing the > > canary to stop the methane leak in the coal mine. No canary? No > > problems! > > Not at all. I cannot speak for Johannes but I am pretty sure his > motivation wasn't to simply silence the warning. The thing is that no > kernel code paths except for the page allocator shouldn't emulate > behavior for which we have a gfp flag. > > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. > > It would be great to get bug reports. I thought we were talking about a manifestation of the problems I've been seeing.... > > That's a big red flag to me that all > > this hacking around the edges is not solving the underlying problem, > > but instead is breaking things that did once work. > > I am heavily trying to discourage people from adding random hacks to > the already complicated and subtle OOM code. > > > And, well, then there's this (gfp.h): > > > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. This modifier is deprecated and no new > > * users should be added. > > > > So, is this another policy relevation from the mm developers about > > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? > > It is deprecated and shouldn't be used. But that doesn't mean that users > should workaround this by developing their own alternative. I'm kinda sick of hearing that, as if saying it enough times will make reality change. We have a *hard requirement* for memory allocation to make forwards progress, otherwise we *fail catastrophically*. History lesson - June 2004: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b30a2f7bf90593b12dbc912e4390b1b8ee133ea9 So, we're hardly working around the deprecation of GFP_NOFAIL when the code existed 5 years before GFP_NOFAIL was deprecated. Indeed, GFP_NOFAIL was shiny and new back then, having been introduced by Andrew Morton back in 2003. > I agree the > wording could be more clear and mention that if the allocation failure > is absolutely unacceptable then the flags can be used rather than > creating the loop around. What do you think about the following? > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index b840e3b2770d..ee6440ccb75d 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -57,8 +57,12 @@ struct vm_area_struct; > * _might_ fail. This depends upon the particular VM implementation. > * > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > - * cannot handle allocation failures. This modifier is deprecated and no new > - * users should be added. > + * cannot handle allocation failures. This modifier is deprecated for allocation > + * with order > 1. Besides that this modifier is very dangerous when allocation > + * happens under a lock because it creates a lock dependency invisible for the > + * OOM killer so it can livelock. If the allocation failure is _absolutely_ > + * unacceptable then the flags has to be used rather than looping around > + * allocator. Doesn't change anything from an XFS point of view. We do order >1 allocations through kmem_alloc() wrapper, and so we are still doing something that is "not supported" even if we use GFP_NOFAIL rather than our own loop. Also, this reads as an excuse for the OOM killer being broken and not fixing it. Keep in mind that we tell the memory alloc/reclaim subsystem that *we hold locks* when we call into it. That's what GFP_NOFS originally meant, and it's what it still means today in an XFS context. If the OOM killer is not obeying GFP_NOFS and deadlocking on locks that the invoking context holds, then that is a OOM killer bug, not a bug in the subsystem calling kmalloc(GFP_NOFS). > * > * __GFP_NORETRY: The VM implementation must not retry indefinitely. > * > > > Or just another symptom of frantic thrashing because nobody actually > > understands the problem or those that do are unwilling to throw out > > the broken crap and redesign it? > > > > If you are changing allocator behaviour and constraints, then you > > better damn well think through that changes fully, then document > > those changes, change all the relevant code to use the new API (not > > just those that throw warnings in your face) and make sure > > *everyone* knows about it. e.g. a LWN article explaining the changes > > and how memory allocation is going to work into the future would be > > a good start. > > Well, I think the first step is to change the users of the allocator > to not lie about gfp flags. So if the code is infinitely trying then > it really should use GFP_NOFAIL flag. That's a complete non-issue when it comes to deciding whether it is safe to invoke the OOM killer or not! > In the meantime page allocator > should develop a proper diagnostic to help identify all the potential > dependencies. Next we should start thinking whether all the existing > GFP_NOFAIL paths are really necessary or the code can be > refactored/reimplemented to accept allocation failures. Last time the "just make filesystems handle memory allocation failures" I pointed out what that meant for XFS: dirty transaction rollback is required. That's freakin' complex, will double the memory footprint of transactions, roughly double the CPU cost, and greatly increase the complexity of the transaction subsystem. It's a *major* rework of a significant amount of the XFS codebase and will take at least a couple of years design, test and stabilise before it could be rolled out to production. I'm not about to spend a couple of years rewriting XFS just so the VM can get rid of a GFP_NOFAIL user. Especially as the we already tell the Hammer of Last Resort the context in which it can work. Move the OOM killer to kswapd - get it out of the direct reclaim path altogether. If the system is that backed up on locks that it cannot free any memory and has no reserves to satisfy the allocation that kicked the OOM killer, then the OOM killer was not invoked soon enough. Hell, if you want a better way to proceed, then how about you allow us to tell the MM subsystem how much memory reserve a specific set of operations is going to require to complete? That's something that we can do rough calculations for, and it integrates straight into the existing transaction reservation system we already use for log space and disk space, and we can tell the mm subsystem when the reserve is no longer needed (i.e. last thing in transaction commit). That way we don't start a transaction until the mm subsystem has reserved enough pages for us to work with, and the reserve only needs to be used when normal allocation has already failed. i.e rather than looping we get a page allocated from the reserve pool. The reservations wouldn't be perfect, but the majority of the time we'd be able to make progress and not need the OOM killer. And best of all, there's no responsibilty on the MM subsystem for preventing OOM - getting the reservations right is the responsibiity of the subsystem using them. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-18 10:48 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-18 10:48 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > > [ cc xfs list - experienced kernel devs should not have to be > > reminded to do this ] > > > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > [...] > > > void * > > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > > { > > > int retries = 0; > > > gfp_t lflags = kmem_flags_convert(flags); > > > void *ptr; > > > > > > do { > > > ptr = kmalloc(size, lflags); > > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > > return ptr; > > > if (!(++retries % 100)) > > > xfs_err(NULL, > > > "possible memory allocation deadlock in %s (mode:0x%x)", > > > __func__, lflags); > > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > > } while (1); > > > } > > > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > > broken code like this, but also recognizes that endless looping on a > > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > > index a7a3a63bb360..17ced1805d3a 100644 > > > --- a/fs/xfs/kmem.c > > > +++ b/fs/xfs/kmem.c > > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > > void * > > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > > { > > > - int retries = 0; > > > gfp_t lflags = kmem_flags_convert(flags); > > > - void *ptr; > > > > > > - do { > > > - ptr = kmalloc(size, lflags); > > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > > - return ptr; > > > - if (!(++retries % 100)) > > > - xfs_err(NULL, > > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > > - __func__, lflags); > > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > > - } while (1); > > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > > + lflags |= __GFP_NOFAIL; > > > + > > > + return kmalloc(size, lflags); > > > } > > > > Hmmm - the only reason there is a focus on this loop is that it > > emits warnings about allocations failing. > > Such a warning should be part of the allocator and the whole point why > I like the patch is that we should really warn at a single place. I > was thinking about a simple warning (e.g. like the above) and having > something more sophisticated when lockdep is enabled. > > > It's obvious that the > > problem being dealt with here is a fundamental design issue w.r.t. > > to locking and the OOM killer, but the proposed special casing > > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > > in XFS started emitting warnings about allocations failing more > > often. > > > > So the answer is to remove the warning? That's like killing the > > canary to stop the methane leak in the coal mine. No canary? No > > problems! > > Not at all. I cannot speak for Johannes but I am pretty sure his > motivation wasn't to simply silence the warning. The thing is that no > kernel code paths except for the page allocator shouldn't emulate > behavior for which we have a gfp flag. > > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. > > It would be great to get bug reports. I thought we were talking about a manifestation of the problems I've been seeing.... > > That's a big red flag to me that all > > this hacking around the edges is not solving the underlying problem, > > but instead is breaking things that did once work. > > I am heavily trying to discourage people from adding random hacks to > the already complicated and subtle OOM code. > > > And, well, then there's this (gfp.h): > > > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. This modifier is deprecated and no new > > * users should be added. > > > > So, is this another policy relevation from the mm developers about > > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? > > It is deprecated and shouldn't be used. But that doesn't mean that users > should workaround this by developing their own alternative. I'm kinda sick of hearing that, as if saying it enough times will make reality change. We have a *hard requirement* for memory allocation to make forwards progress, otherwise we *fail catastrophically*. History lesson - June 2004: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b30a2f7bf90593b12dbc912e4390b1b8ee133ea9 So, we're hardly working around the deprecation of GFP_NOFAIL when the code existed 5 years before GFP_NOFAIL was deprecated. Indeed, GFP_NOFAIL was shiny and new back then, having been introduced by Andrew Morton back in 2003. > I agree the > wording could be more clear and mention that if the allocation failure > is absolutely unacceptable then the flags can be used rather than > creating the loop around. What do you think about the following? > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index b840e3b2770d..ee6440ccb75d 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -57,8 +57,12 @@ struct vm_area_struct; > * _might_ fail. This depends upon the particular VM implementation. > * > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > - * cannot handle allocation failures. This modifier is deprecated and no new > - * users should be added. > + * cannot handle allocation failures. This modifier is deprecated for allocation > + * with order > 1. Besides that this modifier is very dangerous when allocation > + * happens under a lock because it creates a lock dependency invisible for the > + * OOM killer so it can livelock. If the allocation failure is _absolutely_ > + * unacceptable then the flags has to be used rather than looping around > + * allocator. Doesn't change anything from an XFS point of view. We do order >1 allocations through kmem_alloc() wrapper, and so we are still doing something that is "not supported" even if we use GFP_NOFAIL rather than our own loop. Also, this reads as an excuse for the OOM killer being broken and not fixing it. Keep in mind that we tell the memory alloc/reclaim subsystem that *we hold locks* when we call into it. That's what GFP_NOFS originally meant, and it's what it still means today in an XFS context. If the OOM killer is not obeying GFP_NOFS and deadlocking on locks that the invoking context holds, then that is a OOM killer bug, not a bug in the subsystem calling kmalloc(GFP_NOFS). > * > * __GFP_NORETRY: The VM implementation must not retry indefinitely. > * > > > Or just another symptom of frantic thrashing because nobody actually > > understands the problem or those that do are unwilling to throw out > > the broken crap and redesign it? > > > > If you are changing allocator behaviour and constraints, then you > > better damn well think through that changes fully, then document > > those changes, change all the relevant code to use the new API (not > > just those that throw warnings in your face) and make sure > > *everyone* knows about it. e.g. a LWN article explaining the changes > > and how memory allocation is going to work into the future would be > > a good start. > > Well, I think the first step is to change the users of the allocator > to not lie about gfp flags. So if the code is infinitely trying then > it really should use GFP_NOFAIL flag. That's a complete non-issue when it comes to deciding whether it is safe to invoke the OOM killer or not! > In the meantime page allocator > should develop a proper diagnostic to help identify all the potential > dependencies. Next we should start thinking whether all the existing > GFP_NOFAIL paths are really necessary or the code can be > refactored/reimplemented to accept allocation failures. Last time the "just make filesystems handle memory allocation failures" I pointed out what that meant for XFS: dirty transaction rollback is required. That's freakin' complex, will double the memory footprint of transactions, roughly double the CPU cost, and greatly increase the complexity of the transaction subsystem. It's a *major* rework of a significant amount of the XFS codebase and will take at least a couple of years design, test and stabilise before it could be rolled out to production. I'm not about to spend a couple of years rewriting XFS just so the VM can get rid of a GFP_NOFAIL user. Especially as the we already tell the Hammer of Last Resort the context in which it can work. Move the OOM killer to kswapd - get it out of the direct reclaim path altogether. If the system is that backed up on locks that it cannot free any memory and has no reserves to satisfy the allocation that kicked the OOM killer, then the OOM killer was not invoked soon enough. Hell, if you want a better way to proceed, then how about you allow us to tell the MM subsystem how much memory reserve a specific set of operations is going to require to complete? That's something that we can do rough calculations for, and it integrates straight into the existing transaction reservation system we already use for log space and disk space, and we can tell the mm subsystem when the reserve is no longer needed (i.e. last thing in transaction commit). That way we don't start a transaction until the mm subsystem has reserved enough pages for us to work with, and the reserve only needs to be used when normal allocation has already failed. i.e rather than looping we get a page allocated from the reserve pool. The reservations wouldn't be perfect, but the majority of the time we'd be able to make progress and not need the OOM killer. And best of all, there's no responsibilty on the MM subsystem for preventing OOM - getting the reservations right is the responsibiity of the subsystem using them. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 10:48 ` Dave Chinner @ 2015-02-18 12:16 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-18 12:16 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Wed 18-02-15 21:48:59, Dave Chinner wrote: > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: [...] > Also, this reads as an excuse for the OOM killer being broken and > not fixing it. Keep in mind that we tell the memory alloc/reclaim > subsystem that *we hold locks* when we call into it. That's what > GFP_NOFS originally meant, and it's what it still means today in an > XFS context. Sure, and OOM killer will not be invoked in NOFS context. See __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where is the OOM killer broken. The crucial problem we are dealing with is not GFP_NOFAIL triggering the OOM killer but a lock dependency introduced by the following sequence: taskA taskB taskC lock(A) alloc() alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory # looping for ever if we select_bad_process # cannot make any progress victim = taskB There is no way OOM killer can tell taskB is blocked and that there is dependency between A and B (without lockdep). That is why I call NOFAIL under a lock as dangerous and a bug. > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > that the invoking context holds, then that is a OOM killer bug, not > a bug in the subsystem calling kmalloc(GFP_NOFS). I guess we are talking about different things here or what am I missing? [...] > > In the meantime page allocator > > should develop a proper diagnostic to help identify all the potential > > dependencies. Next we should start thinking whether all the existing > > GFP_NOFAIL paths are really necessary or the code can be > > refactored/reimplemented to accept allocation failures. > > Last time the "just make filesystems handle memory allocation > failures" I pointed out what that meant for XFS: dirty transaction > rollback is required. That's freakin' complex, will double the > memory footprint of transactions, roughly double the CPU cost, and > greatly increase the complexity of the transaction subsystem. It's a > *major* rework of a significant amount of the XFS codebase and will > take at least a couple of years design, test and stabilise before > it could be rolled out to production. > > I'm not about to spend a couple of years rewriting XFS just so the > VM can get rid of a GFP_NOFAIL user. Especially as the we already > tell the Hammer of Last Resort the context in which it can work. > > Move the OOM killer to kswapd - get it out of the direct reclaim > path altogether. This doesn't change anything as explained in other email. The triggering path doesn't wait for the victim to die. > If the system is that backed up on locks that it > cannot free any memory and has no reserves to satisfy the allocation > that kicked the OOM killer, then the OOM killer was not invoked soon > enough. > > Hell, if you want a better way to proceed, then how about you allow > us to tell the MM subsystem how much memory reserve a specific set > of operations is going to require to complete? That's something that > we can do rough calculations for, and it integrates straight into > the existing transaction reservation system we already use for log > space and disk space, and we can tell the mm subsystem when the > reserve is no longer needed (i.e. last thing in transaction commit). > > That way we don't start a transaction until the mm subsystem has > reserved enough pages for us to work with, and the reserve only > needs to be used when normal allocation has already failed. i.e > rather than looping we get a page allocated from the reserve pool. I am not sure I understand the above but isn't the mempools a tool for this purpose? > The reservations wouldn't be perfect, but the majority of the time > we'd be able to make progress and not need the OOM killer. And best > of all, there's no responsibilty on the MM subsystem for preventing > OOM - getting the reservations right is the responsibiity of the > subsystem using them. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-18 12:16 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-18 12:16 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Wed 18-02-15 21:48:59, Dave Chinner wrote: > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: [...] > Also, this reads as an excuse for the OOM killer being broken and > not fixing it. Keep in mind that we tell the memory alloc/reclaim > subsystem that *we hold locks* when we call into it. That's what > GFP_NOFS originally meant, and it's what it still means today in an > XFS context. Sure, and OOM killer will not be invoked in NOFS context. See __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where is the OOM killer broken. The crucial problem we are dealing with is not GFP_NOFAIL triggering the OOM killer but a lock dependency introduced by the following sequence: taskA taskB taskC lock(A) alloc() alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory # looping for ever if we select_bad_process # cannot make any progress victim = taskB There is no way OOM killer can tell taskB is blocked and that there is dependency between A and B (without lockdep). That is why I call NOFAIL under a lock as dangerous and a bug. > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > that the invoking context holds, then that is a OOM killer bug, not > a bug in the subsystem calling kmalloc(GFP_NOFS). I guess we are talking about different things here or what am I missing? [...] > > In the meantime page allocator > > should develop a proper diagnostic to help identify all the potential > > dependencies. Next we should start thinking whether all the existing > > GFP_NOFAIL paths are really necessary or the code can be > > refactored/reimplemented to accept allocation failures. > > Last time the "just make filesystems handle memory allocation > failures" I pointed out what that meant for XFS: dirty transaction > rollback is required. That's freakin' complex, will double the > memory footprint of transactions, roughly double the CPU cost, and > greatly increase the complexity of the transaction subsystem. It's a > *major* rework of a significant amount of the XFS codebase and will > take at least a couple of years design, test and stabilise before > it could be rolled out to production. > > I'm not about to spend a couple of years rewriting XFS just so the > VM can get rid of a GFP_NOFAIL user. Especially as the we already > tell the Hammer of Last Resort the context in which it can work. > > Move the OOM killer to kswapd - get it out of the direct reclaim > path altogether. This doesn't change anything as explained in other email. The triggering path doesn't wait for the victim to die. > If the system is that backed up on locks that it > cannot free any memory and has no reserves to satisfy the allocation > that kicked the OOM killer, then the OOM killer was not invoked soon > enough. > > Hell, if you want a better way to proceed, then how about you allow > us to tell the MM subsystem how much memory reserve a specific set > of operations is going to require to complete? That's something that > we can do rough calculations for, and it integrates straight into > the existing transaction reservation system we already use for log > space and disk space, and we can tell the mm subsystem when the > reserve is no longer needed (i.e. last thing in transaction commit). > > That way we don't start a transaction until the mm subsystem has > reserved enough pages for us to work with, and the reserve only > needs to be used when normal allocation has already failed. i.e > rather than looping we get a page allocated from the reserve pool. I am not sure I understand the above but isn't the mempools a tool for this purpose? > The reservations wouldn't be perfect, but the majority of the time > we'd be able to make progress and not need the OOM killer. And best > of all, there's no responsibilty on the MM subsystem for preventing > OOM - getting the reservations right is the responsibiity of the > subsystem using them. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 12:16 ` Michal Hocko @ 2015-02-18 21:31 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-18 21:31 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [...] > > Also, this reads as an excuse for the OOM killer being broken and > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > subsystem that *we hold locks* when we call into it. That's what > > GFP_NOFS originally meant, and it's what it still means today in an > > XFS context. > > Sure, and OOM killer will not be invoked in NOFS context. See > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > is the OOM killer broken. I suspect that the page cache missing the correct GFP_NOFS was one of the sources of the problems I've been seeing. However, the oom killer exceptions are not checked if __GFP_NOFAIL is present and so if we start using __GFP_NOFAIL then it will be called in GFP_NOFS contexts... > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > OOM killer but a lock dependency introduced by the following sequence: > > taskA taskB taskC > lock(A) alloc() > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > # looping for ever if we select_bad_process > # cannot make any progress victim = taskB > > There is no way OOM killer can tell taskB is blocked and that there is > dependency between A and B (without lockdep). That is why I call NOFAIL > under a lock as dangerous and a bug. Sure. However, eventually the OOM killer with select task A to be killed because nothing else is working. That, at least, marks taskA with TIF_MEMDIE and gives us a potential way to break the deadlock. But the bigger problem is this: taskA taskB lock(A) alloc(GFP_NOFS|GFP_NOFAIL) lock(A) out_of_memory select_bad_process victim = taskB Because there is no way to *ever* resolve that dependency because taskA never leaves the allocator. Even if the oom killer selects taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE because GFP_NOFAIL is set and continues to loop. This is why GFP_NOFAIL is not a solution to the "never fail" alloation problem. The caller doing the "no fail" allocation _must be able to set failure policy_. i.e. the choice of aborting and shutting down because progress cannot be made, or continuing and hoping for forwards progress is owned by the allocating context, no the allocator. The memory allocation subsystem cannot make that choice for us as it has no concept of the failure characteristics of the allocating context. The situations in which this actually matters are extremely *rare* - we've had these allocaiton loops in XFS for > 13 years, and we might get a one or two reports a year of these "possible allocation deadlock" messages occurring. Changing *everything* for such a rare, unusual event is not an efficient use of time or resources. > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > > that the invoking context holds, then that is a OOM killer bug, not > > a bug in the subsystem calling kmalloc(GFP_NOFS). > > I guess we are talking about different things here or what am I missing? >From my perspective, you are tightly focussed on one aspect of the problem and hence are not seeing the bigger picture: this is a corner case of behaviour in a "last hope", brute force memory reclaim technique that no production machine relies on for correct or performant operation. > [...] > > > In the meantime page allocator > > > should develop a proper diagnostic to help identify all the potential > > > dependencies. Next we should start thinking whether all the existing > > > GFP_NOFAIL paths are really necessary or the code can be > > > refactored/reimplemented to accept allocation failures. > > > > Last time the "just make filesystems handle memory allocation > > failures" I pointed out what that meant for XFS: dirty transaction > > rollback is required. That's freakin' complex, will double the > > memory footprint of transactions, roughly double the CPU cost, and > > greatly increase the complexity of the transaction subsystem. It's a > > *major* rework of a significant amount of the XFS codebase and will > > take at least a couple of years design, test and stabilise before > > it could be rolled out to production. > > > > I'm not about to spend a couple of years rewriting XFS just so the > > VM can get rid of a GFP_NOFAIL user. Especially as the we already > > tell the Hammer of Last Resort the context in which it can work. > > > > Move the OOM killer to kswapd - get it out of the direct reclaim > > path altogether. > > This doesn't change anything as explained in other email. The triggering > path doesn't wait for the victim to die. But it does - we wouldn't be talking about deadlocks if there were no blocking dependencies. In this case, allocation keeps retrying until the memory freed by the killed tasks enables it to make forward progress. That's a side effect of the last relevation that was made in this thread that low order allocations never fail... > > If the system is that backed up on locks that it > > cannot free any memory and has no reserves to satisfy the allocation > > that kicked the OOM killer, then the OOM killer was not invoked soon > > enough. > > > > Hell, if you want a better way to proceed, then how about you allow > > us to tell the MM subsystem how much memory reserve a specific set > > of operations is going to require to complete? That's something that > > we can do rough calculations for, and it integrates straight into > > the existing transaction reservation system we already use for log > > space and disk space, and we can tell the mm subsystem when the > > reserve is no longer needed (i.e. last thing in transaction commit). > > > > That way we don't start a transaction until the mm subsystem has > > reserved enough pages for us to work with, and the reserve only > > needs to be used when normal allocation has already failed. i.e > > rather than looping we get a page allocated from the reserve pool. > > I am not sure I understand the above but isn't the mempools a tool for > this purpose? I knew this question would be the next one - I even deleted a one line comment from my last email that said "And no, mempools are not a solution" because that needs a more thorough explanation than a dismissive one-liner. As you know, mempools require a forward progress guarantee on a single type of object and the objects must be slab based. In transaction context we allocate from inode slabs, xfs_buf slabs, log item slabs (6 different ones, IIRC), btree cursor slabs, etc, but then we also have direct page allocations for buffers, vm_map_ram() for mapping multi-page buffers, uncounted heap allocations, etc. We cannot make all of these mempools, nor can me meet the forwards progress requirements of a mempool because other allocations can block and prevent progress. Further, the object have lifetimes that don't correspond to the transaction life cycles, and hence even if we complete the transaction there is no guarantee that the objects allocated within a transaction are going to be returned to the mempool at it's completion. IOWs, we have need for forward allocation progress guarantees on (potentially) several megabytes of allocations from slab caches, the heap and the page allocator, with all allocations all in unpredictable order, with objects of different life times and life cycles, and at which may, at any time, get stuck behind objects locked in other transactions and hence can randomly block until some other thread makes forward progress and completes a transaction and unlocks the object. The reservation would only need to cover the memory we need to allocate and hold in the transaction (i.e. dirtied objects). There is potentially unbound amounts of memory required through demand paging of buffers to find the metadata we need to modify, but demand paged metadata that is read and then released is recoverable. i.e the shrinkers will free it as other memory demand requires, so it's not included in reservation pools because it doesn't deplete the amount of free memory. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-18 21:31 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-18 21:31 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [...] > > Also, this reads as an excuse for the OOM killer being broken and > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > subsystem that *we hold locks* when we call into it. That's what > > GFP_NOFS originally meant, and it's what it still means today in an > > XFS context. > > Sure, and OOM killer will not be invoked in NOFS context. See > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > is the OOM killer broken. I suspect that the page cache missing the correct GFP_NOFS was one of the sources of the problems I've been seeing. However, the oom killer exceptions are not checked if __GFP_NOFAIL is present and so if we start using __GFP_NOFAIL then it will be called in GFP_NOFS contexts... > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > OOM killer but a lock dependency introduced by the following sequence: > > taskA taskB taskC > lock(A) alloc() > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > # looping for ever if we select_bad_process > # cannot make any progress victim = taskB > > There is no way OOM killer can tell taskB is blocked and that there is > dependency between A and B (without lockdep). That is why I call NOFAIL > under a lock as dangerous and a bug. Sure. However, eventually the OOM killer with select task A to be killed because nothing else is working. That, at least, marks taskA with TIF_MEMDIE and gives us a potential way to break the deadlock. But the bigger problem is this: taskA taskB lock(A) alloc(GFP_NOFS|GFP_NOFAIL) lock(A) out_of_memory select_bad_process victim = taskB Because there is no way to *ever* resolve that dependency because taskA never leaves the allocator. Even if the oom killer selects taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE because GFP_NOFAIL is set and continues to loop. This is why GFP_NOFAIL is not a solution to the "never fail" alloation problem. The caller doing the "no fail" allocation _must be able to set failure policy_. i.e. the choice of aborting and shutting down because progress cannot be made, or continuing and hoping for forwards progress is owned by the allocating context, no the allocator. The memory allocation subsystem cannot make that choice for us as it has no concept of the failure characteristics of the allocating context. The situations in which this actually matters are extremely *rare* - we've had these allocaiton loops in XFS for > 13 years, and we might get a one or two reports a year of these "possible allocation deadlock" messages occurring. Changing *everything* for such a rare, unusual event is not an efficient use of time or resources. > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > > that the invoking context holds, then that is a OOM killer bug, not > > a bug in the subsystem calling kmalloc(GFP_NOFS). > > I guess we are talking about different things here or what am I missing? >From my perspective, you are tightly focussed on one aspect of the problem and hence are not seeing the bigger picture: this is a corner case of behaviour in a "last hope", brute force memory reclaim technique that no production machine relies on for correct or performant operation. > [...] > > > In the meantime page allocator > > > should develop a proper diagnostic to help identify all the potential > > > dependencies. Next we should start thinking whether all the existing > > > GFP_NOFAIL paths are really necessary or the code can be > > > refactored/reimplemented to accept allocation failures. > > > > Last time the "just make filesystems handle memory allocation > > failures" I pointed out what that meant for XFS: dirty transaction > > rollback is required. That's freakin' complex, will double the > > memory footprint of transactions, roughly double the CPU cost, and > > greatly increase the complexity of the transaction subsystem. It's a > > *major* rework of a significant amount of the XFS codebase and will > > take at least a couple of years design, test and stabilise before > > it could be rolled out to production. > > > > I'm not about to spend a couple of years rewriting XFS just so the > > VM can get rid of a GFP_NOFAIL user. Especially as the we already > > tell the Hammer of Last Resort the context in which it can work. > > > > Move the OOM killer to kswapd - get it out of the direct reclaim > > path altogether. > > This doesn't change anything as explained in other email. The triggering > path doesn't wait for the victim to die. But it does - we wouldn't be talking about deadlocks if there were no blocking dependencies. In this case, allocation keeps retrying until the memory freed by the killed tasks enables it to make forward progress. That's a side effect of the last relevation that was made in this thread that low order allocations never fail... > > If the system is that backed up on locks that it > > cannot free any memory and has no reserves to satisfy the allocation > > that kicked the OOM killer, then the OOM killer was not invoked soon > > enough. > > > > Hell, if you want a better way to proceed, then how about you allow > > us to tell the MM subsystem how much memory reserve a specific set > > of operations is going to require to complete? That's something that > > we can do rough calculations for, and it integrates straight into > > the existing transaction reservation system we already use for log > > space and disk space, and we can tell the mm subsystem when the > > reserve is no longer needed (i.e. last thing in transaction commit). > > > > That way we don't start a transaction until the mm subsystem has > > reserved enough pages for us to work with, and the reserve only > > needs to be used when normal allocation has already failed. i.e > > rather than looping we get a page allocated from the reserve pool. > > I am not sure I understand the above but isn't the mempools a tool for > this purpose? I knew this question would be the next one - I even deleted a one line comment from my last email that said "And no, mempools are not a solution" because that needs a more thorough explanation than a dismissive one-liner. As you know, mempools require a forward progress guarantee on a single type of object and the objects must be slab based. In transaction context we allocate from inode slabs, xfs_buf slabs, log item slabs (6 different ones, IIRC), btree cursor slabs, etc, but then we also have direct page allocations for buffers, vm_map_ram() for mapping multi-page buffers, uncounted heap allocations, etc. We cannot make all of these mempools, nor can me meet the forwards progress requirements of a mempool because other allocations can block and prevent progress. Further, the object have lifetimes that don't correspond to the transaction life cycles, and hence even if we complete the transaction there is no guarantee that the objects allocated within a transaction are going to be returned to the mempool at it's completion. IOWs, we have need for forward allocation progress guarantees on (potentially) several megabytes of allocations from slab caches, the heap and the page allocator, with all allocations all in unpredictable order, with objects of different life times and life cycles, and at which may, at any time, get stuck behind objects locked in other transactions and hence can randomly block until some other thread makes forward progress and completes a transaction and unlocks the object. The reservation would only need to cover the memory we need to allocate and hold in the transaction (i.e. dirtied objects). There is potentially unbound amounts of memory required through demand paging of buffers to find the metadata we need to modify, but demand paged metadata that is read and then released is recoverable. i.e the shrinkers will free it as other memory demand requires, so it's not included in reservation pools because it doesn't deplete the amount of free memory. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 21:31 ` Dave Chinner @ 2015-02-19 9:40 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-19 9:40 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Thu 19-02-15 08:31:18, Dave Chinner wrote: > On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > > [...] > > > Also, this reads as an excuse for the OOM killer being broken and > > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > > subsystem that *we hold locks* when we call into it. That's what > > > GFP_NOFS originally meant, and it's what it still means today in an > > > XFS context. > > > > Sure, and OOM killer will not be invoked in NOFS context. See > > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > > is the OOM killer broken. > > I suspect that the page cache missing the correct GFP_NOFS was one > of the sources of the problems I've been seeing. > > However, the oom killer exceptions are not checked if __GFP_NOFAIL Yes this is true. This is an effect of 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) and IMO a desirable one. Requiring infinite retrying with a seriously restricted reclaim context calls for troubles (e.g. livelock without no way out because regular reclaim cannot make any progress and OOM killer as the last resort will not happen). > is present and so if we start using __GFP_NOFAIL then it will be > called in GFP_NOFS contexts... > > > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > > OOM killer but a lock dependency introduced by the following sequence: > > > > taskA taskB taskC > > lock(A) alloc() > > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > > # looping for ever if we select_bad_process > > # cannot make any progress victim = taskB > > > > There is no way OOM killer can tell taskB is blocked and that there is > > dependency between A and B (without lockdep). That is why I call NOFAIL > > under a lock as dangerous and a bug. > > Sure. However, eventually the OOM killer with select task A to be > killed because nothing else is working. That would require OOM killer to be able to select another victim while the current one is still alive. There were time based heuristics suggested to do this but I do not think they are the right way to handle the problem and should be considered only if all other options fail. One potential way would be giving access to give GFP_NOFAIL context access to memory reserves when the allocation domain (global/memcg/cpuset) is OOM. Andrea was suggesting something like that IIRC. > That, at least, marks > taskA with TIF_MEMDIE and gives us a potential way to break the > deadlock. > > But the bigger problem is this: > > taskA taskB > lock(A) > alloc(GFP_NOFS|GFP_NOFAIL) lock(A) > out_of_memory > select_bad_process > victim = taskB > > Because there is no way to *ever* resolve that dependency because > taskA never leaves the allocator. Even if the oom killer selects > taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE > because GFP_NOFAIL is set and continues to loop. TIF_MEMDIE will at least give the task access to memory reserves. Anyway this is essentially the same category of livelock as above. > This is why GFP_NOFAIL is not a solution to the "never fail" > alloation problem. The caller doing the "no fail" allocation _must > be able to set failure policy_. i.e. the choice of aborting and > shutting down because progress cannot be made, or continuing and > hoping for forwards progress is owned by the allocating context, no > the allocator. I completely agree that the failure policy is the caller responsibility and I would have no objections to something like: do { ptr = kmalloc(size, GFP_NOFS); if (ptr) return ptr; if (fatal_signal_pending(current)) break; if (looping_too_long()) break; } while (1); fallback_solution(); But this is not the case in kmem_alloc which is essentially GFP_NOFAIL allocation with a warning and congestion_wait. There is no failure policy defined there. The warning should be part of the allocator and the NOFAIL policy should be explicit. So why exactly do you oppose to changing kmem_alloc (and others which are doing essentially the same)? > The memory allocation subsystem cannot make that > choice for us as it has no concept of the failure characteristics of > the allocating context. Of course. I wasn't arguing we should change allocation loops which have a fallback policy as well. That is an entirely different thing. My point was we want to turn GFP_NOFAIL equivalents to use GFP_NOFAIL so that the allocator can prevent from livelocks if possible. > The situations in which this actually matters are extremely *rare* - > we've had these allocaiton loops in XFS for > 13 years, and we might > get a one or two reports a year of these "possible allocation > deadlock" messages occurring. Changing *everything* for such a rare, > unusual event is not an efficient use of time or resources. > > > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > > > that the invoking context holds, then that is a OOM killer bug, not > > > a bug in the subsystem calling kmalloc(GFP_NOFS). > > > > I guess we are talking about different things here or what am I missing? > > From my perspective, you are tightly focussed on one aspect of the > problem and hence are not seeing the bigger picture: this is a > corner case of behaviour in a "last hope", brute force memory > reclaim technique that no production machine relies on for correct > or performant operation. Of course this is a corner case. And I am trying to prevent heuristics which would optimize for such a corner case (there were multiple of them suggested in this thread). The reason I care about GFP_NOFAIL is that there are apparently code paths which do not tell allocator they are basically GFP_NOFAIL without any fallback. This leads to two main problems 1) we do not have a good overview how many code paths have such a strong requirements and so cannot estimate e.g. how big memory reserves should be and 2) allocator cannot help those paths (e.g. by giving them access to reserves to break out of the livelock). > > [...] > > > > In the meantime page allocator > > > > should develop a proper diagnostic to help identify all the potential > > > > dependencies. Next we should start thinking whether all the existing > > > > GFP_NOFAIL paths are really necessary or the code can be > > > > refactored/reimplemented to accept allocation failures. > > > > > > Last time the "just make filesystems handle memory allocation > > > failures" I pointed out what that meant for XFS: dirty transaction > > > rollback is required. That's freakin' complex, will double the > > > memory footprint of transactions, roughly double the CPU cost, and > > > greatly increase the complexity of the transaction subsystem. It's a > > > *major* rework of a significant amount of the XFS codebase and will > > > take at least a couple of years design, test and stabilise before > > > it could be rolled out to production. > > > > > > I'm not about to spend a couple of years rewriting XFS just so the > > > VM can get rid of a GFP_NOFAIL user. Especially as the we already > > > tell the Hammer of Last Resort the context in which it can work. > > > > > > Move the OOM killer to kswapd - get it out of the direct reclaim > > > path altogether. > > > > This doesn't change anything as explained in other email. The triggering > > path doesn't wait for the victim to die. > > But it does - we wouldn't be talking about deadlocks if there were > no blocking dependencies. In this case, allocation keeps retrying > until the memory freed by the killed tasks enables it to make > forward progress. That's a side effect of the last relevation that > was made in this thread that low order allocations never fail... Sure, low order allocations being almost GFP_NOFAIL makes things much worse of course. And this should be changed. We just have to think about the way how to do it without breaking the universe. I hope we can discuss this at LSF. But even then I do not see how triggering the OOM killer from kswapd would help here. Victims would be looping in the allocator whether the actual killing happens from their or any other context. > > > If the system is that backed up on locks that it > > > cannot free any memory and has no reserves to satisfy the allocation > > > that kicked the OOM killer, then the OOM killer was not invoked soon > > > enough. > > > > > > Hell, if you want a better way to proceed, then how about you allow > > > us to tell the MM subsystem how much memory reserve a specific set > > > of operations is going to require to complete? That's something that > > > we can do rough calculations for, and it integrates straight into > > > the existing transaction reservation system we already use for log > > > space and disk space, and we can tell the mm subsystem when the > > > reserve is no longer needed (i.e. last thing in transaction commit). > > > > > > That way we don't start a transaction until the mm subsystem has > > > reserved enough pages for us to work with, and the reserve only > > > needs to be used when normal allocation has already failed. i.e > > > rather than looping we get a page allocated from the reserve pool. > > > > I am not sure I understand the above but isn't the mempools a tool for > > this purpose? > > I knew this question would be the next one - I even deleted a one > line comment from my last email that said "And no, mempools are not > a solution" because that needs a more thorough explanation than a > dismissive one-liner. > > As you know, mempools require a forward progress guarantee on a > single type of object and the objects must be slab based. > > In transaction context we allocate from inode slabs, xfs_buf slabs, > log item slabs (6 different ones, IIRC), btree cursor slabs, etc, > but then we also have direct page allocations for buffers, vm_map_ram() > for mapping multi-page buffers, uncounted heap allocations, etc. > We cannot make all of these mempools, nor can me meet the forwards > progress requirements of a mempool because other allocations can > block and prevent progress. > > Further, the object have lifetimes that don't correspond to the > transaction life cycles, and hence even if we complete the > transaction there is no guarantee that the objects allocated within > a transaction are going to be returned to the mempool at it's > completion. > > IOWs, we have need for forward allocation progress guarantees on > (potentially) several megabytes of allocations from slab caches, the > heap and the page allocator, with all allocations all in > unpredictable order, with objects of different life times and life > cycles, and at which may, at any time, get stuck behind > objects locked in other transactions and hence can randomly block > until some other thread makes forward progress and completes a > transaction and unlocks the object. Thanks for the clarification, I have to think about it some more, though. My thinking was that mempools could be used for an emergency pool with a pre-allocated memory which would be used in the non failing contexts. > The reservation would only need to cover the memory we need to > allocate and hold in the transaction (i.e. dirtied objects). There > is potentially unbound amounts of memory required through demand > paging of buffers to find the metadata we need to modify, but demand > paged metadata that is read and then released is recoverable. i.e > the shrinkers will free it as other memory demand requires, so it's > not included in reservation pools because it doesn't deplete the > amount of free memory. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 9:40 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-19 9:40 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Thu 19-02-15 08:31:18, Dave Chinner wrote: > On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > > [...] > > > Also, this reads as an excuse for the OOM killer being broken and > > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > > subsystem that *we hold locks* when we call into it. That's what > > > GFP_NOFS originally meant, and it's what it still means today in an > > > XFS context. > > > > Sure, and OOM killer will not be invoked in NOFS context. See > > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > > is the OOM killer broken. > > I suspect that the page cache missing the correct GFP_NOFS was one > of the sources of the problems I've been seeing. > > However, the oom killer exceptions are not checked if __GFP_NOFAIL Yes this is true. This is an effect of 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) and IMO a desirable one. Requiring infinite retrying with a seriously restricted reclaim context calls for troubles (e.g. livelock without no way out because regular reclaim cannot make any progress and OOM killer as the last resort will not happen). > is present and so if we start using __GFP_NOFAIL then it will be > called in GFP_NOFS contexts... > > > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > > OOM killer but a lock dependency introduced by the following sequence: > > > > taskA taskB taskC > > lock(A) alloc() > > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > > # looping for ever if we select_bad_process > > # cannot make any progress victim = taskB > > > > There is no way OOM killer can tell taskB is blocked and that there is > > dependency between A and B (without lockdep). That is why I call NOFAIL > > under a lock as dangerous and a bug. > > Sure. However, eventually the OOM killer with select task A to be > killed because nothing else is working. That would require OOM killer to be able to select another victim while the current one is still alive. There were time based heuristics suggested to do this but I do not think they are the right way to handle the problem and should be considered only if all other options fail. One potential way would be giving access to give GFP_NOFAIL context access to memory reserves when the allocation domain (global/memcg/cpuset) is OOM. Andrea was suggesting something like that IIRC. > That, at least, marks > taskA with TIF_MEMDIE and gives us a potential way to break the > deadlock. > > But the bigger problem is this: > > taskA taskB > lock(A) > alloc(GFP_NOFS|GFP_NOFAIL) lock(A) > out_of_memory > select_bad_process > victim = taskB > > Because there is no way to *ever* resolve that dependency because > taskA never leaves the allocator. Even if the oom killer selects > taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE > because GFP_NOFAIL is set and continues to loop. TIF_MEMDIE will at least give the task access to memory reserves. Anyway this is essentially the same category of livelock as above. > This is why GFP_NOFAIL is not a solution to the "never fail" > alloation problem. The caller doing the "no fail" allocation _must > be able to set failure policy_. i.e. the choice of aborting and > shutting down because progress cannot be made, or continuing and > hoping for forwards progress is owned by the allocating context, no > the allocator. I completely agree that the failure policy is the caller responsibility and I would have no objections to something like: do { ptr = kmalloc(size, GFP_NOFS); if (ptr) return ptr; if (fatal_signal_pending(current)) break; if (looping_too_long()) break; } while (1); fallback_solution(); But this is not the case in kmem_alloc which is essentially GFP_NOFAIL allocation with a warning and congestion_wait. There is no failure policy defined there. The warning should be part of the allocator and the NOFAIL policy should be explicit. So why exactly do you oppose to changing kmem_alloc (and others which are doing essentially the same)? > The memory allocation subsystem cannot make that > choice for us as it has no concept of the failure characteristics of > the allocating context. Of course. I wasn't arguing we should change allocation loops which have a fallback policy as well. That is an entirely different thing. My point was we want to turn GFP_NOFAIL equivalents to use GFP_NOFAIL so that the allocator can prevent from livelocks if possible. > The situations in which this actually matters are extremely *rare* - > we've had these allocaiton loops in XFS for > 13 years, and we might > get a one or two reports a year of these "possible allocation > deadlock" messages occurring. Changing *everything* for such a rare, > unusual event is not an efficient use of time or resources. > > > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > > > that the invoking context holds, then that is a OOM killer bug, not > > > a bug in the subsystem calling kmalloc(GFP_NOFS). > > > > I guess we are talking about different things here or what am I missing? > > From my perspective, you are tightly focussed on one aspect of the > problem and hence are not seeing the bigger picture: this is a > corner case of behaviour in a "last hope", brute force memory > reclaim technique that no production machine relies on for correct > or performant operation. Of course this is a corner case. And I am trying to prevent heuristics which would optimize for such a corner case (there were multiple of them suggested in this thread). The reason I care about GFP_NOFAIL is that there are apparently code paths which do not tell allocator they are basically GFP_NOFAIL without any fallback. This leads to two main problems 1) we do not have a good overview how many code paths have such a strong requirements and so cannot estimate e.g. how big memory reserves should be and 2) allocator cannot help those paths (e.g. by giving them access to reserves to break out of the livelock). > > [...] > > > > In the meantime page allocator > > > > should develop a proper diagnostic to help identify all the potential > > > > dependencies. Next we should start thinking whether all the existing > > > > GFP_NOFAIL paths are really necessary or the code can be > > > > refactored/reimplemented to accept allocation failures. > > > > > > Last time the "just make filesystems handle memory allocation > > > failures" I pointed out what that meant for XFS: dirty transaction > > > rollback is required. That's freakin' complex, will double the > > > memory footprint of transactions, roughly double the CPU cost, and > > > greatly increase the complexity of the transaction subsystem. It's a > > > *major* rework of a significant amount of the XFS codebase and will > > > take at least a couple of years design, test and stabilise before > > > it could be rolled out to production. > > > > > > I'm not about to spend a couple of years rewriting XFS just so the > > > VM can get rid of a GFP_NOFAIL user. Especially as the we already > > > tell the Hammer of Last Resort the context in which it can work. > > > > > > Move the OOM killer to kswapd - get it out of the direct reclaim > > > path altogether. > > > > This doesn't change anything as explained in other email. The triggering > > path doesn't wait for the victim to die. > > But it does - we wouldn't be talking about deadlocks if there were > no blocking dependencies. In this case, allocation keeps retrying > until the memory freed by the killed tasks enables it to make > forward progress. That's a side effect of the last relevation that > was made in this thread that low order allocations never fail... Sure, low order allocations being almost GFP_NOFAIL makes things much worse of course. And this should be changed. We just have to think about the way how to do it without breaking the universe. I hope we can discuss this at LSF. But even then I do not see how triggering the OOM killer from kswapd would help here. Victims would be looping in the allocator whether the actual killing happens from their or any other context. > > > If the system is that backed up on locks that it > > > cannot free any memory and has no reserves to satisfy the allocation > > > that kicked the OOM killer, then the OOM killer was not invoked soon > > > enough. > > > > > > Hell, if you want a better way to proceed, then how about you allow > > > us to tell the MM subsystem how much memory reserve a specific set > > > of operations is going to require to complete? That's something that > > > we can do rough calculations for, and it integrates straight into > > > the existing transaction reservation system we already use for log > > > space and disk space, and we can tell the mm subsystem when the > > > reserve is no longer needed (i.e. last thing in transaction commit). > > > > > > That way we don't start a transaction until the mm subsystem has > > > reserved enough pages for us to work with, and the reserve only > > > needs to be used when normal allocation has already failed. i.e > > > rather than looping we get a page allocated from the reserve pool. > > > > I am not sure I understand the above but isn't the mempools a tool for > > this purpose? > > I knew this question would be the next one - I even deleted a one > line comment from my last email that said "And no, mempools are not > a solution" because that needs a more thorough explanation than a > dismissive one-liner. > > As you know, mempools require a forward progress guarantee on a > single type of object and the objects must be slab based. > > In transaction context we allocate from inode slabs, xfs_buf slabs, > log item slabs (6 different ones, IIRC), btree cursor slabs, etc, > but then we also have direct page allocations for buffers, vm_map_ram() > for mapping multi-page buffers, uncounted heap allocations, etc. > We cannot make all of these mempools, nor can me meet the forwards > progress requirements of a mempool because other allocations can > block and prevent progress. > > Further, the object have lifetimes that don't correspond to the > transaction life cycles, and hence even if we complete the > transaction there is no guarantee that the objects allocated within > a transaction are going to be returned to the mempool at it's > completion. > > IOWs, we have need for forward allocation progress guarantees on > (potentially) several megabytes of allocations from slab caches, the > heap and the page allocator, with all allocations all in > unpredictable order, with objects of different life times and life > cycles, and at which may, at any time, get stuck behind > objects locked in other transactions and hence can randomly block > until some other thread makes forward progress and completes a > transaction and unlocks the object. Thanks for the clarification, I have to think about it some more, though. My thinking was that mempools could be used for an emergency pool with a pre-allocated memory which would be used in the non failing contexts. > The reservation would only need to cover the memory we need to > allocate and hold in the transaction (i.e. dirtied objects). There > is potentially unbound amounts of memory required through demand > paging of buffers to find the metadata we need to modify, but demand > paged metadata that is read and then released is recoverable. i.e > the shrinkers will free it as other memory demand requires, so it's > not included in reservation pools because it doesn't deplete the > amount of free memory. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 9:40 ` Michal Hocko @ 2015-02-19 22:03 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-19 22:03 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Thu, Feb 19, 2015 at 10:40:20AM +0100, Michal Hocko wrote: > On Thu 19-02-15 08:31:18, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > This is why GFP_NOFAIL is not a solution to the "never fail" > > alloation problem. The caller doing the "no fail" allocation _must > > be able to set failure policy_. i.e. the choice of aborting and > > shutting down because progress cannot be made, or continuing and > > hoping for forwards progress is owned by the allocating context, no > > the allocator. > > I completely agree that the failure policy is the caller responsibility > and I would have no objections to something like: > > do { > ptr = kmalloc(size, GFP_NOFS); > if (ptr) > return ptr; > if (fatal_signal_pending(current)) > break; > if (looping_too_long()) > break; > } while (1); > > fallback_solution(); > > But this is not the case in kmem_alloc which is essentially GFP_NOFAIL > allocation with a warning and congestion_wait. There is no failure > policy defined there. The warning should be part of the allocator and > the NOFAIL policy should be explicit. So why exactly do you oppose to > changing kmem_alloc (and others which are doing essentially the same)? I'm opposing changing kmem_alloc() to GFP_NOFAIL precisely because doing so is *broken*, *and* it removes the policy decision from the calling context where it belongs. We are in the process of discussing - at an XFS level - how to handle errors in a configurable manner. See, for example, this discussion: http://oss.sgi.com/archives/xfs/2015-02/msg00343.html Where we are trying to decide how to expose failure policy to admins to make decisions about error handling behaviour: http://oss.sgi.com/archives/xfs/2015-02/msg00346.html There is little doubt in my mind that this stretches to ENOMEM handling; it is another case where we consider ENOMEM to be a transient error and hence retry forever until it succeeds. But some people are going to want to configure that behaviour, and the API above allows peopel to configure exactly how many repeated memory allocations we'd fail before considering the situation hopeless, failing, and risking a filesystem shutdown.... Converting the code to use GFP_NOFAIL takes us in exactly the opposite direction to our current line of development w.r.t. to filesystem error handling. > The reason I care about GFP_NOFAIL is that there are apparently code > paths which do not tell allocator they are basically GFP_NOFAIL without > any fallback. This leads to two main problems 1) we do not have a good > overview how many code paths have such a strong requirements and so > cannot estimate e.g. how big memory reserves should be and Right, when GFP_NOFAIL got deprecated we lost the ability to document such behaviour and find it easily. People just put retry loops in instead of using GFP_NOFAIL. Good luck finding them all :/ > 2) allocator > cannot help those paths (e.g. by giving them access to reserves to break > out of the livelock). Allocator should not help. Global reserves are unreliable - make the allocation context reserve the amount it needs before it enters the context where it can't back out.... > > IOWs, we have need for forward allocation progress guarantees on > > (potentially) several megabytes of allocations from slab caches, the > > heap and the page allocator, with all allocations all in > > unpredictable order, with objects of different life times and life > > cycles, and at which may, at any time, get stuck behind > > objects locked in other transactions and hence can randomly block > > until some other thread makes forward progress and completes a > > transaction and unlocks the object. > > Thanks for the clarification, I have to think about it some more, > though. My thinking was that mempools could be used for an emergency > pool with a pre-allocated memory which would be used in the non failing > contexts. The other problem with mempools is that they aren't exclusive to the context that needs the reservation. i.e. we can preallocate to the mempool, but then when the preallocating context goes to allocate, that preallocation may have already been drained by other contexts. The memory reservation needs to be follow to the transaction - we can pass them between tasks, and they need to persist across sleeping locks, IO, etc, and mempools simply too constrainted to be usable in this environment. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 22:03 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-19 22:03 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Thu, Feb 19, 2015 at 10:40:20AM +0100, Michal Hocko wrote: > On Thu 19-02-15 08:31:18, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > This is why GFP_NOFAIL is not a solution to the "never fail" > > alloation problem. The caller doing the "no fail" allocation _must > > be able to set failure policy_. i.e. the choice of aborting and > > shutting down because progress cannot be made, or continuing and > > hoping for forwards progress is owned by the allocating context, no > > the allocator. > > I completely agree that the failure policy is the caller responsibility > and I would have no objections to something like: > > do { > ptr = kmalloc(size, GFP_NOFS); > if (ptr) > return ptr; > if (fatal_signal_pending(current)) > break; > if (looping_too_long()) > break; > } while (1); > > fallback_solution(); > > But this is not the case in kmem_alloc which is essentially GFP_NOFAIL > allocation with a warning and congestion_wait. There is no failure > policy defined there. The warning should be part of the allocator and > the NOFAIL policy should be explicit. So why exactly do you oppose to > changing kmem_alloc (and others which are doing essentially the same)? I'm opposing changing kmem_alloc() to GFP_NOFAIL precisely because doing so is *broken*, *and* it removes the policy decision from the calling context where it belongs. We are in the process of discussing - at an XFS level - how to handle errors in a configurable manner. See, for example, this discussion: http://oss.sgi.com/archives/xfs/2015-02/msg00343.html Where we are trying to decide how to expose failure policy to admins to make decisions about error handling behaviour: http://oss.sgi.com/archives/xfs/2015-02/msg00346.html There is little doubt in my mind that this stretches to ENOMEM handling; it is another case where we consider ENOMEM to be a transient error and hence retry forever until it succeeds. But some people are going to want to configure that behaviour, and the API above allows peopel to configure exactly how many repeated memory allocations we'd fail before considering the situation hopeless, failing, and risking a filesystem shutdown.... Converting the code to use GFP_NOFAIL takes us in exactly the opposite direction to our current line of development w.r.t. to filesystem error handling. > The reason I care about GFP_NOFAIL is that there are apparently code > paths which do not tell allocator they are basically GFP_NOFAIL without > any fallback. This leads to two main problems 1) we do not have a good > overview how many code paths have such a strong requirements and so > cannot estimate e.g. how big memory reserves should be and Right, when GFP_NOFAIL got deprecated we lost the ability to document such behaviour and find it easily. People just put retry loops in instead of using GFP_NOFAIL. Good luck finding them all :/ > 2) allocator > cannot help those paths (e.g. by giving them access to reserves to break > out of the livelock). Allocator should not help. Global reserves are unreliable - make the allocation context reserve the amount it needs before it enters the context where it can't back out.... > > IOWs, we have need for forward allocation progress guarantees on > > (potentially) several megabytes of allocations from slab caches, the > > heap and the page allocator, with all allocations all in > > unpredictable order, with objects of different life times and life > > cycles, and at which may, at any time, get stuck behind > > objects locked in other transactions and hence can randomly block > > until some other thread makes forward progress and completes a > > transaction and unlocks the object. > > Thanks for the clarification, I have to think about it some more, > though. My thinking was that mempools could be used for an emergency > pool with a pre-allocated memory which would be used in the non failing > contexts. The other problem with mempools is that they aren't exclusive to the context that needs the reservation. i.e. we can preallocate to the mempool, but then when the preallocating context goes to allocate, that preallocation may have already been drained by other contexts. The memory reservation needs to be follow to the transaction - we can pass them between tasks, and they need to persist across sleeping locks, IO, etc, and mempools simply too constrainted to be usable in this environment. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 22:03 ` Dave Chinner @ 2015-02-20 9:27 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 9:27 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Fri 20-02-15 09:03:55, Dave Chinner wrote: [...] > Converting the code to use GFP_NOFAIL takes us in exactly the > opposite direction to our current line of development w.r.t. to > filesystem error handling. Fair enough. If there are plans to have a failure policy rather than GFP_NOFAIL like behavior then I have, of course, no objections. Quite opposite. This is exactly what I would like to see. GFP_NOFAIL should be rarely used, really. The whole point of this discussion, and I am sorry if I didn't make it clear, is that _if_ there is really a GFP_NOFAIL requirement hidden from the allocator then it should be changed to use GFP_NOFAIL so that allocator knows about this requirement. > > The reason I care about GFP_NOFAIL is that there are apparently code > > paths which do not tell allocator they are basically GFP_NOFAIL without > > any fallback. This leads to two main problems 1) we do not have a good > > overview how many code paths have such a strong requirements and so > > cannot estimate e.g. how big memory reserves should be and > > Right, when GFP_NOFAIL got deprecated we lost the ability to document > such behaviour and find it easily. People just put retry loops in > instead of using GFP_NOFAIL. Good luck finding them all :/ That will be PITA, all right, but I guess the deprecation was a mistake and we should stop this tendency. > > 2) allocator > > cannot help those paths (e.g. by giving them access to reserves to break > > out of the livelock). > > Allocator should not help. Global reserves are unreliable - make the > allocation context reserve the amount it needs before it enters the > context where it can't back out.... Sure pre-allocation is preferable. But once somebody asks for GFP_NOFAIL then it is too late and the allocator only has memory reclaim and potentially reserves. [...] -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 9:27 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 9:27 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Fri 20-02-15 09:03:55, Dave Chinner wrote: [...] > Converting the code to use GFP_NOFAIL takes us in exactly the > opposite direction to our current line of development w.r.t. to > filesystem error handling. Fair enough. If there are plans to have a failure policy rather than GFP_NOFAIL like behavior then I have, of course, no objections. Quite opposite. This is exactly what I would like to see. GFP_NOFAIL should be rarely used, really. The whole point of this discussion, and I am sorry if I didn't make it clear, is that _if_ there is really a GFP_NOFAIL requirement hidden from the allocator then it should be changed to use GFP_NOFAIL so that allocator knows about this requirement. > > The reason I care about GFP_NOFAIL is that there are apparently code > > paths which do not tell allocator they are basically GFP_NOFAIL without > > any fallback. This leads to two main problems 1) we do not have a good > > overview how many code paths have such a strong requirements and so > > cannot estimate e.g. how big memory reserves should be and > > Right, when GFP_NOFAIL got deprecated we lost the ability to document > such behaviour and find it easily. People just put retry loops in > instead of using GFP_NOFAIL. Good luck finding them all :/ That will be PITA, all right, but I guess the deprecation was a mistake and we should stop this tendency. > > 2) allocator > > cannot help those paths (e.g. by giving them access to reserves to break > > out of the livelock). > > Allocator should not help. Global reserves are unreliable - make the > allocation context reserve the amount it needs before it enters the > context where it can't back out.... Sure pre-allocation is preferable. But once somebody asks for GFP_NOFAIL then it is too late and the allocator only has memory reclaim and potentially reserves. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 12:16 ` Michal Hocko @ 2015-02-19 11:01 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-19 11:01 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [...] > > Also, this reads as an excuse for the OOM killer being broken and > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > subsystem that *we hold locks* when we call into it. That's what > > GFP_NOFS originally meant, and it's what it still means today in an > > XFS context. > > Sure, and OOM killer will not be invoked in NOFS context. See > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > is the OOM killer broken. > > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > OOM killer but a lock dependency introduced by the following sequence: > > taskA taskB taskC > lock(A) alloc() > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > # looping for ever if we select_bad_process > # cannot make any progress victim = taskB You don't even need taskC here. taskA could invoke the OOM killer with lock(A) held, and taskB getting selected as the victim while trying to acquire lock(A). It'll get the signal and TIF_MEMDIE and then wait for lock(A) while taskA is waiting for it to exit. But it doesn't matter who is doing the OOM killing - if the allocating task with the lock/state is waiting for the OOM victim to free memory, and the victim is waiting for same the lock/state, we have a deadlock. > There is no way OOM killer can tell taskB is blocked and that there is > dependency between A and B (without lockdep). That is why I call NOFAIL > under a lock as dangerous and a bug. You keep ignoring that it's also one of the main usecases of this flag. The caller has state that it can't unwind and thus needs the allocation to succeed. Chances are somebody else can get blocked up on that same state. And when that somebody else is the first choice of the OOM killer, we're screwed. This is exactly why I'm proposing that the OOM killer should not wait indefinitely for its first choice to exit, but ultimately move on and try other tasks. There is no other way to resolve this deadlock. Preferrably, we'd get rid of all nofail allocations and replace them with preallocated reserves. But this is not going to happen anytime soon, so what other option do we have than resolving this on the OOM killer side? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 11:01 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-19 11:01 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [...] > > Also, this reads as an excuse for the OOM killer being broken and > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > subsystem that *we hold locks* when we call into it. That's what > > GFP_NOFS originally meant, and it's what it still means today in an > > XFS context. > > Sure, and OOM killer will not be invoked in NOFS context. See > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > is the OOM killer broken. > > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > OOM killer but a lock dependency introduced by the following sequence: > > taskA taskB taskC > lock(A) alloc() > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > # looping for ever if we select_bad_process > # cannot make any progress victim = taskB You don't even need taskC here. taskA could invoke the OOM killer with lock(A) held, and taskB getting selected as the victim while trying to acquire lock(A). It'll get the signal and TIF_MEMDIE and then wait for lock(A) while taskA is waiting for it to exit. But it doesn't matter who is doing the OOM killing - if the allocating task with the lock/state is waiting for the OOM victim to free memory, and the victim is waiting for same the lock/state, we have a deadlock. > There is no way OOM killer can tell taskB is blocked and that there is > dependency between A and B (without lockdep). That is why I call NOFAIL > under a lock as dangerous and a bug. You keep ignoring that it's also one of the main usecases of this flag. The caller has state that it can't unwind and thus needs the allocation to succeed. Chances are somebody else can get blocked up on that same state. And when that somebody else is the first choice of the OOM killer, we're screwed. This is exactly why I'm proposing that the OOM killer should not wait indefinitely for its first choice to exit, but ultimately move on and try other tasks. There is no other way to resolve this deadlock. Preferrably, we'd get rid of all nofail allocations and replace them with preallocated reserves. But this is not going to happen anytime soon, so what other option do we have than resolving this on the OOM killer side? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 11:01 ` Johannes Weiner @ 2015-02-19 12:29 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-19 12:29 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Thu 19-02-15 06:01:24, Johannes Weiner wrote: [...] > Preferrably, we'd get rid of all nofail allocations and replace them > with preallocated reserves. But this is not going to happen anytime > soon, so what other option do we have than resolving this on the OOM > killer side? As I've mentioned in other email, we might give GFP_NOFAIL allocator access to memory reserves (by giving it __GFP_HIGH). This is still not a 100% solution because reserves could get depleted but this risk is there even with multiple oom victims. I would still argue that this would be a better approach because selecting more victims might hit pathological case more easily (other victims might be blocked on the very same lock e.g.). Something like the following: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8d52ab18fe0d..4b5cf28a13f4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + int oom = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2628,6 +2629,15 @@ retry: wake_all_kswapds(order, ac); /* + * __GFP_NOFAIL allocations cannot fail but yet the current context + * might be blocking resources needed by the OOM victim to terminate. + * Allow the caller to dive into memory reserves to succeed the + * allocation and break out from a potential deadlock. + */ + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) + gfp_mask |= __GFP_HIGH; + + /* * OK, we're below the kswapd watermark and have kicked background * reclaim. Now things get more complex, so set up alloc_flags according * to how we want to proceed. @@ -2759,6 +2769,8 @@ retry: goto got_pg; if (!did_some_progress) goto nopage; + + oom++; } /* Wait for some write requests to complete then retry */ wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 12:29 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-19 12:29 UTC (permalink / raw) To: Johannes Weiner Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Thu 19-02-15 06:01:24, Johannes Weiner wrote: [...] > Preferrably, we'd get rid of all nofail allocations and replace them > with preallocated reserves. But this is not going to happen anytime > soon, so what other option do we have than resolving this on the OOM > killer side? As I've mentioned in other email, we might give GFP_NOFAIL allocator access to memory reserves (by giving it __GFP_HIGH). This is still not a 100% solution because reserves could get depleted but this risk is there even with multiple oom victims. I would still argue that this would be a better approach because selecting more victims might hit pathological case more easily (other victims might be blocked on the very same lock e.g.). Something like the following: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8d52ab18fe0d..4b5cf28a13f4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + int oom = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2628,6 +2629,15 @@ retry: wake_all_kswapds(order, ac); /* + * __GFP_NOFAIL allocations cannot fail but yet the current context + * might be blocking resources needed by the OOM victim to terminate. + * Allow the caller to dive into memory reserves to succeed the + * allocation and break out from a potential deadlock. + */ + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) + gfp_mask |= __GFP_HIGH; + + /* * OK, we're below the kswapd watermark and have kicked background * reclaim. Now things get more complex, so set up alloc_flags according * to how we want to proceed. @@ -2759,6 +2769,8 @@ retry: goto got_pg; if (!did_some_progress) goto nopage; + + oom++; } /* Wait for some write requests to complete then retry */ wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 12:29 ` Michal Hocko @ 2015-02-19 12:58 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-19 12:58 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Thu 19-02-15 13:29:14, Michal Hocko wrote: [...] > Something like the following. __GFP_HIGH doesn't seem to be sufficient so we would need something slightly else but the idea is still the same: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8d52ab18fe0d..2d224bbdf8e8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + int oom = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2635,6 +2636,15 @@ retry: alloc_flags = gfp_to_alloc_flags(gfp_mask); /* + * __GFP_NOFAIL allocations cannot fail but yet the current context + * might be blocking resources needed by the OOM victim to terminate. + * Allow the caller to dive into memory reserves to succeed the + * allocation and break out from a potential deadlock. + */ + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) + alloc_flags |= ALLOC_NO_WATERMARKS; + + /* * Find the true preferred zone if the allocation is unconstrained by * cpusets. */ @@ -2759,6 +2769,8 @@ retry: goto got_pg; if (!did_some_progress) goto nopage; + + oom++; } /* Wait for some write requests to complete then retry */ wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 12:58 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-19 12:58 UTC (permalink / raw) To: Johannes Weiner Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Thu 19-02-15 13:29:14, Michal Hocko wrote: [...] > Something like the following. __GFP_HIGH doesn't seem to be sufficient so we would need something slightly else but the idea is still the same: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8d52ab18fe0d..2d224bbdf8e8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + int oom = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2635,6 +2636,15 @@ retry: alloc_flags = gfp_to_alloc_flags(gfp_mask); /* + * __GFP_NOFAIL allocations cannot fail but yet the current context + * might be blocking resources needed by the OOM victim to terminate. + * Allow the caller to dive into memory reserves to succeed the + * allocation and break out from a potential deadlock. + */ + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) + alloc_flags |= ALLOC_NO_WATERMARKS; + + /* * Find the true preferred zone if the allocation is unconstrained by * cpusets. */ @@ -2759,6 +2769,8 @@ retry: goto got_pg; if (!did_some_progress) goto nopage; + + oom++; } /* Wait for some write requests to complete then retry */ wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 12:58 ` Michal Hocko (?) @ 2015-02-19 15:29 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 15:29 UTC (permalink / raw) To: mhocko, hannes Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 Michal Hocko wrote: > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > [...] > > Something like the following. > __GFP_HIGH doesn't seem to be sufficient so we would need something > slightly else but the idea is still the same: > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8d52ab18fe0d..2d224bbdf8e8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > enum migrate_mode migration_mode = MIGRATE_ASYNC; > bool deferred_compaction = false; > int contended_compaction = COMPACT_CONTENDED_NONE; > + int oom = 0; > > /* > * In the slowpath, we sanity check order to avoid ever trying to > @@ -2635,6 +2636,15 @@ retry: > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > /* > + * __GFP_NOFAIL allocations cannot fail but yet the current context > + * might be blocking resources needed by the OOM victim to terminate. > + * Allow the caller to dive into memory reserves to succeed the > + * allocation and break out from a potential deadlock. > + */ We don't know how many callers will pass __GFP_NOFAIL. But if 1000 threads are doing the same operation which requires __GFP_NOFAIL allocation with a lock held, wouldn't memory reserves deplete? This heuristic can't continue if memory reserves depleted or continuous pages of requested order cannot be found. > + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) > + alloc_flags |= ALLOC_NO_WATERMARKS; > + > + /* > * Find the true preferred zone if the allocation is unconstrained by > * cpusets. > */ > @@ -2759,6 +2769,8 @@ retry: > goto got_pg; > if (!did_some_progress) > goto nopage; > + > + oom++; > } > /* Wait for some write requests to complete then retry */ > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > -- > Michal Hocko > SUSE Labs > ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 15:29 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 15:29 UTC (permalink / raw) To: mhocko, hannes Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 Michal Hocko wrote: > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > [...] > > Something like the following. > __GFP_HIGH doesn't seem to be sufficient so we would need something > slightly else but the idea is still the same: > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8d52ab18fe0d..2d224bbdf8e8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > enum migrate_mode migration_mode = MIGRATE_ASYNC; > bool deferred_compaction = false; > int contended_compaction = COMPACT_CONTENDED_NONE; > + int oom = 0; > > /* > * In the slowpath, we sanity check order to avoid ever trying to > @@ -2635,6 +2636,15 @@ retry: > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > /* > + * __GFP_NOFAIL allocations cannot fail but yet the current context > + * might be blocking resources needed by the OOM victim to terminate. > + * Allow the caller to dive into memory reserves to succeed the > + * allocation and break out from a potential deadlock. > + */ We don't know how many callers will pass __GFP_NOFAIL. But if 1000 threads are doing the same operation which requires __GFP_NOFAIL allocation with a lock held, wouldn't memory reserves deplete? This heuristic can't continue if memory reserves depleted or continuous pages of requested order cannot be found. > + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) > + alloc_flags |= ALLOC_NO_WATERMARKS; > + > + /* > * Find the true preferred zone if the allocation is unconstrained by > * cpusets. > */ > @@ -2759,6 +2769,8 @@ retry: > goto got_pg; > if (!did_some_progress) > goto nopage; > + > + oom++; > } > /* Wait for some write requests to complete then retry */ > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > -- > Michal Hocko > SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 15:29 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 15:29 UTC (permalink / raw) To: mhocko, hannes Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds Michal Hocko wrote: > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > [...] > > Something like the following. > __GFP_HIGH doesn't seem to be sufficient so we would need something > slightly else but the idea is still the same: > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8d52ab18fe0d..2d224bbdf8e8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > enum migrate_mode migration_mode = MIGRATE_ASYNC; > bool deferred_compaction = false; > int contended_compaction = COMPACT_CONTENDED_NONE; > + int oom = 0; > > /* > * In the slowpath, we sanity check order to avoid ever trying to > @@ -2635,6 +2636,15 @@ retry: > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > /* > + * __GFP_NOFAIL allocations cannot fail but yet the current context > + * might be blocking resources needed by the OOM victim to terminate. > + * Allow the caller to dive into memory reserves to succeed the > + * allocation and break out from a potential deadlock. > + */ We don't know how many callers will pass __GFP_NOFAIL. But if 1000 threads are doing the same operation which requires __GFP_NOFAIL allocation with a lock held, wouldn't memory reserves deplete? This heuristic can't continue if memory reserves depleted or continuous pages of requested order cannot be found. > + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) > + alloc_flags |= ALLOC_NO_WATERMARKS; > + > + /* > * Find the true preferred zone if the allocation is unconstrained by > * cpusets. > */ > @@ -2759,6 +2769,8 @@ retry: > goto got_pg; > if (!did_some_progress) > goto nopage; > + > + oom++; > } > /* Wait for some write requests to complete then retry */ > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > -- > Michal Hocko > SUSE Labs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 15:29 ` Tetsuo Handa (?) @ 2015-02-19 21:53 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 21:53 UTC (permalink / raw) To: mhocko, hannes Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > > [...] > > > Something like the following. > > __GFP_HIGH doesn't seem to be sufficient so we would need something > > slightly else but the idea is still the same: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 8d52ab18fe0d..2d224bbdf8e8 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > enum migrate_mode migration_mode = MIGRATE_ASYNC; > > bool deferred_compaction = false; > > int contended_compaction = COMPACT_CONTENDED_NONE; > > + int oom = 0; > > > > /* > > * In the slowpath, we sanity check order to avoid ever trying to > > @@ -2635,6 +2636,15 @@ retry: > > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > > > /* > > + * __GFP_NOFAIL allocations cannot fail but yet the current context > > + * might be blocking resources needed by the OOM victim to terminate. > > + * Allow the caller to dive into memory reserves to succeed the > > + * allocation and break out from a potential deadlock. > > + */ > > We don't know how many callers will pass __GFP_NOFAIL. But if 1000 > threads are doing the same operation which requires __GFP_NOFAIL > allocation with a lock held, wouldn't memory reserves deplete? > > This heuristic can't continue if memory reserves depleted or > continuous pages of requested order cannot be found. > Even if the system seems to be stalled, deadlocks may not have occurred. If the cause is (e.g.) virtio disk being stuck for unknown reason than a deadlock, nobody should start consuming the memory reserves after waiting for a while. The memory reserves are something like a balloon. To guarantee forward progress, the balloon must not become empty. Therefore, I think that throttling heuristics for memory requester side (deflator of the balloon, or SIGKILL receiver called processes) should be avoided and throttling heuristics for memory releaser side (inflator of the balloon, or SIGKILL sender called the OOM killer) should be used. If heuristic is used on the deflator side, the memory allocator may deliver a final blow via ALLOC_NO_WATERMARKS. If heuristic is used on the inflator side, the OOM killer can act as a watchdog when nobody volunteered memory within reasonable period. > > + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) > > + alloc_flags |= ALLOC_NO_WATERMARKS; > > + > > + /* > > * Find the true preferred zone if the allocation is unconstrained by > > * cpusets. > > */ > > @@ -2759,6 +2769,8 @@ retry: > > goto got_pg; > > if (!did_some_progress) > > goto nopage; > > + > > + oom++; > > } > > /* Wait for some write requests to complete then retry */ > > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > > -- > > Michal Hocko > > SUSE Labs > > > ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 21:53 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 21:53 UTC (permalink / raw) To: mhocko, hannes Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > > [...] > > > Something like the following. > > __GFP_HIGH doesn't seem to be sufficient so we would need something > > slightly else but the idea is still the same: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 8d52ab18fe0d..2d224bbdf8e8 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > enum migrate_mode migration_mode = MIGRATE_ASYNC; > > bool deferred_compaction = false; > > int contended_compaction = COMPACT_CONTENDED_NONE; > > + int oom = 0; > > > > /* > > * In the slowpath, we sanity check order to avoid ever trying to > > @@ -2635,6 +2636,15 @@ retry: > > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > > > /* > > + * __GFP_NOFAIL allocations cannot fail but yet the current context > > + * might be blocking resources needed by the OOM victim to terminate. > > + * Allow the caller to dive into memory reserves to succeed the > > + * allocation and break out from a potential deadlock. > > + */ > > We don't know how many callers will pass __GFP_NOFAIL. But if 1000 > threads are doing the same operation which requires __GFP_NOFAIL > allocation with a lock held, wouldn't memory reserves deplete? > > This heuristic can't continue if memory reserves depleted or > continuous pages of requested order cannot be found. > Even if the system seems to be stalled, deadlocks may not have occurred. If the cause is (e.g.) virtio disk being stuck for unknown reason than a deadlock, nobody should start consuming the memory reserves after waiting for a while. The memory reserves are something like a balloon. To guarantee forward progress, the balloon must not become empty. Therefore, I think that throttling heuristics for memory requester side (deflator of the balloon, or SIGKILL receiver called processes) should be avoided and throttling heuristics for memory releaser side (inflator of the balloon, or SIGKILL sender called the OOM killer) should be used. If heuristic is used on the deflator side, the memory allocator may deliver a final blow via ALLOC_NO_WATERMARKS. If heuristic is used on the inflator side, the OOM killer can act as a watchdog when nobody volunteered memory within reasonable period. > > + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) > > + alloc_flags |= ALLOC_NO_WATERMARKS; > > + > > + /* > > * Find the true preferred zone if the allocation is unconstrained by > > * cpusets. > > */ > > @@ -2759,6 +2769,8 @@ retry: > > goto got_pg; > > if (!did_some_progress) > > goto nopage; > > + > > + oom++; > > } > > /* Wait for some write requests to complete then retry */ > > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > > -- > > Michal Hocko > > SUSE Labs > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 21:53 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 21:53 UTC (permalink / raw) To: mhocko, hannes Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > > [...] > > > Something like the following. > > __GFP_HIGH doesn't seem to be sufficient so we would need something > > slightly else but the idea is still the same: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 8d52ab18fe0d..2d224bbdf8e8 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > enum migrate_mode migration_mode = MIGRATE_ASYNC; > > bool deferred_compaction = false; > > int contended_compaction = COMPACT_CONTENDED_NONE; > > + int oom = 0; > > > > /* > > * In the slowpath, we sanity check order to avoid ever trying to > > @@ -2635,6 +2636,15 @@ retry: > > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > > > /* > > + * __GFP_NOFAIL allocations cannot fail but yet the current context > > + * might be blocking resources needed by the OOM victim to terminate. > > + * Allow the caller to dive into memory reserves to succeed the > > + * allocation and break out from a potential deadlock. > > + */ > > We don't know how many callers will pass __GFP_NOFAIL. But if 1000 > threads are doing the same operation which requires __GFP_NOFAIL > allocation with a lock held, wouldn't memory reserves deplete? > > This heuristic can't continue if memory reserves depleted or > continuous pages of requested order cannot be found. > Even if the system seems to be stalled, deadlocks may not have occurred. If the cause is (e.g.) virtio disk being stuck for unknown reason than a deadlock, nobody should start consuming the memory reserves after waiting for a while. The memory reserves are something like a balloon. To guarantee forward progress, the balloon must not become empty. Therefore, I think that throttling heuristics for memory requester side (deflator of the balloon, or SIGKILL receiver called processes) should be avoided and throttling heuristics for memory releaser side (inflator of the balloon, or SIGKILL sender called the OOM killer) should be used. If heuristic is used on the deflator side, the memory allocator may deliver a final blow via ALLOC_NO_WATERMARKS. If heuristic is used on the inflator side, the OOM killer can act as a watchdog when nobody volunteered memory within reasonable period. > > + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) > > + alloc_flags |= ALLOC_NO_WATERMARKS; > > + > > + /* > > * Find the true preferred zone if the allocation is unconstrained by > > * cpusets. > > */ > > @@ -2759,6 +2769,8 @@ retry: > > goto got_pg; > > if (!did_some_progress) > > goto nopage; > > + > > + oom++; > > } > > /* Wait for some write requests to complete then retry */ > > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > > -- > > Michal Hocko > > SUSE Labs > > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 15:29 ` Tetsuo Handa @ 2015-02-20 9:13 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 9:13 UTC (permalink / raw) To: Tetsuo Handa Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds On Fri 20-02-15 00:29:29, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > > [...] > > > Something like the following. > > __GFP_HIGH doesn't seem to be sufficient so we would need something > > slightly else but the idea is still the same: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 8d52ab18fe0d..2d224bbdf8e8 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > enum migrate_mode migration_mode = MIGRATE_ASYNC; > > bool deferred_compaction = false; > > int contended_compaction = COMPACT_CONTENDED_NONE; > > + int oom = 0; > > > > /* > > * In the slowpath, we sanity check order to avoid ever trying to > > @@ -2635,6 +2636,15 @@ retry: > > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > > > /* > > + * __GFP_NOFAIL allocations cannot fail but yet the current context > > + * might be blocking resources needed by the OOM victim to terminate. > > + * Allow the caller to dive into memory reserves to succeed the > > + * allocation and break out from a potential deadlock. > > + */ > > We don't know how many callers will pass __GFP_NOFAIL. But if 1000 > threads are doing the same operation which requires __GFP_NOFAIL > allocation with a lock held, wouldn't memory reserves deplete? We shouldn't have an unbounded number of GFP_NOFAIL allocations at the same time. This would be even more broken. If a load is known to use such allocations excessively then the administrator can enlarge the memory reserves. > This heuristic can't continue if memory reserves depleted or > continuous pages of requested order cannot be found. Once memory reserves are depleted we are screwed anyway and we might panic. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 9:13 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 9:13 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 On Fri 20-02-15 00:29:29, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > > [...] > > > Something like the following. > > __GFP_HIGH doesn't seem to be sufficient so we would need something > > slightly else but the idea is still the same: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 8d52ab18fe0d..2d224bbdf8e8 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > enum migrate_mode migration_mode = MIGRATE_ASYNC; > > bool deferred_compaction = false; > > int contended_compaction = COMPACT_CONTENDED_NONE; > > + int oom = 0; > > > > /* > > * In the slowpath, we sanity check order to avoid ever trying to > > @@ -2635,6 +2636,15 @@ retry: > > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > > > /* > > + * __GFP_NOFAIL allocations cannot fail but yet the current context > > + * might be blocking resources needed by the OOM victim to terminate. > > + * Allow the caller to dive into memory reserves to succeed the > > + * allocation and break out from a potential deadlock. > > + */ > > We don't know how many callers will pass __GFP_NOFAIL. But if 1000 > threads are doing the same operation which requires __GFP_NOFAIL > allocation with a lock held, wouldn't memory reserves deplete? We shouldn't have an unbounded number of GFP_NOFAIL allocations at the same time. This would be even more broken. If a load is known to use such allocations excessively then the administrator can enlarge the memory reserves. > This heuristic can't continue if memory reserves depleted or > continuous pages of requested order cannot be found. Once memory reserves are depleted we are screwed anyway and we might panic. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 9:13 ` Michal Hocko @ 2015-02-20 13:37 ` Stefan Ring -1 siblings, 0 replies; 276+ messages in thread From: Stefan Ring @ 2015-02-20 13:37 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, Linux fs XFS, hannes, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds >> We don't know how many callers will pass __GFP_NOFAIL. But if 1000 >> threads are doing the same operation which requires __GFP_NOFAIL >> allocation with a lock held, wouldn't memory reserves deplete? > > We shouldn't have an unbounded number of GFP_NOFAIL allocations at the > same time. This would be even more broken. If a load is known to use > such allocations excessively then the administrator can enlarge the > memory reserves. > >> This heuristic can't continue if memory reserves depleted or >> continuous pages of requested order cannot be found. > > Once memory reserves are depleted we are screwed anyway and we might > panic. This discussion reminds me of a situation I've seen somewhat regularly, which I have described here: http://oss.sgi.com/pipermail/xfs/2014-April/035793.html I've actually seen it more often on another box with OpenVZ and VirtualBox installed, where it would almost always happen during startup of a VirtualBox guest machine. This other machine is also running XFS. I blamed it on OpenVZ or VirtualBox originally, but having seen the same thing happen on the other machine with neither of them, the next candidate for taking blame is XFS. Is this behavior something that can be attributed to these memory allocation retry loops? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 13:37 ` Stefan Ring 0 siblings, 0 replies; 276+ messages in thread From: Stefan Ring @ 2015-02-20 13:37 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, Linux fs XFS, linux-mm, mgorman, hannes, linux-fsdevel, rientjes, akpm, fernando_b1, torvalds >> We don't know how many callers will pass __GFP_NOFAIL. But if 1000 >> threads are doing the same operation which requires __GFP_NOFAIL >> allocation with a lock held, wouldn't memory reserves deplete? > > We shouldn't have an unbounded number of GFP_NOFAIL allocations at the > same time. This would be even more broken. If a load is known to use > such allocations excessively then the administrator can enlarge the > memory reserves. > >> This heuristic can't continue if memory reserves depleted or >> continuous pages of requested order cannot be found. > > Once memory reserves are depleted we are screwed anyway and we might > panic. This discussion reminds me of a situation I've seen somewhat regularly, which I have described here: http://oss.sgi.com/pipermail/xfs/2014-April/035793.html I've actually seen it more often on another box with OpenVZ and VirtualBox installed, where it would almost always happen during startup of a VirtualBox guest machine. This other machine is also running XFS. I blamed it on OpenVZ or VirtualBox originally, but having seen the same thing happen on the other machine with neither of them, the next candidate for taking blame is XFS. Is this behavior something that can be attributed to these memory allocation retry loops? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 12:29 ` Michal Hocko (?) @ 2015-02-19 13:29 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 13:29 UTC (permalink / raw) To: mhocko, hannes Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 Michal Hocko wrote: > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > [...] > > Preferrably, we'd get rid of all nofail allocations and replace them > > with preallocated reserves. But this is not going to happen anytime > > soon, so what other option do we have than resolving this on the OOM > > killer side? > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > access to memory reserves (by giving it __GFP_HIGH). This is still not a > 100% solution because reserves could get depleted but this risk is there > even with multiple oom victims. I would still argue that this would be a > better approach because selecting more victims might hit pathological > case more easily (other victims might be blocked on the very same lock > e.g.). > Does "multiple OOM victims" mean "select next if first does not die"? Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 does not deplete memory reserves. ;-) If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, those who do not want to fail (e.g. journal transaction) will start passing __GFP_NOFAIL? ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 13:29 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 13:29 UTC (permalink / raw) To: mhocko, hannes Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 Michal Hocko wrote: > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > [...] > > Preferrably, we'd get rid of all nofail allocations and replace them > > with preallocated reserves. But this is not going to happen anytime > > soon, so what other option do we have than resolving this on the OOM > > killer side? > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > access to memory reserves (by giving it __GFP_HIGH). This is still not a > 100% solution because reserves could get depleted but this risk is there > even with multiple oom victims. I would still argue that this would be a > better approach because selecting more victims might hit pathological > case more easily (other victims might be blocked on the very same lock > e.g.). > Does "multiple OOM victims" mean "select next if first does not die"? Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 does not deplete memory reserves. ;-) If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, those who do not want to fail (e.g. journal transaction) will start passing __GFP_NOFAIL? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 13:29 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 13:29 UTC (permalink / raw) To: mhocko, hannes Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds Michal Hocko wrote: > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > [...] > > Preferrably, we'd get rid of all nofail allocations and replace them > > with preallocated reserves. But this is not going to happen anytime > > soon, so what other option do we have than resolving this on the OOM > > killer side? > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > access to memory reserves (by giving it __GFP_HIGH). This is still not a > 100% solution because reserves could get depleted but this risk is there > even with multiple oom victims. I would still argue that this would be a > better approach because selecting more victims might hit pathological > case more easily (other victims might be blocked on the very same lock > e.g.). > Does "multiple OOM victims" mean "select next if first does not die"? Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 does not deplete memory reserves. ;-) If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, those who do not want to fail (e.g. journal transaction) will start passing __GFP_NOFAIL? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 13:29 ` Tetsuo Handa @ 2015-02-20 9:10 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 9:10 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > [...] > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > with preallocated reserves. But this is not going to happen anytime > > > soon, so what other option do we have than resolving this on the OOM > > > killer side? > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > 100% solution because reserves could get depleted but this risk is there > > even with multiple oom victims. I would still argue that this would be a > > better approach because selecting more victims might hit pathological > > case more easily (other victims might be blocked on the very same lock > > e.g.). > > > Does "multiple OOM victims" mean "select next if first does not die"? > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > does not deplete memory reserves. ;-) It doesn't because --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) alloc_flags |= ALLOC_NO_WATERMARKS; else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; - else if (!in_interrupt() && - ((current->flags & PF_MEMALLOC) || - unlikely(test_thread_flag(TIF_MEMDIE)))) + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion and break out from the allocator. Exiting task might need a memory to do so and you make all those allocations fail basically. How do you know this is not going to blow up? > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, > those who do not want to fail (e.g. journal transaction) will start passing > __GFP_NOFAIL? > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 9:10 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 9:10 UTC (permalink / raw) To: Tetsuo Handa Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > [...] > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > with preallocated reserves. But this is not going to happen anytime > > > soon, so what other option do we have than resolving this on the OOM > > > killer side? > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > 100% solution because reserves could get depleted but this risk is there > > even with multiple oom victims. I would still argue that this would be a > > better approach because selecting more victims might hit pathological > > case more easily (other victims might be blocked on the very same lock > > e.g.). > > > Does "multiple OOM victims" mean "select next if first does not die"? > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > does not deplete memory reserves. ;-) It doesn't because --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) alloc_flags |= ALLOC_NO_WATERMARKS; else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; - else if (!in_interrupt() && - ((current->flags & PF_MEMALLOC) || - unlikely(test_thread_flag(TIF_MEMDIE)))) + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion and break out from the allocator. Exiting task might need a memory to do so and you make all those allocations fail basically. How do you know this is not going to blow up? > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, > those who do not want to fail (e.g. journal transaction) will start passing > __GFP_NOFAIL? > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 9:10 ` Michal Hocko (?) @ 2015-02-20 12:20 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-20 12:20 UTC (permalink / raw) To: mhocko Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 Michal Hocko wrote: > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > [...] > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > with preallocated reserves. But this is not going to happen anytime > > > > soon, so what other option do we have than resolving this on the OOM > > > > killer side? > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > > 100% solution because reserves could get depleted but this risk is there > > > even with multiple oom victims. I would still argue that this would be a > > > better approach because selecting more victims might hit pathological > > > case more easily (other victims might be blocked on the very same lock > > > e.g.). > > > > > Does "multiple OOM victims" mean "select next if first does not die"? > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > > does not deplete memory reserves. ;-) > > It doesn't because > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > alloc_flags |= ALLOC_NO_WATERMARKS; > else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) > alloc_flags |= ALLOC_NO_WATERMARKS; > - else if (!in_interrupt() && > - ((current->flags & PF_MEMALLOC) || > - unlikely(test_thread_flag(TIF_MEMDIE)))) > + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) > alloc_flags |= ALLOC_NO_WATERMARKS; > > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion > and break out from the allocator. Exiting task might need a memory to do > so and you make all those allocations fail basically. How do you know > this is not going to blow up? > Well, treat exiting tasks to imply __GFP_NOFAIL for clean up? We cannot determine correct task to kill + allow access to memory reserves based on lock dependency. Therefore, this patch evenly allow no tasks to access to memory reserves. Exiting task might need some memory to exit, and not allowing access to memory reserves can retard exit of that task. But that task will eventually get memory released by other tasks killed by timeout-based kill-more mechanism. If no more killable tasks or expired panic-timeout, it is the same result with depletion of memory reserves. I think that this situation (automatically making forward progress as if the administrator is periodically doing SysRq-f until the OOM condition is solved, or is doing SysRq-c if no more killable tasks or stalled too long) is better than current situation (not making forward progress since the exiting task cannot exit due to lock dependency, caused by failing to determine correct task to kill + allow access to memory reserves). > > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, > > those who do not want to fail (e.g. journal transaction) will start passing > > __GFP_NOFAIL? > > ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 12:20 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-20 12:20 UTC (permalink / raw) To: mhocko Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 Michal Hocko wrote: > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > [...] > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > with preallocated reserves. But this is not going to happen anytime > > > > soon, so what other option do we have than resolving this on the OOM > > > > killer side? > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > > 100% solution because reserves could get depleted but this risk is there > > > even with multiple oom victims. I would still argue that this would be a > > > better approach because selecting more victims might hit pathological > > > case more easily (other victims might be blocked on the very same lock > > > e.g.). > > > > > Does "multiple OOM victims" mean "select next if first does not die"? > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > > does not deplete memory reserves. ;-) > > It doesn't because > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > alloc_flags |= ALLOC_NO_WATERMARKS; > else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) > alloc_flags |= ALLOC_NO_WATERMARKS; > - else if (!in_interrupt() && > - ((current->flags & PF_MEMALLOC) || > - unlikely(test_thread_flag(TIF_MEMDIE)))) > + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) > alloc_flags |= ALLOC_NO_WATERMARKS; > > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion > and break out from the allocator. Exiting task might need a memory to do > so and you make all those allocations fail basically. How do you know > this is not going to blow up? > Well, treat exiting tasks to imply __GFP_NOFAIL for clean up? We cannot determine correct task to kill + allow access to memory reserves based on lock dependency. Therefore, this patch evenly allow no tasks to access to memory reserves. Exiting task might need some memory to exit, and not allowing access to memory reserves can retard exit of that task. But that task will eventually get memory released by other tasks killed by timeout-based kill-more mechanism. If no more killable tasks or expired panic-timeout, it is the same result with depletion of memory reserves. I think that this situation (automatically making forward progress as if the administrator is periodically doing SysRq-f until the OOM condition is solved, or is doing SysRq-c if no more killable tasks or stalled too long) is better than current situation (not making forward progress since the exiting task cannot exit due to lock dependency, caused by failing to determine correct task to kill + allow access to memory reserves). > > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, > > those who do not want to fail (e.g. journal transaction) will start passing > > __GFP_NOFAIL? > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 12:20 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-20 12:20 UTC (permalink / raw) To: mhocko Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds Michal Hocko wrote: > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > [...] > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > with preallocated reserves. But this is not going to happen anytime > > > > soon, so what other option do we have than resolving this on the OOM > > > > killer side? > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > > 100% solution because reserves could get depleted but this risk is there > > > even with multiple oom victims. I would still argue that this would be a > > > better approach because selecting more victims might hit pathological > > > case more easily (other victims might be blocked on the very same lock > > > e.g.). > > > > > Does "multiple OOM victims" mean "select next if first does not die"? > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > > does not deplete memory reserves. ;-) > > It doesn't because > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > alloc_flags |= ALLOC_NO_WATERMARKS; > else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) > alloc_flags |= ALLOC_NO_WATERMARKS; > - else if (!in_interrupt() && > - ((current->flags & PF_MEMALLOC) || > - unlikely(test_thread_flag(TIF_MEMDIE)))) > + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) > alloc_flags |= ALLOC_NO_WATERMARKS; > > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion > and break out from the allocator. Exiting task might need a memory to do > so and you make all those allocations fail basically. How do you know > this is not going to blow up? > Well, treat exiting tasks to imply __GFP_NOFAIL for clean up? We cannot determine correct task to kill + allow access to memory reserves based on lock dependency. Therefore, this patch evenly allow no tasks to access to memory reserves. Exiting task might need some memory to exit, and not allowing access to memory reserves can retard exit of that task. But that task will eventually get memory released by other tasks killed by timeout-based kill-more mechanism. If no more killable tasks or expired panic-timeout, it is the same result with depletion of memory reserves. I think that this situation (automatically making forward progress as if the administrator is periodically doing SysRq-f until the OOM condition is solved, or is doing SysRq-c if no more killable tasks or stalled too long) is better than current situation (not making forward progress since the exiting task cannot exit due to lock dependency, caused by failing to determine correct task to kill + allow access to memory reserves). > > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, > > those who do not want to fail (e.g. journal transaction) will start passing > > __GFP_NOFAIL? > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 12:20 ` Tetsuo Handa @ 2015-02-20 12:38 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 12:38 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-fsdevel, fernando_b1 On Fri 20-02-15 21:20:58, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > > [...] > > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > > with preallocated reserves. But this is not going to happen anytime > > > > > soon, so what other option do we have than resolving this on the OOM > > > > > killer side? > > > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > > > 100% solution because reserves could get depleted but this risk is there > > > > even with multiple oom victims. I would still argue that this would be a > > > > better approach because selecting more victims might hit pathological > > > > case more easily (other victims might be blocked on the very same lock > > > > e.g.). > > > > > > > Does "multiple OOM victims" mean "select next if first does not die"? > > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > > > does not deplete memory reserves. ;-) > > > > It doesn't because > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > alloc_flags |= ALLOC_NO_WATERMARKS; > > else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) > > alloc_flags |= ALLOC_NO_WATERMARKS; > > - else if (!in_interrupt() && > > - ((current->flags & PF_MEMALLOC) || > > - unlikely(test_thread_flag(TIF_MEMDIE)))) > > + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) > > alloc_flags |= ALLOC_NO_WATERMARKS; > > > > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion > > and break out from the allocator. Exiting task might need a memory to do > > so and you make all those allocations fail basically. How do you know > > this is not going to blow up? > > > > Well, treat exiting tasks to imply __GFP_NOFAIL for clean up? > > We cannot determine correct task to kill + allow access to memory reserves > based on lock dependency. Therefore, this patch evenly allow no tasks to > access to memory reserves. > > Exiting task might need some memory to exit, and not allowing access to > memory reserves can retard exit of that task. But that task will eventually > get memory released by other tasks killed by timeout-based kill-more > mechanism. If no more killable tasks or expired panic-timeout, it is > the same result with depletion of memory reserves. > > I think that this situation (automatically making forward progress as if > the administrator is periodically doing SysRq-f until the OOM condition > is solved, or is doing SysRq-c if no more killable tasks or stalled too > long) is better than current situation (not making forward progress since > the exiting task cannot exit due to lock dependency, caused by failing to > determine correct task to kill + allow access to memory reserves). If you really believe this is an improvement then send a proper patch with justification. But I am _really_ skeptical about such a change to be honest. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 12:38 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 12:38 UTC (permalink / raw) To: Tetsuo Handa Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds On Fri 20-02-15 21:20:58, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > > [...] > > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > > with preallocated reserves. But this is not going to happen anytime > > > > > soon, so what other option do we have than resolving this on the OOM > > > > > killer side? > > > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > > > 100% solution because reserves could get depleted but this risk is there > > > > even with multiple oom victims. I would still argue that this would be a > > > > better approach because selecting more victims might hit pathological > > > > case more easily (other victims might be blocked on the very same lock > > > > e.g.). > > > > > > > Does "multiple OOM victims" mean "select next if first does not die"? > > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > > > does not deplete memory reserves. ;-) > > > > It doesn't because > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > alloc_flags |= ALLOC_NO_WATERMARKS; > > else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) > > alloc_flags |= ALLOC_NO_WATERMARKS; > > - else if (!in_interrupt() && > > - ((current->flags & PF_MEMALLOC) || > > - unlikely(test_thread_flag(TIF_MEMDIE)))) > > + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) > > alloc_flags |= ALLOC_NO_WATERMARKS; > > > > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion > > and break out from the allocator. Exiting task might need a memory to do > > so and you make all those allocations fail basically. How do you know > > this is not going to blow up? > > > > Well, treat exiting tasks to imply __GFP_NOFAIL for clean up? > > We cannot determine correct task to kill + allow access to memory reserves > based on lock dependency. Therefore, this patch evenly allow no tasks to > access to memory reserves. > > Exiting task might need some memory to exit, and not allowing access to > memory reserves can retard exit of that task. But that task will eventually > get memory released by other tasks killed by timeout-based kill-more > mechanism. If no more killable tasks or expired panic-timeout, it is > the same result with depletion of memory reserves. > > I think that this situation (automatically making forward progress as if > the administrator is periodically doing SysRq-f until the OOM condition > is solved, or is doing SysRq-c if no more killable tasks or stalled too > long) is better than current situation (not making forward progress since > the exiting task cannot exit due to lock dependency, caused by failing to > determine correct task to kill + allow access to memory reserves). If you really believe this is an improvement then send a proper patch with justification. But I am _really_ skeptical about such a change to be honest. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 12:29 ` Michal Hocko @ 2015-02-19 21:43 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-19 21:43 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote: > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > [...] > > Preferrably, we'd get rid of all nofail allocations and replace them > > with preallocated reserves. But this is not going to happen anytime > > soon, so what other option do we have than resolving this on the OOM > > killer side? > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > access to memory reserves (by giving it __GFP_HIGH). Won't work when you have thousands of concurrent transactions running in XFS and they are all doing GFP_NOFAIL allocations. That's why I suggested the per-transaction reserve pool - we can use that to throttle the number of concurent contexts demanding memory for forwards progress, just the same was we throttle the number of concurrent processes based on maximum log space requirements of the transactions and the amount of unreserved log space available. No log space, transaction reservations waits on an ordered queue for space to become available. No memory available, transaction reservation waits on an ordered queue for memory to become available. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 21:43 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-19 21:43 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote: > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > [...] > > Preferrably, we'd get rid of all nofail allocations and replace them > > with preallocated reserves. But this is not going to happen anytime > > soon, so what other option do we have than resolving this on the OOM > > killer side? > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > access to memory reserves (by giving it __GFP_HIGH). Won't work when you have thousands of concurrent transactions running in XFS and they are all doing GFP_NOFAIL allocations. That's why I suggested the per-transaction reserve pool - we can use that to throttle the number of concurent contexts demanding memory for forwards progress, just the same was we throttle the number of concurrent processes based on maximum log space requirements of the transactions and the amount of unreserved log space available. No log space, transaction reservations waits on an ordered queue for space to become available. No memory available, transaction reservation waits on an ordered queue for memory to become available. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 21:43 ` Dave Chinner @ 2015-02-20 12:48 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 12:48 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Fri 20-02-15 08:43:56, Dave Chinner wrote: > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote: > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > [...] > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > with preallocated reserves. But this is not going to happen anytime > > > soon, so what other option do we have than resolving this on the OOM > > > killer side? > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > access to memory reserves (by giving it __GFP_HIGH). > > Won't work when you have thousands of concurrent transactions > running in XFS and they are all doing GFP_NOFAIL allocations. Is there any bound on how many transactions can run at the same time? > That's why I suggested the per-transaction reserve pool - we can use > that I am still not sure what you mean by reserve pool (API wise). How does it differ from pre-allocating memory before the "may not fail context"? Could you elaborate on it, please? > to throttle the number of concurent contexts demanding memory for > forwards progress, just the same was we throttle the number of > concurrent processes based on maximum log space requirements of the > transactions and the amount of unreserved log space available. > > No log space, transaction reservations waits on an ordered queue for > space to become available. No memory available, transaction > reservation waits on an ordered queue for memory to become > available. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 12:48 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 12:48 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Fri 20-02-15 08:43:56, Dave Chinner wrote: > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote: > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > [...] > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > with preallocated reserves. But this is not going to happen anytime > > > soon, so what other option do we have than resolving this on the OOM > > > killer side? > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > access to memory reserves (by giving it __GFP_HIGH). > > Won't work when you have thousands of concurrent transactions > running in XFS and they are all doing GFP_NOFAIL allocations. Is there any bound on how many transactions can run at the same time? > That's why I suggested the per-transaction reserve pool - we can use > that I am still not sure what you mean by reserve pool (API wise). How does it differ from pre-allocating memory before the "may not fail context"? Could you elaborate on it, please? > to throttle the number of concurent contexts demanding memory for > forwards progress, just the same was we throttle the number of > concurrent processes based on maximum log space requirements of the > transactions and the amount of unreserved log space available. > > No log space, transaction reservations waits on an ordered queue for > space to become available. No memory available, transaction > reservation waits on an ordered queue for memory to become > available. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 12:48 ` Michal Hocko @ 2015-02-20 23:09 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-20 23:09 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Fri, Feb 20, 2015 at 01:48:49PM +0100, Michal Hocko wrote: > On Fri 20-02-15 08:43:56, Dave Chinner wrote: > > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote: > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > [...] > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > with preallocated reserves. But this is not going to happen anytime > > > > soon, so what other option do we have than resolving this on the OOM > > > > killer side? > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > access to memory reserves (by giving it __GFP_HIGH). > > > > Won't work when you have thousands of concurrent transactions > > running in XFS and they are all doing GFP_NOFAIL allocations. > > Is there any bound on how many transactions can run at the same time? Yes. As many reservations that can fit in the available log space. The log can be sized up to 2GB, and for filesystems larger than 4TB will default to 2GB. Log space reservations depend on the operation being done - an inode timestamp update requires about 5kB of reservation, and rename requires about 200kB. Hence we can easily have thousands of active transactions, even in the worst case log space reversation cases. You're saying it would be insane to have hundreds or thousands of threads doing GFP_NOFAIL allocations concurrently. Reality check: XFS has been operating successfully under such workload conditions in production systems for many years. > > That's why I suggested the per-transaction reserve pool - we can use > > that > > I am still not sure what you mean by reserve pool (API wise). How > does it differ from pre-allocating memory before the "may not fail > context"? Could you elaborate on it, please? It is preallocating memory: into a reserve pool associated with the transaction, done as part of the transaction reservation mechanism we already have in XFS. The allocator then uses that reserve pool to allocate from if an allocation would otherwise fail. There is no way we can preallocate specific objects before the transaction - that's just insane, especially handling the unbound demand paged object requirement. Hence the need for a "preallocated reserve pool" that the allocator can dip into that covers the memory we need to *allocate and can't reclaim* during the course of the transaction. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 23:09 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-20 23:09 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Fri, Feb 20, 2015 at 01:48:49PM +0100, Michal Hocko wrote: > On Fri 20-02-15 08:43:56, Dave Chinner wrote: > > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote: > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > [...] > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > with preallocated reserves. But this is not going to happen anytime > > > > soon, so what other option do we have than resolving this on the OOM > > > > killer side? > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > access to memory reserves (by giving it __GFP_HIGH). > > > > Won't work when you have thousands of concurrent transactions > > running in XFS and they are all doing GFP_NOFAIL allocations. > > Is there any bound on how many transactions can run at the same time? Yes. As many reservations that can fit in the available log space. The log can be sized up to 2GB, and for filesystems larger than 4TB will default to 2GB. Log space reservations depend on the operation being done - an inode timestamp update requires about 5kB of reservation, and rename requires about 200kB. Hence we can easily have thousands of active transactions, even in the worst case log space reversation cases. You're saying it would be insane to have hundreds or thousands of threads doing GFP_NOFAIL allocations concurrently. Reality check: XFS has been operating successfully under such workload conditions in production systems for many years. > > That's why I suggested the per-transaction reserve pool - we can use > > that > > I am still not sure what you mean by reserve pool (API wise). How > does it differ from pre-allocating memory before the "may not fail > context"? Could you elaborate on it, please? It is preallocating memory: into a reserve pool associated with the transaction, done as part of the transaction reservation mechanism we already have in XFS. The allocator then uses that reserve pool to allocate from if an allocation would otherwise fail. There is no way we can preallocate specific objects before the transaction - that's just insane, especially handling the unbound demand paged object requirement. Hence the need for a "preallocated reserve pool" that the allocator can dip into that covers the memory we need to *allocate and can't reclaim* during the course of the transaction. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 22:54 ` Dave Chinner @ 2015-02-19 10:24 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-19 10:24 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote: > [ cc xfs list - experienced kernel devs should not have to be > reminded to do this ] > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > > > Tetsuo Handa wrote: > > > > Johannes Weiner wrote: > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > > > > > --- a/mm/page_alloc.c > > > > > +++ b/mm/page_alloc.c > > > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > > > if (high_zoneidx < ZONE_NORMAL) > > > > > goto out; > > > > > /* The OOM killer does not compensate for light reclaim */ > > > > > - if (!(gfp_mask & __GFP_FS)) > > > > > + if (!(gfp_mask & __GFP_FS)) { > > > > > + /* > > > > > + * XXX: Page reclaim didn't yield anything, > > > > > + * and the OOM killer can't be invoked, but > > > > > + * keep looping as per should_alloc_retry(). > > > > > + */ > > > > > + *did_some_progress = 1; > > > > > goto out; > > > > > + } > > > > > > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? > > > > > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings > > > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm: > > > page_alloc: embed OOM killing naturally into allocation slowpath" introduced > > > a regression and below one is the fix. > > > > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > /* The OOM killer does not needlessly kill tasks for lowmem */ > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > - /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > - goto out; > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Again, we don't want to OOM kill on behalf of allocations that can't > > initiate IO, or even actively prevent others from doing it. Not per > > default anyway, because most callers can deal with the failure without > > having to resort to killing tasks, and NOFS reclaim *can* easily fail. > > It's the exceptions that should be annotated instead: > > > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > void *ptr; > > > > do { > > ptr = kmalloc(size, lflags); > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > return ptr; > > if (!(++retries % 100)) > > xfs_err(NULL, > > "possible memory allocation deadlock in %s (mode:0x%x)", > > __func__, lflags); > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > } while (1); > > } > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > broken code like this, but also recognizes that endless looping on a > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index a7a3a63bb360..17ced1805d3a 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > - int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > - void *ptr; > > > > - do { > > - ptr = kmalloc(size, lflags); > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > - return ptr; > > - if (!(++retries % 100)) > > - xfs_err(NULL, > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > - __func__, lflags); > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > - } while (1); > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > + lflags |= __GFP_NOFAIL; > > + > > + return kmalloc(size, lflags); > > } > > Hmmm - the only reason there is a focus on this loop is that it > emits warnings about allocations failing. It's obvious that the > problem being dealt with here is a fundamental design issue w.r.t. > to locking and the OOM killer, but the proposed special casing > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > in XFS started emitting warnings about allocations failing more > often. > > So the answer is to remove the warning? That's like killing the > canary to stop the methane leak in the coal mine. No canary? No > problems! That's not what happened. The patch that affected behavior here transformed code that an incoherent collection of conditions to something that has an actual model. That model is that we don't loop in the allocator if there are no means to making forward progress. In this case, it was GFP_NOFS triggering an early exit from the allocator because it's not allowed to invoke the OOM killer per default, and there is little point in looping for times to better on their own. So these deadlock warnings happen, ironically, by the page allocator now bailing out of a locked-up state in which it's not making forward progress. They don't strike me as a very useful canary in this case. > Right now, the oom killer is a liability. Over the past 6 months > I've slowly had to exclude filesystem regression tests from running > on small memory machines because the OOM killer is now so unreliable > that it kills the test harness regularly rather than the process > generating memory pressure. That's a big red flag to me that all > this hacking around the edges is not solving the underlying problem, > but instead is breaking things that did once work. > > And, well, then there's this (gfp.h): > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. This modifier is deprecated and no new > * users should be added. > > So, is this another policy relevation from the mm developers about > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? > Or just another symptom of frantic thrashing because nobody actually > understands the problem or those that do are unwilling to throw out > the broken crap and redesign it? Well, understand our dilemma here. __GFP_NOFAIL is a liability because it can trap tasks with unknown state and locks in a potentially never ending loop, and we don't want people to start using it as a convenient solution to get out of having a fallback strategy. However, if your entire architecture around a particular allocation is that failure is not an option at this point, and you can't reasonably preallocate - although that would always be preferrable - then please do not open code an endless loop around the call to the allocator but use __GFP_NOFAIL instead so that these callsites are annotated and can be reviewed. By giving the allocator this information, it can then also adjust its behavior, like it is the case right here: we don't usually want to OOM kill for regular GFP_NOFS allocations because their reclaim powers are weak and we don't want to kill tasks prematurely. But if your NOFS allocation can not fail under any circumstances, then the OOM killer should very much be employed to make any kind of forward progress at all for this allocation. It's just that the allocator needs to be made aware of this requirement. So yes, we are wary of __GFP_NOFAIL allocations, but this is an instance where it's the right way to communicate with the allocator, it was introduced to replace such open-coded endless loops and have the liability of making progress with the allocator, not the caller. And please understand that this callsite blowing up is a chance to better the code and behavior here. Where previously it would just endlessly loop in the allocator without any means to make progress, converting it to a __GFP_NOFAIL allocation tells the allocator that it's fine to use the OOM killer in such an instance, improving the chances that this caller will actually make headway under heavy load. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 10:24 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-19 10:24 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote: > [ cc xfs list - experienced kernel devs should not have to be > reminded to do this ] > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > > > Tetsuo Handa wrote: > > > > Johannes Weiner wrote: > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > > > > > --- a/mm/page_alloc.c > > > > > +++ b/mm/page_alloc.c > > > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > > > if (high_zoneidx < ZONE_NORMAL) > > > > > goto out; > > > > > /* The OOM killer does not compensate for light reclaim */ > > > > > - if (!(gfp_mask & __GFP_FS)) > > > > > + if (!(gfp_mask & __GFP_FS)) { > > > > > + /* > > > > > + * XXX: Page reclaim didn't yield anything, > > > > > + * and the OOM killer can't be invoked, but > > > > > + * keep looping as per should_alloc_retry(). > > > > > + */ > > > > > + *did_some_progress = 1; > > > > > goto out; > > > > > + } > > > > > > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? > > > > > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings > > > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm: > > > page_alloc: embed OOM killing naturally into allocation slowpath" introduced > > > a regression and below one is the fix. > > > > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > /* The OOM killer does not needlessly kill tasks for lowmem */ > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > - /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > - goto out; > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Again, we don't want to OOM kill on behalf of allocations that can't > > initiate IO, or even actively prevent others from doing it. Not per > > default anyway, because most callers can deal with the failure without > > having to resort to killing tasks, and NOFS reclaim *can* easily fail. > > It's the exceptions that should be annotated instead: > > > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > void *ptr; > > > > do { > > ptr = kmalloc(size, lflags); > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > return ptr; > > if (!(++retries % 100)) > > xfs_err(NULL, > > "possible memory allocation deadlock in %s (mode:0x%x)", > > __func__, lflags); > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > } while (1); > > } > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > broken code like this, but also recognizes that endless looping on a > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index a7a3a63bb360..17ced1805d3a 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > - int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > - void *ptr; > > > > - do { > > - ptr = kmalloc(size, lflags); > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > - return ptr; > > - if (!(++retries % 100)) > > - xfs_err(NULL, > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > - __func__, lflags); > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > - } while (1); > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > + lflags |= __GFP_NOFAIL; > > + > > + return kmalloc(size, lflags); > > } > > Hmmm - the only reason there is a focus on this loop is that it > emits warnings about allocations failing. It's obvious that the > problem being dealt with here is a fundamental design issue w.r.t. > to locking and the OOM killer, but the proposed special casing > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > in XFS started emitting warnings about allocations failing more > often. > > So the answer is to remove the warning? That's like killing the > canary to stop the methane leak in the coal mine. No canary? No > problems! That's not what happened. The patch that affected behavior here transformed code that an incoherent collection of conditions to something that has an actual model. That model is that we don't loop in the allocator if there are no means to making forward progress. In this case, it was GFP_NOFS triggering an early exit from the allocator because it's not allowed to invoke the OOM killer per default, and there is little point in looping for times to better on their own. So these deadlock warnings happen, ironically, by the page allocator now bailing out of a locked-up state in which it's not making forward progress. They don't strike me as a very useful canary in this case. > Right now, the oom killer is a liability. Over the past 6 months > I've slowly had to exclude filesystem regression tests from running > on small memory machines because the OOM killer is now so unreliable > that it kills the test harness regularly rather than the process > generating memory pressure. That's a big red flag to me that all > this hacking around the edges is not solving the underlying problem, > but instead is breaking things that did once work. > > And, well, then there's this (gfp.h): > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. This modifier is deprecated and no new > * users should be added. > > So, is this another policy relevation from the mm developers about > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? > Or just another symptom of frantic thrashing because nobody actually > understands the problem or those that do are unwilling to throw out > the broken crap and redesign it? Well, understand our dilemma here. __GFP_NOFAIL is a liability because it can trap tasks with unknown state and locks in a potentially never ending loop, and we don't want people to start using it as a convenient solution to get out of having a fallback strategy. However, if your entire architecture around a particular allocation is that failure is not an option at this point, and you can't reasonably preallocate - although that would always be preferrable - then please do not open code an endless loop around the call to the allocator but use __GFP_NOFAIL instead so that these callsites are annotated and can be reviewed. By giving the allocator this information, it can then also adjust its behavior, like it is the case right here: we don't usually want to OOM kill for regular GFP_NOFS allocations because their reclaim powers are weak and we don't want to kill tasks prematurely. But if your NOFS allocation can not fail under any circumstances, then the OOM killer should very much be employed to make any kind of forward progress at all for this allocation. It's just that the allocator needs to be made aware of this requirement. So yes, we are wary of __GFP_NOFAIL allocations, but this is an instance where it's the right way to communicate with the allocator, it was introduced to replace such open-coded endless loops and have the liability of making progress with the allocator, not the caller. And please understand that this callsite blowing up is a chance to better the code and behavior here. Where previously it would just endlessly loop in the allocator without any means to make progress, converting it to a __GFP_NOFAIL allocation tells the allocator that it's fine to use the OOM killer in such an instance, improving the chances that this caller will actually make headway under heavy load. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 10:24 ` Johannes Weiner @ 2015-02-19 22:52 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-19 22:52 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Thu, Feb 19, 2015 at 05:24:31AM -0500, Johannes Weiner wrote: > On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote: > > [ cc xfs list - experienced kernel devs should not have to be > > reminded to do this ] > > > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > > > - do { > > > - ptr = kmalloc(size, lflags); > > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > > - return ptr; > > > - if (!(++retries % 100)) > > > - xfs_err(NULL, > > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > > - __func__, lflags); > > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > > - } while (1); > > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > > + lflags |= __GFP_NOFAIL; > > > + > > > + return kmalloc(size, lflags); > > > } > > > > Hmmm - the only reason there is a focus on this loop is that it > > emits warnings about allocations failing. It's obvious that the > > problem being dealt with here is a fundamental design issue w.r.t. > > to locking and the OOM killer, but the proposed special casing > > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > > in XFS started emitting warnings about allocations failing more > > often. > > > > So the answer is to remove the warning? That's like killing the > > canary to stop the methane leak in the coal mine. No canary? No > > problems! > > That's not what happened. The patch that affected behavior here > transformed code that an incoherent collection of conditions to > something that has an actual model. Which is entirely undocumented. If you have a model, the first thing to do is document it and communicate that model to everyone who needs to know about that new model. I have no idea what that model is. Keeping it in your head and changing code that other people maintain without giving them any means of understanding WTF you are doing is a really bad engineering practice. And yes, I have had a bit to say about this in public recently. Go watch my recent LCA talk, for example.... And, FWIW, email discussions on a list is no substitute for a properly documented design that people can take their time to understand and digest. > That model is that we don't loop > in the allocator if there are no means to making forward progress. In > this case, it was GFP_NOFS triggering an early exit from the allocator > because it's not allowed to invoke the OOM killer per default, and > there is little point in looping for times to better on their own. So you keep saying.... > So these deadlock warnings happen, ironically, by the page allocator > now bailing out of a locked-up state in which it's not making forward > progress. They don't strike me as a very useful canary in this case. ... yet we *rarely* see the canary warnings we emit when we do too many allocation retries, the code has been that way for 13-odd years. Hence, despite your protestations that your way is *better*, we have code that is tried, tested and proven in rugged production environments. That's far more convincing evidence that the *code should not change* than your assertions that it is broken and needs to be fixed. > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. That's a big red flag to me that all > > this hacking around the edges is not solving the underlying problem, > > but instead is breaking things that did once work. > > > > And, well, then there's this (gfp.h): > > > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. This modifier is deprecated and no new > > * users should be added. > > > > So, is this another policy relevation from the mm developers about > > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? > > Or just another symptom of frantic thrashing because nobody actually > > understands the problem or those that do are unwilling to throw out > > the broken crap and redesign it? > > Well, understand our dilemma here. __GFP_NOFAIL is a liability > because it can trap tasks with unknown state and locks in a > potentially never ending loop, and we don't want people to start using > it as a convenient solution to get out of having a fallback strategy. > > However, if your entire architecture around a particular allocation is > that failure is not an option at this point, and you can't reasonably > preallocate - although that would always be preferrable - then please > do not open code an endless loop around the call to the allocator but > use __GFP_NOFAIL instead so that these callsites are annotated and can > be reviewed. I will actively work around aanything that causes filesystem memory pressure to increase the chance of oom killer invocations. The OOM killer is not a solution - it is, by definition, a loose cannon and so we should be reducing dependencies on it. I really don't care about the OOM Killer corner cases - it's completely the wrong way line of development to be spending time on and you aren't going to convince me otherwise. The OOM killer a crutch used to justify having a memory allocation subsystem that can't provide forward progress guarantee mechanisms to callers that need it. I've proposed a method of providing this forward progress guarantee for subsystems of arbitrary complexity, and this removes the dependency on the OOM killer for fowards allocation progress in such contexts (e.g. filesystems). We should be discussing how to implement that, not what bandaids we need to apply to the OOM killer. I want to fix the underlying problems, not push them under the OOM-killer bus... > And please understand that this callsite blowing up is a chance to > better the code and behavior here. Where previously it would just > endlessly loop in the allocator without any means to make progress, Again, this statement ignores the fact we have *no credible evidence* that this is actually a problem in production environments. And, besides, even if you do force through changing the XFS code to GFP_NOFAIL, it'll get changed back to a retry loop in the near future when we add admin configurable error handling behaviour to XFS, as I pointed Michal to.... (http://oss.sgi.com/archives/xfs/2015-02/msg00346.html) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 22:52 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-19 22:52 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Thu, Feb 19, 2015 at 05:24:31AM -0500, Johannes Weiner wrote: > On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote: > > [ cc xfs list - experienced kernel devs should not have to be > > reminded to do this ] > > > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > > > - do { > > > - ptr = kmalloc(size, lflags); > > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > > - return ptr; > > > - if (!(++retries % 100)) > > > - xfs_err(NULL, > > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > > - __func__, lflags); > > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > > - } while (1); > > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > > + lflags |= __GFP_NOFAIL; > > > + > > > + return kmalloc(size, lflags); > > > } > > > > Hmmm - the only reason there is a focus on this loop is that it > > emits warnings about allocations failing. It's obvious that the > > problem being dealt with here is a fundamental design issue w.r.t. > > to locking and the OOM killer, but the proposed special casing > > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > > in XFS started emitting warnings about allocations failing more > > often. > > > > So the answer is to remove the warning? That's like killing the > > canary to stop the methane leak in the coal mine. No canary? No > > problems! > > That's not what happened. The patch that affected behavior here > transformed code that an incoherent collection of conditions to > something that has an actual model. Which is entirely undocumented. If you have a model, the first thing to do is document it and communicate that model to everyone who needs to know about that new model. I have no idea what that model is. Keeping it in your head and changing code that other people maintain without giving them any means of understanding WTF you are doing is a really bad engineering practice. And yes, I have had a bit to say about this in public recently. Go watch my recent LCA talk, for example.... And, FWIW, email discussions on a list is no substitute for a properly documented design that people can take their time to understand and digest. > That model is that we don't loop > in the allocator if there are no means to making forward progress. In > this case, it was GFP_NOFS triggering an early exit from the allocator > because it's not allowed to invoke the OOM killer per default, and > there is little point in looping for times to better on their own. So you keep saying.... > So these deadlock warnings happen, ironically, by the page allocator > now bailing out of a locked-up state in which it's not making forward > progress. They don't strike me as a very useful canary in this case. ... yet we *rarely* see the canary warnings we emit when we do too many allocation retries, the code has been that way for 13-odd years. Hence, despite your protestations that your way is *better*, we have code that is tried, tested and proven in rugged production environments. That's far more convincing evidence that the *code should not change* than your assertions that it is broken and needs to be fixed. > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. That's a big red flag to me that all > > this hacking around the edges is not solving the underlying problem, > > but instead is breaking things that did once work. > > > > And, well, then there's this (gfp.h): > > > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. This modifier is deprecated and no new > > * users should be added. > > > > So, is this another policy relevation from the mm developers about > > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? > > Or just another symptom of frantic thrashing because nobody actually > > understands the problem or those that do are unwilling to throw out > > the broken crap and redesign it? > > Well, understand our dilemma here. __GFP_NOFAIL is a liability > because it can trap tasks with unknown state and locks in a > potentially never ending loop, and we don't want people to start using > it as a convenient solution to get out of having a fallback strategy. > > However, if your entire architecture around a particular allocation is > that failure is not an option at this point, and you can't reasonably > preallocate - although that would always be preferrable - then please > do not open code an endless loop around the call to the allocator but > use __GFP_NOFAIL instead so that these callsites are annotated and can > be reviewed. I will actively work around aanything that causes filesystem memory pressure to increase the chance of oom killer invocations. The OOM killer is not a solution - it is, by definition, a loose cannon and so we should be reducing dependencies on it. I really don't care about the OOM Killer corner cases - it's completely the wrong way line of development to be spending time on and you aren't going to convince me otherwise. The OOM killer a crutch used to justify having a memory allocation subsystem that can't provide forward progress guarantee mechanisms to callers that need it. I've proposed a method of providing this forward progress guarantee for subsystems of arbitrary complexity, and this removes the dependency on the OOM killer for fowards allocation progress in such contexts (e.g. filesystems). We should be discussing how to implement that, not what bandaids we need to apply to the OOM killer. I want to fix the underlying problems, not push them under the OOM-killer bus... > And please understand that this callsite blowing up is a chance to > better the code and behavior here. Where previously it would just > endlessly loop in the allocator without any means to make progress, Again, this statement ignores the fact we have *no credible evidence* that this is actually a problem in production environments. And, besides, even if you do force through changing the XFS code to GFP_NOFAIL, it'll get changed back to a retry loop in the near future when we add admin configurable error handling behaviour to XFS, as I pointed Michal to.... (http://oss.sgi.com/archives/xfs/2015-02/msg00346.html) Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 22:52 ` Dave Chinner @ 2015-02-20 10:36 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-20 10:36 UTC (permalink / raw) To: david, hannes Cc: dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds Dave Chinner wrote: > I really don't care about the OOM Killer corner cases - it's > completely the wrong way line of development to be spending time on > and you aren't going to convince me otherwise. The OOM killer a > crutch used to justify having a memory allocation subsystem that > can't provide forward progress guarantee mechanisms to callers that > need it. I really care about the OOM Killer corner cases, for I'm (1) seeing trouble cases which occurred in enterprise systems under OOM conditions (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which an unprivileged user with a login shell can trivially trigger since Linux 2.0) to OOM "Genocide" attacks in order to allow OOM-unkillable daemons to restart OOM-killed processes (3) waiting for a bandaid for (2) in order to propose changes for mitigating OOM "Genocide" attacks (as bad guys will find how to trigger OOM "Deadlock or Genocide" attacks from changes for mitigating OOM "Genocide" attacks) I started posting to linux-mm ML in order to make forward progress about (1) and (2). I don't want the memory allocation subsystem to lock up an entire system by indefinitely disabling memory releasing mechanism provided by the OOM killer. > I've proposed a method of providing this forward progress guarantee > for subsystems of arbitrary complexity, and this removes the > dependency on the OOM killer for fowards allocation progress in such > contexts (e.g. filesystems). We should be discussing how to > implement that, not what bandaids we need to apply to the OOM > killer. I want to fix the underlying problems, not push them under > the OOM-killer bus... I'm fine with that direction for new kernels provided that a simple bandaid which can be backported to distributor kernels for making OOM "Deadlock" attacks impossible is implemented. Therefore, I'm discussing what bandaids we need to apply to the OOM killer. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 10:36 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-20 10:36 UTC (permalink / raw) To: david, hannes Cc: mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs Dave Chinner wrote: > I really don't care about the OOM Killer corner cases - it's > completely the wrong way line of development to be spending time on > and you aren't going to convince me otherwise. The OOM killer a > crutch used to justify having a memory allocation subsystem that > can't provide forward progress guarantee mechanisms to callers that > need it. I really care about the OOM Killer corner cases, for I'm (1) seeing trouble cases which occurred in enterprise systems under OOM conditions (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which an unprivileged user with a login shell can trivially trigger since Linux 2.0) to OOM "Genocide" attacks in order to allow OOM-unkillable daemons to restart OOM-killed processes (3) waiting for a bandaid for (2) in order to propose changes for mitigating OOM "Genocide" attacks (as bad guys will find how to trigger OOM "Deadlock or Genocide" attacks from changes for mitigating OOM "Genocide" attacks) I started posting to linux-mm ML in order to make forward progress about (1) and (2). I don't want the memory allocation subsystem to lock up an entire system by indefinitely disabling memory releasing mechanism provided by the OOM killer. > I've proposed a method of providing this forward progress guarantee > for subsystems of arbitrary complexity, and this removes the > dependency on the OOM killer for fowards allocation progress in such > contexts (e.g. filesystems). We should be discussing how to > implement that, not what bandaids we need to apply to the OOM > killer. I want to fix the underlying problems, not push them under > the OOM-killer bus... I'm fine with that direction for new kernels provided that a simple bandaid which can be backported to distributor kernels for making OOM "Deadlock" attacks impossible is implemented. Therefore, I'm discussing what bandaids we need to apply to the OOM killer. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 10:36 ` Tetsuo Handa @ 2015-02-20 23:15 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-20 23:15 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > I really don't care about the OOM Killer corner cases - it's > > completely the wrong way line of development to be spending time on > > and you aren't going to convince me otherwise. The OOM killer a > > crutch used to justify having a memory allocation subsystem that > > can't provide forward progress guarantee mechanisms to callers that > > need it. > > I really care about the OOM Killer corner cases, for I'm > > (1) seeing trouble cases which occurred in enterprise systems > under OOM conditions You reach OOM, then your SLAs are dead and buried. Reboot the box - its a much more reliable way of returning to a working system than playing Russian Roulette with the OOM killer. > (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which > an unprivileged user with a login shell can trivially trigger > since Linux 2.0) to OOM "Genocide" attacks in order to allow > OOM-unkillable daemons to restart OOM-killed processes > > (3) waiting for a bandaid for (2) in order to propose changes for > mitigating OOM "Genocide" attacks (as bad guys will find how to > trigger OOM "Deadlock or Genocide" attacks from changes for > mitigating OOM "Genocide" attacks) Which is yet another indication that the OOM killer is the wrong solution to the "lack of forward progress" problem. Any one can generate enough memory pressure to trigger the OOM killer; we can't prevent that from occurring when the OOM killer can be invoked by user processes. > I started posting to linux-mm ML in order to make forward progress > about (1) and (2). I don't want the memory allocation subsystem to > lock up an entire system by indefinitely disabling memory releasing > mechanism provided by the OOM killer. > > > I've proposed a method of providing this forward progress guarantee > > for subsystems of arbitrary complexity, and this removes the > > dependency on the OOM killer for fowards allocation progress in such > > contexts (e.g. filesystems). We should be discussing how to > > implement that, not what bandaids we need to apply to the OOM > > killer. I want to fix the underlying problems, not push them under > > the OOM-killer bus... > > I'm fine with that direction for new kernels provided that a simple > bandaid which can be backported to distributor kernels for making > OOM "Deadlock" attacks impossible is implemented. Therefore, I'm > discussing what bandaids we need to apply to the OOM killer. The band-aids being proposed are worse than the problem they are intended to cover up. In which case, the band-aids should not be applied. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 23:15 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-20 23:15 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > I really don't care about the OOM Killer corner cases - it's > > completely the wrong way line of development to be spending time on > > and you aren't going to convince me otherwise. The OOM killer a > > crutch used to justify having a memory allocation subsystem that > > can't provide forward progress guarantee mechanisms to callers that > > need it. > > I really care about the OOM Killer corner cases, for I'm > > (1) seeing trouble cases which occurred in enterprise systems > under OOM conditions You reach OOM, then your SLAs are dead and buried. Reboot the box - its a much more reliable way of returning to a working system than playing Russian Roulette with the OOM killer. > (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which > an unprivileged user with a login shell can trivially trigger > since Linux 2.0) to OOM "Genocide" attacks in order to allow > OOM-unkillable daemons to restart OOM-killed processes > > (3) waiting for a bandaid for (2) in order to propose changes for > mitigating OOM "Genocide" attacks (as bad guys will find how to > trigger OOM "Deadlock or Genocide" attacks from changes for > mitigating OOM "Genocide" attacks) Which is yet another indication that the OOM killer is the wrong solution to the "lack of forward progress" problem. Any one can generate enough memory pressure to trigger the OOM killer; we can't prevent that from occurring when the OOM killer can be invoked by user processes. > I started posting to linux-mm ML in order to make forward progress > about (1) and (2). I don't want the memory allocation subsystem to > lock up an entire system by indefinitely disabling memory releasing > mechanism provided by the OOM killer. > > > I've proposed a method of providing this forward progress guarantee > > for subsystems of arbitrary complexity, and this removes the > > dependency on the OOM killer for fowards allocation progress in such > > contexts (e.g. filesystems). We should be discussing how to > > implement that, not what bandaids we need to apply to the OOM > > killer. I want to fix the underlying problems, not push them under > > the OOM-killer bus... > > I'm fine with that direction for new kernels provided that a simple > bandaid which can be backported to distributor kernels for making > OOM "Deadlock" attacks impossible is implemented. Therefore, I'm > discussing what bandaids we need to apply to the OOM killer. The band-aids being proposed are worse than the problem they are intended to cover up. In which case, the band-aids should not be applied. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 23:15 ` Dave Chinner @ 2015-02-21 3:20 ` Theodore Ts'o -1 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-02-21 3:20 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-ext4 +akpm So I'm arriving late to this discussion since I've been in conference mode for the past week, and I'm only now catching up on this thread. I'll note that this whole question of whether or not file systems should use GFP_NOFAIL is one where the mm developers are not of one mind. In fact, search for the subject line "fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL" where we recapitulated many of these arguments, Andrew Morton said that it was better to use GFP_NOFAIL over the alternatives of (a) panic'ing the kernel because the file system has no way to move forward other than leaving the file system corrupted, or (b) looping in the file system to retry the memory allocation to avoid the unfortunate effects of (a). So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to ext4/jbd2. It sounds like 9879de7373fc is causing massive file system errors, and it seems **really** unfortunate it was added so late in the day (between -rc6 and rc7). So at this point, it seems we have two choices. We can either revert 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's memory allocations and submit them as stable bug fixes. Linux MM developers, this is your call. I will liberally be adding GFP_NOFAIL to ext4 if you won't revert the commit, because that's the only way I can fix things with minimal risk of adding additional, potentially more serious regressions. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 3:20 ` Theodore Ts'o 0 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-02-21 3:20 UTC (permalink / raw) To: Dave Chinner Cc: hannes, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, linux-ext4, torvalds +akpm So I'm arriving late to this discussion since I've been in conference mode for the past week, and I'm only now catching up on this thread. I'll note that this whole question of whether or not file systems should use GFP_NOFAIL is one where the mm developers are not of one mind. In fact, search for the subject line "fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL" where we recapitulated many of these arguments, Andrew Morton said that it was better to use GFP_NOFAIL over the alternatives of (a) panic'ing the kernel because the file system has no way to move forward other than leaving the file system corrupted, or (b) looping in the file system to retry the memory allocation to avoid the unfortunate effects of (a). So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to ext4/jbd2. It sounds like 9879de7373fc is causing massive file system errors, and it seems **really** unfortunate it was added so late in the day (between -rc6 and rc7). So at this point, it seems we have two choices. We can either revert 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's memory allocations and submit them as stable bug fixes. Linux MM developers, this is your call. I will liberally be adding GFP_NOFAIL to ext4 if you won't revert the commit, because that's the only way I can fix things with minimal risk of adding additional, potentially more serious regressions. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 3:20 ` Theodore Ts'o @ 2015-02-21 9:19 ` Andrew Morton -1 siblings, 0 replies; 276+ messages in thread From: Andrew Morton @ 2015-02-21 9:19 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Tetsuo Handa, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > +akpm I was hoping not to have to read this thread ;) afaict there are two (main) issues: a) whether to oom-kill when __GFP_FS is not set. The kernel hasn't been doing this for ages and nothing has changed recently. b) whether to keep looping when __GFP_NOFAIL is not set and __GFP_FS is not set and we can't oom-kill anything (which goes without saying, because __GFP_FS isn't set!). And 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") somewhat inadvertently changed this policy - the allocation attempt will now promptly return ENOMEM if !__GFP_NOFAIL and !__GFP_FS. Correct enough? Question a) seems a bit of red herring and we can park it for now. What I'm not really understanding is why the pre-3.19 implementation actually worked. We've exhausted the free pages, we're not succeeding at reclaiming anything, we aren't able to oom-kill anyone. Yet it *does* work - we eventually find that memory and everything proceeds. How come? Where did that memory come from? Short term, we need to fix 3.19.x and 3.20 and that appears to be by applying Johannes's akpm-doesnt-know-why-it-works patch: --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) + if (!(gfp_mask & __GFP_FS)) { + /* + * XXX: Page reclaim didn't yield anything, + * and the OOM killer can't be invoked, but + * keep looping as per should_alloc_retry(). + */ + *did_some_progress = 1; goto out; + } /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. Have people adequately confirmed that this gets us out of trouble? And yes, I agree that sites such as xfs's kmem_alloc() should be passing __GFP_NOFAIL to tell the page allocator what's going on. I don't think it matters a lot whether kmem_alloc() retains its retry loop. If __GFP_NOFAIL is working correctly then it will never loop anyway... Also, this: On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote: > Right now, the oom killer is a liability. Over the past 6 months > I've slowly had to exclude filesystem regression tests from running > on small memory machines because the OOM killer is now so unreliable > that it kills the test harness regularly rather than the process > generating memory pressure. David, I did not know this! If you've been telling us about this then perhaps it wasn't loud enough. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 9:19 ` Andrew Morton 0 siblings, 0 replies; 276+ messages in thread From: Andrew Morton @ 2015-02-21 9:19 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, hannes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, linux-ext4, torvalds On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > +akpm I was hoping not to have to read this thread ;) afaict there are two (main) issues: a) whether to oom-kill when __GFP_FS is not set. The kernel hasn't been doing this for ages and nothing has changed recently. b) whether to keep looping when __GFP_NOFAIL is not set and __GFP_FS is not set and we can't oom-kill anything (which goes without saying, because __GFP_FS isn't set!). And 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") somewhat inadvertently changed this policy - the allocation attempt will now promptly return ENOMEM if !__GFP_NOFAIL and !__GFP_FS. Correct enough? Question a) seems a bit of red herring and we can park it for now. What I'm not really understanding is why the pre-3.19 implementation actually worked. We've exhausted the free pages, we're not succeeding at reclaiming anything, we aren't able to oom-kill anyone. Yet it *does* work - we eventually find that memory and everything proceeds. How come? Where did that memory come from? Short term, we need to fix 3.19.x and 3.20 and that appears to be by applying Johannes's akpm-doesnt-know-why-it-works patch: --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) + if (!(gfp_mask & __GFP_FS)) { + /* + * XXX: Page reclaim didn't yield anything, + * and the OOM killer can't be invoked, but + * keep looping as per should_alloc_retry(). + */ + *did_some_progress = 1; goto out; + } /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. Have people adequately confirmed that this gets us out of trouble? And yes, I agree that sites such as xfs's kmem_alloc() should be passing __GFP_NOFAIL to tell the page allocator what's going on. I don't think it matters a lot whether kmem_alloc() retains its retry loop. If __GFP_NOFAIL is working correctly then it will never loop anyway... Also, this: On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote: > Right now, the oom killer is a liability. Over the past 6 months > I've slowly had to exclude filesystem regression tests from running > on small memory machines because the OOM killer is now so unreliable > that it kills the test harness regularly rather than the process > generating memory pressure. David, I did not know this! If you've been telling us about this then perhaps it wasn't loud enough. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 9:19 ` Andrew Morton (?) @ 2015-02-21 13:48 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-21 13:48 UTC (permalink / raw) To: akpm Cc: tytso, david, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 Andrew Morton wrote: > On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > > > +akpm > > I was hoping not to have to read this thread ;) Sorry for getting so complicated. > What I'm not really understanding is why the pre-3.19 implementation > actually worked. We've exhausted the free pages, we're not succeeding > at reclaiming anything, we aren't able to oom-kill anyone. Yet it > *does* work - we eventually find that memory and everything proceeds. > > How come? Where did that memory come from? > Even without __GFP_NOFAIL, GFP_NOFS / GFP_NOIO allocations retried forever (without invoking the OOM killer) if order <= PAGE_ALLOC_COSTLY_ORDER and TIF_MEMDIE is not set. Somebody else volunteered that memory while retrying. This implies silent hang-up forever if nobody volunteers memory. > And yes, I agree that sites such as xfs's kmem_alloc() should be > passing __GFP_NOFAIL to tell the page allocator what's going on. I > don't think it matters a lot whether kmem_alloc() retains its retry > loop. If __GFP_NOFAIL is working correctly then it will never loop > anyway... Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") inadvertently changed GFP_NOFS / GFP_NOIO allocations not to retry unless __GFP_NOFAIL is specified. Therefore, either applying Johannes's akpm-doesnt-know-why-it-works patch or passing __GFP_NOFAIL will restore the pre-3.19 behavior (with possibility of silent hang-up). ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 13:48 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-21 13:48 UTC (permalink / raw) To: akpm Cc: tytso, david, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 Andrew Morton wrote: > On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > > > +akpm > > I was hoping not to have to read this thread ;) Sorry for getting so complicated. > What I'm not really understanding is why the pre-3.19 implementation > actually worked. We've exhausted the free pages, we're not succeeding > at reclaiming anything, we aren't able to oom-kill anyone. Yet it > *does* work - we eventually find that memory and everything proceeds. > > How come? Where did that memory come from? > Even without __GFP_NOFAIL, GFP_NOFS / GFP_NOIO allocations retried forever (without invoking the OOM killer) if order <= PAGE_ALLOC_COSTLY_ORDER and TIF_MEMDIE is not set. Somebody else volunteered that memory while retrying. This implies silent hang-up forever if nobody volunteers memory. > And yes, I agree that sites such as xfs's kmem_alloc() should be > passing __GFP_NOFAIL to tell the page allocator what's going on. I > don't think it matters a lot whether kmem_alloc() retains its retry > loop. If __GFP_NOFAIL is working correctly then it will never loop > anyway... Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") inadvertently changed GFP_NOFS / GFP_NOIO allocations not to retry unless __GFP_NOFAIL is specified. Therefore, either applying Johannes's akpm-doesnt-know-why-it-works patch or passing __GFP_NOFAIL will restore the pre-3.19 behavior (with possibility of silent hang-up). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 13:48 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-21 13:48 UTC (permalink / raw) To: akpm Cc: tytso, hannes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, linux-ext4, torvalds Andrew Morton wrote: > On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > > > +akpm > > I was hoping not to have to read this thread ;) Sorry for getting so complicated. > What I'm not really understanding is why the pre-3.19 implementation > actually worked. We've exhausted the free pages, we're not succeeding > at reclaiming anything, we aren't able to oom-kill anyone. Yet it > *does* work - we eventually find that memory and everything proceeds. > > How come? Where did that memory come from? > Even without __GFP_NOFAIL, GFP_NOFS / GFP_NOIO allocations retried forever (without invoking the OOM killer) if order <= PAGE_ALLOC_COSTLY_ORDER and TIF_MEMDIE is not set. Somebody else volunteered that memory while retrying. This implies silent hang-up forever if nobody volunteers memory. > And yes, I agree that sites such as xfs's kmem_alloc() should be > passing __GFP_NOFAIL to tell the page allocator what's going on. I > don't think it matters a lot whether kmem_alloc() retains its retry > loop. If __GFP_NOFAIL is working correctly then it will never loop > anyway... Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") inadvertently changed GFP_NOFS / GFP_NOIO allocations not to retry unless __GFP_NOFAIL is specified. Therefore, either applying Johannes's akpm-doesnt-know-why-it-works patch or passing __GFP_NOFAIL will restore the pre-3.19 behavior (with possibility of silent hang-up). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 9:19 ` Andrew Morton (?) @ 2015-02-21 21:38 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-21 21:38 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Ts'o, Tetsuo Handa, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > > > +akpm > > I was hoping not to have to read this thread ;) ditto.... > And yes, I agree that sites such as xfs's kmem_alloc() should be > passing __GFP_NOFAIL to tell the page allocator what's going on. I > don't think it matters a lot whether kmem_alloc() retains its retry > loop. If __GFP_NOFAIL is working correctly then it will never loop > anyway... I'm not about to change behaviour "just because". Any sort of change like this requires a *lot* of low memory regression testing because we'd be replacing long standing known behaviour with behaviour that changes without warning. e.g the ext4 low memory failures starting because of changes made in 3.19-rc6 due to changes in oom-killer behaviour. Those changes *did not affect XFS* and that's the way I'd like things to remain. Put simply: right now I don't trust the mm subsystem to get low memory behaviour right, and this thread has done nothing to convince me that it's going to improve any time soon. > Also, this: > > On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. > > David, I did not know this! If you've been telling us about this then > perhaps it wasn't loud enough. IME, such bug reports get ignored. Instead, over the past few months I have been pointing out bugs and problems in the oom-killer in threads like this because it seems to be the only way to get any attention to the issues I'm seeing. Bug reports simply get ignored. From this process, I've managed to learn that low order memory allocation now never fails (contrary to documentation and long standing behavioural expectations) and pointed out bugs that cause the oom killer to get invoked when the filesystem is saying "I can handle ENOMEM!" (commit 45f87de ("mm: get rid of radix tree gfp mask for pagecache_get_page"). And yes, I've definitely mentioned in these discussions that, for example, xfstests::generic/224 is triggering the oom killer far more often than it used to on my 1GB RAM vm. The only fix that has been made recently that's made any difference is 45f87de, so it's a slow process of raising awareness and trying to ensure things don't get worse before they get better.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 21:38 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-21 21:38 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Ts'o, Tetsuo Handa, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > > > +akpm > > I was hoping not to have to read this thread ;) ditto.... > And yes, I agree that sites such as xfs's kmem_alloc() should be > passing __GFP_NOFAIL to tell the page allocator what's going on. I > don't think it matters a lot whether kmem_alloc() retains its retry > loop. If __GFP_NOFAIL is working correctly then it will never loop > anyway... I'm not about to change behaviour "just because". Any sort of change like this requires a *lot* of low memory regression testing because we'd be replacing long standing known behaviour with behaviour that changes without warning. e.g the ext4 low memory failures starting because of changes made in 3.19-rc6 due to changes in oom-killer behaviour. Those changes *did not affect XFS* and that's the way I'd like things to remain. Put simply: right now I don't trust the mm subsystem to get low memory behaviour right, and this thread has done nothing to convince me that it's going to improve any time soon. > Also, this: > > On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. > > David, I did not know this! If you've been telling us about this then > perhaps it wasn't loud enough. IME, such bug reports get ignored. Instead, over the past few months I have been pointing out bugs and problems in the oom-killer in threads like this because it seems to be the only way to get any attention to the issues I'm seeing. Bug reports simply get ignored. From this process, I've managed to learn that low order memory allocation now never fails (contrary to documentation and long standing behavioural expectations) and pointed out bugs that cause the oom killer to get invoked when the filesystem is saying "I can handle ENOMEM!" (commit 45f87de ("mm: get rid of radix tree gfp mask for pagecache_get_page"). And yes, I've definitely mentioned in these discussions that, for example, xfstests::generic/224 is triggering the oom killer far more often than it used to on my 1GB RAM vm. The only fix that has been made recently that's made any difference is 45f87de, so it's a slow process of raising awareness and trying to ensure things don't get worse before they get better.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 21:38 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-21 21:38 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Ts'o, Tetsuo Handa, hannes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, linux-ext4, torvalds On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > > > +akpm > > I was hoping not to have to read this thread ;) ditto.... > And yes, I agree that sites such as xfs's kmem_alloc() should be > passing __GFP_NOFAIL to tell the page allocator what's going on. I > don't think it matters a lot whether kmem_alloc() retains its retry > loop. If __GFP_NOFAIL is working correctly then it will never loop > anyway... I'm not about to change behaviour "just because". Any sort of change like this requires a *lot* of low memory regression testing because we'd be replacing long standing known behaviour with behaviour that changes without warning. e.g the ext4 low memory failures starting because of changes made in 3.19-rc6 due to changes in oom-killer behaviour. Those changes *did not affect XFS* and that's the way I'd like things to remain. Put simply: right now I don't trust the mm subsystem to get low memory behaviour right, and this thread has done nothing to convince me that it's going to improve any time soon. > Also, this: > > On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. > > David, I did not know this! If you've been telling us about this then > perhaps it wasn't loud enough. IME, such bug reports get ignored. Instead, over the past few months I have been pointing out bugs and problems in the oom-killer in threads like this because it seems to be the only way to get any attention to the issues I'm seeing. Bug reports simply get ignored. From this process, I've managed to learn that low order memory allocation now never fails (contrary to documentation and long standing behavioural expectations) and pointed out bugs that cause the oom killer to get invoked when the filesystem is saying "I can handle ENOMEM!" (commit 45f87de ("mm: get rid of radix tree gfp mask for pagecache_get_page"). And yes, I've definitely mentioned in these discussions that, for example, xfstests::generic/224 is triggering the oom killer far more often than it used to on my 1GB RAM vm. The only fix that has been made recently that's made any difference is 45f87de, so it's a slow process of raising awareness and trying to ensure things don't get worse before they get better.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 9:19 ` Andrew Morton @ 2015-02-22 0:20 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-22 0:20 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Ts'o, Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > applying Johannes's akpm-doesnt-know-why-it-works patch: > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > if (high_zoneidx < ZONE_NORMAL) > goto out; > /* The OOM killer does not compensate for light reclaim */ > - if (!(gfp_mask & __GFP_FS)) > + if (!(gfp_mask & __GFP_FS)) { > + /* > + * XXX: Page reclaim didn't yield anything, > + * and the OOM killer can't be invoked, but > + * keep looping as per should_alloc_retry(). > + */ > + *did_some_progress = 1; > goto out; > + } > /* > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > Have people adequately confirmed that this gets us out of trouble? I'd be interested in this too. Who is seeing these failures? Andrew, can you please use the following changelog for this patch? --- From: Johannes Weiner <hannes@cmpxchg.org> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change Historically, !__GFP_FS allocations were not allowed to invoke the OOM killer once reclaim had failed, but nevertheless kept looping in the allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath"), which should have been a simple cleanup patch, accidentally changed the behavior to aborting the allocation at that point. This creates problems with filesystem callers (?) that currently rely on the allocator waiting for other tasks to intervene. Revert the behavior as it shouldn't have been changed as part of a cleanup patch. Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-22 0:20 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-22 0:20 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, linux-ext4, torvalds On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > applying Johannes's akpm-doesnt-know-why-it-works patch: > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > if (high_zoneidx < ZONE_NORMAL) > goto out; > /* The OOM killer does not compensate for light reclaim */ > - if (!(gfp_mask & __GFP_FS)) > + if (!(gfp_mask & __GFP_FS)) { > + /* > + * XXX: Page reclaim didn't yield anything, > + * and the OOM killer can't be invoked, but > + * keep looping as per should_alloc_retry(). > + */ > + *did_some_progress = 1; > goto out; > + } > /* > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > Have people adequately confirmed that this gets us out of trouble? I'd be interested in this too. Who is seeing these failures? Andrew, can you please use the following changelog for this patch? --- From: Johannes Weiner <hannes@cmpxchg.org> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change Historically, !__GFP_FS allocations were not allowed to invoke the OOM killer once reclaim had failed, but nevertheless kept looping in the allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath"), which should have been a simple cleanup patch, accidentally changed the behavior to aborting the allocation at that point. This creates problems with filesystem callers (?) that currently rely on the allocator waiting for other tasks to intervene. Revert the behavior as it shouldn't have been changed as part of a cleanup patch. Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-22 0:20 ` Johannes Weiner (?) @ 2015-02-23 10:48 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-23 10:48 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Theodore Ts'o, Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 On Sat 21-02-15 19:20:58, Johannes Weiner wrote: > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > > applying Johannes's akpm-doesnt-know-why-it-works patch: > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > if (high_zoneidx < ZONE_NORMAL) > > goto out; > > /* The OOM killer does not compensate for light reclaim */ > > - if (!(gfp_mask & __GFP_FS)) > > + if (!(gfp_mask & __GFP_FS)) { > > + /* > > + * XXX: Page reclaim didn't yield anything, > > + * and the OOM killer can't be invoked, but > > + * keep looping as per should_alloc_retry(). > > + */ > > + *did_some_progress = 1; > > goto out; > > + } > > /* > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Have people adequately confirmed that this gets us out of trouble? > > I'd be interested in this too. Who is seeing these failures? > > Andrew, can you please use the following changelog for this patch? > > --- > From: Johannes Weiner <hannes@cmpxchg.org> > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > killer once reclaim had failed, but nevertheless kept looping in the > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > into allocation slowpath"), which should have been a simple cleanup > patch, accidentally changed the behavior to aborting the allocation at > that point. This creates problems with filesystem callers (?) that > currently rely on the allocator waiting for other tasks to intervene. > > Revert the behavior as it shouldn't have been changed as part of a > cleanup patch. OK, if this a _short term_ change. I really think that all the requests except for __GFP_NOFAIL should be able to fail. I would argue that it should be the caller who should be fixed but it is true that the patch was introduced too late (rc7) and so it caught other subsystems unprepared so backporting to stable makes sense to me. But can we please move on and stop pretending that allocations do not fail for the upcoming release? > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 10:48 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-23 10:48 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Theodore Ts'o, Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 On Sat 21-02-15 19:20:58, Johannes Weiner wrote: > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > > applying Johannes's akpm-doesnt-know-why-it-works patch: > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > if (high_zoneidx < ZONE_NORMAL) > > goto out; > > /* The OOM killer does not compensate for light reclaim */ > > - if (!(gfp_mask & __GFP_FS)) > > + if (!(gfp_mask & __GFP_FS)) { > > + /* > > + * XXX: Page reclaim didn't yield anything, > > + * and the OOM killer can't be invoked, but > > + * keep looping as per should_alloc_retry(). > > + */ > > + *did_some_progress = 1; > > goto out; > > + } > > /* > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Have people adequately confirmed that this gets us out of trouble? > > I'd be interested in this too. Who is seeing these failures? > > Andrew, can you please use the following changelog for this patch? > > --- > From: Johannes Weiner <hannes@cmpxchg.org> > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > killer once reclaim had failed, but nevertheless kept looping in the > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > into allocation slowpath"), which should have been a simple cleanup > patch, accidentally changed the behavior to aborting the allocation at > that point. This creates problems with filesystem callers (?) that > currently rely on the allocator waiting for other tasks to intervene. > > Revert the behavior as it shouldn't have been changed as part of a > cleanup patch. OK, if this a _short term_ change. I really think that all the requests except for __GFP_NOFAIL should be able to fail. I would argue that it should be the caller who should be fixed but it is true that the patch was introduced too late (rc7) and so it caught other subsystems unprepared so backporting to stable makes sense to me. But can we please move on and stop pretending that allocations do not fail for the upcoming release? > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 10:48 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-23 10:48 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, linux-mm, mgorman, dchinner, Andrew Morton, linux-ext4, torvalds On Sat 21-02-15 19:20:58, Johannes Weiner wrote: > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > > applying Johannes's akpm-doesnt-know-why-it-works patch: > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > if (high_zoneidx < ZONE_NORMAL) > > goto out; > > /* The OOM killer does not compensate for light reclaim */ > > - if (!(gfp_mask & __GFP_FS)) > > + if (!(gfp_mask & __GFP_FS)) { > > + /* > > + * XXX: Page reclaim didn't yield anything, > > + * and the OOM killer can't be invoked, but > > + * keep looping as per should_alloc_retry(). > > + */ > > + *did_some_progress = 1; > > goto out; > > + } > > /* > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Have people adequately confirmed that this gets us out of trouble? > > I'd be interested in this too. Who is seeing these failures? > > Andrew, can you please use the following changelog for this patch? > > --- > From: Johannes Weiner <hannes@cmpxchg.org> > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > killer once reclaim had failed, but nevertheless kept looping in the > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > into allocation slowpath"), which should have been a simple cleanup > patch, accidentally changed the behavior to aborting the allocation at > that point. This creates problems with filesystem callers (?) that > currently rely on the allocator waiting for other tasks to intervene. > > Revert the behavior as it shouldn't have been changed as part of a > cleanup patch. OK, if this a _short term_ change. I really think that all the requests except for __GFP_NOFAIL should be able to fail. I would argue that it should be the caller who should be fixed but it is true that the patch was introduced too late (rc7) and so it caught other subsystems unprepared so backporting to stable makes sense to me. But can we please move on and stop pretending that allocations do not fail for the upcoming release? > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 10:48 ` Michal Hocko (?) @ 2015-02-23 11:23 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-23 11:23 UTC (permalink / raw) To: mhocko, hannes Cc: akpm, tytso, david, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 Michal Hocko wrote: > On Sat 21-02-15 19:20:58, Johannes Weiner wrote: > > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > > > applying Johannes's akpm-doesnt-know-why-it-works patch: > > > > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > + if (!(gfp_mask & __GFP_FS)) { > > > + /* > > > + * XXX: Page reclaim didn't yield anything, > > > + * and the OOM killer can't be invoked, but > > > + * keep looping as per should_alloc_retry(). > > > + */ > > > + *did_some_progress = 1; > > > goto out; > > > + } > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > > > Have people adequately confirmed that this gets us out of trouble? > > > > I'd be interested in this too. Who is seeing these failures? So far ext4 and xfs. I don't have environment to test other filesystems. > > > > Andrew, can you please use the following changelog for this patch? > > > > --- > > From: Johannes Weiner <hannes@cmpxchg.org> > > > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > > killer once reclaim had failed, but nevertheless kept looping in the > > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > > into allocation slowpath"), which should have been a simple cleanup > > patch, accidentally changed the behavior to aborting the allocation at > > that point. This creates problems with filesystem callers (?) that > > currently rely on the allocator waiting for other tasks to intervene. > > > > Revert the behavior as it shouldn't have been changed as part of a > > cleanup patch. > > OK, if this a _short term_ change. I really think that all the requests > except for __GFP_NOFAIL should be able to fail. I would argue that it > should be the caller who should be fixed but it is true that the patch > was introduced too late (rc7) and so it caught other subsystems > unprepared so backporting to stable makes sense to me. But can we please > move on and stop pretending that allocations do not fail for the > upcoming release? > > > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > Acked-by: Michal Hocko <mhocko@suse.cz> > Without this patch, I think the system becomes unusable under OOM. However, with this patch, I know the system may become unusable under OOM. Please do write patches for handling below condition. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Johannes's patch will get us out of filesystem error troubles, at the cost of getting us into stall troubles (as with until 3.19-rc6). I retested http://marc.info/?l=linux-ext4&m=142443125221571&w=2 with debug printk patch shown below. ---------- debug printk patch ---------- diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9c..5144506 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -610,6 +610,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) spin_unlock(&zone_scan_lock); } +atomic_t oom_killer_skipped_count = ATOMIC_INIT(0); + /** * out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer @@ -679,6 +681,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, nodemask, "Out of memory"); killed = 1; } + else + atomic_inc(&oom_killer_skipped_count); out: /* * Give the killed threads a good chance of exiting before trying to diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8e20f9c..eaea16b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) + if (!(gfp_mask & __GFP_FS)) { + /* + * XXX: Page reclaim didn't yield anything, + * and the OOM killer can't be invoked, but + * keep looping as per should_alloc_retry(). + */ + *did_some_progress = 1; goto out; + } /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. @@ -2635,6 +2642,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS); } +extern atomic_t oom_killer_skipped_count; + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, @@ -2649,6 +2658,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + unsigned long first_retried_time = 0; + unsigned long next_warn_time = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2821,6 +2832,19 @@ retry: if (!did_some_progress) goto nopage; } + if (!first_retried_time) { + first_retried_time = jiffies; + if (!first_retried_time) + first_retried_time = 1; + next_warn_time = first_retried_time + 5 * HZ; + } else if (time_after(jiffies, next_warn_time)) { + printk(KERN_INFO "%d (%s) : gfp 0x%X : %lu seconds : " + "OOM-killer skipped %u\n", current->pid, + current->comm, gfp_mask, + (jiffies - first_retried_time) / HZ, + atomic_read(&oom_killer_skipped_count)); + next_warn_time = jiffies + 5 * HZ; + } /* Wait for some write requests to complete then retry */ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); goto retry; ---------- debug printk patch ---------- GFP_NOFS allocations stalled for 10 minutes waiting for somebody else to volunteer memory. GFP_FS allocations stalled for 10 minutes waiting for the OOM killer to kill somebody. The OOM killer stalled for 10 minutes waiting for GFP_NOFS allocations to complete. I guess the system made forward progress because the number of remaining a.out processes decreased over time. (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-ext4-patched.txt.xz ) ---------- ext4 / Linux 3.19 + patch ---------- [ 1335.187579] Out of memory: Kill process 14156 (a.out) score 760 or sacrifice child [ 1335.189604] Killed process 14156 (a.out) total-vm:2167392kB, anon-rss:1360196kB, file-rss:0kB [ 1335.191920] Kill process 14177 (a.out) sharing same memory [ 1335.193465] Kill process 14178 (a.out) sharing same memory [ 1335.195013] Kill process 14179 (a.out) sharing same memory [ 1335.196580] Kill process 14180 (a.out) sharing same memory [ 1335.198128] Kill process 14181 (a.out) sharing same memory [ 1335.199674] Kill process 14182 (a.out) sharing same memory [ 1335.201217] Kill process 14183 (a.out) sharing same memory [ 1335.202768] Kill process 14184 (a.out) sharing same memory [ 1335.204316] Kill process 14185 (a.out) sharing same memory [ 1335.205871] Kill process 14186 (a.out) sharing same memory [ 1335.207420] Kill process 14187 (a.out) sharing same memory [ 1335.208974] Kill process 14188 (a.out) sharing same memory [ 1335.210515] Kill process 14189 (a.out) sharing same memory [ 1335.212063] Kill process 14190 (a.out) sharing same memory [ 1335.213611] Kill process 14191 (a.out) sharing same memory [ 1335.215165] Kill process 14192 (a.out) sharing same memory [ 1335.216715] Kill process 14193 (a.out) sharing same memory [ 1335.218286] Kill process 14194 (a.out) sharing same memory [ 1335.219836] Kill process 14195 (a.out) sharing same memory [ 1335.221378] Kill process 14196 (a.out) sharing same memory [ 1335.222918] Kill process 14197 (a.out) sharing same memory [ 1335.224461] Kill process 14198 (a.out) sharing same memory [ 1335.225999] Kill process 14199 (a.out) sharing same memory [ 1335.227545] Kill process 14200 (a.out) sharing same memory [ 1335.229095] Kill process 14201 (a.out) sharing same memory [ 1335.230643] Kill process 14202 (a.out) sharing same memory [ 1335.232184] Kill process 14203 (a.out) sharing same memory [ 1335.233738] Kill process 14204 (a.out) sharing same memory [ 1335.235293] Kill process 14205 (a.out) sharing same memory [ 1335.236834] Kill process 14206 (a.out) sharing same memory [ 1335.238387] Kill process 14207 (a.out) sharing same memory [ 1335.239930] Kill process 14208 (a.out) sharing same memory [ 1335.241471] Kill process 14209 (a.out) sharing same memory [ 1335.243011] Kill process 14210 (a.out) sharing same memory [ 1335.244554] Kill process 14211 (a.out) sharing same memory [ 1335.246101] Kill process 14212 (a.out) sharing same memory [ 1335.247645] Kill process 14213 (a.out) sharing same memory [ 1335.249182] Kill process 14214 (a.out) sharing same memory [ 1335.250718] Kill process 14215 (a.out) sharing same memory [ 1335.252305] Kill process 14216 (a.out) sharing same memory [ 1335.253899] Kill process 14217 (a.out) sharing same memory [ 1335.255443] Kill process 14218 (a.out) sharing same memory [ 1335.256993] Kill process 14219 (a.out) sharing same memory [ 1335.258531] Kill process 14220 (a.out) sharing same memory [ 1335.260066] Kill process 14221 (a.out) sharing same memory [ 1335.261616] Kill process 14222 (a.out) sharing same memory [ 1335.263143] Kill process 14223 (a.out) sharing same memory [ 1335.264647] Kill process 14224 (a.out) sharing same memory [ 1335.266121] Kill process 14225 (a.out) sharing same memory [ 1335.267598] Kill process 14226 (a.out) sharing same memory [ 1335.269077] Kill process 14227 (a.out) sharing same memory [ 1335.270560] Kill process 14228 (a.out) sharing same memory [ 1335.272038] Kill process 14229 (a.out) sharing same memory [ 1335.273508] Kill process 14230 (a.out) sharing same memory [ 1335.274999] Kill process 14231 (a.out) sharing same memory [ 1335.276469] Kill process 14232 (a.out) sharing same memory [ 1335.277947] Kill process 14233 (a.out) sharing same memory [ 1335.279428] Kill process 14234 (a.out) sharing same memory [ 1335.280894] Kill process 14235 (a.out) sharing same memory [ 1335.282361] Kill process 14236 (a.out) sharing same memory [ 1335.283832] Kill process 14237 (a.out) sharing same memory [ 1335.285304] Kill process 14238 (a.out) sharing same memory [ 1335.286768] Kill process 14239 (a.out) sharing same memory [ 1335.288242] Kill process 14240 (a.out) sharing same memory [ 1335.289714] Kill process 14241 (a.out) sharing same memory [ 1335.291196] Kill process 14242 (a.out) sharing same memory [ 1335.292731] Kill process 14243 (a.out) sharing same memory [ 1335.294258] Kill process 14244 (a.out) sharing same memory [ 1335.295734] Kill process 14245 (a.out) sharing same memory [ 1335.297215] Kill process 14246 (a.out) sharing same memory [ 1335.298710] Kill process 14247 (a.out) sharing same memory [ 1335.300188] Kill process 14248 (a.out) sharing same memory [ 1335.301672] Kill process 14249 (a.out) sharing same memory [ 1335.303157] Kill process 14250 (a.out) sharing same memory [ 1335.304655] Kill process 14251 (a.out) sharing same memory [ 1335.306141] Kill process 14252 (a.out) sharing same memory [ 1335.307621] Kill process 14253 (a.out) sharing same memory [ 1335.309107] Kill process 14254 (a.out) sharing same memory [ 1335.310573] Kill process 14255 (a.out) sharing same memory [ 1335.312052] Kill process 14256 (a.out) sharing same memory [ 1335.313528] Kill process 14257 (a.out) sharing same memory [ 1335.315039] Kill process 14258 (a.out) sharing same memory [ 1335.316522] Kill process 14259 (a.out) sharing same memory [ 1335.317992] Kill process 14260 (a.out) sharing same memory [ 1335.319462] Kill process 14261 (a.out) sharing same memory [ 1335.320965] Kill process 14262 (a.out) sharing same memory [ 1335.322459] Kill process 14263 (a.out) sharing same memory [ 1335.323958] Kill process 14264 (a.out) sharing same memory [ 1335.325472] Kill process 14265 (a.out) sharing same memory [ 1335.326966] Kill process 14266 (a.out) sharing same memory [ 1335.328454] Kill process 14267 (a.out) sharing same memory [ 1335.329945] Kill process 14268 (a.out) sharing same memory [ 1335.331444] Kill process 14269 (a.out) sharing same memory [ 1335.332944] Kill process 14270 (a.out) sharing same memory [ 1335.334435] Kill process 14271 (a.out) sharing same memory [ 1335.335930] Kill process 14272 (a.out) sharing same memory [ 1335.337437] Kill process 14273 (a.out) sharing same memory [ 1335.338927] Kill process 14274 (a.out) sharing same memory [ 1335.340400] Kill process 14275 (a.out) sharing same memory [ 1335.341890] Kill process 14276 (a.out) sharing same memory [ 1339.640500] 464 (systemd-journal) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459181 [ 1339.649374] 615 (vmtoolsd) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459438 [ 1339.649611] 4079 (pool) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459447 [ 1340.343322] 14258 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275 [ 1340.343331] 14194 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275 [ 1340.343345] 14210 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478276 [ 1340.343360] 14179 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478277 [ 1340.345290] 14154 (su) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22478339 [ 1340.345312] 14180 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339 [ 1340.345319] 14260 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339 [ 1340.345337] 14178 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340 [ 1340.345345] 14245 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340 [ 1340.345361] 14226 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478341 [ 1340.346119] 14256 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478368 [ 1340.346139] 14181 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478369 [ 1340.347082] 14274 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347091] 14267 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347095] 14189 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347099] 14238 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347107] 14276 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403 [ 1340.347112] 14183 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403 [ 1340.347397] 14254 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413 [ 1340.347402] 14228 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413 [ 1340.347414] 14185 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347419] 14261 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347423] 14217 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347427] 14203 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347439] 14234 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415 [ 1340.347452] 14269 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415 [ 1340.347461] 14255 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347465] 14192 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347473] 14259 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347492] 14232 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347497] 14223 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347505] 14220 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347523] 14252 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418 [ 1340.347531] 14193 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418 (...snipped...) [ 1949.672951] 43 (kworker/1:1) : gfp 0x10 : 90 seconds : OOM-killer skipped 41315348 [ 1949.993045] 4079 (pool) : gfp 0x201DA : 615 seconds : OOM-killer skipped 41325108 [ 1950.694909] 14269 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41346727 [ 1950.703945] 14181 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41347003 [ 1950.742087] 14254 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348208 [ 1950.744937] 14193 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348299 [ 1950.748884] 2 (kthreadd) : gfp 0x2000D0 : 10 seconds : OOM-killer skipped 41348418 [ 1950.751565] 14203 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348502 [ 1950.756955] 14232 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348656 [ 1950.776918] 14185 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349279 [ 1950.791214] 14217 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349720 [ 1950.798961] 14179 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349957 [ 1950.806551] 14255 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350209 [ 1950.810860] 14234 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350356 [ 1950.813821] 14258 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350450 [ 1950.860422] 14261 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41351919 [ 1950.864015] 14210 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352033 [ 1950.866636] 14226 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352107 [ 1950.905003] 14238 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353303 [ 1950.907813] 14180 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353381 [ 1950.913963] 14276 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353567 [ 1952.238344] 649 (chronyd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393388 [ 1952.243228] 4030 (gnome-shell) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393566 [ 1952.247225] 592 (audispd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393701 [ 1952.258265] 1 (systemd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394041 [ 1952.269296] 1691 (rpcbind) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394365 [ 1952.299073] 702 (rtkit-daemon) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41395288 [ 1952.301231] 627 (lsmd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41395385 [ 1952.350200] 464 (systemd-journal) : gfp 0x201DA : 165 seconds : OOM-killer skipped 41396935 [ 1952.472040] 543 (auditd) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400669 [ 1952.475211] 14154 (su) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400795 [ 1952.527084] 3514 (smbd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402412 [ 1952.543205] 613 (irqbalance) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402892 [ 1952.568276] 12672 (pickup) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403656 [ 1952.572329] 770 (tuned) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41403784 [ 1952.578076] 3392 (master) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403955 [ 1952.597273] 615 (vmtoolsd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41404520 [ 1952.619187] 14146 (sleep) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405206 [ 1952.621214] 811 (NetworkManager) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405265 [ 1952.765035] 3700 (gnome-settings-) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409551 [ 1952.776099] 603 (alsactl) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409856 [ 1952.823163] 661 (crond) : gfp 0x201DA : 325 seconds : OOM-killer skipped 41411303 [ 1953.201269] SysRq : Resetting ---------- ext4 / Linux 3.19 + patch ---------- I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19 with debug printk patch shown above. According to console logs, oom_kill_process() is trivially called via pagefault_out_of_memory() for the former kernel. Due to giving up !GFP_FS allocations immediately? (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz ) ---------- xfs / Linux 3.19 ---------- [ 793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0 [ 793.283102] su cpuset=/ mems_allowed=0 [ 793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40 [ 793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 793.283161] 0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe [ 793.283162] ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206 [ 793.283163] 0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8 [ 793.283164] Call Trace: [ 793.283169] [<ffffffff816ae9d4>] dump_stack+0x45/0x57 [ 793.283171] [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1 [ 793.283174] [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390 [ 793.283177] [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30 [ 793.283178] [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500 [ 793.283179] [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90 [ 793.283180] [<ffffffff816aab2c>] mm_fault_error+0x67/0x140 [ 793.283182] [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580 [ 793.283185] [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60 [ 793.283186] [<ffffffff81070fcb>] ? do_wait+0x12b/0x240 [ 793.283187] [<ffffffff8105abb1>] do_page_fault+0x31/0x70 [ 793.283189] [<ffffffff816b83e8>] page_fault+0x28/0x30 ---------- xfs / Linux 3.19 ---------- On the other hand, stall is observed for the latter kernel. I guess that this time the system failed to make forward progress, for oom_killer_skipped_count is increasing over time but the number of remaining a.out processes remained unchanged. (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-patched.txt.xz ) ---------- xfs / Linux 3.19 + patch ---------- [ 2062.847965] 505 (abrt-watch-log) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388568 [ 2062.850270] 515 (lsmd) : gfp 0x2015A : 674 seconds : OOM-killer skipped 22388662 [ 2062.850389] 491 (audispd) : gfp 0x2015A : 666 seconds : OOM-killer skipped 22388667 [ 2062.850400] 346 (systemd-journal) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388667 [ 2062.850402] 610 (rtkit-daemon) : gfp 0x2015A : 677 seconds : OOM-killer skipped 22388667 [ 2062.850424] 494 (alsactl) : gfp 0x2015A : 546 seconds : OOM-killer skipped 22388668 [ 2062.850446] 558 (crond) : gfp 0x2015A : 645 seconds : OOM-killer skipped 22388669 [ 2062.850451] 25532 (su) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388669 [ 2062.850456] 516 (vmtoolsd) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388669 [ 2062.850494] 741 (NetworkManager) : gfp 0x2015A : 530 seconds : OOM-killer skipped 22388670 [ 2062.850503] 3132 (master) : gfp 0x2015A : 644 seconds : OOM-killer skipped 22388671 [ 2062.850508] 3144 (pickup) : gfp 0x2015A : 604 seconds : OOM-killer skipped 22388671 [ 2062.850512] 3145 (qmgr) : gfp 0x2015A : 526 seconds : OOM-killer skipped 22388671 [ 2062.850540] 25653 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388672 [ 2062.850561] 655 (tuned) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388673 [ 2062.852404] 10429 (kworker/0:14) : gfp 0x2040D0 : 683 seconds : OOM-killer skipped 22388748 [ 2062.852430] 543 (chronyd) : gfp 0x2015A : 293 seconds : OOM-killer skipped 22388749 [ 2062.852436] 13012 (goa-daemon) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22388749 [ 2062.852449] 1454 (rpcbind) : gfp 0x2015A : 662 seconds : OOM-killer skipped 22388749 [ 2062.854288] 466 (auditd) : gfp 0x2015A : 626 seconds : OOM-killer skipped 22388751 [ 2062.854305] 25622 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751 [ 2062.854426] 1419 (dhclient) : gfp 0x2015A : 388 seconds : OOM-killer skipped 22388751 [ 2062.854443] 25638 (a.out) : gfp 0x204250 : 683 seconds : OOM-killer skipped 22388751 [ 2062.854450] 25582 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751 [ 2062.854462] 25400 (sleep) : gfp 0x2015A : 635 seconds : OOM-killer skipped 22388751 [ 2062.854469] 532 (smartd) : gfp 0x2015A : 246 seconds : OOM-killer skipped 22388751 [ 2062.854486] 2 (kthreadd) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22388752 [ 2062.854497] 3867 (gnome-shell) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388752 [ 2062.854502] 3562 (gnome-settings-) : gfp 0x2015A : 676 seconds : OOM-killer skipped 22388752 [ 2062.854524] 25641 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753 [ 2062.854536] 25566 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753 [ 2062.908915] 61 (kworker/3:1) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22390715 [ 2062.913407] 531 (irqbalance) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22390894 [ 2064.988155] SysRq : Resetting ---------- xfs / Linux 3.19 + patch ---------- Oh, current code is too hintless to determine whether forward progress is made, for no kernel messages are printed when the OOM victim failed to die immediately. I wish we had debug printk patch shown above and/or like http://marc.info/?l=linux-mm&m=141671829611143&w=2 . ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 11:23 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-23 11:23 UTC (permalink / raw) To: mhocko, hannes Cc: akpm, tytso, david, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4 Michal Hocko wrote: > On Sat 21-02-15 19:20:58, Johannes Weiner wrote: > > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > > > applying Johannes's akpm-doesnt-know-why-it-works patch: > > > > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > + if (!(gfp_mask & __GFP_FS)) { > > > + /* > > > + * XXX: Page reclaim didn't yield anything, > > > + * and the OOM killer can't be invoked, but > > > + * keep looping as per should_alloc_retry(). > > > + */ > > > + *did_some_progress = 1; > > > goto out; > > > + } > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > > > Have people adequately confirmed that this gets us out of trouble? > > > > I'd be interested in this too. Who is seeing these failures? So far ext4 and xfs. I don't have environment to test other filesystems. > > > > Andrew, can you please use the following changelog for this patch? > > > > --- > > From: Johannes Weiner <hannes@cmpxchg.org> > > > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > > killer once reclaim had failed, but nevertheless kept looping in the > > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > > into allocation slowpath"), which should have been a simple cleanup > > patch, accidentally changed the behavior to aborting the allocation at > > that point. This creates problems with filesystem callers (?) that > > currently rely on the allocator waiting for other tasks to intervene. > > > > Revert the behavior as it shouldn't have been changed as part of a > > cleanup patch. > > OK, if this a _short term_ change. I really think that all the requests > except for __GFP_NOFAIL should be able to fail. I would argue that it > should be the caller who should be fixed but it is true that the patch > was introduced too late (rc7) and so it caught other subsystems > unprepared so backporting to stable makes sense to me. But can we please > move on and stop pretending that allocations do not fail for the > upcoming release? > > > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > Acked-by: Michal Hocko <mhocko@suse.cz> > Without this patch, I think the system becomes unusable under OOM. However, with this patch, I know the system may become unusable under OOM. Please do write patches for handling below condition. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Johannes's patch will get us out of filesystem error troubles, at the cost of getting us into stall troubles (as with until 3.19-rc6). I retested http://marc.info/?l=linux-ext4&m=142443125221571&w=2 with debug printk patch shown below. ---------- debug printk patch ---------- diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9c..5144506 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -610,6 +610,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) spin_unlock(&zone_scan_lock); } +atomic_t oom_killer_skipped_count = ATOMIC_INIT(0); + /** * out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer @@ -679,6 +681,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, nodemask, "Out of memory"); killed = 1; } + else + atomic_inc(&oom_killer_skipped_count); out: /* * Give the killed threads a good chance of exiting before trying to diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8e20f9c..eaea16b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) + if (!(gfp_mask & __GFP_FS)) { + /* + * XXX: Page reclaim didn't yield anything, + * and the OOM killer can't be invoked, but + * keep looping as per should_alloc_retry(). + */ + *did_some_progress = 1; goto out; + } /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. @@ -2635,6 +2642,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS); } +extern atomic_t oom_killer_skipped_count; + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, @@ -2649,6 +2658,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + unsigned long first_retried_time = 0; + unsigned long next_warn_time = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2821,6 +2832,19 @@ retry: if (!did_some_progress) goto nopage; } + if (!first_retried_time) { + first_retried_time = jiffies; + if (!first_retried_time) + first_retried_time = 1; + next_warn_time = first_retried_time + 5 * HZ; + } else if (time_after(jiffies, next_warn_time)) { + printk(KERN_INFO "%d (%s) : gfp 0x%X : %lu seconds : " + "OOM-killer skipped %u\n", current->pid, + current->comm, gfp_mask, + (jiffies - first_retried_time) / HZ, + atomic_read(&oom_killer_skipped_count)); + next_warn_time = jiffies + 5 * HZ; + } /* Wait for some write requests to complete then retry */ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); goto retry; ---------- debug printk patch ---------- GFP_NOFS allocations stalled for 10 minutes waiting for somebody else to volunteer memory. GFP_FS allocations stalled for 10 minutes waiting for the OOM killer to kill somebody. The OOM killer stalled for 10 minutes waiting for GFP_NOFS allocations to complete. I guess the system made forward progress because the number of remaining a.out processes decreased over time. (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-ext4-patched.txt.xz ) ---------- ext4 / Linux 3.19 + patch ---------- [ 1335.187579] Out of memory: Kill process 14156 (a.out) score 760 or sacrifice child [ 1335.189604] Killed process 14156 (a.out) total-vm:2167392kB, anon-rss:1360196kB, file-rss:0kB [ 1335.191920] Kill process 14177 (a.out) sharing same memory [ 1335.193465] Kill process 14178 (a.out) sharing same memory [ 1335.195013] Kill process 14179 (a.out) sharing same memory [ 1335.196580] Kill process 14180 (a.out) sharing same memory [ 1335.198128] Kill process 14181 (a.out) sharing same memory [ 1335.199674] Kill process 14182 (a.out) sharing same memory [ 1335.201217] Kill process 14183 (a.out) sharing same memory [ 1335.202768] Kill process 14184 (a.out) sharing same memory [ 1335.204316] Kill process 14185 (a.out) sharing same memory [ 1335.205871] Kill process 14186 (a.out) sharing same memory [ 1335.207420] Kill process 14187 (a.out) sharing same memory [ 1335.208974] Kill process 14188 (a.out) sharing same memory [ 1335.210515] Kill process 14189 (a.out) sharing same memory [ 1335.212063] Kill process 14190 (a.out) sharing same memory [ 1335.213611] Kill process 14191 (a.out) sharing same memory [ 1335.215165] Kill process 14192 (a.out) sharing same memory [ 1335.216715] Kill process 14193 (a.out) sharing same memory [ 1335.218286] Kill process 14194 (a.out) sharing same memory [ 1335.219836] Kill process 14195 (a.out) sharing same memory [ 1335.221378] Kill process 14196 (a.out) sharing same memory [ 1335.222918] Kill process 14197 (a.out) sharing same memory [ 1335.224461] Kill process 14198 (a.out) sharing same memory [ 1335.225999] Kill process 14199 (a.out) sharing same memory [ 1335.227545] Kill process 14200 (a.out) sharing same memory [ 1335.229095] Kill process 14201 (a.out) sharing same memory [ 1335.230643] Kill process 14202 (a.out) sharing same memory [ 1335.232184] Kill process 14203 (a.out) sharing same memory [ 1335.233738] Kill process 14204 (a.out) sharing same memory [ 1335.235293] Kill process 14205 (a.out) sharing same memory [ 1335.236834] Kill process 14206 (a.out) sharing same memory [ 1335.238387] Kill process 14207 (a.out) sharing same memory [ 1335.239930] Kill process 14208 (a.out) sharing same memory [ 1335.241471] Kill process 14209 (a.out) sharing same memory [ 1335.243011] Kill process 14210 (a.out) sharing same memory [ 1335.244554] Kill process 14211 (a.out) sharing same memory [ 1335.246101] Kill process 14212 (a.out) sharing same memory [ 1335.247645] Kill process 14213 (a.out) sharing same memory [ 1335.249182] Kill process 14214 (a.out) sharing same memory [ 1335.250718] Kill process 14215 (a.out) sharing same memory [ 1335.252305] Kill process 14216 (a.out) sharing same memory [ 1335.253899] Kill process 14217 (a.out) sharing same memory [ 1335.255443] Kill process 14218 (a.out) sharing same memory [ 1335.256993] Kill process 14219 (a.out) sharing same memory [ 1335.258531] Kill process 14220 (a.out) sharing same memory [ 1335.260066] Kill process 14221 (a.out) sharing same memory [ 1335.261616] Kill process 14222 (a.out) sharing same memory [ 1335.263143] Kill process 14223 (a.out) sharing same memory [ 1335.264647] Kill process 14224 (a.out) sharing same memory [ 1335.266121] Kill process 14225 (a.out) sharing same memory [ 1335.267598] Kill process 14226 (a.out) sharing same memory [ 1335.269077] Kill process 14227 (a.out) sharing same memory [ 1335.270560] Kill process 14228 (a.out) sharing same memory [ 1335.272038] Kill process 14229 (a.out) sharing same memory [ 1335.273508] Kill process 14230 (a.out) sharing same memory [ 1335.274999] Kill process 14231 (a.out) sharing same memory [ 1335.276469] Kill process 14232 (a.out) sharing same memory [ 1335.277947] Kill process 14233 (a.out) sharing same memory [ 1335.279428] Kill process 14234 (a.out) sharing same memory [ 1335.280894] Kill process 14235 (a.out) sharing same memory [ 1335.282361] Kill process 14236 (a.out) sharing same memory [ 1335.283832] Kill process 14237 (a.out) sharing same memory [ 1335.285304] Kill process 14238 (a.out) sharing same memory [ 1335.286768] Kill process 14239 (a.out) sharing same memory [ 1335.288242] Kill process 14240 (a.out) sharing same memory [ 1335.289714] Kill process 14241 (a.out) sharing same memory [ 1335.291196] Kill process 14242 (a.out) sharing same memory [ 1335.292731] Kill process 14243 (a.out) sharing same memory [ 1335.294258] Kill process 14244 (a.out) sharing same memory [ 1335.295734] Kill process 14245 (a.out) sharing same memory [ 1335.297215] Kill process 14246 (a.out) sharing same memory [ 1335.298710] Kill process 14247 (a.out) sharing same memory [ 1335.300188] Kill process 14248 (a.out) sharing same memory [ 1335.301672] Kill process 14249 (a.out) sharing same memory [ 1335.303157] Kill process 14250 (a.out) sharing same memory [ 1335.304655] Kill process 14251 (a.out) sharing same memory [ 1335.306141] Kill process 14252 (a.out) sharing same memory [ 1335.307621] Kill process 14253 (a.out) sharing same memory [ 1335.309107] Kill process 14254 (a.out) sharing same memory [ 1335.310573] Kill process 14255 (a.out) sharing same memory [ 1335.312052] Kill process 14256 (a.out) sharing same memory [ 1335.313528] Kill process 14257 (a.out) sharing same memory [ 1335.315039] Kill process 14258 (a.out) sharing same memory [ 1335.316522] Kill process 14259 (a.out) sharing same memory [ 1335.317992] Kill process 14260 (a.out) sharing same memory [ 1335.319462] Kill process 14261 (a.out) sharing same memory [ 1335.320965] Kill process 14262 (a.out) sharing same memory [ 1335.322459] Kill process 14263 (a.out) sharing same memory [ 1335.323958] Kill process 14264 (a.out) sharing same memory [ 1335.325472] Kill process 14265 (a.out) sharing same memory [ 1335.326966] Kill process 14266 (a.out) sharing same memory [ 1335.328454] Kill process 14267 (a.out) sharing same memory [ 1335.329945] Kill process 14268 (a.out) sharing same memory [ 1335.331444] Kill process 14269 (a.out) sharing same memory [ 1335.332944] Kill process 14270 (a.out) sharing same memory [ 1335.334435] Kill process 14271 (a.out) sharing same memory [ 1335.335930] Kill process 14272 (a.out) sharing same memory [ 1335.337437] Kill process 14273 (a.out) sharing same memory [ 1335.338927] Kill process 14274 (a.out) sharing same memory [ 1335.340400] Kill process 14275 (a.out) sharing same memory [ 1335.341890] Kill process 14276 (a.out) sharing same memory [ 1339.640500] 464 (systemd-journal) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459181 [ 1339.649374] 615 (vmtoolsd) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459438 [ 1339.649611] 4079 (pool) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459447 [ 1340.343322] 14258 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275 [ 1340.343331] 14194 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275 [ 1340.343345] 14210 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478276 [ 1340.343360] 14179 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478277 [ 1340.345290] 14154 (su) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22478339 [ 1340.345312] 14180 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339 [ 1340.345319] 14260 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339 [ 1340.345337] 14178 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340 [ 1340.345345] 14245 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340 [ 1340.345361] 14226 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478341 [ 1340.346119] 14256 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478368 [ 1340.346139] 14181 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478369 [ 1340.347082] 14274 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347091] 14267 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347095] 14189 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347099] 14238 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347107] 14276 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403 [ 1340.347112] 14183 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403 [ 1340.347397] 14254 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413 [ 1340.347402] 14228 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413 [ 1340.347414] 14185 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347419] 14261 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347423] 14217 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347427] 14203 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347439] 14234 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415 [ 1340.347452] 14269 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415 [ 1340.347461] 14255 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347465] 14192 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347473] 14259 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347492] 14232 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347497] 14223 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347505] 14220 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347523] 14252 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418 [ 1340.347531] 14193 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418 (...snipped...) [ 1949.672951] 43 (kworker/1:1) : gfp 0x10 : 90 seconds : OOM-killer skipped 41315348 [ 1949.993045] 4079 (pool) : gfp 0x201DA : 615 seconds : OOM-killer skipped 41325108 [ 1950.694909] 14269 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41346727 [ 1950.703945] 14181 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41347003 [ 1950.742087] 14254 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348208 [ 1950.744937] 14193 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348299 [ 1950.748884] 2 (kthreadd) : gfp 0x2000D0 : 10 seconds : OOM-killer skipped 41348418 [ 1950.751565] 14203 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348502 [ 1950.756955] 14232 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348656 [ 1950.776918] 14185 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349279 [ 1950.791214] 14217 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349720 [ 1950.798961] 14179 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349957 [ 1950.806551] 14255 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350209 [ 1950.810860] 14234 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350356 [ 1950.813821] 14258 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350450 [ 1950.860422] 14261 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41351919 [ 1950.864015] 14210 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352033 [ 1950.866636] 14226 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352107 [ 1950.905003] 14238 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353303 [ 1950.907813] 14180 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353381 [ 1950.913963] 14276 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353567 [ 1952.238344] 649 (chronyd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393388 [ 1952.243228] 4030 (gnome-shell) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393566 [ 1952.247225] 592 (audispd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393701 [ 1952.258265] 1 (systemd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394041 [ 1952.269296] 1691 (rpcbind) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394365 [ 1952.299073] 702 (rtkit-daemon) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41395288 [ 1952.301231] 627 (lsmd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41395385 [ 1952.350200] 464 (systemd-journal) : gfp 0x201DA : 165 seconds : OOM-killer skipped 41396935 [ 1952.472040] 543 (auditd) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400669 [ 1952.475211] 14154 (su) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400795 [ 1952.527084] 3514 (smbd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402412 [ 1952.543205] 613 (irqbalance) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402892 [ 1952.568276] 12672 (pickup) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403656 [ 1952.572329] 770 (tuned) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41403784 [ 1952.578076] 3392 (master) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403955 [ 1952.597273] 615 (vmtoolsd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41404520 [ 1952.619187] 14146 (sleep) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405206 [ 1952.621214] 811 (NetworkManager) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405265 [ 1952.765035] 3700 (gnome-settings-) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409551 [ 1952.776099] 603 (alsactl) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409856 [ 1952.823163] 661 (crond) : gfp 0x201DA : 325 seconds : OOM-killer skipped 41411303 [ 1953.201269] SysRq : Resetting ---------- ext4 / Linux 3.19 + patch ---------- I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19 with debug printk patch shown above. According to console logs, oom_kill_process() is trivially called via pagefault_out_of_memory() for the former kernel. Due to giving up !GFP_FS allocations immediately? (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz ) ---------- xfs / Linux 3.19 ---------- [ 793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0 [ 793.283102] su cpuset=/ mems_allowed=0 [ 793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40 [ 793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 793.283161] 0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe [ 793.283162] ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206 [ 793.283163] 0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8 [ 793.283164] Call Trace: [ 793.283169] [<ffffffff816ae9d4>] dump_stack+0x45/0x57 [ 793.283171] [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1 [ 793.283174] [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390 [ 793.283177] [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30 [ 793.283178] [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500 [ 793.283179] [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90 [ 793.283180] [<ffffffff816aab2c>] mm_fault_error+0x67/0x140 [ 793.283182] [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580 [ 793.283185] [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60 [ 793.283186] [<ffffffff81070fcb>] ? do_wait+0x12b/0x240 [ 793.283187] [<ffffffff8105abb1>] do_page_fault+0x31/0x70 [ 793.283189] [<ffffffff816b83e8>] page_fault+0x28/0x30 ---------- xfs / Linux 3.19 ---------- On the other hand, stall is observed for the latter kernel. I guess that this time the system failed to make forward progress, for oom_killer_skipped_count is increasing over time but the number of remaining a.out processes remained unchanged. (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-patched.txt.xz ) ---------- xfs / Linux 3.19 + patch ---------- [ 2062.847965] 505 (abrt-watch-log) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388568 [ 2062.850270] 515 (lsmd) : gfp 0x2015A : 674 seconds : OOM-killer skipped 22388662 [ 2062.850389] 491 (audispd) : gfp 0x2015A : 666 seconds : OOM-killer skipped 22388667 [ 2062.850400] 346 (systemd-journal) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388667 [ 2062.850402] 610 (rtkit-daemon) : gfp 0x2015A : 677 seconds : OOM-killer skipped 22388667 [ 2062.850424] 494 (alsactl) : gfp 0x2015A : 546 seconds : OOM-killer skipped 22388668 [ 2062.850446] 558 (crond) : gfp 0x2015A : 645 seconds : OOM-killer skipped 22388669 [ 2062.850451] 25532 (su) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388669 [ 2062.850456] 516 (vmtoolsd) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388669 [ 2062.850494] 741 (NetworkManager) : gfp 0x2015A : 530 seconds : OOM-killer skipped 22388670 [ 2062.850503] 3132 (master) : gfp 0x2015A : 644 seconds : OOM-killer skipped 22388671 [ 2062.850508] 3144 (pickup) : gfp 0x2015A : 604 seconds : OOM-killer skipped 22388671 [ 2062.850512] 3145 (qmgr) : gfp 0x2015A : 526 seconds : OOM-killer skipped 22388671 [ 2062.850540] 25653 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388672 [ 2062.850561] 655 (tuned) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388673 [ 2062.852404] 10429 (kworker/0:14) : gfp 0x2040D0 : 683 seconds : OOM-killer skipped 22388748 [ 2062.852430] 543 (chronyd) : gfp 0x2015A : 293 seconds : OOM-killer skipped 22388749 [ 2062.852436] 13012 (goa-daemon) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22388749 [ 2062.852449] 1454 (rpcbind) : gfp 0x2015A : 662 seconds : OOM-killer skipped 22388749 [ 2062.854288] 466 (auditd) : gfp 0x2015A : 626 seconds : OOM-killer skipped 22388751 [ 2062.854305] 25622 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751 [ 2062.854426] 1419 (dhclient) : gfp 0x2015A : 388 seconds : OOM-killer skipped 22388751 [ 2062.854443] 25638 (a.out) : gfp 0x204250 : 683 seconds : OOM-killer skipped 22388751 [ 2062.854450] 25582 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751 [ 2062.854462] 25400 (sleep) : gfp 0x2015A : 635 seconds : OOM-killer skipped 22388751 [ 2062.854469] 532 (smartd) : gfp 0x2015A : 246 seconds : OOM-killer skipped 22388751 [ 2062.854486] 2 (kthreadd) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22388752 [ 2062.854497] 3867 (gnome-shell) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388752 [ 2062.854502] 3562 (gnome-settings-) : gfp 0x2015A : 676 seconds : OOM-killer skipped 22388752 [ 2062.854524] 25641 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753 [ 2062.854536] 25566 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753 [ 2062.908915] 61 (kworker/3:1) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22390715 [ 2062.913407] 531 (irqbalance) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22390894 [ 2064.988155] SysRq : Resetting ---------- xfs / Linux 3.19 + patch ---------- Oh, current code is too hintless to determine whether forward progress is made, for no kernel messages are printed when the OOM victim failed to die immediately. I wish we had debug printk patch shown above and/or like http://marc.info/?l=linux-mm&m=141671829611143&w=2 . -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 11:23 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-23 11:23 UTC (permalink / raw) To: mhocko, hannes Cc: tytso, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, linux-ext4, torvalds Michal Hocko wrote: > On Sat 21-02-15 19:20:58, Johannes Weiner wrote: > > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > > > applying Johannes's akpm-doesnt-know-why-it-works patch: > > > > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > + if (!(gfp_mask & __GFP_FS)) { > > > + /* > > > + * XXX: Page reclaim didn't yield anything, > > > + * and the OOM killer can't be invoked, but > > > + * keep looping as per should_alloc_retry(). > > > + */ > > > + *did_some_progress = 1; > > > goto out; > > > + } > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > > > Have people adequately confirmed that this gets us out of trouble? > > > > I'd be interested in this too. Who is seeing these failures? So far ext4 and xfs. I don't have environment to test other filesystems. > > > > Andrew, can you please use the following changelog for this patch? > > > > --- > > From: Johannes Weiner <hannes@cmpxchg.org> > > > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > > killer once reclaim had failed, but nevertheless kept looping in the > > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > > into allocation slowpath"), which should have been a simple cleanup > > patch, accidentally changed the behavior to aborting the allocation at > > that point. This creates problems with filesystem callers (?) that > > currently rely on the allocator waiting for other tasks to intervene. > > > > Revert the behavior as it shouldn't have been changed as part of a > > cleanup patch. > > OK, if this a _short term_ change. I really think that all the requests > except for __GFP_NOFAIL should be able to fail. I would argue that it > should be the caller who should be fixed but it is true that the patch > was introduced too late (rc7) and so it caught other subsystems > unprepared so backporting to stable makes sense to me. But can we please > move on and stop pretending that allocations do not fail for the > upcoming release? > > > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > Acked-by: Michal Hocko <mhocko@suse.cz> > Without this patch, I think the system becomes unusable under OOM. However, with this patch, I know the system may become unusable under OOM. Please do write patches for handling below condition. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Johannes's patch will get us out of filesystem error troubles, at the cost of getting us into stall troubles (as with until 3.19-rc6). I retested http://marc.info/?l=linux-ext4&m=142443125221571&w=2 with debug printk patch shown below. ---------- debug printk patch ---------- diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9c..5144506 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -610,6 +610,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) spin_unlock(&zone_scan_lock); } +atomic_t oom_killer_skipped_count = ATOMIC_INIT(0); + /** * out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer @@ -679,6 +681,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, nodemask, "Out of memory"); killed = 1; } + else + atomic_inc(&oom_killer_skipped_count); out: /* * Give the killed threads a good chance of exiting before trying to diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8e20f9c..eaea16b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) + if (!(gfp_mask & __GFP_FS)) { + /* + * XXX: Page reclaim didn't yield anything, + * and the OOM killer can't be invoked, but + * keep looping as per should_alloc_retry(). + */ + *did_some_progress = 1; goto out; + } /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. @@ -2635,6 +2642,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS); } +extern atomic_t oom_killer_skipped_count; + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, @@ -2649,6 +2658,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + unsigned long first_retried_time = 0; + unsigned long next_warn_time = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2821,6 +2832,19 @@ retry: if (!did_some_progress) goto nopage; } + if (!first_retried_time) { + first_retried_time = jiffies; + if (!first_retried_time) + first_retried_time = 1; + next_warn_time = first_retried_time + 5 * HZ; + } else if (time_after(jiffies, next_warn_time)) { + printk(KERN_INFO "%d (%s) : gfp 0x%X : %lu seconds : " + "OOM-killer skipped %u\n", current->pid, + current->comm, gfp_mask, + (jiffies - first_retried_time) / HZ, + atomic_read(&oom_killer_skipped_count)); + next_warn_time = jiffies + 5 * HZ; + } /* Wait for some write requests to complete then retry */ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); goto retry; ---------- debug printk patch ---------- GFP_NOFS allocations stalled for 10 minutes waiting for somebody else to volunteer memory. GFP_FS allocations stalled for 10 minutes waiting for the OOM killer to kill somebody. The OOM killer stalled for 10 minutes waiting for GFP_NOFS allocations to complete. I guess the system made forward progress because the number of remaining a.out processes decreased over time. (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-ext4-patched.txt.xz ) ---------- ext4 / Linux 3.19 + patch ---------- [ 1335.187579] Out of memory: Kill process 14156 (a.out) score 760 or sacrifice child [ 1335.189604] Killed process 14156 (a.out) total-vm:2167392kB, anon-rss:1360196kB, file-rss:0kB [ 1335.191920] Kill process 14177 (a.out) sharing same memory [ 1335.193465] Kill process 14178 (a.out) sharing same memory [ 1335.195013] Kill process 14179 (a.out) sharing same memory [ 1335.196580] Kill process 14180 (a.out) sharing same memory [ 1335.198128] Kill process 14181 (a.out) sharing same memory [ 1335.199674] Kill process 14182 (a.out) sharing same memory [ 1335.201217] Kill process 14183 (a.out) sharing same memory [ 1335.202768] Kill process 14184 (a.out) sharing same memory [ 1335.204316] Kill process 14185 (a.out) sharing same memory [ 1335.205871] Kill process 14186 (a.out) sharing same memory [ 1335.207420] Kill process 14187 (a.out) sharing same memory [ 1335.208974] Kill process 14188 (a.out) sharing same memory [ 1335.210515] Kill process 14189 (a.out) sharing same memory [ 1335.212063] Kill process 14190 (a.out) sharing same memory [ 1335.213611] Kill process 14191 (a.out) sharing same memory [ 1335.215165] Kill process 14192 (a.out) sharing same memory [ 1335.216715] Kill process 14193 (a.out) sharing same memory [ 1335.218286] Kill process 14194 (a.out) sharing same memory [ 1335.219836] Kill process 14195 (a.out) sharing same memory [ 1335.221378] Kill process 14196 (a.out) sharing same memory [ 1335.222918] Kill process 14197 (a.out) sharing same memory [ 1335.224461] Kill process 14198 (a.out) sharing same memory [ 1335.225999] Kill process 14199 (a.out) sharing same memory [ 1335.227545] Kill process 14200 (a.out) sharing same memory [ 1335.229095] Kill process 14201 (a.out) sharing same memory [ 1335.230643] Kill process 14202 (a.out) sharing same memory [ 1335.232184] Kill process 14203 (a.out) sharing same memory [ 1335.233738] Kill process 14204 (a.out) sharing same memory [ 1335.235293] Kill process 14205 (a.out) sharing same memory [ 1335.236834] Kill process 14206 (a.out) sharing same memory [ 1335.238387] Kill process 14207 (a.out) sharing same memory [ 1335.239930] Kill process 14208 (a.out) sharing same memory [ 1335.241471] Kill process 14209 (a.out) sharing same memory [ 1335.243011] Kill process 14210 (a.out) sharing same memory [ 1335.244554] Kill process 14211 (a.out) sharing same memory [ 1335.246101] Kill process 14212 (a.out) sharing same memory [ 1335.247645] Kill process 14213 (a.out) sharing same memory [ 1335.249182] Kill process 14214 (a.out) sharing same memory [ 1335.250718] Kill process 14215 (a.out) sharing same memory [ 1335.252305] Kill process 14216 (a.out) sharing same memory [ 1335.253899] Kill process 14217 (a.out) sharing same memory [ 1335.255443] Kill process 14218 (a.out) sharing same memory [ 1335.256993] Kill process 14219 (a.out) sharing same memory [ 1335.258531] Kill process 14220 (a.out) sharing same memory [ 1335.260066] Kill process 14221 (a.out) sharing same memory [ 1335.261616] Kill process 14222 (a.out) sharing same memory [ 1335.263143] Kill process 14223 (a.out) sharing same memory [ 1335.264647] Kill process 14224 (a.out) sharing same memory [ 1335.266121] Kill process 14225 (a.out) sharing same memory [ 1335.267598] Kill process 14226 (a.out) sharing same memory [ 1335.269077] Kill process 14227 (a.out) sharing same memory [ 1335.270560] Kill process 14228 (a.out) sharing same memory [ 1335.272038] Kill process 14229 (a.out) sharing same memory [ 1335.273508] Kill process 14230 (a.out) sharing same memory [ 1335.274999] Kill process 14231 (a.out) sharing same memory [ 1335.276469] Kill process 14232 (a.out) sharing same memory [ 1335.277947] Kill process 14233 (a.out) sharing same memory [ 1335.279428] Kill process 14234 (a.out) sharing same memory [ 1335.280894] Kill process 14235 (a.out) sharing same memory [ 1335.282361] Kill process 14236 (a.out) sharing same memory [ 1335.283832] Kill process 14237 (a.out) sharing same memory [ 1335.285304] Kill process 14238 (a.out) sharing same memory [ 1335.286768] Kill process 14239 (a.out) sharing same memory [ 1335.288242] Kill process 14240 (a.out) sharing same memory [ 1335.289714] Kill process 14241 (a.out) sharing same memory [ 1335.291196] Kill process 14242 (a.out) sharing same memory [ 1335.292731] Kill process 14243 (a.out) sharing same memory [ 1335.294258] Kill process 14244 (a.out) sharing same memory [ 1335.295734] Kill process 14245 (a.out) sharing same memory [ 1335.297215] Kill process 14246 (a.out) sharing same memory [ 1335.298710] Kill process 14247 (a.out) sharing same memory [ 1335.300188] Kill process 14248 (a.out) sharing same memory [ 1335.301672] Kill process 14249 (a.out) sharing same memory [ 1335.303157] Kill process 14250 (a.out) sharing same memory [ 1335.304655] Kill process 14251 (a.out) sharing same memory [ 1335.306141] Kill process 14252 (a.out) sharing same memory [ 1335.307621] Kill process 14253 (a.out) sharing same memory [ 1335.309107] Kill process 14254 (a.out) sharing same memory [ 1335.310573] Kill process 14255 (a.out) sharing same memory [ 1335.312052] Kill process 14256 (a.out) sharing same memory [ 1335.313528] Kill process 14257 (a.out) sharing same memory [ 1335.315039] Kill process 14258 (a.out) sharing same memory [ 1335.316522] Kill process 14259 (a.out) sharing same memory [ 1335.317992] Kill process 14260 (a.out) sharing same memory [ 1335.319462] Kill process 14261 (a.out) sharing same memory [ 1335.320965] Kill process 14262 (a.out) sharing same memory [ 1335.322459] Kill process 14263 (a.out) sharing same memory [ 1335.323958] Kill process 14264 (a.out) sharing same memory [ 1335.325472] Kill process 14265 (a.out) sharing same memory [ 1335.326966] Kill process 14266 (a.out) sharing same memory [ 1335.328454] Kill process 14267 (a.out) sharing same memory [ 1335.329945] Kill process 14268 (a.out) sharing same memory [ 1335.331444] Kill process 14269 (a.out) sharing same memory [ 1335.332944] Kill process 14270 (a.out) sharing same memory [ 1335.334435] Kill process 14271 (a.out) sharing same memory [ 1335.335930] Kill process 14272 (a.out) sharing same memory [ 1335.337437] Kill process 14273 (a.out) sharing same memory [ 1335.338927] Kill process 14274 (a.out) sharing same memory [ 1335.340400] Kill process 14275 (a.out) sharing same memory [ 1335.341890] Kill process 14276 (a.out) sharing same memory [ 1339.640500] 464 (systemd-journal) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459181 [ 1339.649374] 615 (vmtoolsd) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459438 [ 1339.649611] 4079 (pool) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459447 [ 1340.343322] 14258 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275 [ 1340.343331] 14194 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275 [ 1340.343345] 14210 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478276 [ 1340.343360] 14179 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478277 [ 1340.345290] 14154 (su) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22478339 [ 1340.345312] 14180 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339 [ 1340.345319] 14260 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339 [ 1340.345337] 14178 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340 [ 1340.345345] 14245 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340 [ 1340.345361] 14226 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478341 [ 1340.346119] 14256 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478368 [ 1340.346139] 14181 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478369 [ 1340.347082] 14274 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347091] 14267 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347095] 14189 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347099] 14238 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347107] 14276 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403 [ 1340.347112] 14183 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403 [ 1340.347397] 14254 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413 [ 1340.347402] 14228 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413 [ 1340.347414] 14185 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347419] 14261 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347423] 14217 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347427] 14203 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347439] 14234 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415 [ 1340.347452] 14269 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415 [ 1340.347461] 14255 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347465] 14192 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347473] 14259 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347492] 14232 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347497] 14223 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347505] 14220 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347523] 14252 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418 [ 1340.347531] 14193 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418 (...snipped...) [ 1949.672951] 43 (kworker/1:1) : gfp 0x10 : 90 seconds : OOM-killer skipped 41315348 [ 1949.993045] 4079 (pool) : gfp 0x201DA : 615 seconds : OOM-killer skipped 41325108 [ 1950.694909] 14269 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41346727 [ 1950.703945] 14181 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41347003 [ 1950.742087] 14254 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348208 [ 1950.744937] 14193 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348299 [ 1950.748884] 2 (kthreadd) : gfp 0x2000D0 : 10 seconds : OOM-killer skipped 41348418 [ 1950.751565] 14203 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348502 [ 1950.756955] 14232 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348656 [ 1950.776918] 14185 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349279 [ 1950.791214] 14217 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349720 [ 1950.798961] 14179 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349957 [ 1950.806551] 14255 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350209 [ 1950.810860] 14234 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350356 [ 1950.813821] 14258 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350450 [ 1950.860422] 14261 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41351919 [ 1950.864015] 14210 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352033 [ 1950.866636] 14226 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352107 [ 1950.905003] 14238 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353303 [ 1950.907813] 14180 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353381 [ 1950.913963] 14276 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353567 [ 1952.238344] 649 (chronyd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393388 [ 1952.243228] 4030 (gnome-shell) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393566 [ 1952.247225] 592 (audispd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393701 [ 1952.258265] 1 (systemd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394041 [ 1952.269296] 1691 (rpcbind) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394365 [ 1952.299073] 702 (rtkit-daemon) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41395288 [ 1952.301231] 627 (lsmd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41395385 [ 1952.350200] 464 (systemd-journal) : gfp 0x201DA : 165 seconds : OOM-killer skipped 41396935 [ 1952.472040] 543 (auditd) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400669 [ 1952.475211] 14154 (su) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400795 [ 1952.527084] 3514 (smbd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402412 [ 1952.543205] 613 (irqbalance) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402892 [ 1952.568276] 12672 (pickup) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403656 [ 1952.572329] 770 (tuned) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41403784 [ 1952.578076] 3392 (master) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403955 [ 1952.597273] 615 (vmtoolsd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41404520 [ 1952.619187] 14146 (sleep) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405206 [ 1952.621214] 811 (NetworkManager) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405265 [ 1952.765035] 3700 (gnome-settings-) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409551 [ 1952.776099] 603 (alsactl) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409856 [ 1952.823163] 661 (crond) : gfp 0x201DA : 325 seconds : OOM-killer skipped 41411303 [ 1953.201269] SysRq : Resetting ---------- ext4 / Linux 3.19 + patch ---------- I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19 with debug printk patch shown above. According to console logs, oom_kill_process() is trivially called via pagefault_out_of_memory() for the former kernel. Due to giving up !GFP_FS allocations immediately? (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz ) ---------- xfs / Linux 3.19 ---------- [ 793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0 [ 793.283102] su cpuset=/ mems_allowed=0 [ 793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40 [ 793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 793.283161] 0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe [ 793.283162] ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206 [ 793.283163] 0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8 [ 793.283164] Call Trace: [ 793.283169] [<ffffffff816ae9d4>] dump_stack+0x45/0x57 [ 793.283171] [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1 [ 793.283174] [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390 [ 793.283177] [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30 [ 793.283178] [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500 [ 793.283179] [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90 [ 793.283180] [<ffffffff816aab2c>] mm_fault_error+0x67/0x140 [ 793.283182] [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580 [ 793.283185] [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60 [ 793.283186] [<ffffffff81070fcb>] ? do_wait+0x12b/0x240 [ 793.283187] [<ffffffff8105abb1>] do_page_fault+0x31/0x70 [ 793.283189] [<ffffffff816b83e8>] page_fault+0x28/0x30 ---------- xfs / Linux 3.19 ---------- On the other hand, stall is observed for the latter kernel. I guess that this time the system failed to make forward progress, for oom_killer_skipped_count is increasing over time but the number of remaining a.out processes remained unchanged. (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-patched.txt.xz ) ---------- xfs / Linux 3.19 + patch ---------- [ 2062.847965] 505 (abrt-watch-log) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388568 [ 2062.850270] 515 (lsmd) : gfp 0x2015A : 674 seconds : OOM-killer skipped 22388662 [ 2062.850389] 491 (audispd) : gfp 0x2015A : 666 seconds : OOM-killer skipped 22388667 [ 2062.850400] 346 (systemd-journal) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388667 [ 2062.850402] 610 (rtkit-daemon) : gfp 0x2015A : 677 seconds : OOM-killer skipped 22388667 [ 2062.850424] 494 (alsactl) : gfp 0x2015A : 546 seconds : OOM-killer skipped 22388668 [ 2062.850446] 558 (crond) : gfp 0x2015A : 645 seconds : OOM-killer skipped 22388669 [ 2062.850451] 25532 (su) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388669 [ 2062.850456] 516 (vmtoolsd) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388669 [ 2062.850494] 741 (NetworkManager) : gfp 0x2015A : 530 seconds : OOM-killer skipped 22388670 [ 2062.850503] 3132 (master) : gfp 0x2015A : 644 seconds : OOM-killer skipped 22388671 [ 2062.850508] 3144 (pickup) : gfp 0x2015A : 604 seconds : OOM-killer skipped 22388671 [ 2062.850512] 3145 (qmgr) : gfp 0x2015A : 526 seconds : OOM-killer skipped 22388671 [ 2062.850540] 25653 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388672 [ 2062.850561] 655 (tuned) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388673 [ 2062.852404] 10429 (kworker/0:14) : gfp 0x2040D0 : 683 seconds : OOM-killer skipped 22388748 [ 2062.852430] 543 (chronyd) : gfp 0x2015A : 293 seconds : OOM-killer skipped 22388749 [ 2062.852436] 13012 (goa-daemon) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22388749 [ 2062.852449] 1454 (rpcbind) : gfp 0x2015A : 662 seconds : OOM-killer skipped 22388749 [ 2062.854288] 466 (auditd) : gfp 0x2015A : 626 seconds : OOM-killer skipped 22388751 [ 2062.854305] 25622 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751 [ 2062.854426] 1419 (dhclient) : gfp 0x2015A : 388 seconds : OOM-killer skipped 22388751 [ 2062.854443] 25638 (a.out) : gfp 0x204250 : 683 seconds : OOM-killer skipped 22388751 [ 2062.854450] 25582 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751 [ 2062.854462] 25400 (sleep) : gfp 0x2015A : 635 seconds : OOM-killer skipped 22388751 [ 2062.854469] 532 (smartd) : gfp 0x2015A : 246 seconds : OOM-killer skipped 22388751 [ 2062.854486] 2 (kthreadd) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22388752 [ 2062.854497] 3867 (gnome-shell) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388752 [ 2062.854502] 3562 (gnome-settings-) : gfp 0x2015A : 676 seconds : OOM-killer skipped 22388752 [ 2062.854524] 25641 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753 [ 2062.854536] 25566 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753 [ 2062.908915] 61 (kworker/3:1) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22390715 [ 2062.913407] 531 (irqbalance) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22390894 [ 2064.988155] SysRq : Resetting ---------- xfs / Linux 3.19 + patch ---------- Oh, current code is too hintless to determine whether forward progress is made, for no kernel messages are printed when the OOM victim failed to die immediately. I wish we had debug printk patch shown above and/or like http://marc.info/?l=linux-mm&m=141671829611143&w=2 . _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-22 0:20 ` Johannes Weiner (?) @ 2015-02-23 21:33 ` David Rientjes -1 siblings, 0 replies; 276+ messages in thread From: David Rientjes @ 2015-02-23 21:33 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Theodore Ts'o, Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, oleg, mgorman, torvalds, xfs, linux-ext4 On Sat, 21 Feb 2015, Johannes Weiner wrote: > From: Johannes Weiner <hannes@cmpxchg.org> > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > killer once reclaim had failed, but nevertheless kept looping in the > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > into allocation slowpath"), which should have been a simple cleanup > patch, accidentally changed the behavior to aborting the allocation at > that point. This creates problems with filesystem callers (?) that > currently rely on the allocator waiting for other tasks to intervene. > > Revert the behavior as it shouldn't have been changed as part of a > cleanup patch. > > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org [3.19] Acked-by: David Rientjes <rientjes@google.com> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 21:33 ` David Rientjes 0 siblings, 0 replies; 276+ messages in thread From: David Rientjes @ 2015-02-23 21:33 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Theodore Ts'o, Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, oleg, mgorman, torvalds, xfs, linux-ext4 On Sat, 21 Feb 2015, Johannes Weiner wrote: > From: Johannes Weiner <hannes@cmpxchg.org> > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > killer once reclaim had failed, but nevertheless kept looping in the > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > into allocation slowpath"), which should have been a simple cleanup > patch, accidentally changed the behavior to aborting the allocation at > that point. This creates problems with filesystem callers (?) that > currently rely on the allocator waiting for other tasks to intervene. > > Revert the behavior as it shouldn't have been changed as part of a > cleanup patch. > > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org [3.19] Acked-by: David Rientjes <rientjes@google.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 21:33 ` David Rientjes 0 siblings, 0 replies; 276+ messages in thread From: David Rientjes @ 2015-02-23 21:33 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, linux-ext4, torvalds On Sat, 21 Feb 2015, Johannes Weiner wrote: > From: Johannes Weiner <hannes@cmpxchg.org> > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > killer once reclaim had failed, but nevertheless kept looping in the > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > into allocation slowpath"), which should have been a simple cleanup > patch, accidentally changed the behavior to aborting the allocation at > that point. This creates problems with filesystem callers (?) that > currently rely on the allocator waiting for other tasks to intervene. > > Revert the behavior as it shouldn't have been changed as part of a > cleanup patch. > > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org [3.19] Acked-by: David Rientjes <rientjes@google.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* __GFP_NOFAIL and oom_killer_disabled? 2015-02-21 9:19 ` Andrew Morton ` (3 preceding siblings ...) (?) @ 2015-02-22 14:48 ` Tetsuo Handa 2015-02-23 10:21 ` Michal Hocko -1 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-02-22 14:48 UTC (permalink / raw) To: mhocko Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds Andrew Morton wrote: > And yes, I agree that sites such as xfs's kmem_alloc() should be > passing __GFP_NOFAIL to tell the page allocator what's going on. I > don't think it matters a lot whether kmem_alloc() retains its retry > loop. If __GFP_NOFAIL is working correctly then it will never loop > anyway... __GFP_NOFAIL fails to work correctly if oom_killer_disabled == true. I'm wondering how oom_killer_disable() interferes with __GFP_NOFAIL allocation. We had race check after setting oom_killer_disabled to true in 3.19. ---------- linux-3.19/kernel/power/process.c ---------- int freeze_processes(void) { (...snipped...) pm_wakeup_clear(); printk("Freezing user space processes ... "); pm_freezing = true; oom_kills_saved = oom_kills_count(); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); oom_killer_disable(); /* * There might have been an OOM kill while we were * freezing tasks and the killed task might be still * on the way out so we have to double check for race. */ if (oom_kills_count() != oom_kills_saved && !check_frozen_processes()) { __usermodehelper_set_disable_depth(UMH_ENABLED); printk("OOM in progress."); error = -EBUSY; } else { printk("done."); } } (...snipped...) } ---------- linux-3.19/kernel/power/process.c ---------- I worry that commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer path raceless" might have opened a race window for __alloc_pages_may_oom(__GFP_NOFAIL) allocation to fail when OOM killer is disabled. I think something like --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -789,7 +789,7 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, bool ret = false; down_read(&oom_sem); - if (!oom_killer_disabled) { + if (!oom_killer_disabled || (gfp_mask & __GFP_NOFAIL)) { __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); ret = true; } is needed. But such change can race with up_write() and wait_event() in oom_killer_disable(). While the comment of oom_killer_disable() says "The function cannot be called when there are runnable user tasks because the userspace would see unexpected allocation failures as a result.", aren't there still kernel threads which might do __GFP_NOFAIL allocations? After all, don't we need to recheck after setting oom_killer_disabled to true? ---------- linux.git/kernel/power/process.c ---------- int freeze_processes(void) { (...snipped...) pm_wakeup_clear(); pr_info("Freezing user space processes ... "); pm_freezing = true; error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); pr_cont("done."); } pr_cont("\n"); BUG_ON(in_atomic()); /* * Now that the whole userspace is frozen we need to disbale * the OOM killer to disallow any further interference with * killable tasks. */ if (!error && !oom_killer_disable()) error = -EBUSY; (...snipped...) } ---------- linux.git/kernel/power/process.c ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: __GFP_NOFAIL and oom_killer_disabled? 2015-02-22 14:48 ` __GFP_NOFAIL and oom_killer_disabled? Tetsuo Handa @ 2015-02-23 10:21 ` Michal Hocko 2015-02-23 13:03 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2015-02-23 10:21 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds On Sun 22-02-15 23:48:01, Tetsuo Handa wrote: > Andrew Morton wrote: > > And yes, I agree that sites such as xfs's kmem_alloc() should be > > passing __GFP_NOFAIL to tell the page allocator what's going on. I > > don't think it matters a lot whether kmem_alloc() retains its retry > > loop. If __GFP_NOFAIL is working correctly then it will never loop > > anyway... > > __GFP_NOFAIL fails to work correctly if oom_killer_disabled == true. > I'm wondering how oom_killer_disable() interferes with __GFP_NOFAIL > allocation. We had race check after setting oom_killer_disabled to true > in 3.19. [...] > I worry that commit c32b3cbe0d067a9c "oom, PM: make OOM detection in > the freezer path raceless" might have opened a race window for > __alloc_pages_may_oom(__GFP_NOFAIL) allocation to fail when OOM killer > is disabled. This commit hasn't introduced any behavior changes. GFP_NOFAIL allocations fail when OOM killer is disabled since beginning 7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen). > I think something like > > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -789,7 +789,7 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > bool ret = false; > > down_read(&oom_sem); > - if (!oom_killer_disabled) { > + if (!oom_killer_disabled || (gfp_mask & __GFP_NOFAIL)) { > __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); > ret = true; > } > > is needed. > But such change can race with up_write() and wait_event() in > oom_killer_disable(). Not only it races with the above but also breaks the core assumption that no userspace task might interact with later stages of the suspend. > While the comment of oom_killer_disable() says > "The function cannot be called when there are runnable user tasks because > the userspace would see unexpected allocation failures as a result.", > aren't there still kernel threads which might do __GFP_NOFAIL allocations? OK, this is a fair point. My assumption was that kernel threads rarely do __GFP_NOFAIL allocations. It seems I was wrong here. This makes the logic much more trickier. I can see 3 possible ways to handle this: 1) move oom_killer_disable after kernel threads are frozen. This has a risk that the OOM victim wouldn't be able to finish because it would depend on an already frozen kernel thread. This would be really tricky to debug. 2) do not fail GFP_NOFAIL allocation no matter what and risk a potential (and silent) endless loop during suspend. On the other hand the chances that __GFP_NOFAIL comes from a freezable kernel thread rather than from deep pm suspend path is considerably higher. So now that I am thinking about that it indeed makes more sense to simply warn when OOM is disabled and retry the allocation. Freezable kernel threads will loop and fail the suspend. Incidental allocations after kernel threads are frozen will at least dump a warning - if we are lucky and the serial console is still active of course... 3) do nothing ;) But whatever we do there is simply no way to guarantee __GFP_NOFAIL after OOM killer has been disabled. So we are risking between endless loops and possible crashes due to unexpected allocation failures. Not a nice choice. We can only chose the less risky way and it sounds like 2) is that option. Considering that we haven't seen any crashes with the current behavior I would be tempted to simply declare this a corner case which doesn't need any action but well, I hate to debug nasty issues so better be prepared... What about something like the following? --- ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: __GFP_NOFAIL and oom_killer_disabled? 2015-02-23 10:21 ` Michal Hocko @ 2015-02-23 13:03 ` Tetsuo Handa 2015-02-24 18:14 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-02-23 13:03 UTC (permalink / raw) To: mhocko Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds Michal Hocko wrote: > What about something like the following? I'm fine with whatever approaches as long as retry is guaranteed. But maybe we can use memory reserves like below? I think there will be little risk because userspace processes are already frozen... diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a47f0b2..cea0a1b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2760,8 +2760,17 @@ retry: &did_some_progress); if (page) goto got_pg; - if (!did_some_progress) + if (!did_some_progress && !(gfp_mask & __GFP_NOFAIL)) goto nopage; + /* + * What!? __GFP_NOFAIL allocation failed to invoke + * the OOM killer due to oom_killer_disabled == true? + * Then, pretend ALLOC_NO_WATERMARKS request and let + * __alloc_pages_high_priority() retry forever... + */ + WARN(1, "Retrying GFP_NOFAIL allocation...\n"); + gfp_mask &= ~__GFP_NOMEMALLOC; + gfp_mask |= __GFP_MEMALLOC; } /* Wait for some write requests to complete then retry */ wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: __GFP_NOFAIL and oom_killer_disabled? 2015-02-23 13:03 ` Tetsuo Handa @ 2015-02-24 18:14 ` Michal Hocko 2015-02-25 11:22 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2015-02-24 18:14 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds On Mon 23-02-15 22:03:25, Tetsuo Handa wrote: > Michal Hocko wrote: > > What about something like the following? > > I'm fine with whatever approaches as long as retry is guaranteed. > > But maybe we can use memory reserves like below? This sounds too risky to me and not really necessary. GFP_NOFAIL allocations shouldn't be called while the system is not running any tasks (aka from pm/device code). So we are primarily trying to help those nofail allocations which come from kernel threads and their retry will fail the suspend rather than blow up because of an unexpected allocation failure. > I think there will be little risk because userspace processes are > already frozen... > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a47f0b2..cea0a1b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2760,8 +2760,17 @@ retry: > &did_some_progress); > if (page) > goto got_pg; > - if (!did_some_progress) > + if (!did_some_progress && !(gfp_mask & __GFP_NOFAIL)) > goto nopage; > + /* > + * What!? __GFP_NOFAIL allocation failed to invoke > + * the OOM killer due to oom_killer_disabled == true? > + * Then, pretend ALLOC_NO_WATERMARKS request and let > + * __alloc_pages_high_priority() retry forever... > + */ > + WARN(1, "Retrying GFP_NOFAIL allocation...\n"); > + gfp_mask &= ~__GFP_NOMEMALLOC; > + gfp_mask |= __GFP_MEMALLOC; > } > /* Wait for some write requests to complete then retry */ > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: __GFP_NOFAIL and oom_killer_disabled? 2015-02-24 18:14 ` Michal Hocko @ 2015-02-25 11:22 ` Tetsuo Handa 2015-02-25 16:02 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-02-25 11:22 UTC (permalink / raw) To: mhocko Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds Michal Hocko wrote: > This commit hasn't introduced any behavior changes. GFP_NOFAIL > allocations fail when OOM killer is disabled since beginning > 7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen). I thought that - out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false); - *did_some_progress = 1; + if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) + *did_some_progress = 1; in commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer path raceless" introduced a code path which fails to set *did_some_progress to non 0 value. > " > We haven't seen any bug reports since 2009 so I haven't marked the patch > for stable. I have no problem to backport it to stable trees though if > people think it is a good precaution. > " Until 3.18, GFP_NOFAIL for GFP_NOFS / GFP_NOIO did not fail with oom_killer_disabled == true because of ---------- if (!did_some_progress) { if (oom_gfp_allowed(gfp_mask)) { if (oom_killer_disabled) goto nopage; (...snipped...) goto restart; } } (...snipped...) goto rebalance; ---------- and that might be the reason you did not see bug reports. In 3.19, GFP_NOFAIL for GFP_NOFS / GFP_NOIO started to fail with oom_killer_disabled == true because of ---------- if (should_alloc_retry(gfp_mask, order, did_some_progress, pages_reclaimed)) { /* * If we fail to make progress by freeing individual * pages, but the allocation wants us to keep going, * start OOM killing tasks. */ if (!did_some_progress) { page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, classzone_idx, migratetype,&did_some_progress); if (page) goto got_pg; if (!did_some_progress) goto nopage; } /* Wait for some write requests to complete then retry */ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); goto retry; } else ---------- ---------- static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, int classzone_idx, int migratetype, unsigned long *did_some_progress) { struct page *page; *did_some_progress = 0; if (oom_killer_disabled) return NULL; ---------- and thus you might start seeing bug reports. So, it is commit 9879de7373fc "mm: page_alloc: embed OOM killing naturally into allocation slowpath" than commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer path raceless" that introduced behavior changes? > On Mon 23-02-15 22:03:25, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > What about something like the following? > > > > I'm fine with whatever approaches as long as retry is guaranteed. > > > > But maybe we can use memory reserves like below? > > This sounds too risky to me and not really necessary. GFP_NOFAIL > allocations shouldn't be called while the system is not running any > tasks (aka from pm/device code). So we are primarily trying to help > those nofail allocations which come from kernel threads and their retry > will fail the suspend rather than blow up because of an unexpected > allocation failure. I meant "After all, don't we need to recheck after setting oom_killer_disabled to true?" as "their retry will fail the suspend". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: __GFP_NOFAIL and oom_killer_disabled? 2015-02-25 11:22 ` Tetsuo Handa @ 2015-02-25 16:02 ` Michal Hocko 2015-02-25 21:48 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2015-02-25 16:02 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds On Wed 25-02-15 20:22:22, Tetsuo Handa wrote: > Michal Hocko wrote: > > This commit hasn't introduced any behavior changes. GFP_NOFAIL > > allocations fail when OOM killer is disabled since beginning > > 7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen). > > I thought that > > - out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false); > - *did_some_progress = 1; > + if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) > + *did_some_progress = 1; > > in commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer > path raceless" introduced a code path which fails to set > *did_some_progress to non 0 value. But this commit had also the following hunk: @@ -2317,9 +2315,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, *did_some_progress = 0; - if (oom_killer_disabled) - return NULL; - so we even wouldn't get down to out_of_memory and returned with did_some_progress=0 right away. So the patch hasn't changed the logic. > > " > > We haven't seen any bug reports since 2009 so I haven't marked the patch > > for stable. I have no problem to backport it to stable trees though if > > people think it is a good precaution. > > " > > Until 3.18, GFP_NOFAIL for GFP_NOFS / GFP_NOIO did not fail with > oom_killer_disabled == true because of > > ---------- > if (!did_some_progress) { > if (oom_gfp_allowed(gfp_mask)) { > if (oom_killer_disabled) > goto nopage; > (...snipped...) > goto restart; > } > } > (...snipped...) > goto rebalance; > ---------- > > and that might be the reason you did not see bug reports. > In 3.19, GFP_NOFAIL for GFP_NOFS / GFP_NOIO started to fail with > oom_killer_disabled == true because of OK, that would change the bahavior for __GFP_NOFAIL|~__GFP_FS allocations. The patch from Johannes which reverts GFP_NOFS failure mode should go to stable and that should be sufficient IMO. [...] > So, it is commit 9879de7373fc "mm: page_alloc: embed OOM killing naturally > into allocation slowpath" than commit c32b3cbe0d067a9c "oom, PM: make OOM > detection in the freezer path raceless" that introduced behavior changes? Yes. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: __GFP_NOFAIL and oom_killer_disabled? 2015-02-25 16:02 ` Michal Hocko @ 2015-02-25 21:48 ` Tetsuo Handa 2015-02-25 21:51 ` Andrew Morton 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-02-25 21:48 UTC (permalink / raw) To: mhocko, hannes Cc: akpm, tytso, david, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds Michal Hocko wrote: > On Wed 25-02-15 20:22:22, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > This commit hasn't introduced any behavior changes. GFP_NOFAIL > > > allocations fail when OOM killer is disabled since beginning > > > 7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen). > > > > I thought that > > > > - out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false); > > - *did_some_progress = 1; > > + if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) > > + *did_some_progress = 1; > > > > in commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer > > path raceless" introduced a code path which fails to set > > *did_some_progress to non 0 value. > > But this commit had also the following hunk: > @@ -2317,9 +2315,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > *did_some_progress = 0; > > - if (oom_killer_disabled) > - return NULL; > - > > so we even wouldn't get down to out_of_memory and returned with > did_some_progress=0 right away. So the patch hasn't changed the logic. OK. > OK, that would change the bahavior for __GFP_NOFAIL|~__GFP_FS > allocations. The patch from Johannes which reverts GFP_NOFS failure mode > should go to stable and that should be sufficient IMO. > mm-page_alloc-revert-inadvertent-__gfp_fs-retry-behavior-change.patch fixes only ~__GFP_NOFAIL|~__GFP_FS case. I think we need David's version http://marc.info/?l=linux-mm&m=142489687015873&w=2 for 3.19-stable . -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: __GFP_NOFAIL and oom_killer_disabled? 2015-02-25 21:48 ` Tetsuo Handa @ 2015-02-25 21:51 ` Andrew Morton 0 siblings, 0 replies; 276+ messages in thread From: Andrew Morton @ 2015-02-25 21:51 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, hannes, tytso, david, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds On Thu, 26 Feb 2015 06:48:02 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > OK, that would change the bahavior for __GFP_NOFAIL|~__GFP_FS > > allocations. The patch from Johannes which reverts GFP_NOFS failure mode > > should go to stable and that should be sufficient IMO. > > > > mm-page_alloc-revert-inadvertent-__gfp_fs-retry-behavior-change.patch > fixes only ~__GFP_NOFAIL|~__GFP_FS case. I think we need David's version > http://marc.info/?l=linux-mm&m=142489687015873&w=2 for 3.19-stable . afacit nobody has even tested that. If we want changes made to 3.19.x then they will need to be well tested, well changelogged and signed off. Please. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 3:20 ` Theodore Ts'o (?) @ 2015-02-21 12:00 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-21 12:00 UTC (permalink / raw) To: tytso Cc: david, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-ext4 Theodore Ts'o wrote: > So at this point, it seems we have two choices. We can either revert > 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's > memory allocations and submit them as stable bug fixes. Can you absorb this side effect by simply adding GFP_NOFAIL to only ext4's memory allocations? Don't you also depend on lower layers which use GFP_NOIO? BTW, while you are using open-coded GFP_NOFAIL retry loop for GFP_NOFS allocation in jbd2, you are already using GFP_NOFAIL for GFP_NOFS allocation in jbd. Failure check being there for GFP_NOFAIL seems redundant. ---------- linux-3.19/fs/jbd2/transaction.c ---------- 257 static int start_this_handle(journal_t *journal, handle_t *handle, 258 gfp_t gfp_mask) 259 { 260 transaction_t *transaction, *new_transaction = NULL; 261 int blocks = handle->h_buffer_credits; 262 int rsv_blocks = 0; 263 unsigned long ts = jiffies; 264 265 /* 266 * 1/2 of transaction can be reserved so we can practically handle 267 * only 1/2 of maximum transaction size per operation 268 */ 269 if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) { 270 printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n", 271 current->comm, blocks, 272 journal->j_max_transaction_buffers / 2); 273 return -ENOSPC; 274 } 275 276 if (handle->h_rsv_handle) 277 rsv_blocks = handle->h_rsv_handle->h_buffer_credits; 278 279 alloc_transaction: 280 if (!journal->j_running_transaction) { 281 new_transaction = kmem_cache_zalloc(transaction_cache, 282 gfp_mask); 283 if (!new_transaction) { 284 /* 285 * If __GFP_FS is not present, then we may be 286 * being called from inside the fs writeback 287 * layer, so we MUST NOT fail. Since 288 * __GFP_NOFAIL is going away, we will arrange 289 * to retry the allocation ourselves. 290 */ 291 if ((gfp_mask & __GFP_FS) == 0) { 292 congestion_wait(BLK_RW_ASYNC, HZ/50); 293 goto alloc_transaction; 294 } 295 return -ENOMEM; 296 } 297 } 298 299 jbd_debug(3, "New handle %p going live.\n", handle); ---------- linux-3.19/fs/jbd2/transaction.c ---------- ---------- linux-3.19/fs/jbd/transaction.c ---------- 84 static int start_this_handle(journal_t *journal, handle_t *handle) 85 { 86 transaction_t *transaction; 87 int needed; 88 int nblocks = handle->h_buffer_credits; 89 transaction_t *new_transaction = NULL; 90 int ret = 0; 91 92 if (nblocks > journal->j_max_transaction_buffers) { 93 printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n", 94 current->comm, nblocks, 95 journal->j_max_transaction_buffers); 96 ret = -ENOSPC; 97 goto out; 98 } 99 100 alloc_transaction: 101 if (!journal->j_running_transaction) { 102 new_transaction = kzalloc(sizeof(*new_transaction), 103 GFP_NOFS|__GFP_NOFAIL); 104 if (!new_transaction) { 105 ret = -ENOMEM; 106 goto out; 107 } 108 } 109 110 jbd_debug(3, "New handle %p going live.\n", handle); ---------- linux-3.19/fs/jbd/transaction.c ---------- ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 12:00 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-21 12:00 UTC (permalink / raw) To: tytso Cc: david, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-ext4 Theodore Ts'o wrote: > So at this point, it seems we have two choices. We can either revert > 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's > memory allocations and submit them as stable bug fixes. Can you absorb this side effect by simply adding GFP_NOFAIL to only ext4's memory allocations? Don't you also depend on lower layers which use GFP_NOIO? BTW, while you are using open-coded GFP_NOFAIL retry loop for GFP_NOFS allocation in jbd2, you are already using GFP_NOFAIL for GFP_NOFS allocation in jbd. Failure check being there for GFP_NOFAIL seems redundant. ---------- linux-3.19/fs/jbd2/transaction.c ---------- 257 static int start_this_handle(journal_t *journal, handle_t *handle, 258 gfp_t gfp_mask) 259 { 260 transaction_t *transaction, *new_transaction = NULL; 261 int blocks = handle->h_buffer_credits; 262 int rsv_blocks = 0; 263 unsigned long ts = jiffies; 264 265 /* 266 * 1/2 of transaction can be reserved so we can practically handle 267 * only 1/2 of maximum transaction size per operation 268 */ 269 if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) { 270 printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n", 271 current->comm, blocks, 272 journal->j_max_transaction_buffers / 2); 273 return -ENOSPC; 274 } 275 276 if (handle->h_rsv_handle) 277 rsv_blocks = handle->h_rsv_handle->h_buffer_credits; 278 279 alloc_transaction: 280 if (!journal->j_running_transaction) { 281 new_transaction = kmem_cache_zalloc(transaction_cache, 282 gfp_mask); 283 if (!new_transaction) { 284 /* 285 * If __GFP_FS is not present, then we may be 286 * being called from inside the fs writeback 287 * layer, so we MUST NOT fail. Since 288 * __GFP_NOFAIL is going away, we will arrange 289 * to retry the allocation ourselves. 290 */ 291 if ((gfp_mask & __GFP_FS) == 0) { 292 congestion_wait(BLK_RW_ASYNC, HZ/50); 293 goto alloc_transaction; 294 } 295 return -ENOMEM; 296 } 297 } 298 299 jbd_debug(3, "New handle %p going live.\n", handle); ---------- linux-3.19/fs/jbd2/transaction.c ---------- ---------- linux-3.19/fs/jbd/transaction.c ---------- 84 static int start_this_handle(journal_t *journal, handle_t *handle) 85 { 86 transaction_t *transaction; 87 int needed; 88 int nblocks = handle->h_buffer_credits; 89 transaction_t *new_transaction = NULL; 90 int ret = 0; 91 92 if (nblocks > journal->j_max_transaction_buffers) { 93 printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n", 94 current->comm, nblocks, 95 journal->j_max_transaction_buffers); 96 ret = -ENOSPC; 97 goto out; 98 } 99 100 alloc_transaction: 101 if (!journal->j_running_transaction) { 102 new_transaction = kzalloc(sizeof(*new_transaction), 103 GFP_NOFS|__GFP_NOFAIL); 104 if (!new_transaction) { 105 ret = -ENOMEM; 106 goto out; 107 } 108 } 109 110 jbd_debug(3, "New handle %p going live.\n", handle); ---------- linux-3.19/fs/jbd/transaction.c ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 12:00 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-21 12:00 UTC (permalink / raw) To: tytso Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, linux-ext4, torvalds Theodore Ts'o wrote: > So at this point, it seems we have two choices. We can either revert > 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's > memory allocations and submit them as stable bug fixes. Can you absorb this side effect by simply adding GFP_NOFAIL to only ext4's memory allocations? Don't you also depend on lower layers which use GFP_NOIO? BTW, while you are using open-coded GFP_NOFAIL retry loop for GFP_NOFS allocation in jbd2, you are already using GFP_NOFAIL for GFP_NOFS allocation in jbd. Failure check being there for GFP_NOFAIL seems redundant. ---------- linux-3.19/fs/jbd2/transaction.c ---------- 257 static int start_this_handle(journal_t *journal, handle_t *handle, 258 gfp_t gfp_mask) 259 { 260 transaction_t *transaction, *new_transaction = NULL; 261 int blocks = handle->h_buffer_credits; 262 int rsv_blocks = 0; 263 unsigned long ts = jiffies; 264 265 /* 266 * 1/2 of transaction can be reserved so we can practically handle 267 * only 1/2 of maximum transaction size per operation 268 */ 269 if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) { 270 printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n", 271 current->comm, blocks, 272 journal->j_max_transaction_buffers / 2); 273 return -ENOSPC; 274 } 275 276 if (handle->h_rsv_handle) 277 rsv_blocks = handle->h_rsv_handle->h_buffer_credits; 278 279 alloc_transaction: 280 if (!journal->j_running_transaction) { 281 new_transaction = kmem_cache_zalloc(transaction_cache, 282 gfp_mask); 283 if (!new_transaction) { 284 /* 285 * If __GFP_FS is not present, then we may be 286 * being called from inside the fs writeback 287 * layer, so we MUST NOT fail. Since 288 * __GFP_NOFAIL is going away, we will arrange 289 * to retry the allocation ourselves. 290 */ 291 if ((gfp_mask & __GFP_FS) == 0) { 292 congestion_wait(BLK_RW_ASYNC, HZ/50); 293 goto alloc_transaction; 294 } 295 return -ENOMEM; 296 } 297 } 298 299 jbd_debug(3, "New handle %p going live.\n", handle); ---------- linux-3.19/fs/jbd2/transaction.c ---------- ---------- linux-3.19/fs/jbd/transaction.c ---------- 84 static int start_this_handle(journal_t *journal, handle_t *handle) 85 { 86 transaction_t *transaction; 87 int needed; 88 int nblocks = handle->h_buffer_credits; 89 transaction_t *new_transaction = NULL; 90 int ret = 0; 91 92 if (nblocks > journal->j_max_transaction_buffers) { 93 printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n", 94 current->comm, nblocks, 95 journal->j_max_transaction_buffers); 96 ret = -ENOSPC; 97 goto out; 98 } 99 100 alloc_transaction: 101 if (!journal->j_running_transaction) { 102 new_transaction = kzalloc(sizeof(*new_transaction), 103 GFP_NOFS|__GFP_NOFAIL); 104 if (!new_transaction) { 105 ret = -ENOMEM; 106 goto out; 107 } 108 } 109 110 jbd_debug(3, "New handle %p going live.\n", handle); ---------- linux-3.19/fs/jbd/transaction.c ---------- _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 3:20 ` Theodore Ts'o (?) @ 2015-02-23 10:26 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-23 10:26 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Tetsuo Handa, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-ext4 On Fri 20-02-15 22:20:00, Theodore Ts'o wrote: [...] > So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to > ext4/jbd2. I am currently going through opencoded GFP_NOFAIL allocations and have this in my local branch currently. I assume you did the same so I will drop mine if you have pushed yours already. --- >From dc49cef75dbd677d5542c9e5bd27bbfab9a7bc3a Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Fri, 20 Feb 2015 11:32:58 +0100 Subject: [PATCH] jbd2: revert must-not-fail allocation loops back to GFP_NOFAIL This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2 layer). The deprecation of __GFP_NOFAIL was a bad choice because it led to open coding the endless loop around the allocator rather than removing the dependency on the non failing allocation. So the deprecation was a clear failure and the reality tells us that __GFP_NOFAIL is not even close to go away. It is still true that __GFP_NOFAIL allocations are generally discouraged and new uses should be evaluated and an alternative (pre-allocations or reservations) should be considered but it doesn't make any sense to lie the allocator about the requirements. Allocator can take steps to help making a progress if it knows the requirements. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- fs/jbd2/journal.c | 11 +---------- fs/jbd2/transaction.c | 20 +++++++------------- 2 files changed, 8 insertions(+), 23 deletions(-) diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 1df94fabe4eb..878ed3e761f0 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -371,16 +371,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction, */ J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in)); -retry_alloc: - new_bh = alloc_buffer_head(GFP_NOFS); - if (!new_bh) { - /* - * Failure is not an option, but __GFP_NOFAIL is going - * away; so we retry ourselves here. - */ - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto retry_alloc; - } + new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL); /* keep subsequent assertions sane */ atomic_set(&new_bh->b_count, 1); diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 5f09370c90a8..dac4523fa142 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -278,22 +278,16 @@ static int start_this_handle(journal_t *journal, handle_t *handle, alloc_transaction: if (!journal->j_running_transaction) { + /* + * If __GFP_FS is not present, then we may be being called from + * inside the fs writeback layer, so we MUST NOT fail. + */ + if ((gfp_mask & __GFP_FS) == 0) + gfp_mask |= __GFP_NOFAIL; new_transaction = kmem_cache_zalloc(transaction_cache, gfp_mask); - if (!new_transaction) { - /* - * If __GFP_FS is not present, then we may be - * being called from inside the fs writeback - * layer, so we MUST NOT fail. Since - * __GFP_NOFAIL is going away, we will arrange - * to retry the allocation ourselves. - */ - if ((gfp_mask & __GFP_FS) == 0) { - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto alloc_transaction; - } + if (!new_transaction) return -ENOMEM; - } } jbd_debug(3, "New handle %p going live.\n", handle); -- 2.1.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 10:26 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-23 10:26 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Tetsuo Handa, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, linux-ext4 On Fri 20-02-15 22:20:00, Theodore Ts'o wrote: [...] > So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to > ext4/jbd2. I am currently going through opencoded GFP_NOFAIL allocations and have this in my local branch currently. I assume you did the same so I will drop mine if you have pushed yours already. --- ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 10:26 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-23 10:26 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, akpm, linux-ext4, torvalds On Fri 20-02-15 22:20:00, Theodore Ts'o wrote: [...] > So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to > ext4/jbd2. I am currently going through opencoded GFP_NOFAIL allocations and have this in my local branch currently. I assume you did the same so I will drop mine if you have pushed yours already. --- >From dc49cef75dbd677d5542c9e5bd27bbfab9a7bc3a Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Fri, 20 Feb 2015 11:32:58 +0100 Subject: [PATCH] jbd2: revert must-not-fail allocation loops back to GFP_NOFAIL This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2 layer). The deprecation of __GFP_NOFAIL was a bad choice because it led to open coding the endless loop around the allocator rather than removing the dependency on the non failing allocation. So the deprecation was a clear failure and the reality tells us that __GFP_NOFAIL is not even close to go away. It is still true that __GFP_NOFAIL allocations are generally discouraged and new uses should be evaluated and an alternative (pre-allocations or reservations) should be considered but it doesn't make any sense to lie the allocator about the requirements. Allocator can take steps to help making a progress if it knows the requirements. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- fs/jbd2/journal.c | 11 +---------- fs/jbd2/transaction.c | 20 +++++++------------- 2 files changed, 8 insertions(+), 23 deletions(-) diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 1df94fabe4eb..878ed3e761f0 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -371,16 +371,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction, */ J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in)); -retry_alloc: - new_bh = alloc_buffer_head(GFP_NOFS); - if (!new_bh) { - /* - * Failure is not an option, but __GFP_NOFAIL is going - * away; so we retry ourselves here. - */ - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto retry_alloc; - } + new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL); /* keep subsequent assertions sane */ atomic_set(&new_bh->b_count, 1); diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 5f09370c90a8..dac4523fa142 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -278,22 +278,16 @@ static int start_this_handle(journal_t *journal, handle_t *handle, alloc_transaction: if (!journal->j_running_transaction) { + /* + * If __GFP_FS is not present, then we may be being called from + * inside the fs writeback layer, so we MUST NOT fail. + */ + if ((gfp_mask & __GFP_FS) == 0) + gfp_mask |= __GFP_NOFAIL; new_transaction = kmem_cache_zalloc(transaction_cache, gfp_mask); - if (!new_transaction) { - /* - * If __GFP_FS is not present, then we may be - * being called from inside the fs writeback - * layer, so we MUST NOT fail. Since - * __GFP_NOFAIL is going away, we will arrange - * to retry the allocation ourselves. - */ - if ((gfp_mask & __GFP_FS) == 0) { - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto alloc_transaction; - } + if (!new_transaction) return -ENOMEM; - } } jbd_debug(3, "New handle %p going live.\n", handle); -- 2.1.4 -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 23:15 ` Dave Chinner @ 2015-02-21 11:12 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-21 11:12 UTC (permalink / raw) To: david Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds My main issue is c) whether to oom-kill more processes when the OOM victim cannot be terminated presumably due to the OOM killer deadlock. Dave Chinner wrote: > On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote: > > Dave Chinner wrote: > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > I really care about the OOM Killer corner cases, for I'm > > > > (1) seeing trouble cases which occurred in enterprise systems > > under OOM conditions > > You reach OOM, then your SLAs are dead and buried. Reboot the > box - its a much more reliable way of returning to a working system > than playing Russian Roulette with the OOM killer. What Service Level Agreements? Such troubles are occurring on RHEL systems where users are not sitting in front of the console. Unless somebody is sitting in front of the console in order to do SysRq-b when troubles occur, the down time of system will become significantly longer. What mechanisms are available for minimizing the down time of system when troubles under OOM condition occur? Software/hardware watchdog? Indeed they may help, but they may be triggered prematurely when the system has not entered into the OOM condition. Only the OOM killer knows. > > > (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which > > an unprivileged user with a login shell can trivially trigger > > since Linux 2.0) to OOM "Genocide" attacks in order to allow > > OOM-unkillable daemons to restart OOM-killed processes > > > > (3) waiting for a bandaid for (2) in order to propose changes for > > mitigating OOM "Genocide" attacks (as bad guys will find how to > > trigger OOM "Deadlock or Genocide" attacks from changes for > > mitigating OOM "Genocide" attacks) > > Which is yet another indication that the OOM killer is the wrong > solution to the "lack of forward progress" problem. Any one can > generate enough memory pressure to trigger the OOM killer; we can't > prevent that from occurring when the OOM killer can be invoked by > user processes. > We have memory cgroups to reduce the possibility of triggering the OOM killer, though there will be several bugs remaining in RHEL kernels which make administrators hesitate to use memory cgroups. > > I started posting to linux-mm ML in order to make forward progress > > about (1) and (2). I don't want the memory allocation subsystem to > > lock up an entire system by indefinitely disabling memory releasing > > mechanism provided by the OOM killer. > > > > > I've proposed a method of providing this forward progress guarantee > > > for subsystems of arbitrary complexity, and this removes the > > > dependency on the OOM killer for fowards allocation progress in such > > > contexts (e.g. filesystems). We should be discussing how to > > > implement that, not what bandaids we need to apply to the OOM > > > killer. I want to fix the underlying problems, not push them under > > > the OOM-killer bus... > > > > I'm fine with that direction for new kernels provided that a simple > > bandaid which can be backported to distributor kernels for making > > OOM "Deadlock" attacks impossible is implemented. Therefore, I'm > > discussing what bandaids we need to apply to the OOM killer. > > The band-aids being proposed are worse than the problem they are > intended to cover up. In which case, the band-aids should not be > applied. > The problem is simple. /proc/sys/vm/panic_on_oom == 0 setting does not help if the OOM killer failed to determine correct task to kill + allow access to memory reserves. The OOM killer is waiting forever under the OOM deadlock condition than triggering kernel panic. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Swapping_and_Out_Of_Memory_Tips.html says that "Usually, oom_killer can kill rogue processes and the system will survive." but says nothing about what to do when we hit the OOM killer deadlock condition. My band-aids allows the OOM killer to trigger kernel panic (followed by optionally kdump and automatic reboot) for people who want to reboot the box when default /proc/sys/vm/panic_on_oom == 0 setting failed to kill rogue processes, and allows people who want the system to survive when the OOM killer failed to determine correct task to kill + allow access to memory reserves. Not only we cannot expect that the OOM killer messages being saved to /var/log/messages under the OOM killer deadlock condition, but also we do not emit the OOM killer messages if we hit void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned int points, unsigned long totalpages, struct mem_cgroup *memcg, nodemask_t *nodemask, const char *message) { struct task_struct *victim = p; struct task_struct *child; struct task_struct *t; struct mm_struct *mm; unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (task_will_free_mem(p)) { /***** _THIS_ _CONDITION_ *****/ set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); return; } if (__ratelimit(&oom_rs)) dump_header(p, gfp_mask, order, memcg, nodemask); task_lock(p); pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", message, task_pid_nr(p), p->comm, points); task_unlock(p); followed by entering into the OOM killer deadlock condition. This is annoying for me because neither serial console nor netconsole helps finding out that the system entered into the OOM condition. If you want to stop people from playing Russian Roulette with the OOM killer, please remove the OOM killer code entirely from RHEL kernels so that people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1 setting. Can you do it? > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 11:12 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-21 11:12 UTC (permalink / raw) To: david Cc: hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs My main issue is c) whether to oom-kill more processes when the OOM victim cannot be terminated presumably due to the OOM killer deadlock. Dave Chinner wrote: > On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote: > > Dave Chinner wrote: > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > I really care about the OOM Killer corner cases, for I'm > > > > (1) seeing trouble cases which occurred in enterprise systems > > under OOM conditions > > You reach OOM, then your SLAs are dead and buried. Reboot the > box - its a much more reliable way of returning to a working system > than playing Russian Roulette with the OOM killer. What Service Level Agreements? Such troubles are occurring on RHEL systems where users are not sitting in front of the console. Unless somebody is sitting in front of the console in order to do SysRq-b when troubles occur, the down time of system will become significantly longer. What mechanisms are available for minimizing the down time of system when troubles under OOM condition occur? Software/hardware watchdog? Indeed they may help, but they may be triggered prematurely when the system has not entered into the OOM condition. Only the OOM killer knows. > > > (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which > > an unprivileged user with a login shell can trivially trigger > > since Linux 2.0) to OOM "Genocide" attacks in order to allow > > OOM-unkillable daemons to restart OOM-killed processes > > > > (3) waiting for a bandaid for (2) in order to propose changes for > > mitigating OOM "Genocide" attacks (as bad guys will find how to > > trigger OOM "Deadlock or Genocide" attacks from changes for > > mitigating OOM "Genocide" attacks) > > Which is yet another indication that the OOM killer is the wrong > solution to the "lack of forward progress" problem. Any one can > generate enough memory pressure to trigger the OOM killer; we can't > prevent that from occurring when the OOM killer can be invoked by > user processes. > We have memory cgroups to reduce the possibility of triggering the OOM killer, though there will be several bugs remaining in RHEL kernels which make administrators hesitate to use memory cgroups. > > I started posting to linux-mm ML in order to make forward progress > > about (1) and (2). I don't want the memory allocation subsystem to > > lock up an entire system by indefinitely disabling memory releasing > > mechanism provided by the OOM killer. > > > > > I've proposed a method of providing this forward progress guarantee > > > for subsystems of arbitrary complexity, and this removes the > > > dependency on the OOM killer for fowards allocation progress in such > > > contexts (e.g. filesystems). We should be discussing how to > > > implement that, not what bandaids we need to apply to the OOM > > > killer. I want to fix the underlying problems, not push them under > > > the OOM-killer bus... > > > > I'm fine with that direction for new kernels provided that a simple > > bandaid which can be backported to distributor kernels for making > > OOM "Deadlock" attacks impossible is implemented. Therefore, I'm > > discussing what bandaids we need to apply to the OOM killer. > > The band-aids being proposed are worse than the problem they are > intended to cover up. In which case, the band-aids should not be > applied. > The problem is simple. /proc/sys/vm/panic_on_oom == 0 setting does not help if the OOM killer failed to determine correct task to kill + allow access to memory reserves. The OOM killer is waiting forever under the OOM deadlock condition than triggering kernel panic. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Swapping_and_Out_Of_Memory_Tips.html says that "Usually, oom_killer can kill rogue processes and the system will survive." but says nothing about what to do when we hit the OOM killer deadlock condition. My band-aids allows the OOM killer to trigger kernel panic (followed by optionally kdump and automatic reboot) for people who want to reboot the box when default /proc/sys/vm/panic_on_oom == 0 setting failed to kill rogue processes, and allows people who want the system to survive when the OOM killer failed to determine correct task to kill + allow access to memory reserves. Not only we cannot expect that the OOM killer messages being saved to /var/log/messages under the OOM killer deadlock condition, but also we do not emit the OOM killer messages if we hit void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned int points, unsigned long totalpages, struct mem_cgroup *memcg, nodemask_t *nodemask, const char *message) { struct task_struct *victim = p; struct task_struct *child; struct task_struct *t; struct mm_struct *mm; unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (task_will_free_mem(p)) { /***** _THIS_ _CONDITION_ *****/ set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); return; } if (__ratelimit(&oom_rs)) dump_header(p, gfp_mask, order, memcg, nodemask); task_lock(p); pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", message, task_pid_nr(p), p->comm, points); task_unlock(p); followed by entering into the OOM killer deadlock condition. This is annoying for me because neither serial console nor netconsole helps finding out that the system entered into the OOM condition. If you want to stop people from playing Russian Roulette with the OOM killer, please remove the OOM killer code entirely from RHEL kernels so that people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1 setting. Can you do it? > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 11:12 ` Tetsuo Handa @ 2015-02-21 21:48 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-21 21:48 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 21, 2015 at 08:12:08PM +0900, Tetsuo Handa wrote: > My main issue is > > c) whether to oom-kill more processes when the OOM victim cannot be > terminated presumably due to the OOM killer deadlock. > > Dave Chinner wrote: > > On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote: > > > Dave Chinner wrote: > > > > I really don't care about the OOM Killer corner cases - it's > > > > completely the wrong way line of development to be spending time on > > > > and you aren't going to convince me otherwise. The OOM killer a > > > > crutch used to justify having a memory allocation subsystem that > > > > can't provide forward progress guarantee mechanisms to callers that > > > > need it. > > > > > > I really care about the OOM Killer corner cases, for I'm > > > > > > (1) seeing trouble cases which occurred in enterprise systems > > > under OOM conditions > > > > You reach OOM, then your SLAs are dead and buried. Reboot the > > box - its a much more reliable way of returning to a working system > > than playing Russian Roulette with the OOM killer. > > What Service Level Agreements? Such troubles are occurring on RHEL systems > where users are not sitting in front of the console. Unless somebody is > sitting in front of the console in order to do SysRq-b when troubles > occur, the down time of system will become significantly longer. > > What mechanisms are available for minimizing the down time of system > when troubles under OOM condition occur? Software/hardware watchdog? > Indeed they may help, but they may be triggered prematurely when the > system has not entered into the OOM condition. Only the OOM killer knows. # echo 1 > /proc/sys/vm/panic_on_oom .... > We have memory cgroups to reduce the possibility of triggering the OOM > killer, though there will be several bugs remaining in RHEL kernels > which make administrators hesitate to use memory cgroups. Fix upstream first, then worry about vendor kernels. .... > Not only we cannot expect that the OOM killer messages being saved to > /var/log/messages under the OOM killer deadlock condition, but also CONFIG_PSTORE=y and configure appropriately from there. > we do not emit the OOM killer messages if we hit So add a warning. > If you want to stop people from playing Russian Roulette with the OOM > killer, please remove the OOM killer code entirely from RHEL kernels so that > people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1 > setting. Can you do it? No. You need to go through vendor channels to get a vendor kernel config change made. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 21:48 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-21 21:48 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sat, Feb 21, 2015 at 08:12:08PM +0900, Tetsuo Handa wrote: > My main issue is > > c) whether to oom-kill more processes when the OOM victim cannot be > terminated presumably due to the OOM killer deadlock. > > Dave Chinner wrote: > > On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote: > > > Dave Chinner wrote: > > > > I really don't care about the OOM Killer corner cases - it's > > > > completely the wrong way line of development to be spending time on > > > > and you aren't going to convince me otherwise. The OOM killer a > > > > crutch used to justify having a memory allocation subsystem that > > > > can't provide forward progress guarantee mechanisms to callers that > > > > need it. > > > > > > I really care about the OOM Killer corner cases, for I'm > > > > > > (1) seeing trouble cases which occurred in enterprise systems > > > under OOM conditions > > > > You reach OOM, then your SLAs are dead and buried. Reboot the > > box - its a much more reliable way of returning to a working system > > than playing Russian Roulette with the OOM killer. > > What Service Level Agreements? Such troubles are occurring on RHEL systems > where users are not sitting in front of the console. Unless somebody is > sitting in front of the console in order to do SysRq-b when troubles > occur, the down time of system will become significantly longer. > > What mechanisms are available for minimizing the down time of system > when troubles under OOM condition occur? Software/hardware watchdog? > Indeed they may help, but they may be triggered prematurely when the > system has not entered into the OOM condition. Only the OOM killer knows. # echo 1 > /proc/sys/vm/panic_on_oom .... > We have memory cgroups to reduce the possibility of triggering the OOM > killer, though there will be several bugs remaining in RHEL kernels > which make administrators hesitate to use memory cgroups. Fix upstream first, then worry about vendor kernels. .... > Not only we cannot expect that the OOM killer messages being saved to > /var/log/messages under the OOM killer deadlock condition, but also CONFIG_PSTORE=y and configure appropriately from there. > we do not emit the OOM killer messages if we hit So add a warning. > If you want to stop people from playing Russian Roulette with the OOM > killer, please remove the OOM killer code entirely from RHEL kernels so that > people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1 > setting. Can you do it? No. You need to go through vendor channels to get a vendor kernel config change made. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 22:52 ` Dave Chinner @ 2015-02-21 23:52 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-21 23:52 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: > I will actively work around aanything that causes filesystem memory > pressure to increase the chance of oom killer invocations. The OOM > killer is not a solution - it is, by definition, a loose cannon and > so we should be reducing dependencies on it. Once we have a better-working alternative, sure. > I really don't care about the OOM Killer corner cases - it's > completely the wrong way line of development to be spending time on > and you aren't going to convince me otherwise. The OOM killer a > crutch used to justify having a memory allocation subsystem that > can't provide forward progress guarantee mechanisms to callers that > need it. We can provide this. Are all these callers able to preallocate? --- diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 51bd1e72a917..af81b8a67651 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -380,6 +380,10 @@ extern void free_kmem_pages(unsigned long addr, unsigned int order); #define __free_page(page) __free_pages((page), 0) #define free_page(addr) free_pages((addr), 0) +void register_private_page(struct page *page, unsigned int order); +int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr); +void free_private_pages(void); + void page_alloc_init(void); void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); void drain_all_pages(struct zone *zone); diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432e14ff..1fe390779f23 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1545,6 +1545,8 @@ struct task_struct { #endif /* VM state */ + struct list_head private_pages; + struct reclaim_state *reclaim_state; struct backing_dev_info *backing_dev_info; diff --git a/kernel/fork.c b/kernel/fork.c index cf65139615a0..b6349b0e5da2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1308,6 +1308,8 @@ static struct task_struct *copy_process(unsigned long clone_flags, memset(&p->rss_stat, 0, sizeof(p->rss_stat)); #endif + INIT_LIST_HEAD(&p->private_pages); + p->default_timer_slack_ns = current->timer_slack_ns; task_io_accounting_init(&p->ioac); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a47f0b229a1a..546db4e0da75 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -490,12 +490,10 @@ static inline void clear_page_guard(struct zone *zone, struct page *page, static inline void set_page_order(struct page *page, unsigned int order) { set_page_private(page, order); - __SetPageBuddy(page); } static inline void rmv_page_order(struct page *page) { - __ClearPageBuddy(page); set_page_private(page, 0); } @@ -617,6 +615,7 @@ static inline void __free_one_page(struct page *page, list_del(&buddy->lru); zone->free_area[order].nr_free--; rmv_page_order(buddy); + __ClearPageBuddy(buddy); } combined_idx = buddy_idx & page_idx; page = page + (combined_idx - page_idx); @@ -624,6 +623,7 @@ static inline void __free_one_page(struct page *page, order++; } set_page_order(page, order); + __SetPageBuddy(page); /* * If this is not the largest possible page, check if the buddy @@ -924,6 +924,7 @@ static inline void expand(struct zone *zone, struct page *page, list_add(&page[size].lru, &area->free_list[migratetype]); area->nr_free++; set_page_order(&page[size], high); + __SetPageBuddy(page); } } @@ -1015,6 +1016,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, struct page, lru); list_del(&page->lru); rmv_page_order(page); + __ClearPageBuddy(page); area->nr_free--; expand(zone, page, order, current_order, area, migratetype); set_freepage_migratetype(page, migratetype); @@ -1212,6 +1214,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype) /* Remove the page from the freelists */ list_del(&page->lru); rmv_page_order(page); + __ClearPageBuddy(page); expand(zone, page, order, current_order, area, buddy_type); @@ -1598,6 +1601,7 @@ int __isolate_free_page(struct page *page, unsigned int order) list_del(&page->lru); zone->free_area[order].nr_free--; rmv_page_order(page); + __ClearPageBuddy(page); /* Set the pageblock if the isolated page is at least a pageblock */ if (order >= pageblock_order - 1) { @@ -2504,6 +2508,40 @@ retry: return page; } +/* Try to allocate from the caller's private memory reserves */ +static inline struct page * +__alloc_pages_private(gfp_t gfp_mask, unsigned int order, + const struct alloc_context *ac) +{ + unsigned int uninitialized_var(alloc_order); + struct page *page = NULL; + struct page *p; + + /* Dopy, but this is a slowpath right before OOM */ + list_for_each_entry(p, ¤t->private_pages, lru) { + int o = page_order(p); + + if (o >= order && (!page || o < alloc_order)) { + page = p; + alloc_order = o; + } + } + if (!page) + return NULL; + + list_del(&page->lru); + rmv_page_order(page); + + /* Give back the remainder */ + while (alloc_order > order) { + alloc_order--; + set_page_order(&page[1 << alloc_order], alloc_order); + list_add(&page[1 << alloc_order].lru, ¤t->private_pages); + } + + return page; +} + /* * This is called in the allocator slow-path if the allocation request is of * sufficient urgency to ignore watermarks and take other desperate measures @@ -2753,9 +2791,13 @@ retry: /* * If we fail to make progress by freeing individual * pages, but the allocation wants us to keep going, - * start OOM killing tasks. + * dip into private reserves, or start OOM killing. */ if (!did_some_progress) { + page = __alloc_pages_private(gfp_mask, order, ac); + if (page) + goto got_pg; + page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) @@ -3046,6 +3088,82 @@ void free_pages_exact(void *virt, size_t size) EXPORT_SYMBOL(free_pages_exact); /** + * alloc_private_pages - allocate private memory reserve pages + * @gfp_mask: gfp flags for the allocations + * @order: order of pages to allocate + * @nr: number of pages to allocate + * + * This allocates @nr pages of order @order as an emergency reserve of + * the calling task, to be used by the page allocator if an allocation + * would otherwise fail. + * + * The caller is responsible for calling free_private_pages() once the + * reserves are no longer required. + */ +int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr) +{ + struct page *page, *page2; + LIST_HEAD(pages); + unsigned int i; + + for (i = 0; i < nr; i++) { + page = alloc_pages(gfp_mask, order); + if (!page) + goto error; + set_page_order(page, order); + list_add(&page->lru, &pages); + } + + list_splice(&pages, ¤t->private_pages); + return 0; + +error: + list_for_each_entry_safe(page, page2, &pages, lru) { + list_del(&page->lru); + rmv_page_order(page); + __free_pages(page, order); + } + return -ENOMEM; +} + +/** + * register_private_page - register a private memory reserve page + * @page: pre-allocated page + * @order: @page's order + * + * This registers @page as an emergency reserve of the calling task, + * to be used by the page allocator if an allocation would otherwise + * fail. + * + * The caller is responsible for calling free_private_pages() once the + * reserves are no longer required. + */ +void register_private_page(struct page *page, unsigned int order) +{ + set_page_order(page, order); + list_add(&page->lru, ¤t->private_pages); +} + +/** + * free_private_pages - free all private memory reserve pages + * + * Frees all (remaining) pages of the calling task's memory reserves + * established by alloc_private_pages() and register_private_page(). + */ +void free_private_pages(void) +{ + struct page *page, *page2; + + list_for_each_entry_safe(page, page2, ¤t->private_pages, lru) { + int order = page_order(page); + + list_del(&page->lru); + rmv_page_order(page); + __free_pages(page, order); + } +} + +/** * nr_free_zone_pages - count number of pages beyond high watermark * @offset: The zone index of the highest zone * @@ -6551,6 +6669,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) #endif list_del(&page->lru); rmv_page_order(page); + __ClearPageBuddy(page); zone->free_area[order].nr_free--; for (i = 0; i < (1 << order); i++) SetPageReserved((page+i)); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-21 23:52 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-21 23:52 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: > I will actively work around aanything that causes filesystem memory > pressure to increase the chance of oom killer invocations. The OOM > killer is not a solution - it is, by definition, a loose cannon and > so we should be reducing dependencies on it. Once we have a better-working alternative, sure. > I really don't care about the OOM Killer corner cases - it's > completely the wrong way line of development to be spending time on > and you aren't going to convince me otherwise. The OOM killer a > crutch used to justify having a memory allocation subsystem that > can't provide forward progress guarantee mechanisms to callers that > need it. We can provide this. Are all these callers able to preallocate? --- diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 51bd1e72a917..af81b8a67651 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -380,6 +380,10 @@ extern void free_kmem_pages(unsigned long addr, unsigned int order); #define __free_page(page) __free_pages((page), 0) #define free_page(addr) free_pages((addr), 0) +void register_private_page(struct page *page, unsigned int order); +int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr); +void free_private_pages(void); + void page_alloc_init(void); void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); void drain_all_pages(struct zone *zone); diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432e14ff..1fe390779f23 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1545,6 +1545,8 @@ struct task_struct { #endif /* VM state */ + struct list_head private_pages; + struct reclaim_state *reclaim_state; struct backing_dev_info *backing_dev_info; diff --git a/kernel/fork.c b/kernel/fork.c index cf65139615a0..b6349b0e5da2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1308,6 +1308,8 @@ static struct task_struct *copy_process(unsigned long clone_flags, memset(&p->rss_stat, 0, sizeof(p->rss_stat)); #endif + INIT_LIST_HEAD(&p->private_pages); + p->default_timer_slack_ns = current->timer_slack_ns; task_io_accounting_init(&p->ioac); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a47f0b229a1a..546db4e0da75 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -490,12 +490,10 @@ static inline void clear_page_guard(struct zone *zone, struct page *page, static inline void set_page_order(struct page *page, unsigned int order) { set_page_private(page, order); - __SetPageBuddy(page); } static inline void rmv_page_order(struct page *page) { - __ClearPageBuddy(page); set_page_private(page, 0); } @@ -617,6 +615,7 @@ static inline void __free_one_page(struct page *page, list_del(&buddy->lru); zone->free_area[order].nr_free--; rmv_page_order(buddy); + __ClearPageBuddy(buddy); } combined_idx = buddy_idx & page_idx; page = page + (combined_idx - page_idx); @@ -624,6 +623,7 @@ static inline void __free_one_page(struct page *page, order++; } set_page_order(page, order); + __SetPageBuddy(page); /* * If this is not the largest possible page, check if the buddy @@ -924,6 +924,7 @@ static inline void expand(struct zone *zone, struct page *page, list_add(&page[size].lru, &area->free_list[migratetype]); area->nr_free++; set_page_order(&page[size], high); + __SetPageBuddy(page); } } @@ -1015,6 +1016,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, struct page, lru); list_del(&page->lru); rmv_page_order(page); + __ClearPageBuddy(page); area->nr_free--; expand(zone, page, order, current_order, area, migratetype); set_freepage_migratetype(page, migratetype); @@ -1212,6 +1214,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype) /* Remove the page from the freelists */ list_del(&page->lru); rmv_page_order(page); + __ClearPageBuddy(page); expand(zone, page, order, current_order, area, buddy_type); @@ -1598,6 +1601,7 @@ int __isolate_free_page(struct page *page, unsigned int order) list_del(&page->lru); zone->free_area[order].nr_free--; rmv_page_order(page); + __ClearPageBuddy(page); /* Set the pageblock if the isolated page is at least a pageblock */ if (order >= pageblock_order - 1) { @@ -2504,6 +2508,40 @@ retry: return page; } +/* Try to allocate from the caller's private memory reserves */ +static inline struct page * +__alloc_pages_private(gfp_t gfp_mask, unsigned int order, + const struct alloc_context *ac) +{ + unsigned int uninitialized_var(alloc_order); + struct page *page = NULL; + struct page *p; + + /* Dopy, but this is a slowpath right before OOM */ + list_for_each_entry(p, ¤t->private_pages, lru) { + int o = page_order(p); + + if (o >= order && (!page || o < alloc_order)) { + page = p; + alloc_order = o; + } + } + if (!page) + return NULL; + + list_del(&page->lru); + rmv_page_order(page); + + /* Give back the remainder */ + while (alloc_order > order) { + alloc_order--; + set_page_order(&page[1 << alloc_order], alloc_order); + list_add(&page[1 << alloc_order].lru, ¤t->private_pages); + } + + return page; +} + /* * This is called in the allocator slow-path if the allocation request is of * sufficient urgency to ignore watermarks and take other desperate measures @@ -2753,9 +2791,13 @@ retry: /* * If we fail to make progress by freeing individual * pages, but the allocation wants us to keep going, - * start OOM killing tasks. + * dip into private reserves, or start OOM killing. */ if (!did_some_progress) { + page = __alloc_pages_private(gfp_mask, order, ac); + if (page) + goto got_pg; + page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) @@ -3046,6 +3088,82 @@ void free_pages_exact(void *virt, size_t size) EXPORT_SYMBOL(free_pages_exact); /** + * alloc_private_pages - allocate private memory reserve pages + * @gfp_mask: gfp flags for the allocations + * @order: order of pages to allocate + * @nr: number of pages to allocate + * + * This allocates @nr pages of order @order as an emergency reserve of + * the calling task, to be used by the page allocator if an allocation + * would otherwise fail. + * + * The caller is responsible for calling free_private_pages() once the + * reserves are no longer required. + */ +int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr) +{ + struct page *page, *page2; + LIST_HEAD(pages); + unsigned int i; + + for (i = 0; i < nr; i++) { + page = alloc_pages(gfp_mask, order); + if (!page) + goto error; + set_page_order(page, order); + list_add(&page->lru, &pages); + } + + list_splice(&pages, ¤t->private_pages); + return 0; + +error: + list_for_each_entry_safe(page, page2, &pages, lru) { + list_del(&page->lru); + rmv_page_order(page); + __free_pages(page, order); + } + return -ENOMEM; +} + +/** + * register_private_page - register a private memory reserve page + * @page: pre-allocated page + * @order: @page's order + * + * This registers @page as an emergency reserve of the calling task, + * to be used by the page allocator if an allocation would otherwise + * fail. + * + * The caller is responsible for calling free_private_pages() once the + * reserves are no longer required. + */ +void register_private_page(struct page *page, unsigned int order) +{ + set_page_order(page, order); + list_add(&page->lru, ¤t->private_pages); +} + +/** + * free_private_pages - free all private memory reserve pages + * + * Frees all (remaining) pages of the calling task's memory reserves + * established by alloc_private_pages() and register_private_page(). + */ +void free_private_pages(void) +{ + struct page *page, *page2; + + list_for_each_entry_safe(page, page2, ¤t->private_pages, lru) { + int order = page_order(page); + + list_del(&page->lru); + rmv_page_order(page); + __free_pages(page, order); + } +} + +/** * nr_free_zone_pages - count number of pages beyond high watermark * @offset: The zone index of the highest zone * @@ -6551,6 +6669,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) #endif list_del(&page->lru); rmv_page_order(page); + __ClearPageBuddy(page); zone->free_area[order].nr_free--; for (i = 0; i < (1 << order); i++) SetPageReserved((page+i)); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 23:52 ` Johannes Weiner @ 2015-02-23 0:45 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-23 0:45 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote: > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: > > I will actively work around aanything that causes filesystem memory > > pressure to increase the chance of oom killer invocations. The OOM > > killer is not a solution - it is, by definition, a loose cannon and > > so we should be reducing dependencies on it. > > Once we have a better-working alternative, sure. Great, but first a simple request: please stop writing code and instead start architecting a solution to the problem. i.e. we need a design and have that documented before code gets written. If you watched my recent LCA talk, then you'll understand what I mean when I say: stop programming and start engineering. > > I really don't care about the OOM Killer corner cases - it's > > completely the wrong way line of development to be spending time on > > and you aren't going to convince me otherwise. The OOM killer a > > crutch used to justify having a memory allocation subsystem that > > can't provide forward progress guarantee mechanisms to callers that > > need it. > > We can provide this. Are all these callers able to preallocate? Anything that allocates in transaction context (and therefor is GFP_NOFS by definition) can preallocate at transaction reservation time. However, preallocation is dumb, complex, CPU and memory intensive and will have a *massive* impact on performance. Allocating 10-100 pages to a reserve which we will almost *never use* and then free them again *on every single transaction* is a lot of unnecessary additional fast path overhead. Hence a "preallocate for every context" reserve pool is not a viable solution. And, really, "reservation" != "preallocation". Maybe it's my filesystem background, but those to things are vastly different things. Reservations are simply an *accounting* of the maximum amount of a reserve required by an operation to guarantee forwards progress. In filesystems, we do this for log space (transactions) and some do it for filesystem space (e.g. delayed allocation needs correct ENOSPC detection so we don't overcommit disk space). The VM already has such concepts (e.g. watermarks and things like min_free_kbytes) that it uses to ensure that there are sufficient reserves for certain types of allocations to succeed. A reserve memory pool is no different - every time a memory reserve occurs, a watermark is lifted to accommodate it, and the transaction is not allowed to proceed until the amount of free memory exceeds that watermark. The memory allocation subsystem then only allows *allocations* marked correctly to allocate pages from that the reserve that watermark protects. e.g. only allocations using __GFP_RESERVE are allowed to dip into the reserve pool. By using watermarks, freeing of memory will automatically top up the reserve pool which means that we guarantee that reclaimable memory allocated for demand paging during transacitons doesn't deplete the reserve pool permanently. As a result, when there is plenty of free and/or reclaimable memory, the reserve pool watermarks will have almost zero impact on performance and behaviour. Further, because it's just accounting and behavioural thresholds, this allows the mm subsystem to control how the reserve pool is accounted internally. e.g. clean, reclaimable pages in the page cache could serve as reserve pool pages as they can be immediately reclaimed for allocation. This could be acheived by setting reclaim targets first to the reserve pool watermark, then the second target is enough pages to satisfy the current allocation. And, FWIW, there's nothing stopping this mechanism from have order based reserve thresholds. e.g. IB could really do with a 64k reserve pool threshold and hence help solve the long standing problems they have with filling the receive ring in GFP_ATOMIC context... Sure, that's looking further down the track, but my point still remains: we need a viable long term solution to this problem. Maybe reservations are not the solution, but I don't see anyone else who is thinking of how to address this architectural problem at a system level right now. We need to design and document the model first, then review it, then we can start working at the code level to implement the solution we've designed. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 0:45 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-23 0:45 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote: > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: > > I will actively work around aanything that causes filesystem memory > > pressure to increase the chance of oom killer invocations. The OOM > > killer is not a solution - it is, by definition, a loose cannon and > > so we should be reducing dependencies on it. > > Once we have a better-working alternative, sure. Great, but first a simple request: please stop writing code and instead start architecting a solution to the problem. i.e. we need a design and have that documented before code gets written. If you watched my recent LCA talk, then you'll understand what I mean when I say: stop programming and start engineering. > > I really don't care about the OOM Killer corner cases - it's > > completely the wrong way line of development to be spending time on > > and you aren't going to convince me otherwise. The OOM killer a > > crutch used to justify having a memory allocation subsystem that > > can't provide forward progress guarantee mechanisms to callers that > > need it. > > We can provide this. Are all these callers able to preallocate? Anything that allocates in transaction context (and therefor is GFP_NOFS by definition) can preallocate at transaction reservation time. However, preallocation is dumb, complex, CPU and memory intensive and will have a *massive* impact on performance. Allocating 10-100 pages to a reserve which we will almost *never use* and then free them again *on every single transaction* is a lot of unnecessary additional fast path overhead. Hence a "preallocate for every context" reserve pool is not a viable solution. And, really, "reservation" != "preallocation". Maybe it's my filesystem background, but those to things are vastly different things. Reservations are simply an *accounting* of the maximum amount of a reserve required by an operation to guarantee forwards progress. In filesystems, we do this for log space (transactions) and some do it for filesystem space (e.g. delayed allocation needs correct ENOSPC detection so we don't overcommit disk space). The VM already has such concepts (e.g. watermarks and things like min_free_kbytes) that it uses to ensure that there are sufficient reserves for certain types of allocations to succeed. A reserve memory pool is no different - every time a memory reserve occurs, a watermark is lifted to accommodate it, and the transaction is not allowed to proceed until the amount of free memory exceeds that watermark. The memory allocation subsystem then only allows *allocations* marked correctly to allocate pages from that the reserve that watermark protects. e.g. only allocations using __GFP_RESERVE are allowed to dip into the reserve pool. By using watermarks, freeing of memory will automatically top up the reserve pool which means that we guarantee that reclaimable memory allocated for demand paging during transacitons doesn't deplete the reserve pool permanently. As a result, when there is plenty of free and/or reclaimable memory, the reserve pool watermarks will have almost zero impact on performance and behaviour. Further, because it's just accounting and behavioural thresholds, this allows the mm subsystem to control how the reserve pool is accounted internally. e.g. clean, reclaimable pages in the page cache could serve as reserve pool pages as they can be immediately reclaimed for allocation. This could be acheived by setting reclaim targets first to the reserve pool watermark, then the second target is enough pages to satisfy the current allocation. And, FWIW, there's nothing stopping this mechanism from have order based reserve thresholds. e.g. IB could really do with a 64k reserve pool threshold and hence help solve the long standing problems they have with filling the receive ring in GFP_ATOMIC context... Sure, that's looking further down the track, but my point still remains: we need a viable long term solution to this problem. Maybe reservations are not the solution, but I don't see anyone else who is thinking of how to address this architectural problem at a system level right now. We need to design and document the model first, then review it, then we can start working at the code level to implement the solution we've designed. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 0:45 ` Dave Chinner @ 2015-02-23 1:29 ` Andrew Morton -1 siblings, 0 replies; 276+ messages in thread From: Andrew Morton @ 2015-02-23 1:29 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, torvalds On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. Yup. > Reservations are simply an *accounting* of the maximum amount of a > reserve required by an operation to guarantee forwards progress. In > filesystems, we do this for log space (transactions) and some do it > for filesystem space (e.g. delayed allocation needs correct ENOSPC > detection so we don't overcommit disk space). The VM already has > such concepts (e.g. watermarks and things like min_free_kbytes) that > it uses to ensure that there are sufficient reserves for certain > types of allocations to succeed. Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic reserve. So to reserve N pages we increase the page allocator dynamic reserve by N, do some reclaim if necessary then deposit N tokens into the caller's task_struct (it'll be a set of zone/nr-pages tuples I suppose). When allocating pages the caller should drain its reserves in preference to dipping into the regular freelist. This guy has already done his reclaim and shouldn't be penalised a second time. I guess Johannes's preallocation code should switch to doing this for the same reason, plus the fact that snipping a page off task_struct.prealloc_pages is super-fast and needs to be done sometime anyway so why not do it by default. Both reservation and preallocation are vulnerable to deadlocks - 10,000 tasks all trying to reserve/prealloc 100 pages, they all have 50 pages and we ran out of memory. Whoops. We can undeadlock by returning ENOMEM but I suspect there will still be problematic situations where massive numbers of pages are temporarily AWOL. Perhaps some form of queuing and throttling will be needed, to limit the peak number of reserved pages. Per zone, I guess. And it'll be a huge pain handling order>0 pages. I'd be inclined to make it order-0 only, and tell the lamer callers that vmap-is-thattaway. Alas, one lame caller is slub. But the biggest issue is how the heck does a caller work out how many pages to reserve/prealloc? Even a single sb_bread() - it's sitting on loop on a sparse NTFS file on loop on a five-deep DM stack on a six-deep MD stack on loop on NFS on an eleventy-deep networking stack. And then there will be an unknown number of slab allocations of unknown size with unknown slabs-per-page rules - how many pages needed for them? And to make it much worse, how many pages of which orders? Bless its heart, slub will go and use a 1-order page for allocations which should have been in 0-order pages.. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 1:29 ` Andrew Morton 0 siblings, 0 replies; 276+ messages in thread From: Andrew Morton @ 2015-02-23 1:29 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. Yup. > Reservations are simply an *accounting* of the maximum amount of a > reserve required by an operation to guarantee forwards progress. In > filesystems, we do this for log space (transactions) and some do it > for filesystem space (e.g. delayed allocation needs correct ENOSPC > detection so we don't overcommit disk space). The VM already has > such concepts (e.g. watermarks and things like min_free_kbytes) that > it uses to ensure that there are sufficient reserves for certain > types of allocations to succeed. Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic reserve. So to reserve N pages we increase the page allocator dynamic reserve by N, do some reclaim if necessary then deposit N tokens into the caller's task_struct (it'll be a set of zone/nr-pages tuples I suppose). When allocating pages the caller should drain its reserves in preference to dipping into the regular freelist. This guy has already done his reclaim and shouldn't be penalised a second time. I guess Johannes's preallocation code should switch to doing this for the same reason, plus the fact that snipping a page off task_struct.prealloc_pages is super-fast and needs to be done sometime anyway so why not do it by default. Both reservation and preallocation are vulnerable to deadlocks - 10,000 tasks all trying to reserve/prealloc 100 pages, they all have 50 pages and we ran out of memory. Whoops. We can undeadlock by returning ENOMEM but I suspect there will still be problematic situations where massive numbers of pages are temporarily AWOL. Perhaps some form of queuing and throttling will be needed, to limit the peak number of reserved pages. Per zone, I guess. And it'll be a huge pain handling order>0 pages. I'd be inclined to make it order-0 only, and tell the lamer callers that vmap-is-thattaway. Alas, one lame caller is slub. But the biggest issue is how the heck does a caller work out how many pages to reserve/prealloc? Even a single sb_bread() - it's sitting on loop on a sparse NTFS file on loop on a five-deep DM stack on a six-deep MD stack on loop on NFS on an eleventy-deep networking stack. And then there will be an unknown number of slab allocations of unknown size with unknown slabs-per-page rules - how many pages needed for them? And to make it much worse, how many pages of which orders? Bless its heart, slub will go and use a 1-order page for allocations which should have been in 0-order pages.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 1:29 ` Andrew Morton @ 2015-02-23 7:32 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-23 7:32 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, torvalds On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > > > I really don't care about the OOM Killer corner cases - it's > > > > completely the wrong way line of development to be spending time on > > > > and you aren't going to convince me otherwise. The OOM killer a > > > > crutch used to justify having a memory allocation subsystem that > > > > can't provide forward progress guarantee mechanisms to callers that > > > > need it. > > > > > > We can provide this. Are all these callers able to preallocate? > > > > Anything that allocates in transaction context (and therefor is > > GFP_NOFS by definition) can preallocate at transaction reservation > > time. However, preallocation is dumb, complex, CPU and memory > > intensive and will have a *massive* impact on performance. > > Allocating 10-100 pages to a reserve which we will almost *never > > use* and then free them again *on every single transaction* is a lot > > of unnecessary additional fast path overhead. Hence a "preallocate > > for every context" reserve pool is not a viable solution. > > Yup. > > > Reservations are simply an *accounting* of the maximum amount of a > > reserve required by an operation to guarantee forwards progress. In > > filesystems, we do this for log space (transactions) and some do it > > for filesystem space (e.g. delayed allocation needs correct ENOSPC > > detection so we don't overcommit disk space). The VM already has > > such concepts (e.g. watermarks and things like min_free_kbytes) that > > it uses to ensure that there are sufficient reserves for certain > > types of allocations to succeed. > > Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic > reserve. So to reserve N pages we increase the page allocator dynamic > reserve by N, do some reclaim if necessary then deposit N tokens into > the caller's task_struct (it'll be a set of zone/nr-pages tuples I > suppose). > > When allocating pages the caller should drain its reserves in > preference to dipping into the regular freelist. This guy has already > done his reclaim and shouldn't be penalised a second time. I guess > Johannes's preallocation code should switch to doing this for the same > reason, plus the fact that snipping a page off > task_struct.prealloc_pages is super-fast and needs to be done sometime > anyway so why not do it by default. That is at odds with the requirements of demand paging, which allocate for objects that are reclaimable within the course of the transaction. The reserve is there to ensure forward progress for allocations for objects that aren't freed until after the transaction completes, but if we drain it for reclaimable objects we then have nothing left in the reserve pool when we actually need it. We do not know ahead of time if the object we are allocating is going to modified and hence locked into the transaction. Hence we can't say "use the reserve for this *specific* allocation", and so the only guidance we can really give is "we will to allocate and *permanently consume* this much memory", and the reserve pool needs to cover that consumption to guarantee forwards progress. Forwards progress for all other allocations is guaranteed because they are reclaimable objects - they either freed directly back to their source (slab, heap, page lists) or they are freed by shrinkers once they have been released from the transaction. Hence we need allocations to come from the free list and trigger reclaim, regardless of the fact there is a reserve pool there. The reserve pool needs to be a last resort once there are no other avenues to allocate memory. i.e. it would be used to replace the OOM killer for GFP_NOFAIL allocations. > Both reservation and preallocation are vulnerable to deadlocks - 10,000 > tasks all trying to reserve/prealloc 100 pages, they all have 50 pages > and we ran out of memory. Whoops. Yes, that's the big problem with preallocation, as well as your proposed "depelete the reserved memory first" approach. They *require* up front "preallocation" of free memory, either directly by the application, or internally by the mm subsystem. Hence my comments about appropriate classification of "reserved memory". Reserved memory does not necessarily need to be on the free list. It could be "immediately reclaimable" memory, so that reserving memory doesn't need to immediately reclaim memory, but can it can be pulled from the reclaimable memory reserves when memory pressure occurs. If there is no memory pressure, we do nothing beause we have no need to do anything.... > We can undeadlock by returning ENOMEM but I suspect there will > still be problematic situations where massive numbers of pages are > temporarily AWOL. Perhaps some form of queuing and throttling > will be needed, Yes, think that is necessary, but I don't see it as necessary in the MM subsystem. XFS already has a ticket-based queue mechanisms for throttling concurrent access to ensure we don't overcommit log space and I'd want to tie the two together... > to limit the peak number of reserved pages. Per > zone, I guess. Internal implementation issue that I don't really care about. When it comes to guaranteeing memory allocation, global context is all I care about. Locality of allocation simple doesn't matter; we want that page we reserved, no matter wher eit is located. > And it'll be a huge pain handling order>0 pages. I'd be inclined > to make it order-0 only, and tell the lamer callers that > vmap-is-thattaway. Alas, one lame caller is slub. Sure, but vmap requires GFP_KERNEL memory allocation and we're talking about allocation in transactions, which are GFP_NOFS. I've lost count of the number of times we've asked for that problem to be fixed. Refusing to fix it has simply lead to the growing use of ugly hacks around that problem (i.e. memalloc_noio_save() and friends). > But the biggest issue is how the heck does a caller work out how > many pages to reserve/prealloc? Even a single sb_bread() - it's > sitting on loop on a sparse NTFS file on loop on a five-deep DM > stack on a six-deep MD stack on loop on NFS on an eleventy-deep > networking stack. Each subsystem needs to take care of itself first, then we can worry about esoteric stacking requirements. Besides, stacking requirements through the IO layer is still pretty trivial - we only need to guarantee single IO progress from the highest layer as it can be recycled again and again for every IO that needs to be done. And, because mempools already give that guarantee to most block devices and drivers, we won't need to reserve memory for most block devices to make forwards progress. It's only crazy "recurse through filesystem" configurations where this will be an issue. > And then there will be an unknown number of > slab allocations of unknown size with unknown slabs-per-page rules > - how many pages needed for them? However many pages needed to allocate the number of objects we'll consume from the slab. > And to make it much worse, how > many pages of which orders? Bless its heart, slub will go and use > a 1-order page for allocations which should have been in 0-order > pages.. The majority of allocations will be order-0, though if we know that they are going to be significant numbers of high order allocations, then it should be simple enough to tell the mm subsystem "need a reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have memory compaction just do it's stuff. But, IMO, we should cross that bridge when somebody actually needs reservations to be that specific.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-23 7:32 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-23 7:32 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > > > I really don't care about the OOM Killer corner cases - it's > > > > completely the wrong way line of development to be spending time on > > > > and you aren't going to convince me otherwise. The OOM killer a > > > > crutch used to justify having a memory allocation subsystem that > > > > can't provide forward progress guarantee mechanisms to callers that > > > > need it. > > > > > > We can provide this. Are all these callers able to preallocate? > > > > Anything that allocates in transaction context (and therefor is > > GFP_NOFS by definition) can preallocate at transaction reservation > > time. However, preallocation is dumb, complex, CPU and memory > > intensive and will have a *massive* impact on performance. > > Allocating 10-100 pages to a reserve which we will almost *never > > use* and then free them again *on every single transaction* is a lot > > of unnecessary additional fast path overhead. Hence a "preallocate > > for every context" reserve pool is not a viable solution. > > Yup. > > > Reservations are simply an *accounting* of the maximum amount of a > > reserve required by an operation to guarantee forwards progress. In > > filesystems, we do this for log space (transactions) and some do it > > for filesystem space (e.g. delayed allocation needs correct ENOSPC > > detection so we don't overcommit disk space). The VM already has > > such concepts (e.g. watermarks and things like min_free_kbytes) that > > it uses to ensure that there are sufficient reserves for certain > > types of allocations to succeed. > > Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic > reserve. So to reserve N pages we increase the page allocator dynamic > reserve by N, do some reclaim if necessary then deposit N tokens into > the caller's task_struct (it'll be a set of zone/nr-pages tuples I > suppose). > > When allocating pages the caller should drain its reserves in > preference to dipping into the regular freelist. This guy has already > done his reclaim and shouldn't be penalised a second time. I guess > Johannes's preallocation code should switch to doing this for the same > reason, plus the fact that snipping a page off > task_struct.prealloc_pages is super-fast and needs to be done sometime > anyway so why not do it by default. That is at odds with the requirements of demand paging, which allocate for objects that are reclaimable within the course of the transaction. The reserve is there to ensure forward progress for allocations for objects that aren't freed until after the transaction completes, but if we drain it for reclaimable objects we then have nothing left in the reserve pool when we actually need it. We do not know ahead of time if the object we are allocating is going to modified and hence locked into the transaction. Hence we can't say "use the reserve for this *specific* allocation", and so the only guidance we can really give is "we will to allocate and *permanently consume* this much memory", and the reserve pool needs to cover that consumption to guarantee forwards progress. Forwards progress for all other allocations is guaranteed because they are reclaimable objects - they either freed directly back to their source (slab, heap, page lists) or they are freed by shrinkers once they have been released from the transaction. Hence we need allocations to come from the free list and trigger reclaim, regardless of the fact there is a reserve pool there. The reserve pool needs to be a last resort once there are no other avenues to allocate memory. i.e. it would be used to replace the OOM killer for GFP_NOFAIL allocations. > Both reservation and preallocation are vulnerable to deadlocks - 10,000 > tasks all trying to reserve/prealloc 100 pages, they all have 50 pages > and we ran out of memory. Whoops. Yes, that's the big problem with preallocation, as well as your proposed "depelete the reserved memory first" approach. They *require* up front "preallocation" of free memory, either directly by the application, or internally by the mm subsystem. Hence my comments about appropriate classification of "reserved memory". Reserved memory does not necessarily need to be on the free list. It could be "immediately reclaimable" memory, so that reserving memory doesn't need to immediately reclaim memory, but can it can be pulled from the reclaimable memory reserves when memory pressure occurs. If there is no memory pressure, we do nothing beause we have no need to do anything.... > We can undeadlock by returning ENOMEM but I suspect there will > still be problematic situations where massive numbers of pages are > temporarily AWOL. Perhaps some form of queuing and throttling > will be needed, Yes, think that is necessary, but I don't see it as necessary in the MM subsystem. XFS already has a ticket-based queue mechanisms for throttling concurrent access to ensure we don't overcommit log space and I'd want to tie the two together... > to limit the peak number of reserved pages. Per > zone, I guess. Internal implementation issue that I don't really care about. When it comes to guaranteeing memory allocation, global context is all I care about. Locality of allocation simple doesn't matter; we want that page we reserved, no matter wher eit is located. > And it'll be a huge pain handling order>0 pages. I'd be inclined > to make it order-0 only, and tell the lamer callers that > vmap-is-thattaway. Alas, one lame caller is slub. Sure, but vmap requires GFP_KERNEL memory allocation and we're talking about allocation in transactions, which are GFP_NOFS. I've lost count of the number of times we've asked for that problem to be fixed. Refusing to fix it has simply lead to the growing use of ugly hacks around that problem (i.e. memalloc_noio_save() and friends). > But the biggest issue is how the heck does a caller work out how > many pages to reserve/prealloc? Even a single sb_bread() - it's > sitting on loop on a sparse NTFS file on loop on a five-deep DM > stack on a six-deep MD stack on loop on NFS on an eleventy-deep > networking stack. Each subsystem needs to take care of itself first, then we can worry about esoteric stacking requirements. Besides, stacking requirements through the IO layer is still pretty trivial - we only need to guarantee single IO progress from the highest layer as it can be recycled again and again for every IO that needs to be done. And, because mempools already give that guarantee to most block devices and drivers, we won't need to reserve memory for most block devices to make forwards progress. It's only crazy "recurse through filesystem" configurations where this will be an issue. > And then there will be an unknown number of > slab allocations of unknown size with unknown slabs-per-page rules > - how many pages needed for them? However many pages needed to allocate the number of objects we'll consume from the slab. > And to make it much worse, how > many pages of which orders? Bless its heart, slub will go and use > a 1-order page for allocations which should have been in 0-order > pages.. The majority of allocations will be order-0, though if we know that they are going to be significant numbers of high order allocations, then it should be simple enough to tell the mm subsystem "need a reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have memory compaction just do it's stuff. But, IMO, we should cross that bridge when somebody actually needs reservations to be that specific.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 7:32 ` Dave Chinner @ 2015-02-27 18:24 ` Vlastimil Babka -1 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-02-27 18:24 UTC (permalink / raw) To: Dave Chinner, Andrew Morton Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, torvalds On 02/23/2015 08:32 AM, Dave Chinner wrote: >> > And then there will be an unknown number of >> > slab allocations of unknown size with unknown slabs-per-page rules >> > - how many pages needed for them? > However many pages needed to allocate the number of objects we'll > consume from the slab. I think the best way is if slab could also learn to provide reserves for individual objects. Either just mark internally how many of them are reserved, if sufficient number is free, or translate this to the page allocator reserves, as slab knows which order it uses for the given objects. >> > And to make it much worse, how >> > many pages of which orders? Bless its heart, slub will go and use >> > a 1-order page for allocations which should have been in 0-order >> > pages.. > The majority of allocations will be order-0, though if we know that > they are going to be significant numbers of high order allocations, > then it should be simple enough to tell the mm subsystem "need a > reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have > memory compaction just do it's stuff. But, IMO, we should cross that > bridge when somebody actually needs reservations to be that > specific.... Note that watermark checking for higher-order allocations is somewhat fuzzy compared to order-0 checks, but I guess some kind of reservations could work there too. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-27 18:24 ` Vlastimil Babka 0 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-02-27 18:24 UTC (permalink / raw) To: Dave Chinner, Andrew Morton Cc: Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On 02/23/2015 08:32 AM, Dave Chinner wrote: >> > And then there will be an unknown number of >> > slab allocations of unknown size with unknown slabs-per-page rules >> > - how many pages needed for them? > However many pages needed to allocate the number of objects we'll > consume from the slab. I think the best way is if slab could also learn to provide reserves for individual objects. Either just mark internally how many of them are reserved, if sufficient number is free, or translate this to the page allocator reserves, as slab knows which order it uses for the given objects. >> > And to make it much worse, how >> > many pages of which orders? Bless its heart, slub will go and use >> > a 1-order page for allocations which should have been in 0-order >> > pages.. > The majority of allocations will be order-0, though if we know that > they are going to be significant numbers of high order allocations, > then it should be simple enough to tell the mm subsystem "need a > reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have > memory compaction just do it's stuff. But, IMO, we should cross that > bridge when somebody actually needs reservations to be that > specific.... Note that watermark checking for higher-order allocations is somewhat fuzzy compared to order-0 checks, but I guess some kind of reservations could work there too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-27 18:24 ` Vlastimil Babka @ 2015-02-28 0:03 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-28 0:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Fri, Feb 27, 2015 at 07:24:34PM +0100, Vlastimil Babka wrote: > On 02/23/2015 08:32 AM, Dave Chinner wrote: > >> > And then there will be an unknown number of > >> > slab allocations of unknown size with unknown slabs-per-page rules > >> > - how many pages needed for them? > > However many pages needed to allocate the number of objects we'll > > consume from the slab. > > I think the best way is if slab could also learn to provide reserves for > individual objects. Either just mark internally how many of them are reserved, > if sufficient number is free, or translate this to the page allocator reserves, > as slab knows which order it uses for the given objects. Which is effectively what a slab based mempool is. Mempools don't guarantee a reserve is available once it's been resized, however, and we'd have to have mempools configured for every type of allocation we are going to do. So from that perspective it's not really a solution. Further, the kmalloc heap is backed by slab caches. We do *lots* of variable sized kmalloc allocations in transactions the size of which aren't known until allocation time. In that case, we have to assume it's going to be a page per object, because the allocations could actually be that size. AFAICT, the worst case is a slab-backing page allocation for every slab object that is allocated, so we may as well cater for that case from the start... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-28 0:03 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-02-28 0:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Fri, Feb 27, 2015 at 07:24:34PM +0100, Vlastimil Babka wrote: > On 02/23/2015 08:32 AM, Dave Chinner wrote: > >> > And then there will be an unknown number of > >> > slab allocations of unknown size with unknown slabs-per-page rules > >> > - how many pages needed for them? > > However many pages needed to allocate the number of objects we'll > > consume from the slab. > > I think the best way is if slab could also learn to provide reserves for > individual objects. Either just mark internally how many of them are reserved, > if sufficient number is free, or translate this to the page allocator reserves, > as slab knows which order it uses for the given objects. Which is effectively what a slab based mempool is. Mempools don't guarantee a reserve is available once it's been resized, however, and we'd have to have mempools configured for every type of allocation we are going to do. So from that perspective it's not really a solution. Further, the kmalloc heap is backed by slab caches. We do *lots* of variable sized kmalloc allocations in transactions the size of which aren't known until allocation time. In that case, we have to assume it's going to be a page per object, because the allocations could actually be that size. AFAICT, the worst case is a slab-backing page allocation for every slab object that is allocated, so we may as well cater for that case from the start... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 0:03 ` Dave Chinner @ 2015-02-28 15:17 ` Theodore Ts'o -1 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-02-28 15:17 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds, Vlastimil Babka On Sat, Feb 28, 2015 at 11:03:59AM +1100, Dave Chinner wrote: > > I think the best way is if slab could also learn to provide reserves for > > individual objects. Either just mark internally how many of them are reserved, > > if sufficient number is free, or translate this to the page allocator reserves, > > as slab knows which order it uses for the given objects. > > Which is effectively what a slab based mempool is. Mempools don't > guarantee a reserve is available once it's been resized, however, > and we'd have to have mempools configured for every type of > allocation we are going to do. So from that perspective it's not > really a solution. The bigger problem is it means that the upper layer which is making the reservation before it starts taking lock won't necessarily know exactly which slab objects it and all of the lower layers might need. So it's much more flexible, and requires less accuracy, if we can just request that (a) the mm subsystems reserves at least N pages, and (b) tell it that at this point in time, it's safe for the requesting subsystem to block until N pages is available. Can this be guaranteed to be accurate? No, of course not. And in some cases, it may be possible since it might depend on whether the iSCSI device needs to reconnect to the target, or some sort of exception handling, before it can complete its I/O request. But it's better than what we have now, which is that once we've taken certain locks, and/or started a complex transaction, we can't really back out, so we end up looping either using GFP_NOFAIL, or around the memory allocation request if there are still mm developers who are delusional enough to believe, ala like King Canute, to say, "You must always be able to handle memory allocation at any point in the kernel and GFP_NOFAIL is an indicatoin of a subsystem bug!" I can imagine using some adjustment factors, where a particular voratious device might require hint to the file system to boost its memory allocation estimate by 30%, or 50%. So yes, it's a very, *very* rough estimate. And if we guess wrong, we might end up having to loop ala GFP_NOFAIL anyway. But it's better than not having such an estimate. I also grant that this doesn't work very well for emergency writeback, or background writeback, where we can't and shouldn't block waiting for enough memory to become free, since page cleaning is one of the ways that we might be able to make memory available. But if that's the only problem we have, we're in good shape, since that can be solved by either (a) doing a better job throttling memory allocations or memory reservation requests in the first place, and/or (b) starting the background writeback much more aggressively and earlier. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-28 15:17 ` Theodore Ts'o 0 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-02-28 15:17 UTC (permalink / raw) To: Dave Chinner Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Sat, Feb 28, 2015 at 11:03:59AM +1100, Dave Chinner wrote: > > I think the best way is if slab could also learn to provide reserves for > > individual objects. Either just mark internally how many of them are reserved, > > if sufficient number is free, or translate this to the page allocator reserves, > > as slab knows which order it uses for the given objects. > > Which is effectively what a slab based mempool is. Mempools don't > guarantee a reserve is available once it's been resized, however, > and we'd have to have mempools configured for every type of > allocation we are going to do. So from that perspective it's not > really a solution. The bigger problem is it means that the upper layer which is making the reservation before it starts taking lock won't necessarily know exactly which slab objects it and all of the lower layers might need. So it's much more flexible, and requires less accuracy, if we can just request that (a) the mm subsystems reserves at least N pages, and (b) tell it that at this point in time, it's safe for the requesting subsystem to block until N pages is available. Can this be guaranteed to be accurate? No, of course not. And in some cases, it may be possible since it might depend on whether the iSCSI device needs to reconnect to the target, or some sort of exception handling, before it can complete its I/O request. But it's better than what we have now, which is that once we've taken certain locks, and/or started a complex transaction, we can't really back out, so we end up looping either using GFP_NOFAIL, or around the memory allocation request if there are still mm developers who are delusional enough to believe, ala like King Canute, to say, "You must always be able to handle memory allocation at any point in the kernel and GFP_NOFAIL is an indicatoin of a subsystem bug!" I can imagine using some adjustment factors, where a particular voratious device might require hint to the file system to boost its memory allocation estimate by 30%, or 50%. So yes, it's a very, *very* rough estimate. And if we guess wrong, we might end up having to loop ala GFP_NOFAIL anyway. But it's better than not having such an estimate. I also grant that this doesn't work very well for emergency writeback, or background writeback, where we can't and shouldn't block waiting for enough memory to become free, since page cleaning is one of the ways that we might be able to make memory available. But if that's the only problem we have, we're in good shape, since that can be solved by either (a) doing a better job throttling memory allocations or memory reservation requests in the first place, and/or (b) starting the background writeback much more aggressively and earlier. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 7:32 ` Dave Chinner @ 2015-03-02 9:39 ` Vlastimil Babka -1 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-03-02 9:39 UTC (permalink / raw) To: Dave Chinner, Andrew Morton Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, torvalds On 02/23/2015 08:32 AM, Dave Chinner wrote: > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: >> On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: >> >> Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic >> reserve. So to reserve N pages we increase the page allocator dynamic >> reserve by N, do some reclaim if necessary then deposit N tokens into >> the caller's task_struct (it'll be a set of zone/nr-pages tuples I >> suppose). >> >> When allocating pages the caller should drain its reserves in >> preference to dipping into the regular freelist. This guy has already >> done his reclaim and shouldn't be penalised a second time. I guess >> Johannes's preallocation code should switch to doing this for the same >> reason, plus the fact that snipping a page off >> task_struct.prealloc_pages is super-fast and needs to be done sometime >> anyway so why not do it by default. > > That is at odds with the requirements of demand paging, which > allocate for objects that are reclaimable within the course of the > transaction. The reserve is there to ensure forward progress for > allocations for objects that aren't freed until after the > transaction completes, but if we drain it for reclaimable objects we > then have nothing left in the reserve pool when we actually need it. > > We do not know ahead of time if the object we are allocating is > going to modified and hence locked into the transaction. Hence we > can't say "use the reserve for this *specific* allocation", and so > the only guidance we can really give is "we will to allocate and > *permanently consume* this much memory", and the reserve pool needs > to cover that consumption to guarantee forwards progress. I'm not sure I understand properly. You don't know if a specific allocation is permanent or reclaimable, but you can tell in advance how much in total will be permanent? Is it because you are conservative and assume everything will be permanent, or how? Can you at least at some later point in transaction recognize that "OK, this object was not permanent after all" and tell mm that it can lower your reserve? > Forwards progress for all other allocations is guaranteed because > they are reclaimable objects - they either freed directly back to > their source (slab, heap, page lists) or they are freed by shrinkers > once they have been released from the transaction. Which are the "all other allocations?" Above you wrote that all allocations are treated as potentially permanent. Also how does the fact that an object is later reclaimable, affect forward progress during its allocation? Or all you talking about allocations from contexts that don't use reserves? > Hence we need allocations to come from the free list and trigger > reclaim, regardless of the fact there is a reserve pool there. The > reserve pool needs to be a last resort once there are no other > avenues to allocate memory. i.e. it would be used to replace the OOM > killer for GFP_NOFAIL allocations. That's probably going to result in lot of wasted memory and I still don't understand why it's needed, if your reserve estimate is guaranteed to cover the worst-case. >> Both reservation and preallocation are vulnerable to deadlocks - 10,000 >> tasks all trying to reserve/prealloc 100 pages, they all have 50 pages >> and we ran out of memory. Whoops. > > Yes, that's the big problem with preallocation, as well as your > proposed "depelete the reserved memory first" approach. They > *require* up front "preallocation" of free memory, either directly > by the application, or internally by the mm subsystem. I don't see why it would deadlock, if during reserve time the mm can return ENOMEM as the reserver should be able to back out at that point. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 9:39 ` Vlastimil Babka 0 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-03-02 9:39 UTC (permalink / raw) To: Dave Chinner, Andrew Morton Cc: Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On 02/23/2015 08:32 AM, Dave Chinner wrote: > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: >> On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: >> >> Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic >> reserve. So to reserve N pages we increase the page allocator dynamic >> reserve by N, do some reclaim if necessary then deposit N tokens into >> the caller's task_struct (it'll be a set of zone/nr-pages tuples I >> suppose). >> >> When allocating pages the caller should drain its reserves in >> preference to dipping into the regular freelist. This guy has already >> done his reclaim and shouldn't be penalised a second time. I guess >> Johannes's preallocation code should switch to doing this for the same >> reason, plus the fact that snipping a page off >> task_struct.prealloc_pages is super-fast and needs to be done sometime >> anyway so why not do it by default. > > That is at odds with the requirements of demand paging, which > allocate for objects that are reclaimable within the course of the > transaction. The reserve is there to ensure forward progress for > allocations for objects that aren't freed until after the > transaction completes, but if we drain it for reclaimable objects we > then have nothing left in the reserve pool when we actually need it. > > We do not know ahead of time if the object we are allocating is > going to modified and hence locked into the transaction. Hence we > can't say "use the reserve for this *specific* allocation", and so > the only guidance we can really give is "we will to allocate and > *permanently consume* this much memory", and the reserve pool needs > to cover that consumption to guarantee forwards progress. I'm not sure I understand properly. You don't know if a specific allocation is permanent or reclaimable, but you can tell in advance how much in total will be permanent? Is it because you are conservative and assume everything will be permanent, or how? Can you at least at some later point in transaction recognize that "OK, this object was not permanent after all" and tell mm that it can lower your reserve? > Forwards progress for all other allocations is guaranteed because > they are reclaimable objects - they either freed directly back to > their source (slab, heap, page lists) or they are freed by shrinkers > once they have been released from the transaction. Which are the "all other allocations?" Above you wrote that all allocations are treated as potentially permanent. Also how does the fact that an object is later reclaimable, affect forward progress during its allocation? Or all you talking about allocations from contexts that don't use reserves? > Hence we need allocations to come from the free list and trigger > reclaim, regardless of the fact there is a reserve pool there. The > reserve pool needs to be a last resort once there are no other > avenues to allocate memory. i.e. it would be used to replace the OOM > killer for GFP_NOFAIL allocations. That's probably going to result in lot of wasted memory and I still don't understand why it's needed, if your reserve estimate is guaranteed to cover the worst-case. >> Both reservation and preallocation are vulnerable to deadlocks - 10,000 >> tasks all trying to reserve/prealloc 100 pages, they all have 50 pages >> and we ran out of memory. Whoops. > > Yes, that's the big problem with preallocation, as well as your > proposed "depelete the reserved memory first" approach. They > *require* up front "preallocation" of free memory, either directly > by the application, or internally by the mm subsystem. I don't see why it would deadlock, if during reserve time the mm can return ENOMEM as the reserver should be able to back out at that point. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 9:39 ` Vlastimil Babka @ 2015-03-02 22:31 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-02 22:31 UTC (permalink / raw) To: Vlastimil Babka Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote: > On 02/23/2015 08:32 AM, Dave Chinner wrote: > >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: > >> > >>Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic > >>reserve. So to reserve N pages we increase the page allocator dynamic > >>reserve by N, do some reclaim if necessary then deposit N tokens into > >>the caller's task_struct (it'll be a set of zone/nr-pages tuples I > >>suppose). > >> > >>When allocating pages the caller should drain its reserves in > >>preference to dipping into the regular freelist. This guy has already > >>done his reclaim and shouldn't be penalised a second time. I guess > >>Johannes's preallocation code should switch to doing this for the same > >>reason, plus the fact that snipping a page off > >>task_struct.prealloc_pages is super-fast and needs to be done sometime > >>anyway so why not do it by default. > > > >That is at odds with the requirements of demand paging, which > >allocate for objects that are reclaimable within the course of the > >transaction. The reserve is there to ensure forward progress for > >allocations for objects that aren't freed until after the > >transaction completes, but if we drain it for reclaimable objects we > >then have nothing left in the reserve pool when we actually need it. > > > >We do not know ahead of time if the object we are allocating is > >going to modified and hence locked into the transaction. Hence we > >can't say "use the reserve for this *specific* allocation", and so > >the only guidance we can really give is "we will to allocate and > >*permanently consume* this much memory", and the reserve pool needs > >to cover that consumption to guarantee forwards progress. > > I'm not sure I understand properly. You don't know if a specific > allocation is permanent or reclaimable, but you can tell in advance > how much in total will be permanent? Is it because you are > conservative and assume everything will be permanent, or how? Because we know the worst case object modification constraints *exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know exactly what in memory objects we lock into the transaction and what memory is required to modify and track those objects. e.g: for a data extent allocation, the log reservation is as such: /* * In a write transaction we can allocate a maximum of 2 * extents. This gives: * the inode getting the new extents: inode size * the inode's bmap btree: max depth * block size * the agfs of the ags from which the extents are allocated: 2 * sector * the superblock free block counter: sector size * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size * And the bmap_finish transaction can free bmap blocks in a join: * the agfs of the ags containing the blocks: 2 * sector size * the agfls of the ags containing the blocks: 2 * sector size * the super block free block counter: sector size * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size */ STATIC uint xfs_calc_write_reservation( struct xfs_mount *mp) { return XFS_DQUOT_LOGRES(mp) + MAX((xfs_calc_inode_res(mp, 1) + xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), XFS_FSB_TO_B(mp, 1)) + xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), XFS_FSB_TO_B(mp, 1))), (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), XFS_FSB_TO_B(mp, 1)))); } It's trivial to extend this logic to to memory allocation requirements, because the above is an exact encoding of all the objects we "permanently consume" memory for within the transaction. What we don't know is how many objects we might need to scan to find the objects we will eventually modify. Here's an (admittedly extreme) example to demonstrate a worst case scenario: allocate a 64k data extent. Because it is an exact size allocation, we look it up in the by-size free space btree. Free space is fragmented, so there are about a million 64k free space extents in the tree. Once we find the first 64k extent, we search them to find the best locality target match. The btree records are 16 bytes each, so we fit roughly 500 to a 4k block. Say we search half the extents to find the best match - i.e. we walk a thousand leaf blocks before finding the match we want, and modify that leaf block. Now, the modification removed an entry from the leaf and tht triggers leaf merge thresholds, so a merge with the 1002nd block occurs. That block now demand pages in and we then modify and join it to the transaction. Now we walk back up the btree to update indexes, merging blocks all the way back up to the root. We have a worst case size btree (5 levels) and we merge at every level meaning we demand page another 8 btree blocks and modify them. In this case, we've demand paged ~1010 btree blocks, but only modified 10 of them. i.e. the memory we consumed permanently was only 10 4k buffers (approx. 10 slab and 10 page allocations), but the allocation demand was 2 orders of magnitude more than the unreclaimable memory consumption of the btree modification. I hope you start to see the scope of the problem now... > Can you at least at some later point in transaction recognize that > "OK, this object was not permanent after all" and tell mm that it > can lower your reserve? I'm not including any memory used by objects we know won't be locked into the transaction in the reserve. Demand paged object memory is essentially unbound but is easily reclaimable. That reclaim will give us forward progress guarantees on the memory required here. > >Yes, that's the big problem with preallocation, as well as your > >proposed "depelete the reserved memory first" approach. They > >*require* up front "preallocation" of free memory, either directly > >by the application, or internally by the mm subsystem. > > I don't see why it would deadlock, if during reserve time the mm can > return ENOMEM as the reserver should be able to back out at that > point. Preallocated reserves do not allow for unbound demand paging of reclaimable objects within reserved allocation contexts. Cheers Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 22:31 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-02 22:31 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote: > On 02/23/2015 08:32 AM, Dave Chinner wrote: > >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: > >> > >>Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic > >>reserve. So to reserve N pages we increase the page allocator dynamic > >>reserve by N, do some reclaim if necessary then deposit N tokens into > >>the caller's task_struct (it'll be a set of zone/nr-pages tuples I > >>suppose). > >> > >>When allocating pages the caller should drain its reserves in > >>preference to dipping into the regular freelist. This guy has already > >>done his reclaim and shouldn't be penalised a second time. I guess > >>Johannes's preallocation code should switch to doing this for the same > >>reason, plus the fact that snipping a page off > >>task_struct.prealloc_pages is super-fast and needs to be done sometime > >>anyway so why not do it by default. > > > >That is at odds with the requirements of demand paging, which > >allocate for objects that are reclaimable within the course of the > >transaction. The reserve is there to ensure forward progress for > >allocations for objects that aren't freed until after the > >transaction completes, but if we drain it for reclaimable objects we > >then have nothing left in the reserve pool when we actually need it. > > > >We do not know ahead of time if the object we are allocating is > >going to modified and hence locked into the transaction. Hence we > >can't say "use the reserve for this *specific* allocation", and so > >the only guidance we can really give is "we will to allocate and > >*permanently consume* this much memory", and the reserve pool needs > >to cover that consumption to guarantee forwards progress. > > I'm not sure I understand properly. You don't know if a specific > allocation is permanent or reclaimable, but you can tell in advance > how much in total will be permanent? Is it because you are > conservative and assume everything will be permanent, or how? Because we know the worst case object modification constraints *exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know exactly what in memory objects we lock into the transaction and what memory is required to modify and track those objects. e.g: for a data extent allocation, the log reservation is as such: /* * In a write transaction we can allocate a maximum of 2 * extents. This gives: * the inode getting the new extents: inode size * the inode's bmap btree: max depth * block size * the agfs of the ags from which the extents are allocated: 2 * sector * the superblock free block counter: sector size * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size * And the bmap_finish transaction can free bmap blocks in a join: * the agfs of the ags containing the blocks: 2 * sector size * the agfls of the ags containing the blocks: 2 * sector size * the super block free block counter: sector size * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size */ STATIC uint xfs_calc_write_reservation( struct xfs_mount *mp) { return XFS_DQUOT_LOGRES(mp) + MAX((xfs_calc_inode_res(mp, 1) + xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), XFS_FSB_TO_B(mp, 1)) + xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), XFS_FSB_TO_B(mp, 1))), (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), XFS_FSB_TO_B(mp, 1)))); } It's trivial to extend this logic to to memory allocation requirements, because the above is an exact encoding of all the objects we "permanently consume" memory for within the transaction. What we don't know is how many objects we might need to scan to find the objects we will eventually modify. Here's an (admittedly extreme) example to demonstrate a worst case scenario: allocate a 64k data extent. Because it is an exact size allocation, we look it up in the by-size free space btree. Free space is fragmented, so there are about a million 64k free space extents in the tree. Once we find the first 64k extent, we search them to find the best locality target match. The btree records are 16 bytes each, so we fit roughly 500 to a 4k block. Say we search half the extents to find the best match - i.e. we walk a thousand leaf blocks before finding the match we want, and modify that leaf block. Now, the modification removed an entry from the leaf and tht triggers leaf merge thresholds, so a merge with the 1002nd block occurs. That block now demand pages in and we then modify and join it to the transaction. Now we walk back up the btree to update indexes, merging blocks all the way back up to the root. We have a worst case size btree (5 levels) and we merge at every level meaning we demand page another 8 btree blocks and modify them. In this case, we've demand paged ~1010 btree blocks, but only modified 10 of them. i.e. the memory we consumed permanently was only 10 4k buffers (approx. 10 slab and 10 page allocations), but the allocation demand was 2 orders of magnitude more than the unreclaimable memory consumption of the btree modification. I hope you start to see the scope of the problem now... > Can you at least at some later point in transaction recognize that > "OK, this object was not permanent after all" and tell mm that it > can lower your reserve? I'm not including any memory used by objects we know won't be locked into the transaction in the reserve. Demand paged object memory is essentially unbound but is easily reclaimable. That reclaim will give us forward progress guarantees on the memory required here. > >Yes, that's the big problem with preallocation, as well as your > >proposed "depelete the reserved memory first" approach. They > >*require* up front "preallocation" of free memory, either directly > >by the application, or internally by the mm subsystem. > > I don't see why it would deadlock, if during reserve time the mm can > return ENOMEM as the reserver should be able to back out at that > point. Preallocated reserves do not allow for unbound demand paging of reclaimable objects within reserved allocation contexts. Cheers Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 22:31 ` Dave Chinner @ 2015-03-03 9:13 ` Vlastimil Babka -1 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-03-03 9:13 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On 03/02/2015 11:31 PM, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote: >> On 02/23/2015 08:32 AM, Dave Chinner wrote: >> >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: >> >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: >> >We do not know ahead of time if the object we are allocating is >> >going to modified and hence locked into the transaction. Hence we >> >can't say "use the reserve for this *specific* allocation", and so >> >the only guidance we can really give is "we will to allocate and >> >*permanently consume* this much memory", and the reserve pool needs >> >to cover that consumption to guarantee forwards progress. >> >> I'm not sure I understand properly. You don't know if a specific >> allocation is permanent or reclaimable, but you can tell in advance >> how much in total will be permanent? Is it because you are >> conservative and assume everything will be permanent, or how? > > Because we know the worst case object modification constraints > *exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know > exactly what in memory objects we lock into the transaction and what > memory is required to modify and track those objects. e.g: for a > data extent allocation, the log reservation is as such: > > /* > * In a write transaction we can allocate a maximum of 2 > * extents. This gives: > * the inode getting the new extents: inode size > * the inode's bmap btree: max depth * block size > * the agfs of the ags from which the extents are allocated: 2 * sector > * the superblock free block counter: sector size > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size > * And the bmap_finish transaction can free bmap blocks in a join: > * the agfs of the ags containing the blocks: 2 * sector size > * the agfls of the ags containing the blocks: 2 * sector size > * the super block free block counter: sector size > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size > */ > STATIC uint > xfs_calc_write_reservation( > struct xfs_mount *mp) > { > return XFS_DQUOT_LOGRES(mp) + > MAX((xfs_calc_inode_res(mp, 1) + > xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), > XFS_FSB_TO_B(mp, 1)) + > xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + > xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), > XFS_FSB_TO_B(mp, 1))), > (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) + > xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), > XFS_FSB_TO_B(mp, 1)))); > } > > It's trivial to extend this logic to to memory allocation > requirements, because the above is an exact encoding of all the > objects we "permanently consume" memory for within the transaction. > > What we don't know is how many objects we might need to scan to find > the objects we will eventually modify. Here's an (admittedly > extreme) example to demonstrate a worst case scenario: allocate a > 64k data extent. Because it is an exact size allocation, we look it > up in the by-size free space btree. Free space is fragmented, so > there are about a million 64k free space extents in the tree. > > Once we find the first 64k extent, we search them to find the best > locality target match. The btree records are 16 bytes each, so we > fit roughly 500 to a 4k block. Say we search half the extents to > find the best match - i.e. we walk a thousand leaf blocks before > finding the match we want, and modify that leaf block. > > Now, the modification removed an entry from the leaf and tht > triggers leaf merge thresholds, so a merge with the 1002nd block > occurs. That block now demand pages in and we then modify and join > it to the transaction. Now we walk back up the btree to update > indexes, merging blocks all the way back up to the root. We have a > worst case size btree (5 levels) and we merge at every level meaning > we demand page another 8 btree blocks and modify them. > > In this case, we've demand paged ~1010 btree blocks, but only > modified 10 of them. i.e. the memory we consumed permanently was > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > the allocation demand was 2 orders of magnitude more than the > unreclaimable memory consumption of the btree modification. > > I hope you start to see the scope of the problem now... Thanks, that example did help me understand your position much better. So you would need to reserve for a worst case number of the objects you modify, plus some slack for the demand-paged objects that you need to temporarily access, before you can drop and reclaim them (I suppose that in some of the tree operations, you need to be holding references to e.g. two nodes at a time, or maybe the full depth). Or maybe since all these temporary objects are potentially modifiable, it's already accounted for in the "might be modified" part. >> Can you at least at some later point in transaction recognize that >> "OK, this object was not permanent after all" and tell mm that it >> can lower your reserve? > > I'm not including any memory used by objects we know won't be locked > into the transaction in the reserve. Demand paged object memory is > essentially unbound but is easily reclaimable. That reclaim will > give us forward progress guarantees on the memory required here. > >> >Yes, that's the big problem with preallocation, as well as your >> >proposed "depelete the reserved memory first" approach. They >> >*require* up front "preallocation" of free memory, either directly >> >by the application, or internally by the mm subsystem. >> >> I don't see why it would deadlock, if during reserve time the mm can >> return ENOMEM as the reserver should be able to back out at that >> point. > > Preallocated reserves do not allow for unbound demand paging of > reclaimable objects within reserved allocation contexts. OK I think I get the point now. So, lots of the concerns by me and others were about the wasted memory due to reservations, and increased pressure on the rest of the system. I was thinking, are you able, at the beginning of the transaction (for this purposes, I think of transaction as the work that starts with the memory reservation, then it cannot rollback and relies on the reserves, until it commits and frees the memory), determine whether the transaction cannot be blocked in its progress by any other transaction, and the only thing that would block it would be inability to allocate memory during its course? If that was the case, we could "share" the reserved memory for all ongoing transactions of a single class (i.e. xfs transactions). If a transaction knows it cannot be blocked by anything else, only then it passes the GFP_CAN_USE_RESERVE flag to the allocator. Once the allocator gives part of the reserve to one such transaction, it will deny the reserves to other such transactions, until the first one finishes. In practice it would be more complex of course, but it should guarantee forward progress without lots of wasted memory (maybe we wouldn't have to rely on treting clean reclaimable pages as reserve in that case, which was also pointed out to be problematic). Of course it all depends on whether you are able to determine the "guaranteed to not block". I can however easily imagine it's not possible... > Cheers > > Dave. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-03 9:13 ` Vlastimil Babka 0 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-03-03 9:13 UTC (permalink / raw) To: Dave Chinner Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On 03/02/2015 11:31 PM, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote: >> On 02/23/2015 08:32 AM, Dave Chinner wrote: >> >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: >> >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: >> >We do not know ahead of time if the object we are allocating is >> >going to modified and hence locked into the transaction. Hence we >> >can't say "use the reserve for this *specific* allocation", and so >> >the only guidance we can really give is "we will to allocate and >> >*permanently consume* this much memory", and the reserve pool needs >> >to cover that consumption to guarantee forwards progress. >> >> I'm not sure I understand properly. You don't know if a specific >> allocation is permanent or reclaimable, but you can tell in advance >> how much in total will be permanent? Is it because you are >> conservative and assume everything will be permanent, or how? > > Because we know the worst case object modification constraints > *exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know > exactly what in memory objects we lock into the transaction and what > memory is required to modify and track those objects. e.g: for a > data extent allocation, the log reservation is as such: > > /* > * In a write transaction we can allocate a maximum of 2 > * extents. This gives: > * the inode getting the new extents: inode size > * the inode's bmap btree: max depth * block size > * the agfs of the ags from which the extents are allocated: 2 * sector > * the superblock free block counter: sector size > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size > * And the bmap_finish transaction can free bmap blocks in a join: > * the agfs of the ags containing the blocks: 2 * sector size > * the agfls of the ags containing the blocks: 2 * sector size > * the super block free block counter: sector size > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size > */ > STATIC uint > xfs_calc_write_reservation( > struct xfs_mount *mp) > { > return XFS_DQUOT_LOGRES(mp) + > MAX((xfs_calc_inode_res(mp, 1) + > xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), > XFS_FSB_TO_B(mp, 1)) + > xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + > xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), > XFS_FSB_TO_B(mp, 1))), > (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) + > xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), > XFS_FSB_TO_B(mp, 1)))); > } > > It's trivial to extend this logic to to memory allocation > requirements, because the above is an exact encoding of all the > objects we "permanently consume" memory for within the transaction. > > What we don't know is how many objects we might need to scan to find > the objects we will eventually modify. Here's an (admittedly > extreme) example to demonstrate a worst case scenario: allocate a > 64k data extent. Because it is an exact size allocation, we look it > up in the by-size free space btree. Free space is fragmented, so > there are about a million 64k free space extents in the tree. > > Once we find the first 64k extent, we search them to find the best > locality target match. The btree records are 16 bytes each, so we > fit roughly 500 to a 4k block. Say we search half the extents to > find the best match - i.e. we walk a thousand leaf blocks before > finding the match we want, and modify that leaf block. > > Now, the modification removed an entry from the leaf and tht > triggers leaf merge thresholds, so a merge with the 1002nd block > occurs. That block now demand pages in and we then modify and join > it to the transaction. Now we walk back up the btree to update > indexes, merging blocks all the way back up to the root. We have a > worst case size btree (5 levels) and we merge at every level meaning > we demand page another 8 btree blocks and modify them. > > In this case, we've demand paged ~1010 btree blocks, but only > modified 10 of them. i.e. the memory we consumed permanently was > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > the allocation demand was 2 orders of magnitude more than the > unreclaimable memory consumption of the btree modification. > > I hope you start to see the scope of the problem now... Thanks, that example did help me understand your position much better. So you would need to reserve for a worst case number of the objects you modify, plus some slack for the demand-paged objects that you need to temporarily access, before you can drop and reclaim them (I suppose that in some of the tree operations, you need to be holding references to e.g. two nodes at a time, or maybe the full depth). Or maybe since all these temporary objects are potentially modifiable, it's already accounted for in the "might be modified" part. >> Can you at least at some later point in transaction recognize that >> "OK, this object was not permanent after all" and tell mm that it >> can lower your reserve? > > I'm not including any memory used by objects we know won't be locked > into the transaction in the reserve. Demand paged object memory is > essentially unbound but is easily reclaimable. That reclaim will > give us forward progress guarantees on the memory required here. > >> >Yes, that's the big problem with preallocation, as well as your >> >proposed "depelete the reserved memory first" approach. They >> >*require* up front "preallocation" of free memory, either directly >> >by the application, or internally by the mm subsystem. >> >> I don't see why it would deadlock, if during reserve time the mm can >> return ENOMEM as the reserver should be able to back out at that >> point. > > Preallocated reserves do not allow for unbound demand paging of > reclaimable objects within reserved allocation contexts. OK I think I get the point now. So, lots of the concerns by me and others were about the wasted memory due to reservations, and increased pressure on the rest of the system. I was thinking, are you able, at the beginning of the transaction (for this purposes, I think of transaction as the work that starts with the memory reservation, then it cannot rollback and relies on the reserves, until it commits and frees the memory), determine whether the transaction cannot be blocked in its progress by any other transaction, and the only thing that would block it would be inability to allocate memory during its course? If that was the case, we could "share" the reserved memory for all ongoing transactions of a single class (i.e. xfs transactions). If a transaction knows it cannot be blocked by anything else, only then it passes the GFP_CAN_USE_RESERVE flag to the allocator. Once the allocator gives part of the reserve to one such transaction, it will deny the reserves to other such transactions, until the first one finishes. In practice it would be more complex of course, but it should guarantee forward progress without lots of wasted memory (maybe we wouldn't have to rely on treting clean reclaimable pages as reserve in that case, which was also pointed out to be problematic). Of course it all depends on whether you are able to determine the "guaranteed to not block". I can however easily imagine it's not possible... > Cheers > > Dave. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-03 9:13 ` Vlastimil Babka @ 2015-03-04 1:33 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 1:33 UTC (permalink / raw) To: Vlastimil Babka Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote: > On 03/02/2015 11:31 PM, Dave Chinner wrote: > > On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote: > > > > /* > > * In a write transaction we can allocate a maximum of 2 > > * extents. This gives: > > * the inode getting the new extents: inode size > > * the inode's bmap btree: max depth * block size > > * the agfs of the ags from which the extents are allocated: 2 * sector > > * the superblock free block counter: sector size > > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ..... > Thanks, that example did help me understand your position much better. > So you would need to reserve for a worst case number of the objects you modify, > plus some slack for the demand-paged objects that you need to temporarily > access, before you can drop and reclaim them (I suppose that in some of the tree > operations, you need to be holding references to e.g. two nodes at a time, or > maybe the full depth). Or maybe since all these temporary objects are > potentially modifiable, it's already accounted for in the "might be modified" part. Already accounted for in the "might be modified path". > >> Can you at least at some later point in transaction recognize that > >> "OK, this object was not permanent after all" and tell mm that it > >> can lower your reserve? > > > > I'm not including any memory used by objects we know won't be locked > > into the transaction in the reserve. Demand paged object memory is > > essentially unbound but is easily reclaimable. That reclaim will > > give us forward progress guarantees on the memory required here. > > > >> >Yes, that's the big problem with preallocation, as well as your > >> >proposed "depelete the reserved memory first" approach. They > >> >*require* up front "preallocation" of free memory, either directly > >> >by the application, or internally by the mm subsystem. > >> > >> I don't see why it would deadlock, if during reserve time the mm can > >> return ENOMEM as the reserver should be able to back out at that > >> point. > > > > Preallocated reserves do not allow for unbound demand paging of > > reclaimable objects within reserved allocation contexts. > > OK I think I get the point now. > > So, lots of the concerns by me and others were about the wasted memory due to > reservations, and increased pressure on the rest of the system. I was thinking, > are you able, at the beginning of the transaction (for this purposes, I think of > transaction as the work that starts with the memory reservation, then it cannot > rollback and relies on the reserves, until it commits and frees the memory), > determine whether the transaction cannot be blocked in its progress by any other > transaction, and the only thing that would block it would be inability to > allocate memory during its course? No. e.g. any transaction that requires allocation or freeing of an inode or extent can get stuck behind any other transaction that is allocating/freeing and inode/extent. And this will happen when holding inode locks, which means other transactions on that inode will then get stuck on the inode lock, and so on. Blocking dependencies within transactions are everywhere and cannot be avoided. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-04 1:33 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 1:33 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote: > On 03/02/2015 11:31 PM, Dave Chinner wrote: > > On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote: > > > > /* > > * In a write transaction we can allocate a maximum of 2 > > * extents. This gives: > > * the inode getting the new extents: inode size > > * the inode's bmap btree: max depth * block size > > * the agfs of the ags from which the extents are allocated: 2 * sector > > * the superblock free block counter: sector size > > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ..... > Thanks, that example did help me understand your position much better. > So you would need to reserve for a worst case number of the objects you modify, > plus some slack for the demand-paged objects that you need to temporarily > access, before you can drop and reclaim them (I suppose that in some of the tree > operations, you need to be holding references to e.g. two nodes at a time, or > maybe the full depth). Or maybe since all these temporary objects are > potentially modifiable, it's already accounted for in the "might be modified" part. Already accounted for in the "might be modified path". > >> Can you at least at some later point in transaction recognize that > >> "OK, this object was not permanent after all" and tell mm that it > >> can lower your reserve? > > > > I'm not including any memory used by objects we know won't be locked > > into the transaction in the reserve. Demand paged object memory is > > essentially unbound but is easily reclaimable. That reclaim will > > give us forward progress guarantees on the memory required here. > > > >> >Yes, that's the big problem with preallocation, as well as your > >> >proposed "depelete the reserved memory first" approach. They > >> >*require* up front "preallocation" of free memory, either directly > >> >by the application, or internally by the mm subsystem. > >> > >> I don't see why it would deadlock, if during reserve time the mm can > >> return ENOMEM as the reserver should be able to back out at that > >> point. > > > > Preallocated reserves do not allow for unbound demand paging of > > reclaimable objects within reserved allocation contexts. > > OK I think I get the point now. > > So, lots of the concerns by me and others were about the wasted memory due to > reservations, and increased pressure on the rest of the system. I was thinking, > are you able, at the beginning of the transaction (for this purposes, I think of > transaction as the work that starts with the memory reservation, then it cannot > rollback and relies on the reserves, until it commits and frees the memory), > determine whether the transaction cannot be blocked in its progress by any other > transaction, and the only thing that would block it would be inability to > allocate memory during its course? No. e.g. any transaction that requires allocation or freeing of an inode or extent can get stuck behind any other transaction that is allocating/freeing and inode/extent. And this will happen when holding inode locks, which means other transactions on that inode will then get stuck on the inode lock, and so on. Blocking dependencies within transactions are everywhere and cannot be avoided. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 1:33 ` Dave Chinner @ 2015-03-04 8:50 ` Vlastimil Babka -1 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-03-04 8:50 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On 03/04/2015 02:33 AM, Dave Chinner wrote: > On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote: >>> >>> Preallocated reserves do not allow for unbound demand paging of >>> reclaimable objects within reserved allocation contexts. >> >> OK I think I get the point now. >> >> So, lots of the concerns by me and others were about the wasted memory due to >> reservations, and increased pressure on the rest of the system. I was thinking, >> are you able, at the beginning of the transaction (for this purposes, I think of >> transaction as the work that starts with the memory reservation, then it cannot >> rollback and relies on the reserves, until it commits and frees the memory), >> determine whether the transaction cannot be blocked in its progress by any other >> transaction, and the only thing that would block it would be inability to >> allocate memory during its course? > > No. e.g. any transaction that requires allocation or freeing of an > inode or extent can get stuck behind any other transaction that is > allocating/freeing and inode/extent. And this will happen when > holding inode locks, which means other transactions on that inode > will then get stuck on the inode lock, and so on. Blocking > dependencies within transactions are everywhere and cannot be > avoided. Hm, I see. I thought that perhaps to avoid deadlocks between transactions (which you already have to do somehow), either the dependencies have to be structured in a way that there's always some transaction that can't block on others. Or you have a way to detect potential deadlocks before they happen, and stall somebody who tries to lock. Both should (at least theoretically) mean that you would be able to point to such transaction, although I can imagine the cost of being able to do that could be prohibitive. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-04 8:50 ` Vlastimil Babka 0 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-03-04 8:50 UTC (permalink / raw) To: Dave Chinner Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On 03/04/2015 02:33 AM, Dave Chinner wrote: > On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote: >>> >>> Preallocated reserves do not allow for unbound demand paging of >>> reclaimable objects within reserved allocation contexts. >> >> OK I think I get the point now. >> >> So, lots of the concerns by me and others were about the wasted memory due to >> reservations, and increased pressure on the rest of the system. I was thinking, >> are you able, at the beginning of the transaction (for this purposes, I think of >> transaction as the work that starts with the memory reservation, then it cannot >> rollback and relies on the reserves, until it commits and frees the memory), >> determine whether the transaction cannot be blocked in its progress by any other >> transaction, and the only thing that would block it would be inability to >> allocate memory during its course? > > No. e.g. any transaction that requires allocation or freeing of an > inode or extent can get stuck behind any other transaction that is > allocating/freeing and inode/extent. And this will happen when > holding inode locks, which means other transactions on that inode > will then get stuck on the inode lock, and so on. Blocking > dependencies within transactions are everywhere and cannot be > avoided. Hm, I see. I thought that perhaps to avoid deadlocks between transactions (which you already have to do somehow), either the dependencies have to be structured in a way that there's always some transaction that can't block on others. Or you have a way to detect potential deadlocks before they happen, and stall somebody who tries to lock. Both should (at least theoretically) mean that you would be able to point to such transaction, although I can imagine the cost of being able to do that could be prohibitive. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 8:50 ` Vlastimil Babka @ 2015-03-04 11:03 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 11:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Wed, Mar 04, 2015 at 09:50:58AM +0100, Vlastimil Babka wrote: > On 03/04/2015 02:33 AM, Dave Chinner wrote: > >On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote: > >>> > >>>Preallocated reserves do not allow for unbound demand paging of > >>>reclaimable objects within reserved allocation contexts. > >> > >>OK I think I get the point now. > >> > >>So, lots of the concerns by me and others were about the wasted memory due to > >>reservations, and increased pressure on the rest of the system. I was thinking, > >>are you able, at the beginning of the transaction (for this purposes, I think of > >>transaction as the work that starts with the memory reservation, then it cannot > >>rollback and relies on the reserves, until it commits and frees the memory), > >>determine whether the transaction cannot be blocked in its progress by any other > >>transaction, and the only thing that would block it would be inability to > >>allocate memory during its course? > > > >No. e.g. any transaction that requires allocation or freeing of an > >inode or extent can get stuck behind any other transaction that is > >allocating/freeing and inode/extent. And this will happen when > >holding inode locks, which means other transactions on that inode > >will then get stuck on the inode lock, and so on. Blocking > >dependencies within transactions are everywhere and cannot be > >avoided. > > Hm, I see. I thought that perhaps to avoid deadlocks between > transactions (which you already have to do somehow), Of course, by following lock ordering rules, rules about holding locks over transaction reservations, allowing bulk reservations for rolling transactions that don't unlock objects between transaction commits, having allocation group ordering rules, block allocation ordering rules, transactional lock recursion suport to prevent transaction deadlocking walking over objects already locked into the transaction, etc. By following those rules, we guarantee forwards progress in the transaction subsystem. If we can also guarantee forwards progress in memory allocation inside transaction context (like Irix did all those years ago :P), then we can guarantee that transactions will always complete unless there is a bug or corruption is detected during an operation... > either the > dependencies have to be structured in a way that there's always some > transaction that can't block on others. Or you have a way to detect > potential deadlocks before they happen, and stall somebody who tries > to lock. $ git grep ASSERT fs/xfs |wc -l 1716 About 3% of the code in XFS is ASSERT statements used to verify context specific state is correct in CONFIG_XFS_DEBUG=y builds. FYI, from cloc: Subsystem files blank comment code ------------------------------------------------------------------------------- fs/xfs 157 10841 25339 69140 mm/ 97 13923 25534 67870 fs/btrfs 86 14443 15097 85065 Cheers, Dave. PS: XFS userspace has another 110,000 lines of code in xfsprogs and 60,000 lines of code in xfsdump, and there's also 80,000 lines of test code in xfstests. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-04 11:03 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 11:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Wed, Mar 04, 2015 at 09:50:58AM +0100, Vlastimil Babka wrote: > On 03/04/2015 02:33 AM, Dave Chinner wrote: > >On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote: > >>> > >>>Preallocated reserves do not allow for unbound demand paging of > >>>reclaimable objects within reserved allocation contexts. > >> > >>OK I think I get the point now. > >> > >>So, lots of the concerns by me and others were about the wasted memory due to > >>reservations, and increased pressure on the rest of the system. I was thinking, > >>are you able, at the beginning of the transaction (for this purposes, I think of > >>transaction as the work that starts with the memory reservation, then it cannot > >>rollback and relies on the reserves, until it commits and frees the memory), > >>determine whether the transaction cannot be blocked in its progress by any other > >>transaction, and the only thing that would block it would be inability to > >>allocate memory during its course? > > > >No. e.g. any transaction that requires allocation or freeing of an > >inode or extent can get stuck behind any other transaction that is > >allocating/freeing and inode/extent. And this will happen when > >holding inode locks, which means other transactions on that inode > >will then get stuck on the inode lock, and so on. Blocking > >dependencies within transactions are everywhere and cannot be > >avoided. > > Hm, I see. I thought that perhaps to avoid deadlocks between > transactions (which you already have to do somehow), Of course, by following lock ordering rules, rules about holding locks over transaction reservations, allowing bulk reservations for rolling transactions that don't unlock objects between transaction commits, having allocation group ordering rules, block allocation ordering rules, transactional lock recursion suport to prevent transaction deadlocking walking over objects already locked into the transaction, etc. By following those rules, we guarantee forwards progress in the transaction subsystem. If we can also guarantee forwards progress in memory allocation inside transaction context (like Irix did all those years ago :P), then we can guarantee that transactions will always complete unless there is a bug or corruption is detected during an operation... > either the > dependencies have to be structured in a way that there's always some > transaction that can't block on others. Or you have a way to detect > potential deadlocks before they happen, and stall somebody who tries > to lock. $ git grep ASSERT fs/xfs |wc -l 1716 About 3% of the code in XFS is ASSERT statements used to verify context specific state is correct in CONFIG_XFS_DEBUG=y builds. FYI, from cloc: Subsystem files blank comment code ------------------------------------------------------------------------------- fs/xfs 157 10841 25339 69140 mm/ 97 13923 25534 67870 fs/btrfs 86 14443 15097 85065 Cheers, Dave. PS: XFS userspace has another 110,000 lines of code in xfsprogs and 60,000 lines of code in xfsdump, and there's also 80,000 lines of test code in xfstests. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 22:31 ` Dave Chinner @ 2015-03-07 0:20 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-07 0:20 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, torvalds, Vlastimil Babka On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > What we don't know is how many objects we might need to scan to find > the objects we will eventually modify. Here's an (admittedly > extreme) example to demonstrate a worst case scenario: allocate a > 64k data extent. Because it is an exact size allocation, we look it > up in the by-size free space btree. Free space is fragmented, so > there are about a million 64k free space extents in the tree. > > Once we find the first 64k extent, we search them to find the best > locality target match. The btree records are 16 bytes each, so we > fit roughly 500 to a 4k block. Say we search half the extents to > find the best match - i.e. we walk a thousand leaf blocks before > finding the match we want, and modify that leaf block. > > Now, the modification removed an entry from the leaf and tht > triggers leaf merge thresholds, so a merge with the 1002nd block > occurs. That block now demand pages in and we then modify and join > it to the transaction. Now we walk back up the btree to update > indexes, merging blocks all the way back up to the root. We have a > worst case size btree (5 levels) and we merge at every level meaning > we demand page another 8 btree blocks and modify them. > > In this case, we've demand paged ~1010 btree blocks, but only > modified 10 of them. i.e. the memory we consumed permanently was > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > the allocation demand was 2 orders of magnitude more than the > unreclaimable memory consumption of the btree modification. > > I hope you start to see the scope of the problem now... Isn't this bounded one way or another? Sure, the inaccuracy itself is high, but when you put the absolute numbers in perspective it really doesn't seem to matter: with your extreme case of 3MB per transaction, you can still run 5k+ of them in parallel on a small 16G machine. Occupy a generous 75% of RAM with anonymous pages, and you can STILL run over a thousand transactions concurrently. That would seem like a decent pipeline to keep the storage device occupied. The level of precision that you are asking for comes with complexity and fragility that I'm not convinced is necessary, or justified. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-07 0:20 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-07 0:20 UTC (permalink / raw) To: Dave Chinner Cc: Vlastimil Babka, Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > What we don't know is how many objects we might need to scan to find > the objects we will eventually modify. Here's an (admittedly > extreme) example to demonstrate a worst case scenario: allocate a > 64k data extent. Because it is an exact size allocation, we look it > up in the by-size free space btree. Free space is fragmented, so > there are about a million 64k free space extents in the tree. > > Once we find the first 64k extent, we search them to find the best > locality target match. The btree records are 16 bytes each, so we > fit roughly 500 to a 4k block. Say we search half the extents to > find the best match - i.e. we walk a thousand leaf blocks before > finding the match we want, and modify that leaf block. > > Now, the modification removed an entry from the leaf and tht > triggers leaf merge thresholds, so a merge with the 1002nd block > occurs. That block now demand pages in and we then modify and join > it to the transaction. Now we walk back up the btree to update > indexes, merging blocks all the way back up to the root. We have a > worst case size btree (5 levels) and we merge at every level meaning > we demand page another 8 btree blocks and modify them. > > In this case, we've demand paged ~1010 btree blocks, but only > modified 10 of them. i.e. the memory we consumed permanently was > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > the allocation demand was 2 orders of magnitude more than the > unreclaimable memory consumption of the btree modification. > > I hope you start to see the scope of the problem now... Isn't this bounded one way or another? Sure, the inaccuracy itself is high, but when you put the absolute numbers in perspective it really doesn't seem to matter: with your extreme case of 3MB per transaction, you can still run 5k+ of them in parallel on a small 16G machine. Occupy a generous 75% of RAM with anonymous pages, and you can STILL run over a thousand transactions concurrently. That would seem like a decent pipeline to keep the storage device occupied. The level of precision that you are asking for comes with complexity and fragility that I'm not convinced is necessary, or justified. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-07 0:20 ` Johannes Weiner @ 2015-03-07 3:43 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-07 3:43 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, torvalds, Vlastimil Babka On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote: > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > > What we don't know is how many objects we might need to scan to find > > the objects we will eventually modify. Here's an (admittedly > > extreme) example to demonstrate a worst case scenario: allocate a > > 64k data extent. Because it is an exact size allocation, we look it > > up in the by-size free space btree. Free space is fragmented, so > > there are about a million 64k free space extents in the tree. > > > > Once we find the first 64k extent, we search them to find the best > > locality target match. The btree records are 16 bytes each, so we > > fit roughly 500 to a 4k block. Say we search half the extents to > > find the best match - i.e. we walk a thousand leaf blocks before > > finding the match we want, and modify that leaf block. > > > > Now, the modification removed an entry from the leaf and tht > > triggers leaf merge thresholds, so a merge with the 1002nd block > > occurs. That block now demand pages in and we then modify and join > > it to the transaction. Now we walk back up the btree to update > > indexes, merging blocks all the way back up to the root. We have a > > worst case size btree (5 levels) and we merge at every level meaning > > we demand page another 8 btree blocks and modify them. > > > > In this case, we've demand paged ~1010 btree blocks, but only > > modified 10 of them. i.e. the memory we consumed permanently was > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > > the allocation demand was 2 orders of magnitude more than the > > unreclaimable memory consumption of the btree modification. > > > > I hope you start to see the scope of the problem now... > > Isn't this bounded one way or another? Fo a single transaction? No. > Sure, the inaccuracy itself is > high, but when you put the absolute numbers in perspective it really > doesn't seem to matter: with your extreme case of 3MB per transaction, > you can still run 5k+ of them in parallel on a small 16G machine. No you can't. The number of concurrent transactions is bounded by the size of the log and the amount of unused space available for reservation in the log. Under heavy modification loads, that's usually somewhere between 15-25% of the log, so worst case is a few hundred megabytes. The memory reservation demand is in the same order of magnitude as the log space reservation demand..... > Occupy a generous 75% of RAM with anonymous pages, and you can STILL > run over a thousand transactions concurrently. That would seem like a > decent pipeline to keep the storage device occupied. Typical systems won't ever get to that - they don't do more than a handful of current transactions at a time - the "thousands of transactions" occur on dedicated storage servers like petabyte scale NFS servers that have hundreds of gigabytes of RAM and hundreds-to-thousands of processing threads to keep the request pipeline full. The memory in those machines is entirely dedicated to the filesystem, so keeping a usuable pool of a few gigabytes for transaction reservations isn't a big deal. The point here is that you're taking what I'm describing as the requirements of a reservation pool and then applying the worst case to situations where completely inappropriate. That's what I mean when I told Michal to stop building silly strawman situations; large amounts of concurrency are required for huge machines, not your desktop workstation. And, realistically, sizing that reservation pool appropriately is my problem to solve - it will depend on many factors, one of which is the actual geometry of the filesystem itself. You need to stop thinking like you can control how application use the memory allocation and reclaim subsystem and start to trust we will our memory usage appropriately to maintain maximum system throughput. After all, we already do that for all the filesystem caches the mm subsystem doesn't control - why do you think I have had such an interest in shrinker scalability? For XFS, the only cache we actually don't control reclaim from is user data in the page cache - we control everything else directly from custom shrinkers..... > The level of precision that you are asking for comes with complexity > and fragility that I'm not convinced is necessary, or justified. Look, if you dont think reservations will work, then how about you suggest something that will. I don't really care what you implement, as long as it meets the needs of demand paging, I have direct control over memory usage and concurrency policy and the allocation mechanism guarantees forward progress without needing the OOM killer. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-07 3:43 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-07 3:43 UTC (permalink / raw) To: Johannes Weiner Cc: Vlastimil Babka, Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote: > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > > What we don't know is how many objects we might need to scan to find > > the objects we will eventually modify. Here's an (admittedly > > extreme) example to demonstrate a worst case scenario: allocate a > > 64k data extent. Because it is an exact size allocation, we look it > > up in the by-size free space btree. Free space is fragmented, so > > there are about a million 64k free space extents in the tree. > > > > Once we find the first 64k extent, we search them to find the best > > locality target match. The btree records are 16 bytes each, so we > > fit roughly 500 to a 4k block. Say we search half the extents to > > find the best match - i.e. we walk a thousand leaf blocks before > > finding the match we want, and modify that leaf block. > > > > Now, the modification removed an entry from the leaf and tht > > triggers leaf merge thresholds, so a merge with the 1002nd block > > occurs. That block now demand pages in and we then modify and join > > it to the transaction. Now we walk back up the btree to update > > indexes, merging blocks all the way back up to the root. We have a > > worst case size btree (5 levels) and we merge at every level meaning > > we demand page another 8 btree blocks and modify them. > > > > In this case, we've demand paged ~1010 btree blocks, but only > > modified 10 of them. i.e. the memory we consumed permanently was > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > > the allocation demand was 2 orders of magnitude more than the > > unreclaimable memory consumption of the btree modification. > > > > I hope you start to see the scope of the problem now... > > Isn't this bounded one way or another? Fo a single transaction? No. > Sure, the inaccuracy itself is > high, but when you put the absolute numbers in perspective it really > doesn't seem to matter: with your extreme case of 3MB per transaction, > you can still run 5k+ of them in parallel on a small 16G machine. No you can't. The number of concurrent transactions is bounded by the size of the log and the amount of unused space available for reservation in the log. Under heavy modification loads, that's usually somewhere between 15-25% of the log, so worst case is a few hundred megabytes. The memory reservation demand is in the same order of magnitude as the log space reservation demand..... > Occupy a generous 75% of RAM with anonymous pages, and you can STILL > run over a thousand transactions concurrently. That would seem like a > decent pipeline to keep the storage device occupied. Typical systems won't ever get to that - they don't do more than a handful of current transactions at a time - the "thousands of transactions" occur on dedicated storage servers like petabyte scale NFS servers that have hundreds of gigabytes of RAM and hundreds-to-thousands of processing threads to keep the request pipeline full. The memory in those machines is entirely dedicated to the filesystem, so keeping a usuable pool of a few gigabytes for transaction reservations isn't a big deal. The point here is that you're taking what I'm describing as the requirements of a reservation pool and then applying the worst case to situations where completely inappropriate. That's what I mean when I told Michal to stop building silly strawman situations; large amounts of concurrency are required for huge machines, not your desktop workstation. And, realistically, sizing that reservation pool appropriately is my problem to solve - it will depend on many factors, one of which is the actual geometry of the filesystem itself. You need to stop thinking like you can control how application use the memory allocation and reclaim subsystem and start to trust we will our memory usage appropriately to maintain maximum system throughput. After all, we already do that for all the filesystem caches the mm subsystem doesn't control - why do you think I have had such an interest in shrinker scalability? For XFS, the only cache we actually don't control reclaim from is user data in the page cache - we control everything else directly from custom shrinkers..... > The level of precision that you are asking for comes with complexity > and fragility that I'm not convinced is necessary, or justified. Look, if you dont think reservations will work, then how about you suggest something that will. I don't really care what you implement, as long as it meets the needs of demand paging, I have direct control over memory usage and concurrency policy and the allocation mechanism guarantees forward progress without needing the OOM killer. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-07 3:43 ` Dave Chinner @ 2015-03-07 15:08 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-07 15:08 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, torvalds, Vlastimil Babka On Sat, Mar 07, 2015 at 02:43:47PM +1100, Dave Chinner wrote: > On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote: > > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > > > What we don't know is how many objects we might need to scan to find > > > the objects we will eventually modify. Here's an (admittedly > > > extreme) example to demonstrate a worst case scenario: allocate a > > > 64k data extent. Because it is an exact size allocation, we look it > > > up in the by-size free space btree. Free space is fragmented, so > > > there are about a million 64k free space extents in the tree. > > > > > > Once we find the first 64k extent, we search them to find the best > > > locality target match. The btree records are 16 bytes each, so we > > > fit roughly 500 to a 4k block. Say we search half the extents to > > > find the best match - i.e. we walk a thousand leaf blocks before > > > finding the match we want, and modify that leaf block. > > > > > > Now, the modification removed an entry from the leaf and tht > > > triggers leaf merge thresholds, so a merge with the 1002nd block > > > occurs. That block now demand pages in and we then modify and join > > > it to the transaction. Now we walk back up the btree to update > > > indexes, merging blocks all the way back up to the root. We have a > > > worst case size btree (5 levels) and we merge at every level meaning > > > we demand page another 8 btree blocks and modify them. > > > > > > In this case, we've demand paged ~1010 btree blocks, but only > > > modified 10 of them. i.e. the memory we consumed permanently was > > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > > > the allocation demand was 2 orders of magnitude more than the > > > unreclaimable memory consumption of the btree modification. > > > > > > I hope you start to see the scope of the problem now... > > > > Isn't this bounded one way or another? > > Fo a single transaction? No. So you can have an infinite number of allocations in the context of a transaction, and only the objects that are going to be locked in are bounded? > > Sure, the inaccuracy itself is > > high, but when you put the absolute numbers in perspective it really > > doesn't seem to matter: with your extreme case of 3MB per transaction, > > you can still run 5k+ of them in parallel on a small 16G machine. > > No you can't. The number of concurrent transactions is bounded by > the size of the log and the amount of unused space available for > reservation in the log. Under heavy modification loads, that's > usually somewhere between 15-25% of the log, so worst case is a few > hundred megabytes. The memory reservation demand is in the same > order of magnitude as the log space reservation demand..... > > > Occupy a generous 75% of RAM with anonymous pages, and you can STILL > > run over a thousand transactions concurrently. That would seem like a > > decent pipeline to keep the storage device occupied. > > Typical systems won't ever get to that - they don't do more than a > handful of current transactions at a time - the "thousands of > transactions" occur on dedicated storage servers like petabyte scale > NFS servers that have hundreds of gigabytes of RAM and > hundreds-to-thousands of processing threads to keep the request > pipeline full. The memory in those machines is entirely dedicated to > the filesystem, so keeping a usuable pool of a few gigabytes for > transaction reservations isn't a big deal. > > The point here is that you're taking what I'm describing as the > requirements of a reservation pool and then applying the worst case > to situations where completely inappropriate. That's what I mean > when I told Michal to stop building silly strawman situations; large > amounts of concurrency are required for huge machines, not your > desktop workstation. Why do you have to take everything I say in bad faith and choose to be smug instead of constructive? This is unneccessary. OF COURSE you know your constraints better than we do. Now explain how they matter in practice, because that's what dictates the design in engineering. I'm trying to figure out your requirements to find the simplest model, and yes I'm obviously going to follow up when you give me incomplete information. I'm responding to this: : What we don't know is how many objects we might need to scan to find : the objects we will eventually modify. Here's an (admittedly : extreme) example to demonstrate a worst case scenario: You gave us numbers that you called "worst case", so I took them and put them in a scenario where it looks like memory wouldn't be the bottle neck in real life, even if we just had simple pre-allocation semantics. If it was a silly example, why not provide a better one? I'm fine with reservations and I'm fine with adding more complexity when you demonstrate that it's needed. Your argument seems to have been that worst-case estimates are way off, but can you please just demonstrate why it matters in practice? Instead of having me do it and calling my attempts strawman arguments? I can just guess your constraints, it's up to you to make a case for your requirements. Here is another example where you responded to akpm: --- > When allocating pages the caller should drain its reserves in > preference to dipping into the regular freelist. This guy has already > done his reclaim and shouldn't be penalised a second time. I guess > Johannes's preallocation code should switch to doing this for the same > reason, plus the fact that snipping a page off > task_struct.prealloc_pages is super-fast and needs to be done sometime > anyway so why not do it by default. That is at odds with the requirements of demand paging, which allocate for objects that are reclaimable within the course of the transaction. The reserve is there to ensure forward progress for allocations for objects that aren't freed until after the transaction completes, but if we drain it for reclaimable objects we then have nothing left in the reserve pool when we actually need it. We do not know ahead of time if the object we are allocating is going to modified and hence locked into the transaction. Hence we can't say "use the reserve for this *specific* allocation", and so the only guidance we can really give is "we will to allocate and *permanently consume* this much memory", and the reserve pool needs to cover that consumption to guarantee forwards progress. Forwards progress for all other allocations is guaranteed because they are reclaimable objects - they either freed directly back to their source (slab, heap, page lists) or they are freed by shrinkers once they have been released from the transaction. Hence we need allocations to come from the free list and trigger reclaim, regardless of the fact there is a reserve pool there. The reserve pool needs to be a last resort once there are no other avenues to allocate memory. i.e. it would be used to replace the OOM killer for GFP_NOFAIL allocations. --- Andrew makes a proposal and backs it up with real life benefits: simpler, faster. You on the other hand follow up with a list of unfounded claims and your only counter-argument really seems to be that Andrew's proposal differs from what you've had in mind. What you had in mind was obviously driven by constraints known to you, but it's not an argument until you actually include them. We're not taking your claims at face value, that's not how this ever works. Just explain why and how your requirements, demand paging reserves in this case, matter in real life. Then we can take them seriously. > And, realistically, sizing that reservation pool appropriately is my > problem to solve - it will depend on many factors, one of which is > the actual geometry of the filesystem itself. You need to stop > thinking like you can control how application use the memory > allocation and reclaim subsystem and start to trust we will our > memory usage appropriately to maintain maximum system throughput. You've been working on the kernel long enough to know that this is not how it goes. I don't care about getting a list of things you claim you need and implementing them blindly, trusting that you know what you're doing when it comes to memory. If you want us to expose an interface, which puts constraints on our implementation, then you better provide justification for every single requirement. > After all, we already do that for all the filesystem caches the mm > subsystem doesn't control - why do you think I have had such an > interest in shrinker scalability? For XFS, the only cache we > actually don't control reclaim from is user data in the page cache - > we control everything else directly from custom shrinkers..... You mean those global object pools that are aged through unrelated and independent per-zone pressure values? Look, we are specialized in different subsystems, which means we know the details in front of us better than the details in the surrounding areas. You are quick to dismiss constraints and scalability concerns in the memory subsystem, and I do the same for memory users. We are having this discussion in order to explore where our problem spaces intersect, and we could be making more progress if you stopped assuming that everybody else is an idiot and you already found the perfect solution. We need data on your parameters in order to make a basic cost-benefit analysis of any proposed solutions. Don't just propose something and talk down to us when we ask for clarifications on your constraints. It's not getting us anywhere. Explore the problem space with us, explain your constraints and exact requirements based on real life data, and then we can look for potential solutions. That is how we evaluate every single proposal for the kernel, and it's how it's going to work in this case. It's not that complicated. > > The level of precision that you are asking for comes with complexity > > and fragility that I'm not convinced is necessary, or justified. > > Look, if you dont think reservations will work, then how about you > suggest something that will. I don't really care what you implement, > as long as it meets the needs of demand paging, I have direct > control over memory usage and concurrency policy and the allocation > mechanism guarantees forward progress without needing the OOM > killer. Reservations are fine and I also want them to replace the OOM killer, we agree on that. The only thing my email was about was that, in light of the worst-case numbers you quoted, it didn't look like the demand paging requirement is strictly necessary to make the system work in practice, which is why I'm questioning that particular requirement and prompting you to clarify your position. You have yet to address this. Until then, the simplest semantics are preallocation semantics, where you in advance establish private reserve pools (which can be backed by clean cache) from which you allocate directly using __GFP_RESERVE. If the pool is empty it's immediately detectable and attributable to the culprit, and the other reserves are not impacted by it. A globally shared demand-paged pool is much more fragile because you trust other participants in the system to keep their promise and not pin more objects than they reserved for. Otherwise, they deadlock your transaction and corrupt your userdata. How does "XFS filesystem corrupted because it shares its emergency memory pool to ensure data integrity with some buggy driver" sound to you? It's also harder to verify. If one of the participants misbehaves and pins more objects than they initially reserved for, how do we identify the culprit when the system locks up? Make an actual case why preallocation semantics are unworkable on real systems with real memory and real filesystems and real data on them, then we can consider making the model more complex and fragile. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-07 15:08 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-07 15:08 UTC (permalink / raw) To: Dave Chinner Cc: Vlastimil Babka, Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Sat, Mar 07, 2015 at 02:43:47PM +1100, Dave Chinner wrote: > On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote: > > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > > > What we don't know is how many objects we might need to scan to find > > > the objects we will eventually modify. Here's an (admittedly > > > extreme) example to demonstrate a worst case scenario: allocate a > > > 64k data extent. Because it is an exact size allocation, we look it > > > up in the by-size free space btree. Free space is fragmented, so > > > there are about a million 64k free space extents in the tree. > > > > > > Once we find the first 64k extent, we search them to find the best > > > locality target match. The btree records are 16 bytes each, so we > > > fit roughly 500 to a 4k block. Say we search half the extents to > > > find the best match - i.e. we walk a thousand leaf blocks before > > > finding the match we want, and modify that leaf block. > > > > > > Now, the modification removed an entry from the leaf and tht > > > triggers leaf merge thresholds, so a merge with the 1002nd block > > > occurs. That block now demand pages in and we then modify and join > > > it to the transaction. Now we walk back up the btree to update > > > indexes, merging blocks all the way back up to the root. We have a > > > worst case size btree (5 levels) and we merge at every level meaning > > > we demand page another 8 btree blocks and modify them. > > > > > > In this case, we've demand paged ~1010 btree blocks, but only > > > modified 10 of them. i.e. the memory we consumed permanently was > > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > > > the allocation demand was 2 orders of magnitude more than the > > > unreclaimable memory consumption of the btree modification. > > > > > > I hope you start to see the scope of the problem now... > > > > Isn't this bounded one way or another? > > Fo a single transaction? No. So you can have an infinite number of allocations in the context of a transaction, and only the objects that are going to be locked in are bounded? > > Sure, the inaccuracy itself is > > high, but when you put the absolute numbers in perspective it really > > doesn't seem to matter: with your extreme case of 3MB per transaction, > > you can still run 5k+ of them in parallel on a small 16G machine. > > No you can't. The number of concurrent transactions is bounded by > the size of the log and the amount of unused space available for > reservation in the log. Under heavy modification loads, that's > usually somewhere between 15-25% of the log, so worst case is a few > hundred megabytes. The memory reservation demand is in the same > order of magnitude as the log space reservation demand..... > > > Occupy a generous 75% of RAM with anonymous pages, and you can STILL > > run over a thousand transactions concurrently. That would seem like a > > decent pipeline to keep the storage device occupied. > > Typical systems won't ever get to that - they don't do more than a > handful of current transactions at a time - the "thousands of > transactions" occur on dedicated storage servers like petabyte scale > NFS servers that have hundreds of gigabytes of RAM and > hundreds-to-thousands of processing threads to keep the request > pipeline full. The memory in those machines is entirely dedicated to > the filesystem, so keeping a usuable pool of a few gigabytes for > transaction reservations isn't a big deal. > > The point here is that you're taking what I'm describing as the > requirements of a reservation pool and then applying the worst case > to situations where completely inappropriate. That's what I mean > when I told Michal to stop building silly strawman situations; large > amounts of concurrency are required for huge machines, not your > desktop workstation. Why do you have to take everything I say in bad faith and choose to be smug instead of constructive? This is unneccessary. OF COURSE you know your constraints better than we do. Now explain how they matter in practice, because that's what dictates the design in engineering. I'm trying to figure out your requirements to find the simplest model, and yes I'm obviously going to follow up when you give me incomplete information. I'm responding to this: : What we don't know is how many objects we might need to scan to find : the objects we will eventually modify. Here's an (admittedly : extreme) example to demonstrate a worst case scenario: You gave us numbers that you called "worst case", so I took them and put them in a scenario where it looks like memory wouldn't be the bottle neck in real life, even if we just had simple pre-allocation semantics. If it was a silly example, why not provide a better one? I'm fine with reservations and I'm fine with adding more complexity when you demonstrate that it's needed. Your argument seems to have been that worst-case estimates are way off, but can you please just demonstrate why it matters in practice? Instead of having me do it and calling my attempts strawman arguments? I can just guess your constraints, it's up to you to make a case for your requirements. Here is another example where you responded to akpm: --- > When allocating pages the caller should drain its reserves in > preference to dipping into the regular freelist. This guy has already > done his reclaim and shouldn't be penalised a second time. I guess > Johannes's preallocation code should switch to doing this for the same > reason, plus the fact that snipping a page off > task_struct.prealloc_pages is super-fast and needs to be done sometime > anyway so why not do it by default. That is at odds with the requirements of demand paging, which allocate for objects that are reclaimable within the course of the transaction. The reserve is there to ensure forward progress for allocations for objects that aren't freed until after the transaction completes, but if we drain it for reclaimable objects we then have nothing left in the reserve pool when we actually need it. We do not know ahead of time if the object we are allocating is going to modified and hence locked into the transaction. Hence we can't say "use the reserve for this *specific* allocation", and so the only guidance we can really give is "we will to allocate and *permanently consume* this much memory", and the reserve pool needs to cover that consumption to guarantee forwards progress. Forwards progress for all other allocations is guaranteed because they are reclaimable objects - they either freed directly back to their source (slab, heap, page lists) or they are freed by shrinkers once they have been released from the transaction. Hence we need allocations to come from the free list and trigger reclaim, regardless of the fact there is a reserve pool there. The reserve pool needs to be a last resort once there are no other avenues to allocate memory. i.e. it would be used to replace the OOM killer for GFP_NOFAIL allocations. --- Andrew makes a proposal and backs it up with real life benefits: simpler, faster. You on the other hand follow up with a list of unfounded claims and your only counter-argument really seems to be that Andrew's proposal differs from what you've had in mind. What you had in mind was obviously driven by constraints known to you, but it's not an argument until you actually include them. We're not taking your claims at face value, that's not how this ever works. Just explain why and how your requirements, demand paging reserves in this case, matter in real life. Then we can take them seriously. > And, realistically, sizing that reservation pool appropriately is my > problem to solve - it will depend on many factors, one of which is > the actual geometry of the filesystem itself. You need to stop > thinking like you can control how application use the memory > allocation and reclaim subsystem and start to trust we will our > memory usage appropriately to maintain maximum system throughput. You've been working on the kernel long enough to know that this is not how it goes. I don't care about getting a list of things you claim you need and implementing them blindly, trusting that you know what you're doing when it comes to memory. If you want us to expose an interface, which puts constraints on our implementation, then you better provide justification for every single requirement. > After all, we already do that for all the filesystem caches the mm > subsystem doesn't control - why do you think I have had such an > interest in shrinker scalability? For XFS, the only cache we > actually don't control reclaim from is user data in the page cache - > we control everything else directly from custom shrinkers..... You mean those global object pools that are aged through unrelated and independent per-zone pressure values? Look, we are specialized in different subsystems, which means we know the details in front of us better than the details in the surrounding areas. You are quick to dismiss constraints and scalability concerns in the memory subsystem, and I do the same for memory users. We are having this discussion in order to explore where our problem spaces intersect, and we could be making more progress if you stopped assuming that everybody else is an idiot and you already found the perfect solution. We need data on your parameters in order to make a basic cost-benefit analysis of any proposed solutions. Don't just propose something and talk down to us when we ask for clarifications on your constraints. It's not getting us anywhere. Explore the problem space with us, explain your constraints and exact requirements based on real life data, and then we can look for potential solutions. That is how we evaluate every single proposal for the kernel, and it's how it's going to work in this case. It's not that complicated. > > The level of precision that you are asking for comes with complexity > > and fragility that I'm not convinced is necessary, or justified. > > Look, if you dont think reservations will work, then how about you > suggest something that will. I don't really care what you implement, > as long as it meets the needs of demand paging, I have direct > control over memory usage and concurrency policy and the allocation > mechanism guarantees forward progress without needing the OOM > killer. Reservations are fine and I also want them to replace the OOM killer, we agree on that. The only thing my email was about was that, in light of the worst-case numbers you quoted, it didn't look like the demand paging requirement is strictly necessary to make the system work in practice, which is why I'm questioning that particular requirement and prompting you to clarify your position. You have yet to address this. Until then, the simplest semantics are preallocation semantics, where you in advance establish private reserve pools (which can be backed by clean cache) from which you allocate directly using __GFP_RESERVE. If the pool is empty it's immediately detectable and attributable to the culprit, and the other reserves are not impacted by it. A globally shared demand-paged pool is much more fragile because you trust other participants in the system to keep their promise and not pin more objects than they reserved for. Otherwise, they deadlock your transaction and corrupt your userdata. How does "XFS filesystem corrupted because it shares its emergency memory pool to ensure data integrity with some buggy driver" sound to you? It's also harder to verify. If one of the participants misbehaves and pins more objects than they initially reserved for, how do we identify the culprit when the system locks up? Make an actual case why preallocation semantics are unworkable on real systems with real memory and real filesystems and real data on them, then we can consider making the model more complex and fragile. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 7:32 ` Dave Chinner @ 2015-03-02 20:22 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-02 20:22 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > When allocating pages the caller should drain its reserves in > > preference to dipping into the regular freelist. This guy has already > > done his reclaim and shouldn't be penalised a second time. I guess > > Johannes's preallocation code should switch to doing this for the same > > reason, plus the fact that snipping a page off > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > anyway so why not do it by default. > > That is at odds with the requirements of demand paging, which > allocate for objects that are reclaimable within the course of the > transaction. The reserve is there to ensure forward progress for > allocations for objects that aren't freed until after the > transaction completes, but if we drain it for reclaimable objects we > then have nothing left in the reserve pool when we actually need it. > > We do not know ahead of time if the object we are allocating is > going to modified and hence locked into the transaction. Hence we > can't say "use the reserve for this *specific* allocation", and so > the only guidance we can really give is "we will to allocate and > *permanently consume* this much memory", and the reserve pool needs > to cover that consumption to guarantee forwards progress. > > Forwards progress for all other allocations is guaranteed because > they are reclaimable objects - they either freed directly back to > their source (slab, heap, page lists) or they are freed by shrinkers > once they have been released from the transaction. > > Hence we need allocations to come from the free list and trigger > reclaim, regardless of the fact there is a reserve pool there. The > reserve pool needs to be a last resort once there are no other > avenues to allocate memory. i.e. it would be used to replace the OOM > killer for GFP_NOFAIL allocations. That won't work. Clean cache can be temporarily unavailable and off-LRU for several reasons - compaction, migration, pending page promotion, other reclaimers. How often are we trying before we dip into the reserve pool? As you have noticed, the OOM killer goes off seemingly prematurely at times, and the reason for that is that we simply don't KNOW the exact point when we ran out of reclaimable memory. We cannot take an atomic snapshot of all zones, of all nodes, of all tasks running in order to determine this reliably, we have to approximate it. That's why OOM is defined as "we have scanned a great many pages and couldn't free any of them." So unless you tell us which allocations should come from previously declared reserves, and which ones should rely on reclaim and may fail, the reserves can deplete prematurely and we're back to square one. > > And to make it much worse, how > > many pages of which orders? Bless its heart, slub will go and use > > a 1-order page for allocations which should have been in 0-order > > pages.. It can always fall back to the minimum order. > The majority of allocations will be order-0, though if we know that > they are going to be significant numbers of high order allocations, > then it should be simple enough to tell the mm subsystem "need a > reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have > memory compaction just do it's stuff. But, IMO, we should cross that > bridge when somebody actually needs reservations to be that > specific.... Compaction can be at an impasse for the same reasons mentioned above. It can not just stop_machine() to guarantee it can assemble a higher order page from a bunch of in-use order-0 cache pages. If you need higher-order allocations in a transaction, you have to pre-allocate. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 20:22 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-02 20:22 UTC (permalink / raw) To: Dave Chinner Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > When allocating pages the caller should drain its reserves in > > preference to dipping into the regular freelist. This guy has already > > done his reclaim and shouldn't be penalised a second time. I guess > > Johannes's preallocation code should switch to doing this for the same > > reason, plus the fact that snipping a page off > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > anyway so why not do it by default. > > That is at odds with the requirements of demand paging, which > allocate for objects that are reclaimable within the course of the > transaction. The reserve is there to ensure forward progress for > allocations for objects that aren't freed until after the > transaction completes, but if we drain it for reclaimable objects we > then have nothing left in the reserve pool when we actually need it. > > We do not know ahead of time if the object we are allocating is > going to modified and hence locked into the transaction. Hence we > can't say "use the reserve for this *specific* allocation", and so > the only guidance we can really give is "we will to allocate and > *permanently consume* this much memory", and the reserve pool needs > to cover that consumption to guarantee forwards progress. > > Forwards progress for all other allocations is guaranteed because > they are reclaimable objects - they either freed directly back to > their source (slab, heap, page lists) or they are freed by shrinkers > once they have been released from the transaction. > > Hence we need allocations to come from the free list and trigger > reclaim, regardless of the fact there is a reserve pool there. The > reserve pool needs to be a last resort once there are no other > avenues to allocate memory. i.e. it would be used to replace the OOM > killer for GFP_NOFAIL allocations. That won't work. Clean cache can be temporarily unavailable and off-LRU for several reasons - compaction, migration, pending page promotion, other reclaimers. How often are we trying before we dip into the reserve pool? As you have noticed, the OOM killer goes off seemingly prematurely at times, and the reason for that is that we simply don't KNOW the exact point when we ran out of reclaimable memory. We cannot take an atomic snapshot of all zones, of all nodes, of all tasks running in order to determine this reliably, we have to approximate it. That's why OOM is defined as "we have scanned a great many pages and couldn't free any of them." So unless you tell us which allocations should come from previously declared reserves, and which ones should rely on reclaim and may fail, the reserves can deplete prematurely and we're back to square one. > > And to make it much worse, how > > many pages of which orders? Bless its heart, slub will go and use > > a 1-order page for allocations which should have been in 0-order > > pages.. It can always fall back to the minimum order. > The majority of allocations will be order-0, though if we know that > they are going to be significant numbers of high order allocations, > then it should be simple enough to tell the mm subsystem "need a > reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have > memory compaction just do it's stuff. But, IMO, we should cross that > bridge when somebody actually needs reservations to be that > specific.... Compaction can be at an impasse for the same reasons mentioned above. It can not just stop_machine() to guarantee it can assemble a higher order page from a bunch of in-use order-0 cache pages. If you need higher-order allocations in a transaction, you have to pre-allocate. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 20:22 ` Johannes Weiner @ 2015-03-02 23:12 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-02 23:12 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote: > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > > When allocating pages the caller should drain its reserves in > > > preference to dipping into the regular freelist. This guy has already > > > done his reclaim and shouldn't be penalised a second time. I guess > > > Johannes's preallocation code should switch to doing this for the same > > > reason, plus the fact that snipping a page off > > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > > anyway so why not do it by default. > > > > That is at odds with the requirements of demand paging, which > > allocate for objects that are reclaimable within the course of the > > transaction. The reserve is there to ensure forward progress for > > allocations for objects that aren't freed until after the > > transaction completes, but if we drain it for reclaimable objects we > > then have nothing left in the reserve pool when we actually need it. > > > > We do not know ahead of time if the object we are allocating is > > going to modified and hence locked into the transaction. Hence we > > can't say "use the reserve for this *specific* allocation", and so > > the only guidance we can really give is "we will to allocate and > > *permanently consume* this much memory", and the reserve pool needs > > to cover that consumption to guarantee forwards progress. > > > > Forwards progress for all other allocations is guaranteed because > > they are reclaimable objects - they either freed directly back to > > their source (slab, heap, page lists) or they are freed by shrinkers > > once they have been released from the transaction. > > > > Hence we need allocations to come from the free list and trigger > > reclaim, regardless of the fact there is a reserve pool there. The > > reserve pool needs to be a last resort once there are no other > > avenues to allocate memory. i.e. it would be used to replace the OOM > > killer for GFP_NOFAIL allocations. > > That won't work. I don't see why not... > Clean cache can be temporarily unavailable and > off-LRU for several reasons - compaction, migration, pending page > promotion, other reclaimers. How often are we trying before we dip > into the reserve pool? As you have noticed, the OOM killer goes off > seemingly prematurely at times, and the reason for that is that we > simply don't KNOW the exact point when we ran out of reclaimable > memory. Sure, but that's irrelevant to the problem at hand. At some point, the Mm subsystem is going to decide "we're at OOM" - it's *what happens next* that matters. > We cannot take an atomic snapshot of all zones, of all nodes, > of all tasks running in order to determine this reliably, we have to > approximate it. That's why OOM is defined as "we have scanned a great > many pages and couldn't free any of them." Yes, and reserve pools *do not change* the logic that leads to that decision. What changes is that we don't "kick the OOM killer", instead we "allocate from the reserve pool." The reserve pool *replaces* the OOM killer as a method of guaranteeing forwards allocation progress for those subsystems that can use reservations. If there is no reserve pool for the current task, then you can still kick the OOM killer.... > So unless you tell us which allocations should come from previously > declared reserves, and which ones should rely on reclaim and may fail, > the reserves can deplete prematurely and we're back to square one. Like the OOM killer, filesystems are not omnipotent and are not perfect. Requiring us to be so is entirely unreasonable, and is *entirely unnecessary* from the POV of the mm subsystem. Reservations give the mm subsystem a *strong model* for guaranteeing forwards allocation progress, and it can be independently verified and tested without having to care about how some subsystem uses it. The mm subsystem supplies the *mechanism*, and mm developers are entirely focussed around ensuring the mechanism works and is verifiable. i.e. you could write some debug kernel module to exercise, verify and regression test the model behaviour, which is something that simply cannot be done with the OOM killer. Reservation sizes required by a subsystem are *policy*. They are not a problem the mm subsystem needs to be concerned with as the subsystem has to get the reservations right for the mechanism to work. i.e. Managing reservation sizes is my responsibility as a subsystem maintainer, just like it's currently my responsibility for ensuring that transient ENOMEM conditions don't result in a filesystem shutdown.... > Compaction can be at an impasse for the same reasons mentioned above. > It can not just stop_machine() to guarantee it can assemble a higher > order page from a bunch of in-use order-0 cache pages. If you need > higher-order allocations in a transaction, you have to pre-allocate. It's much simpler just to use order-0 reservations and vmalloc if we can't get high order allocations. We already do this in most places where high order allocations are required, so there's really no change needed here. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 23:12 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-02 23:12 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote: > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > > When allocating pages the caller should drain its reserves in > > > preference to dipping into the regular freelist. This guy has already > > > done his reclaim and shouldn't be penalised a second time. I guess > > > Johannes's preallocation code should switch to doing this for the same > > > reason, plus the fact that snipping a page off > > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > > anyway so why not do it by default. > > > > That is at odds with the requirements of demand paging, which > > allocate for objects that are reclaimable within the course of the > > transaction. The reserve is there to ensure forward progress for > > allocations for objects that aren't freed until after the > > transaction completes, but if we drain it for reclaimable objects we > > then have nothing left in the reserve pool when we actually need it. > > > > We do not know ahead of time if the object we are allocating is > > going to modified and hence locked into the transaction. Hence we > > can't say "use the reserve for this *specific* allocation", and so > > the only guidance we can really give is "we will to allocate and > > *permanently consume* this much memory", and the reserve pool needs > > to cover that consumption to guarantee forwards progress. > > > > Forwards progress for all other allocations is guaranteed because > > they are reclaimable objects - they either freed directly back to > > their source (slab, heap, page lists) or they are freed by shrinkers > > once they have been released from the transaction. > > > > Hence we need allocations to come from the free list and trigger > > reclaim, regardless of the fact there is a reserve pool there. The > > reserve pool needs to be a last resort once there are no other > > avenues to allocate memory. i.e. it would be used to replace the OOM > > killer for GFP_NOFAIL allocations. > > That won't work. I don't see why not... > Clean cache can be temporarily unavailable and > off-LRU for several reasons - compaction, migration, pending page > promotion, other reclaimers. How often are we trying before we dip > into the reserve pool? As you have noticed, the OOM killer goes off > seemingly prematurely at times, and the reason for that is that we > simply don't KNOW the exact point when we ran out of reclaimable > memory. Sure, but that's irrelevant to the problem at hand. At some point, the Mm subsystem is going to decide "we're at OOM" - it's *what happens next* that matters. > We cannot take an atomic snapshot of all zones, of all nodes, > of all tasks running in order to determine this reliably, we have to > approximate it. That's why OOM is defined as "we have scanned a great > many pages and couldn't free any of them." Yes, and reserve pools *do not change* the logic that leads to that decision. What changes is that we don't "kick the OOM killer", instead we "allocate from the reserve pool." The reserve pool *replaces* the OOM killer as a method of guaranteeing forwards allocation progress for those subsystems that can use reservations. If there is no reserve pool for the current task, then you can still kick the OOM killer.... > So unless you tell us which allocations should come from previously > declared reserves, and which ones should rely on reclaim and may fail, > the reserves can deplete prematurely and we're back to square one. Like the OOM killer, filesystems are not omnipotent and are not perfect. Requiring us to be so is entirely unreasonable, and is *entirely unnecessary* from the POV of the mm subsystem. Reservations give the mm subsystem a *strong model* for guaranteeing forwards allocation progress, and it can be independently verified and tested without having to care about how some subsystem uses it. The mm subsystem supplies the *mechanism*, and mm developers are entirely focussed around ensuring the mechanism works and is verifiable. i.e. you could write some debug kernel module to exercise, verify and regression test the model behaviour, which is something that simply cannot be done with the OOM killer. Reservation sizes required by a subsystem are *policy*. They are not a problem the mm subsystem needs to be concerned with as the subsystem has to get the reservations right for the mechanism to work. i.e. Managing reservation sizes is my responsibility as a subsystem maintainer, just like it's currently my responsibility for ensuring that transient ENOMEM conditions don't result in a filesystem shutdown.... > Compaction can be at an impasse for the same reasons mentioned above. > It can not just stop_machine() to guarantee it can assemble a higher > order page from a bunch of in-use order-0 cache pages. If you need > higher-order allocations in a transaction, you have to pre-allocate. It's much simpler just to use order-0 reservations and vmalloc if we can't get high order allocations. We already do this in most places where high order allocations are required, so there's really no change needed here. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 23:12 ` Dave Chinner @ 2015-03-03 2:50 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-03 2:50 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote: > > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > > > When allocating pages the caller should drain its reserves in > > > > preference to dipping into the regular freelist. This guy has already > > > > done his reclaim and shouldn't be penalised a second time. I guess > > > > Johannes's preallocation code should switch to doing this for the same > > > > reason, plus the fact that snipping a page off > > > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > > > anyway so why not do it by default. > > > > > > That is at odds with the requirements of demand paging, which > > > allocate for objects that are reclaimable within the course of the > > > transaction. The reserve is there to ensure forward progress for > > > allocations for objects that aren't freed until after the > > > transaction completes, but if we drain it for reclaimable objects we > > > then have nothing left in the reserve pool when we actually need it. > > > > > > We do not know ahead of time if the object we are allocating is > > > going to modified and hence locked into the transaction. Hence we > > > can't say "use the reserve for this *specific* allocation", and so > > > the only guidance we can really give is "we will to allocate and > > > *permanently consume* this much memory", and the reserve pool needs > > > to cover that consumption to guarantee forwards progress. > > > > > > Forwards progress for all other allocations is guaranteed because > > > they are reclaimable objects - they either freed directly back to > > > their source (slab, heap, page lists) or they are freed by shrinkers > > > once they have been released from the transaction. > > > > > > Hence we need allocations to come from the free list and trigger > > > reclaim, regardless of the fact there is a reserve pool there. The > > > reserve pool needs to be a last resort once there are no other > > > avenues to allocate memory. i.e. it would be used to replace the OOM > > > killer for GFP_NOFAIL allocations. > > > > That won't work. > > I don't see why not... > > > Clean cache can be temporarily unavailable and > > off-LRU for several reasons - compaction, migration, pending page > > promotion, other reclaimers. How often are we trying before we dip > > into the reserve pool? As you have noticed, the OOM killer goes off > > seemingly prematurely at times, and the reason for that is that we > > simply don't KNOW the exact point when we ran out of reclaimable > > memory. > > Sure, but that's irrelevant to the problem at hand. At some point, > the Mm subsystem is going to decide "we're at OOM" - it's *what > happens next* that matters. It's not irrelevant at all. That point is an arbitrary magic number that is a byproduct of many implementation details and concurrency in the memory management layer. It's completely fine to tie allocations which can fail to this point, but you can't reasonably calibrate your emergency reserves, which are supposed to guarantee progress, to such an unpredictable variable. When you reserve based on the share of allocations that you know will be unreclaimable, you are assuming that all other allocations will be reclaimable, and that is simply flawed. There is so much concurrency in the MM subsystem that you can't reasonably expect a single scanner instance to recover the majority of theoretically reclaimable memory. > > We cannot take an atomic snapshot of all zones, of all nodes, > > of all tasks running in order to determine this reliably, we have to > > approximate it. That's why OOM is defined as "we have scanned a great > > many pages and couldn't free any of them." > > Yes, and reserve pools *do not change* the logic that leads to that > decision. What changes is that we don't "kick the OOM killer", > instead we "allocate from the reserve pool." The reserve pool > *replaces* the OOM killer as a method of guaranteeing forwards > allocation progress for those subsystems that can use reservations. In order to replace the OOM killer in its role as progress guarantee, the reserves can't run dry during the transaction. Because what are we going to do in that case? > If there is no reserve pool for the current task, then you can still > kick the OOM killer.... ... so we are not actually replacing the OOM killer, we just defer it with reserves that were calibrated to an anecdotal snapshot of a fuzzy quantity of reclaim activity? Is the idea here to just pile sh*tty, mostly-working mechanisms on top of each other in the hope that one of them will kick things along just enough to avoid locking up? > > So unless you tell us which allocations should come from previously > > declared reserves, and which ones should rely on reclaim and may fail, > > the reserves can deplete prematurely and we're back to square one. > > Like the OOM killer, filesystems are not omnipotent and are not > perfect. Requiring us to be so is entirely unreasonable, and is > *entirely unnecessary* from the POV of the mm subsystem. > > Reservations give the mm subsystem a *strong model* for guaranteeing > forwards allocation progress, and it can be independently verified > and tested without having to care about how some subsystem uses it. > The mm subsystem supplies the *mechanism*, and mm developers are > entirely focussed around ensuring the mechanism works and is > verifiable. i.e. you could write some debug kernel module to > exercise, verify and regression test the model behaviour, which is > something that simply cannot be done with the OOM killer. > > Reservation sizes required by a subsystem are *policy*. They are not > a problem the mm subsystem needs to be concerned with as the > subsystem has to get the reservations right for the mechanism to > work. i.e. Managing reservation sizes is my responsibility as a > subsystem maintainer, just like it's currently my responsibility for > ensuring that transient ENOMEM conditions don't result in a > filesystem shutdown.... Anything that depends on the point at which the memory management system gives up reclaiming pages is not verifiable in the slightest. It will vary from kernel to kernel, from workload to workload, from run to run. It will regress in the blink of an eye. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-03 2:50 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-03 2:50 UTC (permalink / raw) To: Dave Chinner Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote: > > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > > > When allocating pages the caller should drain its reserves in > > > > preference to dipping into the regular freelist. This guy has already > > > > done his reclaim and shouldn't be penalised a second time. I guess > > > > Johannes's preallocation code should switch to doing this for the same > > > > reason, plus the fact that snipping a page off > > > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > > > anyway so why not do it by default. > > > > > > That is at odds with the requirements of demand paging, which > > > allocate for objects that are reclaimable within the course of the > > > transaction. The reserve is there to ensure forward progress for > > > allocations for objects that aren't freed until after the > > > transaction completes, but if we drain it for reclaimable objects we > > > then have nothing left in the reserve pool when we actually need it. > > > > > > We do not know ahead of time if the object we are allocating is > > > going to modified and hence locked into the transaction. Hence we > > > can't say "use the reserve for this *specific* allocation", and so > > > the only guidance we can really give is "we will to allocate and > > > *permanently consume* this much memory", and the reserve pool needs > > > to cover that consumption to guarantee forwards progress. > > > > > > Forwards progress for all other allocations is guaranteed because > > > they are reclaimable objects - they either freed directly back to > > > their source (slab, heap, page lists) or they are freed by shrinkers > > > once they have been released from the transaction. > > > > > > Hence we need allocations to come from the free list and trigger > > > reclaim, regardless of the fact there is a reserve pool there. The > > > reserve pool needs to be a last resort once there are no other > > > avenues to allocate memory. i.e. it would be used to replace the OOM > > > killer for GFP_NOFAIL allocations. > > > > That won't work. > > I don't see why not... > > > Clean cache can be temporarily unavailable and > > off-LRU for several reasons - compaction, migration, pending page > > promotion, other reclaimers. How often are we trying before we dip > > into the reserve pool? As you have noticed, the OOM killer goes off > > seemingly prematurely at times, and the reason for that is that we > > simply don't KNOW the exact point when we ran out of reclaimable > > memory. > > Sure, but that's irrelevant to the problem at hand. At some point, > the Mm subsystem is going to decide "we're at OOM" - it's *what > happens next* that matters. It's not irrelevant at all. That point is an arbitrary magic number that is a byproduct of many implementation details and concurrency in the memory management layer. It's completely fine to tie allocations which can fail to this point, but you can't reasonably calibrate your emergency reserves, which are supposed to guarantee progress, to such an unpredictable variable. When you reserve based on the share of allocations that you know will be unreclaimable, you are assuming that all other allocations will be reclaimable, and that is simply flawed. There is so much concurrency in the MM subsystem that you can't reasonably expect a single scanner instance to recover the majority of theoretically reclaimable memory. > > We cannot take an atomic snapshot of all zones, of all nodes, > > of all tasks running in order to determine this reliably, we have to > > approximate it. That's why OOM is defined as "we have scanned a great > > many pages and couldn't free any of them." > > Yes, and reserve pools *do not change* the logic that leads to that > decision. What changes is that we don't "kick the OOM killer", > instead we "allocate from the reserve pool." The reserve pool > *replaces* the OOM killer as a method of guaranteeing forwards > allocation progress for those subsystems that can use reservations. In order to replace the OOM killer in its role as progress guarantee, the reserves can't run dry during the transaction. Because what are we going to do in that case? > If there is no reserve pool for the current task, then you can still > kick the OOM killer.... ... so we are not actually replacing the OOM killer, we just defer it with reserves that were calibrated to an anecdotal snapshot of a fuzzy quantity of reclaim activity? Is the idea here to just pile sh*tty, mostly-working mechanisms on top of each other in the hope that one of them will kick things along just enough to avoid locking up? > > So unless you tell us which allocations should come from previously > > declared reserves, and which ones should rely on reclaim and may fail, > > the reserves can deplete prematurely and we're back to square one. > > Like the OOM killer, filesystems are not omnipotent and are not > perfect. Requiring us to be so is entirely unreasonable, and is > *entirely unnecessary* from the POV of the mm subsystem. > > Reservations give the mm subsystem a *strong model* for guaranteeing > forwards allocation progress, and it can be independently verified > and tested without having to care about how some subsystem uses it. > The mm subsystem supplies the *mechanism*, and mm developers are > entirely focussed around ensuring the mechanism works and is > verifiable. i.e. you could write some debug kernel module to > exercise, verify and regression test the model behaviour, which is > something that simply cannot be done with the OOM killer. > > Reservation sizes required by a subsystem are *policy*. They are not > a problem the mm subsystem needs to be concerned with as the > subsystem has to get the reservations right for the mechanism to > work. i.e. Managing reservation sizes is my responsibility as a > subsystem maintainer, just like it's currently my responsibility for > ensuring that transient ENOMEM conditions don't result in a > filesystem shutdown.... Anything that depends on the point at which the memory management system gives up reclaiming pages is not verifiable in the slightest. It will vary from kernel to kernel, from workload to workload, from run to run. It will regress in the blink of an eye. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-03 2:50 ` Johannes Weiner @ 2015-03-04 6:52 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 6:52 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Mon, Mar 02, 2015 at 09:50:23PM -0500, Johannes Weiner wrote: > On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote: > > On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote: > > > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > > > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > > > > When allocating pages the caller should drain its reserves in > > > > > preference to dipping into the regular freelist. This guy has already > > > > > done his reclaim and shouldn't be penalised a second time. I guess > > > > > Johannes's preallocation code should switch to doing this for the same > > > > > reason, plus the fact that snipping a page off > > > > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > > > > anyway so why not do it by default. > > > > > > > > That is at odds with the requirements of demand paging, which > > > > allocate for objects that are reclaimable within the course of the > > > > transaction. The reserve is there to ensure forward progress for > > > > allocations for objects that aren't freed until after the > > > > transaction completes, but if we drain it for reclaimable objects we > > > > then have nothing left in the reserve pool when we actually need it. > > > > > > > > We do not know ahead of time if the object we are allocating is > > > > going to modified and hence locked into the transaction. Hence we > > > > can't say "use the reserve for this *specific* allocation", and so > > > > the only guidance we can really give is "we will to allocate and > > > > *permanently consume* this much memory", and the reserve pool needs > > > > to cover that consumption to guarantee forwards progress. > > > > > > > > Forwards progress for all other allocations is guaranteed because > > > > they are reclaimable objects - they either freed directly back to > > > > their source (slab, heap, page lists) or they are freed by shrinkers > > > > once they have been released from the transaction. > > > > > > > > Hence we need allocations to come from the free list and trigger > > > > reclaim, regardless of the fact there is a reserve pool there. The > > > > reserve pool needs to be a last resort once there are no other > > > > avenues to allocate memory. i.e. it would be used to replace the OOM > > > > killer for GFP_NOFAIL allocations. > > > > > > That won't work. > > > > I don't see why not... > > > > > Clean cache can be temporarily unavailable and > > > off-LRU for several reasons - compaction, migration, pending page > > > promotion, other reclaimers. How often are we trying before we dip > > > into the reserve pool? As you have noticed, the OOM killer goes off > > > seemingly prematurely at times, and the reason for that is that we > > > simply don't KNOW the exact point when we ran out of reclaimable > > > memory. > > > > Sure, but that's irrelevant to the problem at hand. At some point, > > the Mm subsystem is going to decide "we're at OOM" - it's *what > > happens next* that matters. > > It's not irrelevant at all. That point is an arbitrary magic number > that is a byproduct of many imlementation details and concurrency in > the memory management layer. It's completely fine to tie allocations > which can fail to this point, but you can't reasonably calibrate your > emergency reserves, which are supposed to guarantee progress, to such > an unpredictable variable. > > When you reserve based on the share of allocations that you know will > be unreclaimable, you are assuming that all other allocations will be > reclaimable, and that is simply flawed. There is so much concurrency > in the MM subsystem that you can't reasonably expect a single scanner > instance to recover the majority of theoretically reclaimable memory. On one hand you say "memory accounting is unreliable, so detecting OOM is unreliable, and so we have an unreliable trigger point. On the other hand you say "single scanner instance can't reclaim all memory", again stating we have an unreliable trigger point. On the gripping hand, that unreliable trigger point is what kicks the OOM killer. Yet you consider that point to be reliable enough to kick the OOM killer, but too unreliable to trigger allocation from a reserve pool? Say what? I suspect you've completely misunderstood what I've been suggesting. By definition, we have the pages we reserved in the reserve pool, and unless we've exhausted that reservation with permanent allocations we should always be able to allocate from it. If the pool got emptied by demand page allocations, then we back off and retry reclaim until the reclaimable objects are released back into the reserve pool. i.e. reclaim fills reserve pools first, then when they are full pages can go back on free lists for normal allocations. This provides the mechanism for forwards progress, and it's essentially the same mechanism that mempools use to guarantee forwards progess. the only difference is that reserve pool refilling comes through reclaim via shrinker invocation... In reality, though, I don't really care how the mm subsystem implements that pool as long as it handles the cases I've described (e.g http://oss.sgi.com/archives/xfs/2015-03/msg00039.html). I don't think we're making progress here, anyway, so unless you come up with some other solution this thread is going to die here.... -Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-04 6:52 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 6:52 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Mon, Mar 02, 2015 at 09:50:23PM -0500, Johannes Weiner wrote: > On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote: > > On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote: > > > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > > > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > > > > When allocating pages the caller should drain its reserves in > > > > > preference to dipping into the regular freelist. This guy has already > > > > > done his reclaim and shouldn't be penalised a second time. I guess > > > > > Johannes's preallocation code should switch to doing this for the same > > > > > reason, plus the fact that snipping a page off > > > > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > > > > anyway so why not do it by default. > > > > > > > > That is at odds with the requirements of demand paging, which > > > > allocate for objects that are reclaimable within the course of the > > > > transaction. The reserve is there to ensure forward progress for > > > > allocations for objects that aren't freed until after the > > > > transaction completes, but if we drain it for reclaimable objects we > > > > then have nothing left in the reserve pool when we actually need it. > > > > > > > > We do not know ahead of time if the object we are allocating is > > > > going to modified and hence locked into the transaction. Hence we > > > > can't say "use the reserve for this *specific* allocation", and so > > > > the only guidance we can really give is "we will to allocate and > > > > *permanently consume* this much memory", and the reserve pool needs > > > > to cover that consumption to guarantee forwards progress. > > > > > > > > Forwards progress for all other allocations is guaranteed because > > > > they are reclaimable objects - they either freed directly back to > > > > their source (slab, heap, page lists) or they are freed by shrinkers > > > > once they have been released from the transaction. > > > > > > > > Hence we need allocations to come from the free list and trigger > > > > reclaim, regardless of the fact there is a reserve pool there. The > > > > reserve pool needs to be a last resort once there are no other > > > > avenues to allocate memory. i.e. it would be used to replace the OOM > > > > killer for GFP_NOFAIL allocations. > > > > > > That won't work. > > > > I don't see why not... > > > > > Clean cache can be temporarily unavailable and > > > off-LRU for several reasons - compaction, migration, pending page > > > promotion, other reclaimers. How often are we trying before we dip > > > into the reserve pool? As you have noticed, the OOM killer goes off > > > seemingly prematurely at times, and the reason for that is that we > > > simply don't KNOW the exact point when we ran out of reclaimable > > > memory. > > > > Sure, but that's irrelevant to the problem at hand. At some point, > > the Mm subsystem is going to decide "we're at OOM" - it's *what > > happens next* that matters. > > It's not irrelevant at all. That point is an arbitrary magic number > that is a byproduct of many imlementation details and concurrency in > the memory management layer. It's completely fine to tie allocations > which can fail to this point, but you can't reasonably calibrate your > emergency reserves, which are supposed to guarantee progress, to such > an unpredictable variable. > > When you reserve based on the share of allocations that you know will > be unreclaimable, you are assuming that all other allocations will be > reclaimable, and that is simply flawed. There is so much concurrency > in the MM subsystem that you can't reasonably expect a single scanner > instance to recover the majority of theoretically reclaimable memory. On one hand you say "memory accounting is unreliable, so detecting OOM is unreliable, and so we have an unreliable trigger point. On the other hand you say "single scanner instance can't reclaim all memory", again stating we have an unreliable trigger point. On the gripping hand, that unreliable trigger point is what kicks the OOM killer. Yet you consider that point to be reliable enough to kick the OOM killer, but too unreliable to trigger allocation from a reserve pool? Say what? I suspect you've completely misunderstood what I've been suggesting. By definition, we have the pages we reserved in the reserve pool, and unless we've exhausted that reservation with permanent allocations we should always be able to allocate from it. If the pool got emptied by demand page allocations, then we back off and retry reclaim until the reclaimable objects are released back into the reserve pool. i.e. reclaim fills reserve pools first, then when they are full pages can go back on free lists for normal allocations. This provides the mechanism for forwards progress, and it's essentially the same mechanism that mempools use to guarantee forwards progess. the only difference is that reserve pool refilling comes through reclaim via shrinker invocation... In reality, though, I don't really care how the mm subsystem implements that pool as long as it handles the cases I've described (e.g http://oss.sgi.com/archives/xfs/2015-03/msg00039.html). I don't think we're making progress here, anyway, so unless you come up with some other solution this thread is going to die here.... -Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 6:52 ` Dave Chinner @ 2015-03-04 15:04 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-04 15:04 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Wed, Mar 04, 2015 at 05:52:42PM +1100, Dave Chinner wrote: > I suspect you've completely misunderstood what I've been suggesting. > > By definition, we have the pages we reserved in the reserve pool, > and unless we've exhausted that reservation with permanent > allocations we should always be able to allocate from it. If the > pool got emptied by demand page allocations, then we back off and > retry reclaim until the reclaimable objects are released back into > the reserve pool. i.e. reclaim fills reserve pools first, then when > they are full pages can go back on free lists for normal > allocations. This provides the mechanism for forwards progress, and > it's essentially the same mechanism that mempools use to guarantee > forwards progess. the only difference is that reserve pool refilling > comes through reclaim via shrinker invocation... Yes, I had something else in mind. In order to rely on replenishing through reclaim, you have to make sure that all allocations taken out of the pool are guaranteed to come back in a reasonable time frame. So once Ted said that the filesystem will not be able to declare which allocations of a task are allowed to dip into its reserves, and thus allocations of indefinite lifetime can enter the picture, my mind went to a one-off reserve pool that doesn't rely on replenishing in order to make forward progress. You declare the worst-case, finish the transaction, and return what is left of the reserves. This obviously conflicts with the estimation model that you are proposing, I hope it's now clear where our misunderstanding lies. Yes, we can make this work if you can tell us which allocations have limited/controllable lifetime. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-04 15:04 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-04 15:04 UTC (permalink / raw) To: Dave Chinner Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Wed, Mar 04, 2015 at 05:52:42PM +1100, Dave Chinner wrote: > I suspect you've completely misunderstood what I've been suggesting. > > By definition, we have the pages we reserved in the reserve pool, > and unless we've exhausted that reservation with permanent > allocations we should always be able to allocate from it. If the > pool got emptied by demand page allocations, then we back off and > retry reclaim until the reclaimable objects are released back into > the reserve pool. i.e. reclaim fills reserve pools first, then when > they are full pages can go back on free lists for normal > allocations. This provides the mechanism for forwards progress, and > it's essentially the same mechanism that mempools use to guarantee > forwards progess. the only difference is that reserve pool refilling > comes through reclaim via shrinker invocation... Yes, I had something else in mind. In order to rely on replenishing through reclaim, you have to make sure that all allocations taken out of the pool are guaranteed to come back in a reasonable time frame. So once Ted said that the filesystem will not be able to declare which allocations of a task are allowed to dip into its reserves, and thus allocations of indefinite lifetime can enter the picture, my mind went to a one-off reserve pool that doesn't rely on replenishing in order to make forward progress. You declare the worst-case, finish the transaction, and return what is left of the reserves. This obviously conflicts with the estimation model that you are proposing, I hope it's now clear where our misunderstanding lies. Yes, we can make this work if you can tell us which allocations have limited/controllable lifetime. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 15:04 ` Johannes Weiner @ 2015-03-04 17:38 ` Theodore Ts'o -1 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-03-04 17:38 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, torvalds On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote: > Yes, we can make this work if you can tell us which allocations have > limited/controllable lifetime. It may be helpful to be a bit precise about definitions here. There are a number of different object lifetimes: a) will be released before the kernel thread returns control to userspace b) will be released once the current I/O operation finishes. (In the case of nbd where the remote server has unexpectedy gone away might be quite a while, but I'm not sure how much we care about that scenario) c) can be trivially released if the mm subsystem asks via calling a shrinker d) can be released only after doing some amount of bounded work (i.e., cleaning a dirty page) e) impossible to predict when it can be released (e.g., dcache, inodes attached to an open file descriptors, buffer heads that won't be freed until the file system is umounted, etc.) I'm guessing that what you mean is (b), but what about cases such as (c)? Would the mm subsystem find it helpful if it had more information about object lifetime? For example, the CMA folks seem to really care about know whether memory allocations falls in category (e) or not. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-04 17:38 ` Theodore Ts'o 0 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-03-04 17:38 UTC (permalink / raw) To: Johannes Weiner Cc: Dave Chinner, Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote: > Yes, we can make this work if you can tell us which allocations have > limited/controllable lifetime. It may be helpful to be a bit precise about definitions here. There are a number of different object lifetimes: a) will be released before the kernel thread returns control to userspace b) will be released once the current I/O operation finishes. (In the case of nbd where the remote server has unexpectedy gone away might be quite a while, but I'm not sure how much we care about that scenario) c) can be trivially released if the mm subsystem asks via calling a shrinker d) can be released only after doing some amount of bounded work (i.e., cleaning a dirty page) e) impossible to predict when it can be released (e.g., dcache, inodes attached to an open file descriptors, buffer heads that won't be freed until the file system is umounted, etc.) I'm guessing that what you mean is (b), but what about cases such as (c)? Would the mm subsystem find it helpful if it had more information about object lifetime? For example, the CMA folks seem to really care about know whether memory allocations falls in category (e) or not. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 17:38 ` Theodore Ts'o @ 2015-03-04 23:17 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 23:17 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Wed, Mar 04, 2015 at 12:38:41PM -0500, Theodore Ts'o wrote: > On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote: > > Yes, we can make this work if you can tell us which allocations have > > limited/controllable lifetime. > > It may be helpful to be a bit precise about definitions here. There > are a number of different object lifetimes: > > a) will be released before the kernel thread returns control to > userspace > > b) will be released once the current I/O operation finishes. (In the > case of nbd where the remote server has unexpectedy gone away might be > quite a while, but I'm not sure how much we care about that scenario) > > c) can be trivially released if the mm subsystem asks via calling a > shrinker > > d) can be released only after doing some amount of bounded work (i.e., > cleaning a dirty page) > > e) impossible to predict when it can be released (e.g., dcache, inodes > attached to an open file descriptors, buffer heads that won't be freed > until the file system is umounted, etc.) > > > I'm guessing that what you mean is (b), but what about cases such as > (c)? The thing is, in the XFS transaction case we are hitting e) for every allocation, and only after IO and/or some processing do we know whether it will fall into c), d) or whether it will be permanently consumed. > Would the mm subsystem find it helpful if it had more information > about object lifetime? For example, the CMA folks seem to really care > about know whether memory allocations falls in category (e) or not. The problem is that most filesystem allocations fall into category (e). Worse is that the state of an object can change without allocations having taken place e.g. an object on a reclaimable LRU can be found via a cache lookup, then joined to and modified in a transaction. Hence objects can change state from "reclaimable" to "permanently consumed" without actually going through memory reclaim and allocation. IOWs, what is really required is the ability to say "this amount of allocation reserve is now consumed" /some time after/ we've done the allocation. i.e. when we join the object to the transaction and modify it, that's when we need to be able to reduce the reservation limit as that memory is now permanently consumed by the transaction context. Objects that fall into c) and d) don't need to have anyting special done, because reclaim will eventually free the memory they hold once the allocating context releases them. Indeed, this model works even when we find those c) and d) objects in cache rather than allocating them. They would get correctly accounted as "consumed reserve" because we no longer need to allocate that memory in transaction context and so that reserve can be released back to the free pool.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-04 23:17 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 23:17 UTC (permalink / raw) To: Theodore Ts'o Cc: Johannes Weiner, Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs On Wed, Mar 04, 2015 at 12:38:41PM -0500, Theodore Ts'o wrote: > On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote: > > Yes, we can make this work if you can tell us which allocations have > > limited/controllable lifetime. > > It may be helpful to be a bit precise about definitions here. There > are a number of different object lifetimes: > > a) will be released before the kernel thread returns control to > userspace > > b) will be released once the current I/O operation finishes. (In the > case of nbd where the remote server has unexpectedy gone away might be > quite a while, but I'm not sure how much we care about that scenario) > > c) can be trivially released if the mm subsystem asks via calling a > shrinker > > d) can be released only after doing some amount of bounded work (i.e., > cleaning a dirty page) > > e) impossible to predict when it can be released (e.g., dcache, inodes > attached to an open file descriptors, buffer heads that won't be freed > until the file system is umounted, etc.) > > > I'm guessing that what you mean is (b), but what about cases such as > (c)? The thing is, in the XFS transaction case we are hitting e) for every allocation, and only after IO and/or some processing do we know whether it will fall into c), d) or whether it will be permanently consumed. > Would the mm subsystem find it helpful if it had more information > about object lifetime? For example, the CMA folks seem to really care > about know whether memory allocations falls in category (e) or not. The problem is that most filesystem allocations fall into category (e). Worse is that the state of an object can change without allocations having taken place e.g. an object on a reclaimable LRU can be found via a cache lookup, then joined to and modified in a transaction. Hence objects can change state from "reclaimable" to "permanently consumed" without actually going through memory reclaim and allocation. IOWs, what is really required is the ability to say "this amount of allocation reserve is now consumed" /some time after/ we've done the allocation. i.e. when we join the object to the transaction and modify it, that's when we need to be able to reduce the reservation limit as that memory is now permanently consumed by the transaction context. Objects that fall into c) and d) don't need to have anyting special done, because reclaim will eventually free the memory they hold once the allocating context releases them. Indeed, this model works even when we find those c) and d) objects in cache rather than allocating them. They would get correctly accounted as "consumed reserve" because we no longer need to allocate that memory in transaction context and so that reserve can be released back to the free pool.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 0:45 ` Dave Chinner @ 2015-02-28 16:29 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-28 16:29 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Feb 23, 2015 at 11:45:21AM +1100, Dave Chinner wrote: > On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote: > > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: > > > I will actively work around aanything that causes filesystem memory > > > pressure to increase the chance of oom killer invocations. The OOM > > > killer is not a solution - it is, by definition, a loose cannon and > > > so we should be reducing dependencies on it. > > > > Once we have a better-working alternative, sure. > > Great, but first a simple request: please stop writing code and > instead start architecting a solution to the problem. i.e. we need a > design and have that documented before code gets written. If you > watched my recent LCA talk, then you'll understand what I mean > when I say: stop programming and start engineering. This code was for the sake of argument, see below. > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. You are missing the point of my question. Whether we allocate right away or make sure the memory is allocatable later on is a matter of cost, but the logical outcome is the same. That is not my concern right now. An OOM killer allows transactional allocation sites to get away without planning ahead. You are arguing that the OOM killer is a cop-out on the MM site but I see it as the opposite: it puts a lot of complexity in the allocator so that callsites can maneuver themselves into situations where they absolutely need to get memory - or corrupt user data - without actually making sure their needs will be covered. If we replace __GFP_NOFAIL + OOM killer with a reserve system, we are putting the full responsibility on the user. Are you sure this is going to reduce our kernel-wide error rate? > And, really, "reservation" != "preallocation". That's an implementation detail. Yes, the example implementation was dumb and heavy-handed, but a reservation system that works based on watermarks, and considers clean cache readily allocatable, is not much more complex than that. I'm trying to figure out if the current nofail allocators can get their memory needs figured out beforehand. And reliably so - what good are estimates that are right 90% of the time, when failing the allocation means corrupting user data? What is the contingency plan? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-28 16:29 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-28 16:29 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Mon, Feb 23, 2015 at 11:45:21AM +1100, Dave Chinner wrote: > On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote: > > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: > > > I will actively work around aanything that causes filesystem memory > > > pressure to increase the chance of oom killer invocations. The OOM > > > killer is not a solution - it is, by definition, a loose cannon and > > > so we should be reducing dependencies on it. > > > > Once we have a better-working alternative, sure. > > Great, but first a simple request: please stop writing code and > instead start architecting a solution to the problem. i.e. we need a > design and have that documented before code gets written. If you > watched my recent LCA talk, then you'll understand what I mean > when I say: stop programming and start engineering. This code was for the sake of argument, see below. > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. You are missing the point of my question. Whether we allocate right away or make sure the memory is allocatable later on is a matter of cost, but the logical outcome is the same. That is not my concern right now. An OOM killer allows transactional allocation sites to get away without planning ahead. You are arguing that the OOM killer is a cop-out on the MM site but I see it as the opposite: it puts a lot of complexity in the allocator so that callsites can maneuver themselves into situations where they absolutely need to get memory - or corrupt user data - without actually making sure their needs will be covered. If we replace __GFP_NOFAIL + OOM killer with a reserve system, we are putting the full responsibility on the user. Are you sure this is going to reduce our kernel-wide error rate? > And, really, "reservation" != "preallocation". That's an implementation detail. Yes, the example implementation was dumb and heavy-handed, but a reservation system that works based on watermarks, and considers clean cache readily allocatable, is not much more complex than that. I'm trying to figure out if the current nofail allocators can get their memory needs figured out beforehand. And reliably so - what good are estimates that are right 90% of the time, when failing the allocation means corrupting user data? What is the contingency plan? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 16:29 ` Johannes Weiner @ 2015-02-28 16:41 ` Theodore Ts'o -1 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-02-28 16:41 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > I'm trying to figure out if the current nofail allocators can get > their memory needs figured out beforehand. And reliably so - what > good are estimates that are right 90% of the time, when failing the > allocation means corrupting user data? What is the contingency plan? In the ideal world, we can figure out the exact memory needs beforehand. But we live in an imperfect world, and given that block devices *also* need memory, the answer is "of course not". We can't be perfect. But we can least give some kind of hint, and we can offer to wait before we get into a situation where we need to loop in GFP_NOWAIT --- which is the contingency/fallback plan. I'm sure that's not very satisfying, but it's better than what we have now. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-28 16:41 ` Theodore Ts'o 0 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-02-28 16:41 UTC (permalink / raw) To: Johannes Weiner Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > I'm trying to figure out if the current nofail allocators can get > their memory needs figured out beforehand. And reliably so - what > good are estimates that are right 90% of the time, when failing the > allocation means corrupting user data? What is the contingency plan? In the ideal world, we can figure out the exact memory needs beforehand. But we live in an imperfect world, and given that block devices *also* need memory, the answer is "of course not". We can't be perfect. But we can least give some kind of hint, and we can offer to wait before we get into a situation where we need to loop in GFP_NOWAIT --- which is the contingency/fallback plan. I'm sure that's not very satisfying, but it's better than what we have now. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 16:41 ` Theodore Ts'o @ 2015-02-28 22:15 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-28 22:15 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > I'm trying to figure out if the current nofail allocators can get > > their memory needs figured out beforehand. And reliably so - what > > good are estimates that are right 90% of the time, when failing the > > allocation means corrupting user data? What is the contingency plan? > > In the ideal world, we can figure out the exact memory needs > beforehand. But we live in an imperfect world, and given that block > devices *also* need memory, the answer is "of course not". We can't > be perfect. But we can least give some kind of hint, and we can offer > to wait before we get into a situation where we need to loop in > GFP_NOWAIT --- which is the contingency/fallback plan. Overestimating should be fine, the result would a bit of false memory pressure. But underestimating and looping can't be an option or the original lockups will still be there. We need to guarantee forward progress or the problem is somewhat mitigated at best - only now with quite a bit more complexity in the allocator and the filesystems. The block code would have to be looked at separately, but doesn't it already use mempools etc. to guarantee progress? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-28 22:15 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-02-28 22:15 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > I'm trying to figure out if the current nofail allocators can get > > their memory needs figured out beforehand. And reliably so - what > > good are estimates that are right 90% of the time, when failing the > > allocation means corrupting user data? What is the contingency plan? > > In the ideal world, we can figure out the exact memory needs > beforehand. But we live in an imperfect world, and given that block > devices *also* need memory, the answer is "of course not". We can't > be perfect. But we can least give some kind of hint, and we can offer > to wait before we get into a situation where we need to loop in > GFP_NOWAIT --- which is the contingency/fallback plan. Overestimating should be fine, the result would a bit of false memory pressure. But underestimating and looping can't be an option or the original lockups will still be there. We need to guarantee forward progress or the problem is somewhat mitigated at best - only now with quite a bit more complexity in the allocator and the filesystems. The block code would have to be looked at separately, but doesn't it already use mempools etc. to guarantee progress? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 22:15 ` Johannes Weiner @ 2015-03-01 11:17 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-03-01 11:17 UTC (permalink / raw) To: hannes, tytso Cc: dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, fernando_b1, torvalds Johannes Weiner wrote: > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > I'm trying to figure out if the current nofail allocators can get > > > their memory needs figured out beforehand. And reliably so - what > > > good are estimates that are right 90% of the time, when failing the > > > allocation means corrupting user data? What is the contingency plan? > > > > In the ideal world, we can figure out the exact memory needs > > beforehand. But we live in an imperfect world, and given that block > > devices *also* need memory, the answer is "of course not". We can't > > be perfect. But we can least give some kind of hint, and we can offer > > to wait before we get into a situation where we need to loop in > > GFP_NOWAIT --- which is the contingency/fallback plan. > > Overestimating should be fine, the result would a bit of false memory > pressure. But underestimating and looping can't be an option or the > original lockups will still be there. We need to guarantee forward > progress or the problem is somewhat mitigated at best - only now with > quite a bit more complexity in the allocator and the filesystems. > > The block code would have to be looked at separately, but doesn't it > already use mempools etc. to guarantee progress? > If underestimating is tolerable, can we simply set different watermark levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations? For example, GFP_KERNEL (or above) can fail if memory usage exceeds 95% GFP_NOFS can fail if memory usage exceeds 97% GFP_NOIO can fail if memory usage exceeds 98% GFP_ATOMIC can fail if memory usage exceeds 99% I think that below order-0 GFP_NOIO allocation enters into retry-forever loop when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds strange. Use of same watermark is preventing kernel worker threads from processing workqueue. While it is legal to do blocking operation from workqueue, being blocked forever is an exclusive occupation for workqueue; other jobs in the workqueue get stuck. [ 907.302050] kworker/1:0 R running task 0 10832 2 0x00000080 [ 907.303961] Workqueue: events_freezable_power_ disk_events_workfn [ 907.305706] ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190 [ 907.307761] 0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190 [ 907.309894] 0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408 [ 907.311949] Call Trace: [ 907.312989] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 907.314578] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 907.316182] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 907.317889] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 907.319535] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 907.321259] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 907.322945] [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380 [ 907.324606] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 907.326196] [<ffffffff8125c119>] bio_copy_kern+0x49/0x100 [ 907.327788] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 907.329549] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 907.331184] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 907.332877] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 907.334452] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 907.336156] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 907.337893] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 907.339539] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 907.341289] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 907.343115] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 907.344771] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 907.346421] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 907.348057] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 907.349650] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 907.351295] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 907.352765] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 907.354520] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 907.356097] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 If I change GFP_NOIO in scsi_execute() to GFP_ATOMIC, above trace went away. If we can reserve some amount of memory for block / filesystem layer than allow non critical allocation, above trace will likely go away. Or, instead maybe we can change GFP_NOIO to do (1) try allocation using GFP_ATOMIC|GFP_NOWARN (2) try allocating from freelist for GFP_NOIO (3) fail the allocation with warning message steps if we can implement freelist for GFP_NOIO. Ditto for GFP_NOFS. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-01 11:17 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-03-01 11:17 UTC (permalink / raw) To: hannes, tytso Cc: david, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, fernando_b1 Johannes Weiner wrote: > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > I'm trying to figure out if the current nofail allocators can get > > > their memory needs figured out beforehand. And reliably so - what > > > good are estimates that are right 90% of the time, when failing the > > > allocation means corrupting user data? What is the contingency plan? > > > > In the ideal world, we can figure out the exact memory needs > > beforehand. But we live in an imperfect world, and given that block > > devices *also* need memory, the answer is "of course not". We can't > > be perfect. But we can least give some kind of hint, and we can offer > > to wait before we get into a situation where we need to loop in > > GFP_NOWAIT --- which is the contingency/fallback plan. > > Overestimating should be fine, the result would a bit of false memory > pressure. But underestimating and looping can't be an option or the > original lockups will still be there. We need to guarantee forward > progress or the problem is somewhat mitigated at best - only now with > quite a bit more complexity in the allocator and the filesystems. > > The block code would have to be looked at separately, but doesn't it > already use mempools etc. to guarantee progress? > If underestimating is tolerable, can we simply set different watermark levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations? For example, GFP_KERNEL (or above) can fail if memory usage exceeds 95% GFP_NOFS can fail if memory usage exceeds 97% GFP_NOIO can fail if memory usage exceeds 98% GFP_ATOMIC can fail if memory usage exceeds 99% I think that below order-0 GFP_NOIO allocation enters into retry-forever loop when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds strange. Use of same watermark is preventing kernel worker threads from processing workqueue. While it is legal to do blocking operation from workqueue, being blocked forever is an exclusive occupation for workqueue; other jobs in the workqueue get stuck. [ 907.302050] kworker/1:0 R running task 0 10832 2 0x00000080 [ 907.303961] Workqueue: events_freezable_power_ disk_events_workfn [ 907.305706] ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190 [ 907.307761] 0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190 [ 907.309894] 0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408 [ 907.311949] Call Trace: [ 907.312989] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 907.314578] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 907.316182] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 907.317889] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 907.319535] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 907.321259] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 907.322945] [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380 [ 907.324606] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 907.326196] [<ffffffff8125c119>] bio_copy_kern+0x49/0x100 [ 907.327788] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 907.329549] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 907.331184] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 907.332877] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 907.334452] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 907.336156] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 907.337893] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 907.339539] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 907.341289] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 907.343115] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 907.344771] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 907.346421] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 907.348057] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 907.349650] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 907.351295] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 907.352765] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 907.354520] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 907.356097] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 If I change GFP_NOIO in scsi_execute() to GFP_ATOMIC, above trace went away. If we can reserve some amount of memory for block / filesystem layer than allow non critical allocation, above trace will likely go away. Or, instead maybe we can change GFP_NOIO to do (1) try allocation using GFP_ATOMIC|GFP_NOWARN (2) try allocating from freelist for GFP_NOIO (3) fail the allocation with warning message steps if we can implement freelist for GFP_NOIO. Ditto for GFP_NOFS. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 11:17 ` Tetsuo Handa @ 2015-03-06 11:53 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-03-06 11:53 UTC (permalink / raw) To: david Cc: tytso, hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, fernando_b1, torvalds Tetsuo Handa wrote: > If underestimating is tolerable, can we simply set different watermark > levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations? > For example, > > GFP_KERNEL (or above) can fail if memory usage exceeds 95% > GFP_NOFS can fail if memory usage exceeds 97% > GFP_NOIO can fail if memory usage exceeds 98% > GFP_ATOMIC can fail if memory usage exceeds 99% > > I think that below order-0 GFP_NOIO allocation enters into retry-forever loop > when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds > strange. Use of same watermark is preventing kernel worker threads from > processing workqueue. While it is legal to do blocking operation from > workqueue, being blocked forever is an exclusive occupation for workqueue; > other jobs in the workqueue get stuck. > Below experimental patch which raises zone watermark works for me. ---------- diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432..92233e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1710,6 +1710,7 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif + gfp_t gfp_mask; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7abfa70..1a6b830 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1810,6 +1810,12 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order, min -= min / 2; if (alloc_flags & ALLOC_HARDER) min -= min / 4; + if (min == mark) { + if (current->gfp_mask & __GFP_FS) + min <<= 1; + if (current->gfp_mask & __GFP_IO) + min <<= 1; + } #ifdef CONFIG_CMA /* If allocation can't use CMA areas don't use free CMA pages */ if (!(alloc_flags & ALLOC_CMA)) @@ -2810,6 +2816,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, .nodemask = nodemask, .migratetype = gfpflags_to_migratetype(gfp_mask), }; + gfp_t orig_gfp_mask; gfp_mask &= gfp_allowed_mask; @@ -2831,6 +2838,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; + orig_gfp_mask = current->gfp_mask; + current->gfp_mask = gfp_mask; retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); @@ -2873,6 +2882,7 @@ out: if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; + current->gfp_mask = orig_gfp_mask; return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); ---------- Thanks again to Jonathan Corbet for writing https://lwn.net/Articles/635354/ . Is Dave Chinner's "reservations" suggestion conceptually doing the patch above? Dave's suggestion is to ask each GFP_NOFS and GFP_NOIO users to estimate how much amount of pages they need for their transaction like if (min == mark) { if (current->gfp_mask & __GFP_FS) min += atomic_read(&reservation_for_gfp_fs); if (current->gfp_mask & __GFP_IO) min += atomic_read(&reservation_for_gfp_io); } than ask the administrator to specify a static amount like if (min == mark) { if (current->gfp_mask & __GFP_FS) min += sysctl_reservation_for_gfp_fs; if (current->gfp_mask & __GFP_IO) min += sysctl_reservation_for_gfp_io; } ? The retry-forever loop will happen if underestimated, won't it? Then, how to handle it when the OOM killer missed the target (due to __GFP_FS) or the OOM killer cannot be invoked (due to !__GFP_FS)? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-06 11:53 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-03-06 11:53 UTC (permalink / raw) To: david Cc: hannes, tytso, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs, fernando_b1 Tetsuo Handa wrote: > If underestimating is tolerable, can we simply set different watermark > levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations? > For example, > > GFP_KERNEL (or above) can fail if memory usage exceeds 95% > GFP_NOFS can fail if memory usage exceeds 97% > GFP_NOIO can fail if memory usage exceeds 98% > GFP_ATOMIC can fail if memory usage exceeds 99% > > I think that below order-0 GFP_NOIO allocation enters into retry-forever loop > when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds > strange. Use of same watermark is preventing kernel worker threads from > processing workqueue. While it is legal to do blocking operation from > workqueue, being blocked forever is an exclusive occupation for workqueue; > other jobs in the workqueue get stuck. > Below experimental patch which raises zone watermark works for me. ---------- diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432..92233e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1710,6 +1710,7 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif + gfp_t gfp_mask; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7abfa70..1a6b830 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1810,6 +1810,12 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order, min -= min / 2; if (alloc_flags & ALLOC_HARDER) min -= min / 4; + if (min == mark) { + if (current->gfp_mask & __GFP_FS) + min <<= 1; + if (current->gfp_mask & __GFP_IO) + min <<= 1; + } #ifdef CONFIG_CMA /* If allocation can't use CMA areas don't use free CMA pages */ if (!(alloc_flags & ALLOC_CMA)) @@ -2810,6 +2816,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, .nodemask = nodemask, .migratetype = gfpflags_to_migratetype(gfp_mask), }; + gfp_t orig_gfp_mask; gfp_mask &= gfp_allowed_mask; @@ -2831,6 +2838,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; + orig_gfp_mask = current->gfp_mask; + current->gfp_mask = gfp_mask; retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); @@ -2873,6 +2882,7 @@ out: if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; + current->gfp_mask = orig_gfp_mask; return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); ---------- Thanks again to Jonathan Corbet for writing https://lwn.net/Articles/635354/ . Is Dave Chinner's "reservations" suggestion conceptually doing the patch above? Dave's suggestion is to ask each GFP_NOFS and GFP_NOIO users to estimate how much amount of pages they need for their transaction like if (min == mark) { if (current->gfp_mask & __GFP_FS) min += atomic_read(&reservation_for_gfp_fs); if (current->gfp_mask & __GFP_IO) min += atomic_read(&reservation_for_gfp_io); } than ask the administrator to specify a static amount like if (min == mark) { if (current->gfp_mask & __GFP_FS) min += sysctl_reservation_for_gfp_fs; if (current->gfp_mask & __GFP_IO) min += sysctl_reservation_for_gfp_io; } ? The retry-forever loop will happen if underestimated, won't it? Then, how to handle it when the OOM killer missed the target (due to __GFP_FS) or the OOM killer cannot be invoked (due to !__GFP_FS)? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 22:15 ` Johannes Weiner @ 2015-03-01 13:43 ` Theodore Ts'o -1 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-03-01 13:43 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > Overestimating should be fine, the result would a bit of false memory > pressure. But underestimating and looping can't be an option or the > original lockups will still be there. We need to guarantee forward > progress or the problem is somewhat mitigated at best - only now with > quite a bit more complexity in the allocator and the filesystems. We've lived with looping as it is and in practice it's actually worked well. I can only speak for ext4, but I do a lot of testing under very high memory pressure situations, and it is used in *production* under very high stress situations --- and the only time we'e run into trouble is when the looping behaviour somehow got accidentally *removed*. There have been MM experts who have been worrying about this situation for a very long time, but honestly, it seems to be much more of a theoretical than actual concern. So if you don't want to get hints/estimates about how much memory the file system is about to use, when the file system is willing to wait or even potentially return ENOMEM (although I suspect starting to return ENOMEM where most user space application don't expect it will cause more problems), I'm personally happy to just use GFP_NOFAIL everywhere --- or to hard code my own infinite loops if the MM developers want to take GFP_NOFAIL away. Because in my experience, looping simply hasn't been as awful as some folks on this thread have made it out to be. So if you don't like the complexity because the perfect is the enemy of the good, we can just drop this and the file systems can simply continue to loop around their memory allocation calls... or if that fails we can start adding subsystem specific mempools, which would be even more wasteful of memory and probably at least as complicated. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-01 13:43 ` Theodore Ts'o 0 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-03-01 13:43 UTC (permalink / raw) To: Johannes Weiner Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > Overestimating should be fine, the result would a bit of false memory > pressure. But underestimating and looping can't be an option or the > original lockups will still be there. We need to guarantee forward > progress or the problem is somewhat mitigated at best - only now with > quite a bit more complexity in the allocator and the filesystems. We've lived with looping as it is and in practice it's actually worked well. I can only speak for ext4, but I do a lot of testing under very high memory pressure situations, and it is used in *production* under very high stress situations --- and the only time we'e run into trouble is when the looping behaviour somehow got accidentally *removed*. There have been MM experts who have been worrying about this situation for a very long time, but honestly, it seems to be much more of a theoretical than actual concern. So if you don't want to get hints/estimates about how much memory the file system is about to use, when the file system is willing to wait or even potentially return ENOMEM (although I suspect starting to return ENOMEM where most user space application don't expect it will cause more problems), I'm personally happy to just use GFP_NOFAIL everywhere --- or to hard code my own infinite loops if the MM developers want to take GFP_NOFAIL away. Because in my experience, looping simply hasn't been as awful as some folks on this thread have made it out to be. So if you don't like the complexity because the perfect is the enemy of the good, we can just drop this and the file systems can simply continue to loop around their memory allocation calls... or if that fails we can start adding subsystem specific mempools, which would be even more wasteful of memory and probably at least as complicated. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 13:43 ` Theodore Ts'o @ 2015-03-01 16:15 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-01 16:15 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > We've lived with looping as it is and in practice it's actually worked > well. I can only speak for ext4, but I do a lot of testing under very > high memory pressure situations, and it is used in *production* under > very high stress situations --- and the only time we'e run into > trouble is when the looping behaviour somehow got accidentally > *removed*. > > There have been MM experts who have been worrying about this situation > for a very long time, but honestly, it seems to be much more of a > theoretical than actual concern. Well, looping is a valid thing to do in most situations because on a loaded system there is a decent chance that an unrelated thread will volunteer some unreclaimable memory, or exit altogether. Right now, we rely on this happening, and it works most of the time. Maybe all the time, depending on how your machine is used. But when it does't, machines do lock up in practice. We had these lockups in cgroups with just a handful of threads, which all got stuck in the allocator and there was nobody left to volunteer unreclaimable memory. When this was being addressed, we knew that the same can theoretically happen on the system-level but weren't aware of any reports. Well now, here we are. It's been argued in this thread that systems shouldn't be pushed to such extremes in real life and that we simply expect failure at some point. If that's the consensus, then yes, we can stop this and tell users that they should scale back. But I'm not convinced just yet that this is the best we can do. > So if you don't want to get hints/estimates about how much memory > the file system is about to use, when the file system is willing to > wait or even potentially return ENOMEM (although I suspect starting > to return ENOMEM where most user space application don't expect it > will cause more problems), I'm personally happy to just use > GFP_NOFAIL everywhere --- or to hard code my own infinite loops if > the MM developers want to take GFP_NOFAIL away. Because in my > experience, looping simply hasn't been as awful as some folks on > this thread have made it out to be. As I've said before, I'd be happy to get estimates from the filesystem so that we can adjust our reserves, instead of simply running against the wall at some point and hoping that the OOM killer heuristics will save the day. Until then, I'd much prefer __GFP_NOFAIL over open-coded loops. If the OOM killer is too aggressive, we can tone it down, but as it stands that mechanism is the last attempt at forward progress if looping doesn't work out. In addition, when we finally transition to private memory reserves, we can easily find the callsites that need to be annotated with __GFP_MAY_DIP_INTO_PRIVATE_RESERVES. > So if you don't like the complexity because the perfect is the enemy > of the good, we can just drop this and the file systems can simply > continue to loop around their memory allocation calls... or if that > fails we can start adding subsystem specific mempools, which would be > even more wasteful of memory and probably at least as complicated. It really depends on what the goal here is. You don't have to be perfectly accurate, but if you can give us a worst-case estimate we can actually guarantee forward progress and eliminate these lockups entirely, like in the block layer. Sure, there will be bugs and the estimates won't be right from the start, but we can converge towards the right answer. If the allocations which are allowed to dip into the reserves - the current nofail sites? - can be annotated with a gfp flag, we can easily verify the estimates by serving those sites exclusively from the private reserve pool and emit warnings when that runs dry. We wouldn't even have to stress the system for that. But there are legitimate concerns that this might never work. For example, the requirements could be so unpredictable, or assessing them with reasonable accuracy could be so expensive, that the margin of error would make the worst case estimate too big to be useful. Big enough that the reserves would harm well-behaved systems. And if useful worst-case estimates are unattainable, I don't think we need to bother with reserves. We can just stick with looping and OOM killing, that works most of the time, too. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-01 16:15 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-01 16:15 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > We've lived with looping as it is and in practice it's actually worked > well. I can only speak for ext4, but I do a lot of testing under very > high memory pressure situations, and it is used in *production* under > very high stress situations --- and the only time we'e run into > trouble is when the looping behaviour somehow got accidentally > *removed*. > > There have been MM experts who have been worrying about this situation > for a very long time, but honestly, it seems to be much more of a > theoretical than actual concern. Well, looping is a valid thing to do in most situations because on a loaded system there is a decent chance that an unrelated thread will volunteer some unreclaimable memory, or exit altogether. Right now, we rely on this happening, and it works most of the time. Maybe all the time, depending on how your machine is used. But when it does't, machines do lock up in practice. We had these lockups in cgroups with just a handful of threads, which all got stuck in the allocator and there was nobody left to volunteer unreclaimable memory. When this was being addressed, we knew that the same can theoretically happen on the system-level but weren't aware of any reports. Well now, here we are. It's been argued in this thread that systems shouldn't be pushed to such extremes in real life and that we simply expect failure at some point. If that's the consensus, then yes, we can stop this and tell users that they should scale back. But I'm not convinced just yet that this is the best we can do. > So if you don't want to get hints/estimates about how much memory > the file system is about to use, when the file system is willing to > wait or even potentially return ENOMEM (although I suspect starting > to return ENOMEM where most user space application don't expect it > will cause more problems), I'm personally happy to just use > GFP_NOFAIL everywhere --- or to hard code my own infinite loops if > the MM developers want to take GFP_NOFAIL away. Because in my > experience, looping simply hasn't been as awful as some folks on > this thread have made it out to be. As I've said before, I'd be happy to get estimates from the filesystem so that we can adjust our reserves, instead of simply running against the wall at some point and hoping that the OOM killer heuristics will save the day. Until then, I'd much prefer __GFP_NOFAIL over open-coded loops. If the OOM killer is too aggressive, we can tone it down, but as it stands that mechanism is the last attempt at forward progress if looping doesn't work out. In addition, when we finally transition to private memory reserves, we can easily find the callsites that need to be annotated with __GFP_MAY_DIP_INTO_PRIVATE_RESERVES. > So if you don't like the complexity because the perfect is the enemy > of the good, we can just drop this and the file systems can simply > continue to loop around their memory allocation calls... or if that > fails we can start adding subsystem specific mempools, which would be > even more wasteful of memory and probably at least as complicated. It really depends on what the goal here is. You don't have to be perfectly accurate, but if you can give us a worst-case estimate we can actually guarantee forward progress and eliminate these lockups entirely, like in the block layer. Sure, there will be bugs and the estimates won't be right from the start, but we can converge towards the right answer. If the allocations which are allowed to dip into the reserves - the current nofail sites? - can be annotated with a gfp flag, we can easily verify the estimates by serving those sites exclusively from the private reserve pool and emit warnings when that runs dry. We wouldn't even have to stress the system for that. But there are legitimate concerns that this might never work. For example, the requirements could be so unpredictable, or assessing them with reasonable accuracy could be so expensive, that the margin of error would make the worst case estimate too big to be useful. Big enough that the reserves would harm well-behaved systems. And if useful worst-case estimates are unattainable, I don't think we need to bother with reserves. We can just stick with looping and OOM killing, that works most of the time, too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 16:15 ` Johannes Weiner @ 2015-03-01 19:36 ` Theodore Ts'o -1 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-03-01 19:36 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote: > > We had these lockups in cgroups with just a handful of threads, which > all got stuck in the allocator and there was nobody left to volunteer > unreclaimable memory. When this was being addressed, we knew that the > same can theoretically happen on the system-level but weren't aware of > any reports. Well now, here we are. I think the "few threads in a small" cgroup problem is a little difference, because in those cases very often the global system has enough memory, and there is always the possibility that we might relax the memory cgroup guarantees a little in order to allow forward progress. In fact, arguably this *is* the right thing to do, because we have situations where (a) the VFS takes the directory mutex, (b) the directory blocks have been pushed out of memory, and so (c) a system call running in container with a small amount of memory and/or a small amount of disk bandwidth allowed via its prop I/O settings ends up taking a very long time for the directory blocks to be read into memory. If a high priority process, like say a cluster management daemon, also tries to to read the same directory, it can end up stalled for long enough for the software watchdog to take out the entire machine from the cluster. The hard problem here is that the lock is taken by the VFS, *before* it calls into the file system specific layer, and so the VFS has no idea (a) how much memory or disk bandwidth it needs, and (b) whether it needs any memory or disk bandwidth in the first place in order to service a directory lookup operation (most of the time, it doesn't). So there may be situations where in the restricted cgroup, it would useful for the file system to be able to say, "you know, we're holding onto a lock and the fact that the disk controller is going to force this low priority cgroup to wait over a minute for the I/O to even be queued out to the disk, maybe we should make an exception and bust the disk controller cgroup cap". (There is a related problem where a cgroup with a low disk bandwidth quota is slowing down writeback, and we are desperately short on global memory, and where relaxing the disk bandwidth limit via some kind of priority inheritance scheme would prevent "innocent" high, proprity cgroups from having some of their processes get OOM-killed. I suppose one could claim that the high priority cgroups tend to belong to the sysadmin, who set the stupid disk bandwidth caps in the first place, so there is a certain justice in having the high priority processes getting OOM killed, but still, it would be nice if we could do the right thing automatically.) But in any case, some of these workarounds, where we relax a particuarly tightly constrained cgroup limit, are obviously not going to help when the entire system is low on memory. > It really depends on what the goal here is. You don't have to be > perfectly accurate, but if you can give us a worst-case estimate we > can actually guarantee forward progress and eliminate these lockups > entirely, like in the block layer. Sure, there will be bugs and the > estimates won't be right from the start, but we can converge towards > the right answer. If the allocations which are allowed to dip into > the reserves - the current nofail sites? - can be annotated with a gfp > flag, we can easily verify the estimates by serving those sites > exclusively from the private reserve pool and emit warnings when that > runs dry. We wouldn't even have to stress the system for that. > > But there are legitimate concerns that this might never work. For > example, the requirements could be so unpredictable, or assessing them > with reasonable accuracy could be so expensive, that the margin of > error would make the worst case estimate too big to be useful. Big > enough that the reserves would harm well-behaved systems. And if > useful worst-case estimates are unattainable, I don't think we need to > bother with reserves. We can just stick with looping and OOM killing, > that works most of the time, too. I'm not sure that you want to reserve for the worst-case. What might work is if subsystems (probably primarily file systems) give you estimates for the usual case and the worst case, and you reserve for the something in between these two bounds. In practice there will be a huge number of file systems operations taking place in your typical super-busy system, and if you reserve for the worst case, it probably will be too much. We need to make sure there is enough memory available for some forward progress, and if we need to stall a few operations with some sleeping loops, so be it. So I don't think the "heads up" mounts don't have to be strict reservations in the sense that the memory will be available instantly without any sleeping or looping. I would also suggest that "reservations" be tied to a task struct and not to some magic __GFP_* flag, since it's not just allocations done by the file system, but also by the block device drivers, and if certain write operations fail, the results will be catastrophic -- and the block device can't tell whether a particular I/O operatoion must succeed or we declare the file system as needing manual recovery and potentially reboot the entire system, and an I/O operation where a fail could be handled by reflecting ENOMEM back up to userspace. The difference is a property of the call stack, so the simplest way of handing this is to store the reservation in the task struct, and let the reservation get automatically returned to the system when a particular process makes a transition from kernel space to user space. The bottom line is that I agree that looping and OOM-killing works most of the time, and so I'm happy with something that makes life a little bit better and a little bit more predictable for the VM, if that makes the system behave a bit more smoothly under high memory pressures. But at the same time, we don't want to make things too complicated; whether that means that we don't try to achieve perfection, or simply not worry about the global memory pressure situation, and instead try to think about other solutions to handle the "small number of threads in a container, and try to OOM kill a bit less frequently, and instead force it to loop/sleep for a bit, and then let a random foreground kernel thread in the container allow to "borrow" a small amount of memory to hopefully let it make forward progress, especially if it is holding locks, or is in the process of exiting, etc. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-01 19:36 ` Theodore Ts'o 0 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-03-01 19:36 UTC (permalink / raw) To: Johannes Weiner Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote: > > We had these lockups in cgroups with just a handful of threads, which > all got stuck in the allocator and there was nobody left to volunteer > unreclaimable memory. When this was being addressed, we knew that the > same can theoretically happen on the system-level but weren't aware of > any reports. Well now, here we are. I think the "few threads in a small" cgroup problem is a little difference, because in those cases very often the global system has enough memory, and there is always the possibility that we might relax the memory cgroup guarantees a little in order to allow forward progress. In fact, arguably this *is* the right thing to do, because we have situations where (a) the VFS takes the directory mutex, (b) the directory blocks have been pushed out of memory, and so (c) a system call running in container with a small amount of memory and/or a small amount of disk bandwidth allowed via its prop I/O settings ends up taking a very long time for the directory blocks to be read into memory. If a high priority process, like say a cluster management daemon, also tries to to read the same directory, it can end up stalled for long enough for the software watchdog to take out the entire machine from the cluster. The hard problem here is that the lock is taken by the VFS, *before* it calls into the file system specific layer, and so the VFS has no idea (a) how much memory or disk bandwidth it needs, and (b) whether it needs any memory or disk bandwidth in the first place in order to service a directory lookup operation (most of the time, it doesn't). So there may be situations where in the restricted cgroup, it would useful for the file system to be able to say, "you know, we're holding onto a lock and the fact that the disk controller is going to force this low priority cgroup to wait over a minute for the I/O to even be queued out to the disk, maybe we should make an exception and bust the disk controller cgroup cap". (There is a related problem where a cgroup with a low disk bandwidth quota is slowing down writeback, and we are desperately short on global memory, and where relaxing the disk bandwidth limit via some kind of priority inheritance scheme would prevent "innocent" high, proprity cgroups from having some of their processes get OOM-killed. I suppose one could claim that the high priority cgroups tend to belong to the sysadmin, who set the stupid disk bandwidth caps in the first place, so there is a certain justice in having the high priority processes getting OOM killed, but still, it would be nice if we could do the right thing automatically.) But in any case, some of these workarounds, where we relax a particuarly tightly constrained cgroup limit, are obviously not going to help when the entire system is low on memory. > It really depends on what the goal here is. You don't have to be > perfectly accurate, but if you can give us a worst-case estimate we > can actually guarantee forward progress and eliminate these lockups > entirely, like in the block layer. Sure, there will be bugs and the > estimates won't be right from the start, but we can converge towards > the right answer. If the allocations which are allowed to dip into > the reserves - the current nofail sites? - can be annotated with a gfp > flag, we can easily verify the estimates by serving those sites > exclusively from the private reserve pool and emit warnings when that > runs dry. We wouldn't even have to stress the system for that. > > But there are legitimate concerns that this might never work. For > example, the requirements could be so unpredictable, or assessing them > with reasonable accuracy could be so expensive, that the margin of > error would make the worst case estimate too big to be useful. Big > enough that the reserves would harm well-behaved systems. And if > useful worst-case estimates are unattainable, I don't think we need to > bother with reserves. We can just stick with looping and OOM killing, > that works most of the time, too. I'm not sure that you want to reserve for the worst-case. What might work is if subsystems (probably primarily file systems) give you estimates for the usual case and the worst case, and you reserve for the something in between these two bounds. In practice there will be a huge number of file systems operations taking place in your typical super-busy system, and if you reserve for the worst case, it probably will be too much. We need to make sure there is enough memory available for some forward progress, and if we need to stall a few operations with some sleeping loops, so be it. So I don't think the "heads up" mounts don't have to be strict reservations in the sense that the memory will be available instantly without any sleeping or looping. I would also suggest that "reservations" be tied to a task struct and not to some magic __GFP_* flag, since it's not just allocations done by the file system, but also by the block device drivers, and if certain write operations fail, the results will be catastrophic -- and the block device can't tell whether a particular I/O operatoion must succeed or we declare the file system as needing manual recovery and potentially reboot the entire system, and an I/O operation where a fail could be handled by reflecting ENOMEM back up to userspace. The difference is a property of the call stack, so the simplest way of handing this is to store the reservation in the task struct, and let the reservation get automatically returned to the system when a particular process makes a transition from kernel space to user space. The bottom line is that I agree that looping and OOM-killing works most of the time, and so I'm happy with something that makes life a little bit better and a little bit more predictable for the VM, if that makes the system behave a bit more smoothly under high memory pressures. But at the same time, we don't want to make things too complicated; whether that means that we don't try to achieve perfection, or simply not worry about the global memory pressure situation, and instead try to think about other solutions to handle the "small number of threads in a container, and try to OOM kill a bit less frequently, and instead force it to loop/sleep for a bit, and then let a random foreground kernel thread in the container allow to "borrow" a small amount of memory to hopefully let it make forward progress, especially if it is holding locks, or is in the process of exiting, etc. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 19:36 ` Theodore Ts'o @ 2015-03-01 20:44 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-01 20:44 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sun, Mar 01, 2015 at 02:36:35PM -0500, Theodore Ts'o wrote: > On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote: > > > > We had these lockups in cgroups with just a handful of threads, which > > all got stuck in the allocator and there was nobody left to volunteer > > unreclaimable memory. When this was being addressed, we knew that the > > same can theoretically happen on the system-level but weren't aware of > > any reports. Well now, here we are. > > I think the "few threads in a small" cgroup problem is a little > difference, because in those cases very often the global system has > enough memory, and there is always the possibility that we might relax > the memory cgroup guarantees a little in order to allow forward > progress. That's exactly how we fixed it. __GFP_NOFAIL are allowed to simply bypass the cgroup memory limits when reclaim within the group fails to make room for the allocation. I'm just mentioning that because the global case doesn't have the same out, but is susceptible to the same deadlock situation when there are no other threads volunteering pages. If your machines are loaded with hundreds or thousands of threads, the chances that a thread stuck in the allocator will be bailed out by the other threads in the system is likely (or that you run into CPU limits first), but if you have only a handful of memory-intensive tasks, this might not be the case. The cgroup problem was closer to that second scenario, where few threads split all available memory between them. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-01 20:44 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-01 20:44 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sun, Mar 01, 2015 at 02:36:35PM -0500, Theodore Ts'o wrote: > On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote: > > > > We had these lockups in cgroups with just a handful of threads, which > > all got stuck in the allocator and there was nobody left to volunteer > > unreclaimable memory. When this was being addressed, we knew that the > > same can theoretically happen on the system-level but weren't aware of > > any reports. Well now, here we are. > > I think the "few threads in a small" cgroup problem is a little > difference, because in those cases very often the global system has > enough memory, and there is always the possibility that we might relax > the memory cgroup guarantees a little in order to allow forward > progress. That's exactly how we fixed it. __GFP_NOFAIL are allowed to simply bypass the cgroup memory limits when reclaim within the group fails to make room for the allocation. I'm just mentioning that because the global case doesn't have the same out, but is susceptible to the same deadlock situation when there are no other threads volunteering pages. If your machines are loaded with hundreds or thousands of threads, the chances that a thread stuck in the allocator will be bailed out by the other threads in the system is likely (or that you run into CPU limits first), but if you have only a handful of memory-intensive tasks, this might not be the case. The cgroup problem was closer to that second scenario, where few threads split all available memory between them. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 13:43 ` Theodore Ts'o @ 2015-03-01 20:17 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-01 20:17 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > We've lived with looping as it is and in practice it's actually worked > well. I can only speak for ext4, but I do a lot of testing under very > high memory pressure situations, and it is used in *production* under > very high stress situations --- and the only time we'e run into > trouble is when the looping behaviour somehow got accidentally > *removed*. Memory is a finite resource and there are (unlimited) consumers that do not allow their share to be reclaimed/recycled. Mainly this is the kernel itself, but it also includes anon memory once swap space runs out, as well as mlocked and dirty memory. It's not a question of whether there exists a true point of OOM (where not enough memory is recyclable to satisfy new allocations). That point inevitably exists. It's a policy question of how to inform userspace once it is reached. We agree that we can't unconditionally fail allocations, because we might be in the middle of a transaction, where an allocation failure can potentially corrupt userdata. However, endlessly looping for progress that can not happen at this point has the exact same effect: the transaction won't finish. Only the machine locks up in addition. It's great that your setups don't ever truly go out of memory, but that doesn't mean it can't happen in practice. One answer to users at this point could certainly be to stay away from the true point of OOM, and if you don't then that's your problem. But the issue I take with this answer is that, for the sake of memory utilization, users kind of do want to get fairly close to this point, and at the same time it's hard to reliably predict the memory consumption of a workload in advance. It can depend on the timing between threads, it can depend on user/network-supplied input, and it can simply be a bug in the application. And if that OOM situation is accidentally entered, I'd prefer we had a better answer than locking up the machine and blame the user. So one attempt to make progress in this situation is to kill userspace applications that are pinning unreclaimable memory. This is what we are doing now, but there are several problems with it. For one, we are doing a terrible job and might still get stuck sometimes, which deteriorates the situation back to failing the allocation and corrupting the filesystem. Secondly, killing tasks is disruptive, and because it's driven by heuristics we're never going to kill the "right" one in all situations. Reserves would allow us to look ahead and avoid starting transactions that can not be finished given the available resources. So we are at least avoiding filesystem corruption. The tasks could probably be put to sleep for some time in the hope that ongoing transactions complete and release memory, but there might not be any, and eventually the OOM situation has to be communicated to userspace. Arguably, an -ENOMEM from a syscall at this point might be easier to handle than a SIGKILL from the OOM killer in an unrelated task. So if we could pull off reserves, they look like the most attractive solution to me. If not, the OOM killer needs to be fixed to always make forward progress instead. I proposed a patch for that already. But infinite loops that force the user to reboot the machine at the point of OOM seem like a terrible policy. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-01 20:17 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-01 20:17 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > We've lived with looping as it is and in practice it's actually worked > well. I can only speak for ext4, but I do a lot of testing under very > high memory pressure situations, and it is used in *production* under > very high stress situations --- and the only time we'e run into > trouble is when the looping behaviour somehow got accidentally > *removed*. Memory is a finite resource and there are (unlimited) consumers that do not allow their share to be reclaimed/recycled. Mainly this is the kernel itself, but it also includes anon memory once swap space runs out, as well as mlocked and dirty memory. It's not a question of whether there exists a true point of OOM (where not enough memory is recyclable to satisfy new allocations). That point inevitably exists. It's a policy question of how to inform userspace once it is reached. We agree that we can't unconditionally fail allocations, because we might be in the middle of a transaction, where an allocation failure can potentially corrupt userdata. However, endlessly looping for progress that can not happen at this point has the exact same effect: the transaction won't finish. Only the machine locks up in addition. It's great that your setups don't ever truly go out of memory, but that doesn't mean it can't happen in practice. One answer to users at this point could certainly be to stay away from the true point of OOM, and if you don't then that's your problem. But the issue I take with this answer is that, for the sake of memory utilization, users kind of do want to get fairly close to this point, and at the same time it's hard to reliably predict the memory consumption of a workload in advance. It can depend on the timing between threads, it can depend on user/network-supplied input, and it can simply be a bug in the application. And if that OOM situation is accidentally entered, I'd prefer we had a better answer than locking up the machine and blame the user. So one attempt to make progress in this situation is to kill userspace applications that are pinning unreclaimable memory. This is what we are doing now, but there are several problems with it. For one, we are doing a terrible job and might still get stuck sometimes, which deteriorates the situation back to failing the allocation and corrupting the filesystem. Secondly, killing tasks is disruptive, and because it's driven by heuristics we're never going to kill the "right" one in all situations. Reserves would allow us to look ahead and avoid starting transactions that can not be finished given the available resources. So we are at least avoiding filesystem corruption. The tasks could probably be put to sleep for some time in the hope that ongoing transactions complete and release memory, but there might not be any, and eventually the OOM situation has to be communicated to userspace. Arguably, an -ENOMEM from a syscall at this point might be easier to handle than a SIGKILL from the OOM killer in an unrelated task. So if we could pull off reserves, they look like the most attractive solution to me. If not, the OOM killer needs to be fixed to always make forward progress instead. I proposed a patch for that already. But infinite loops that force the user to reboot the machine at the point of OOM seem like a terrible policy. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 22:15 ` Johannes Weiner @ 2015-03-01 21:48 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-01 21:48 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > I'm trying to figure out if the current nofail allocators can get > > > their memory needs figured out beforehand. And reliably so - what > > > good are estimates that are right 90% of the time, when failing the > > > allocation means corrupting user data? What is the contingency plan? > > > > In the ideal world, we can figure out the exact memory needs > > beforehand. But we live in an imperfect world, and given that block > > devices *also* need memory, the answer is "of course not". We can't > > be perfect. But we can least give some kind of hint, and we can offer > > to wait before we get into a situation where we need to loop in > > GFP_NOWAIT --- which is the contingency/fallback plan. > > Overestimating should be fine, the result would a bit of false memory > pressure. But underestimating and looping can't be an option or the > original lockups will still be there. We need to guarantee forward > progress or the problem is somewhat mitigated at best - only now with > quite a bit more complexity in the allocator and the filesystems. The additional complexity in XFS is actually quite minor, and initial "rough worst case" memory usage estimates are not that hard to measure.... > The block code would have to be looked at separately, but doesn't it > already use mempools etc. to guarantee progress? Yes, it does. I'm not concerned about the block layer. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-01 21:48 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-01 21:48 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > I'm trying to figure out if the current nofail allocators can get > > > their memory needs figured out beforehand. And reliably so - what > > > good are estimates that are right 90% of the time, when failing the > > > allocation means corrupting user data? What is the contingency plan? > > > > In the ideal world, we can figure out the exact memory needs > > beforehand. But we live in an imperfect world, and given that block > > devices *also* need memory, the answer is "of course not". We can't > > be perfect. But we can least give some kind of hint, and we can offer > > to wait before we get into a situation where we need to loop in > > GFP_NOWAIT --- which is the contingency/fallback plan. > > Overestimating should be fine, the result would a bit of false memory > pressure. But underestimating and looping can't be an option or the > original lockups will still be there. We need to guarantee forward > progress or the problem is somewhat mitigated at best - only now with > quite a bit more complexity in the allocator and the filesystems. The additional complexity in XFS is actually quite minor, and initial "rough worst case" memory usage estimates are not that hard to measure.... > The block code would have to be looked at separately, but doesn't it > already use mempools etc. to guarantee progress? Yes, it does. I'm not concerned about the block layer. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 21:48 ` Dave Chinner @ 2015-03-02 0:17 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-02 0:17 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, akpm, torvalds On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > > > I'm trying to figure out if the current nofail allocators can get > > > > their memory needs figured out beforehand. And reliably so - what > > > > good are estimates that are right 90% of the time, when failing the > > > > allocation means corrupting user data? What is the contingency plan? > > > > > > In the ideal world, we can figure out the exact memory needs > > > beforehand. But we live in an imperfect world, and given that block > > > devices *also* need memory, the answer is "of course not". We can't > > > be perfect. But we can least give some kind of hint, and we can offer > > > to wait before we get into a situation where we need to loop in > > > GFP_NOWAIT --- which is the contingency/fallback plan. > > > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > The additional complexity in XFS is actually quite minor, and > initial "rough worst case" memory usage estimates are not that hard > to measure.... And, just to point out that the OOM killer can be invoked without a single transaction-based filesystem ENOMEM failure, here's what xfs/084 does on 4.0-rc1: [ 148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [ 148.822113] resvtest cpuset=/ mems_allowed=0 [ 148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825 [ 148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 148.826471] 0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c [ 148.828220] ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000 [ 148.829958] 0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8 [ 148.831734] Call Trace: [ 148.832325] [<ffffffff81dcb570>] dump_stack+0x4c/0x65 [ 148.833493] [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb [ 148.834855] [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0 [ 148.836195] [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40 [ 148.837633] [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500 [ 148.838925] [<ffffffff8117e44b>] out_of_memory+0x5b/0x80 [ 148.840162] [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810 [ 148.841592] [<ffffffff811c0531>] alloc_pages_current+0x91/0x100 [ 148.842950] [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0 [ 148.844286] [<ffffffff8117c688>] filemap_fault+0x1b8/0x420 [ 148.845545] [<ffffffff811a05ed>] __do_fault+0x3d/0x70 [ 148.846706] [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230 [ 148.848042] [<ffffffff81090305>] __do_page_fault+0x1a5/0x460 [ 148.849333] [<ffffffff81090675>] trace_do_page_fault+0x45/0x130 [ 148.850681] [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0 [ 148.852025] [<ffffffff81dd1567>] ? schedule+0x37/0x90 [ 148.853187] [<ffffffff81dd8b88>] async_page_fault+0x28/0x30 [ 148.854456] Mem-Info: [ 148.854986] Node 0 DMA per-cpu: [ 148.855727] CPU 0: hi: 0, btch: 1 usd: 0 [ 148.856820] Node 0 DMA32 per-cpu: [ 148.857600] CPU 0: hi: 186, btch: 31 usd: 0 [ 148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0 [ 148.858688] active_file:19 inactive_file:2 isolated_file:0 [ 148.858688] unevictable:0 dirty:0 writeback:0 unstable:0 [ 148.858688] free:1965 slab_reclaimable:2816 slab_unreclaimable:2184 [ 148.858688] mapped:3 shmem:2 pagetables:1259 bounce:0 [ 148.858688] free_cma:0 [ 148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as [ 148.874431] lowmem_reserve[]: 0 966 966 966 [ 148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s [ 148.884817] lowmem_reserve[]: 0 0 0 0 [ 148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB [ 148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB [ 148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 148.894949] 47361 total pagecache pages [ 148.895816] 47334 pages in swap cache [ 148.896657] Swap cache stats: add 124669, delete 77335, find 83/169 [ 148.898057] Free swap = 0kB [ 148.898714] Total swap = 497976kB [ 148.899470] 262044 pages RAM [ 148.900145] 0 pages HighMem/MovableOnly [ 148.901006] 10253 pages reserved [ 148.901735] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 148.903637] [ 1204] 0 1204 6039 1 15 3 163 -1000 udevd [ 148.905571] [ 1323] 0 1323 6038 1 14 3 165 -1000 udevd [ 148.907499] [ 1324] 0 1324 6038 1 14 3 164 -1000 udevd [ 148.909439] [ 2176] 0 2176 2524 0 6 2 571 0 dhclient [ 148.911427] [ 2227] 0 2227 9267 0 22 3 95 0 rpcbind [ 148.913392] [ 2632] 0 2632 64981 30 29 3 136 0 rsyslogd [ 148.915391] [ 2686] 0 2686 1062 1 6 3 36 0 acpid [ 148.917325] [ 2826] 0 2826 4753 0 12 2 44 0 atd [ 148.919209] [ 2877] 0 2877 6473 0 17 3 66 0 cron [ 148.921120] [ 2911] 104 2911 7078 1 17 3 81 0 dbus-daemon [ 148.923150] [ 3591] 0 3591 13731 0 28 2 165 -1000 sshd [ 148.925073] [ 3603] 0 3603 22024 0 43 2 215 0 winbindd [ 148.927066] [ 3612] 0 3612 22024 0 42 2 216 0 winbindd [ 148.929062] [ 3636] 0 3636 3722 1 11 3 41 0 getty [ 148.930981] [ 3637] 0 3637 3722 1 11 3 40 0 getty [ 148.932915] [ 3638] 0 3638 3722 1 11 3 39 0 getty [ 148.934835] [ 3639] 0 3639 3722 1 11 3 40 0 getty [ 148.936789] [ 3640] 0 3640 3722 1 11 3 40 0 getty [ 148.938704] [ 3641] 0 3641 3722 1 10 3 38 0 getty [ 148.940635] [ 3642] 0 3642 3677 1 11 3 40 0 getty [ 148.942550] [ 3643] 0 3643 25894 2 52 2 248 0 sshd [ 148.944469] [ 3649] 0 3649 146652 1 35 4 320 0 console-kit-dae [ 148.946578] [ 3716] 0 3716 48287 1 31 4 171 0 polkitd [ 148.948552] [ 3722] 1000 3722 25894 0 51 2 250 0 sshd [ 148.950457] [ 3723] 1000 3723 5435 3 15 3 495 0 bash [ 148.952375] [ 3742] 0 3742 17157 1 37 2 160 0 sudo [ 148.954275] [ 3743] 0 3743 3365 1 11 3 516 0 check [ 148.956229] [ 4130] 0 4130 3334 1 11 3 484 0 084 [ 148.958108] [ 4342] 0 4342 314556 191159 619 4 119808 0 resvtest [ 148.960104] [ 4343] 0 4343 3334 0 11 3 485 0 084 [ 148.961990] [ 4344] 0 4344 3334 0 11 3 485 0 084 [ 148.963876] [ 4345] 0 4345 3305 0 11 3 36 0 sed [ 148.965766] [ 4346] 0 4346 3305 0 11 3 37 0 sed [ 148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child [ 148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB [ 149.415288] XFS (vda): Unmounting Filesystem [ 150.211229] XFS (vda): Mounting V5 Filesystem [ 150.292092] XFS (vda): Ending clean mount [ 150.342307] XFS (vda): Unmounting Filesystem [ 150.346522] XFS (vdb): Unmounting Filesystem [ 151.264135] XFS: kmalloc allocations by trans type [ 151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024 [ 151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144 [ 151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536 [ 151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696 [ 151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384 [ 151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696 [ 151.272833] XFS: slab allocations by trans type [ 151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0 [ 151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0 [ 151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0 [ 151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0 [ 151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0 [ 151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0 [ 151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0 [ 151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0 [ 151.283476] XFS: vmalloc allocations by trans type [ 151.284535] XFS: page allocations by trans type Those XFS allocation stats are largest measured allocations done under transaction context broken down by allocation and transaction type. No failures that would result in looping, even though the system invoked the OOM killer on a filesystem workload.... I need to break the slab allocations down further by cache (other workloads are generating over 50 slab allocations per transaction), but another hour's work and a few days of observation of the stats in my normal day-to-day work wll get me all the information I need to do a decent first pass at memory reservation requirements for XFS. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 0:17 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-02 0:17 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > > > I'm trying to figure out if the current nofail allocators can get > > > > their memory needs figured out beforehand. And reliably so - what > > > > good are estimates that are right 90% of the time, when failing the > > > > allocation means corrupting user data? What is the contingency plan? > > > > > > In the ideal world, we can figure out the exact memory needs > > > beforehand. But we live in an imperfect world, and given that block > > > devices *also* need memory, the answer is "of course not". We can't > > > be perfect. But we can least give some kind of hint, and we can offer > > > to wait before we get into a situation where we need to loop in > > > GFP_NOWAIT --- which is the contingency/fallback plan. > > > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > The additional complexity in XFS is actually quite minor, and > initial "rough worst case" memory usage estimates are not that hard > to measure.... And, just to point out that the OOM killer can be invoked without a single transaction-based filesystem ENOMEM failure, here's what xfs/084 does on 4.0-rc1: [ 148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [ 148.822113] resvtest cpuset=/ mems_allowed=0 [ 148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825 [ 148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 148.826471] 0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c [ 148.828220] ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000 [ 148.829958] 0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8 [ 148.831734] Call Trace: [ 148.832325] [<ffffffff81dcb570>] dump_stack+0x4c/0x65 [ 148.833493] [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb [ 148.834855] [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0 [ 148.836195] [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40 [ 148.837633] [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500 [ 148.838925] [<ffffffff8117e44b>] out_of_memory+0x5b/0x80 [ 148.840162] [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810 [ 148.841592] [<ffffffff811c0531>] alloc_pages_current+0x91/0x100 [ 148.842950] [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0 [ 148.844286] [<ffffffff8117c688>] filemap_fault+0x1b8/0x420 [ 148.845545] [<ffffffff811a05ed>] __do_fault+0x3d/0x70 [ 148.846706] [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230 [ 148.848042] [<ffffffff81090305>] __do_page_fault+0x1a5/0x460 [ 148.849333] [<ffffffff81090675>] trace_do_page_fault+0x45/0x130 [ 148.850681] [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0 [ 148.852025] [<ffffffff81dd1567>] ? schedule+0x37/0x90 [ 148.853187] [<ffffffff81dd8b88>] async_page_fault+0x28/0x30 [ 148.854456] Mem-Info: [ 148.854986] Node 0 DMA per-cpu: [ 148.855727] CPU 0: hi: 0, btch: 1 usd: 0 [ 148.856820] Node 0 DMA32 per-cpu: [ 148.857600] CPU 0: hi: 186, btch: 31 usd: 0 [ 148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0 [ 148.858688] active_file:19 inactive_file:2 isolated_file:0 [ 148.858688] unevictable:0 dirty:0 writeback:0 unstable:0 [ 148.858688] free:1965 slab_reclaimable:2816 slab_unreclaimable:2184 [ 148.858688] mapped:3 shmem:2 pagetables:1259 bounce:0 [ 148.858688] free_cma:0 [ 148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as [ 148.874431] lowmem_reserve[]: 0 966 966 966 [ 148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s [ 148.884817] lowmem_reserve[]: 0 0 0 0 [ 148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB [ 148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB [ 148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 148.894949] 47361 total pagecache pages [ 148.895816] 47334 pages in swap cache [ 148.896657] Swap cache stats: add 124669, delete 77335, find 83/169 [ 148.898057] Free swap = 0kB [ 148.898714] Total swap = 497976kB [ 148.899470] 262044 pages RAM [ 148.900145] 0 pages HighMem/MovableOnly [ 148.901006] 10253 pages reserved [ 148.901735] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 148.903637] [ 1204] 0 1204 6039 1 15 3 163 -1000 udevd [ 148.905571] [ 1323] 0 1323 6038 1 14 3 165 -1000 udevd [ 148.907499] [ 1324] 0 1324 6038 1 14 3 164 -1000 udevd [ 148.909439] [ 2176] 0 2176 2524 0 6 2 571 0 dhclient [ 148.911427] [ 2227] 0 2227 9267 0 22 3 95 0 rpcbind [ 148.913392] [ 2632] 0 2632 64981 30 29 3 136 0 rsyslogd [ 148.915391] [ 2686] 0 2686 1062 1 6 3 36 0 acpid [ 148.917325] [ 2826] 0 2826 4753 0 12 2 44 0 atd [ 148.919209] [ 2877] 0 2877 6473 0 17 3 66 0 cron [ 148.921120] [ 2911] 104 2911 7078 1 17 3 81 0 dbus-daemon [ 148.923150] [ 3591] 0 3591 13731 0 28 2 165 -1000 sshd [ 148.925073] [ 3603] 0 3603 22024 0 43 2 215 0 winbindd [ 148.927066] [ 3612] 0 3612 22024 0 42 2 216 0 winbindd [ 148.929062] [ 3636] 0 3636 3722 1 11 3 41 0 getty [ 148.930981] [ 3637] 0 3637 3722 1 11 3 40 0 getty [ 148.932915] [ 3638] 0 3638 3722 1 11 3 39 0 getty [ 148.934835] [ 3639] 0 3639 3722 1 11 3 40 0 getty [ 148.936789] [ 3640] 0 3640 3722 1 11 3 40 0 getty [ 148.938704] [ 3641] 0 3641 3722 1 10 3 38 0 getty [ 148.940635] [ 3642] 0 3642 3677 1 11 3 40 0 getty [ 148.942550] [ 3643] 0 3643 25894 2 52 2 248 0 sshd [ 148.944469] [ 3649] 0 3649 146652 1 35 4 320 0 console-kit-dae [ 148.946578] [ 3716] 0 3716 48287 1 31 4 171 0 polkitd [ 148.948552] [ 3722] 1000 3722 25894 0 51 2 250 0 sshd [ 148.950457] [ 3723] 1000 3723 5435 3 15 3 495 0 bash [ 148.952375] [ 3742] 0 3742 17157 1 37 2 160 0 sudo [ 148.954275] [ 3743] 0 3743 3365 1 11 3 516 0 check [ 148.956229] [ 4130] 0 4130 3334 1 11 3 484 0 084 [ 148.958108] [ 4342] 0 4342 314556 191159 619 4 119808 0 resvtest [ 148.960104] [ 4343] 0 4343 3334 0 11 3 485 0 084 [ 148.961990] [ 4344] 0 4344 3334 0 11 3 485 0 084 [ 148.963876] [ 4345] 0 4345 3305 0 11 3 36 0 sed [ 148.965766] [ 4346] 0 4346 3305 0 11 3 37 0 sed [ 148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child [ 148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB [ 149.415288] XFS (vda): Unmounting Filesystem [ 150.211229] XFS (vda): Mounting V5 Filesystem [ 150.292092] XFS (vda): Ending clean mount [ 150.342307] XFS (vda): Unmounting Filesystem [ 150.346522] XFS (vdb): Unmounting Filesystem [ 151.264135] XFS: kmalloc allocations by trans type [ 151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024 [ 151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144 [ 151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536 [ 151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696 [ 151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384 [ 151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696 [ 151.272833] XFS: slab allocations by trans type [ 151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0 [ 151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0 [ 151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0 [ 151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0 [ 151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0 [ 151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0 [ 151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0 [ 151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0 [ 151.283476] XFS: vmalloc allocations by trans type [ 151.284535] XFS: page allocations by trans type Those XFS allocation stats are largest measured allocations done under transaction context broken down by allocation and transaction type. No failures that would result in looping, even though the system invoked the OOM killer on a filesystem workload.... I need to break the slab allocations down further by cache (other workloads are generating over 50 slab allocations per transaction), but another hour's work and a few days of observation of the stats in my normal day-to-day work wll get me all the information I need to do a decent first pass at memory reservation requirements for XFS. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 0:17 ` Dave Chinner @ 2015-03-02 12:46 ` Brian Foster -1 siblings, 0 replies; 276+ messages in thread From: Brian Foster @ 2015-03-02 12:46 UTC (permalink / raw) To: Dave Chinner Cc: Theodore Ts'o, Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 11:17:23AM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote: > > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > > > > > I'm trying to figure out if the current nofail allocators can get > > > > > their memory needs figured out beforehand. And reliably so - what > > > > > good are estimates that are right 90% of the time, when failing the > > > > > allocation means corrupting user data? What is the contingency plan? > > > > > > > > In the ideal world, we can figure out the exact memory needs > > > > beforehand. But we live in an imperfect world, and given that block > > > > devices *also* need memory, the answer is "of course not". We can't > > > > be perfect. But we can least give some kind of hint, and we can offer > > > > to wait before we get into a situation where we need to loop in > > > > GFP_NOWAIT --- which is the contingency/fallback plan. > > > > > > Overestimating should be fine, the result would a bit of false memory > > > pressure. But underestimating and looping can't be an option or the > > > original lockups will still be there. We need to guarantee forward > > > progress or the problem is somewhat mitigated at best - only now with > > > quite a bit more complexity in the allocator and the filesystems. > > > > The additional complexity in XFS is actually quite minor, and > > initial "rough worst case" memory usage estimates are not that hard > > to measure.... > > And, just to point out that the OOM killer can be invoked without a > single transaction-based filesystem ENOMEM failure, here's what > xfs/084 does on 4.0-rc1: > > [ 148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 > [ 148.822113] resvtest cpuset=/ mems_allowed=0 > [ 148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825 > [ 148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > [ 148.826471] 0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c > [ 148.828220] ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000 > [ 148.829958] 0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8 > [ 148.831734] Call Trace: > [ 148.832325] [<ffffffff81dcb570>] dump_stack+0x4c/0x65 > [ 148.833493] [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb > [ 148.834855] [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0 > [ 148.836195] [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40 > [ 148.837633] [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500 > [ 148.838925] [<ffffffff8117e44b>] out_of_memory+0x5b/0x80 > [ 148.840162] [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810 > [ 148.841592] [<ffffffff811c0531>] alloc_pages_current+0x91/0x100 > [ 148.842950] [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0 > [ 148.844286] [<ffffffff8117c688>] filemap_fault+0x1b8/0x420 > [ 148.845545] [<ffffffff811a05ed>] __do_fault+0x3d/0x70 > [ 148.846706] [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230 > [ 148.848042] [<ffffffff81090305>] __do_page_fault+0x1a5/0x460 > [ 148.849333] [<ffffffff81090675>] trace_do_page_fault+0x45/0x130 > [ 148.850681] [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0 > [ 148.852025] [<ffffffff81dd1567>] ? schedule+0x37/0x90 > [ 148.853187] [<ffffffff81dd8b88>] async_page_fault+0x28/0x30 > [ 148.854456] Mem-Info: > [ 148.854986] Node 0 DMA per-cpu: > [ 148.855727] CPU 0: hi: 0, btch: 1 usd: 0 > [ 148.856820] Node 0 DMA32 per-cpu: > [ 148.857600] CPU 0: hi: 186, btch: 31 usd: 0 > [ 148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0 > [ 148.858688] active_file:19 inactive_file:2 isolated_file:0 > [ 148.858688] unevictable:0 dirty:0 writeback:0 unstable:0 > [ 148.858688] free:1965 slab_reclaimable:2816 slab_unreclaimable:2184 > [ 148.858688] mapped:3 shmem:2 pagetables:1259 bounce:0 > [ 148.858688] free_cma:0 > [ 148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as > [ 148.874431] lowmem_reserve[]: 0 966 966 966 > [ 148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s > [ 148.884817] lowmem_reserve[]: 0 0 0 0 > [ 148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB > [ 148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB > [ 148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > [ 148.894949] 47361 total pagecache pages > [ 148.895816] 47334 pages in swap cache > [ 148.896657] Swap cache stats: add 124669, delete 77335, find 83/169 > [ 148.898057] Free swap = 0kB > [ 148.898714] Total swap = 497976kB > [ 148.899470] 262044 pages RAM > [ 148.900145] 0 pages HighMem/MovableOnly > [ 148.901006] 10253 pages reserved > [ 148.901735] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > [ 148.903637] [ 1204] 0 1204 6039 1 15 3 163 -1000 udevd > [ 148.905571] [ 1323] 0 1323 6038 1 14 3 165 -1000 udevd > [ 148.907499] [ 1324] 0 1324 6038 1 14 3 164 -1000 udevd > [ 148.909439] [ 2176] 0 2176 2524 0 6 2 571 0 dhclient > [ 148.911427] [ 2227] 0 2227 9267 0 22 3 95 0 rpcbind > [ 148.913392] [ 2632] 0 2632 64981 30 29 3 136 0 rsyslogd > [ 148.915391] [ 2686] 0 2686 1062 1 6 3 36 0 acpid > [ 148.917325] [ 2826] 0 2826 4753 0 12 2 44 0 atd > [ 148.919209] [ 2877] 0 2877 6473 0 17 3 66 0 cron > [ 148.921120] [ 2911] 104 2911 7078 1 17 3 81 0 dbus-daemon > [ 148.923150] [ 3591] 0 3591 13731 0 28 2 165 -1000 sshd > [ 148.925073] [ 3603] 0 3603 22024 0 43 2 215 0 winbindd > [ 148.927066] [ 3612] 0 3612 22024 0 42 2 216 0 winbindd > [ 148.929062] [ 3636] 0 3636 3722 1 11 3 41 0 getty > [ 148.930981] [ 3637] 0 3637 3722 1 11 3 40 0 getty > [ 148.932915] [ 3638] 0 3638 3722 1 11 3 39 0 getty > [ 148.934835] [ 3639] 0 3639 3722 1 11 3 40 0 getty > [ 148.936789] [ 3640] 0 3640 3722 1 11 3 40 0 getty > [ 148.938704] [ 3641] 0 3641 3722 1 10 3 38 0 getty > [ 148.940635] [ 3642] 0 3642 3677 1 11 3 40 0 getty > [ 148.942550] [ 3643] 0 3643 25894 2 52 2 248 0 sshd > [ 148.944469] [ 3649] 0 3649 146652 1 35 4 320 0 console-kit-dae > [ 148.946578] [ 3716] 0 3716 48287 1 31 4 171 0 polkitd > [ 148.948552] [ 3722] 1000 3722 25894 0 51 2 250 0 sshd > [ 148.950457] [ 3723] 1000 3723 5435 3 15 3 495 0 bash > [ 148.952375] [ 3742] 0 3742 17157 1 37 2 160 0 sudo > [ 148.954275] [ 3743] 0 3743 3365 1 11 3 516 0 check > [ 148.956229] [ 4130] 0 4130 3334 1 11 3 484 0 084 > [ 148.958108] [ 4342] 0 4342 314556 191159 619 4 119808 0 resvtest > [ 148.960104] [ 4343] 0 4343 3334 0 11 3 485 0 084 > [ 148.961990] [ 4344] 0 4344 3334 0 11 3 485 0 084 > [ 148.963876] [ 4345] 0 4345 3305 0 11 3 36 0 sed > [ 148.965766] [ 4346] 0 4346 3305 0 11 3 37 0 sed > [ 148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child > [ 148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB > [ 149.415288] XFS (vda): Unmounting Filesystem > [ 150.211229] XFS (vda): Mounting V5 Filesystem > [ 150.292092] XFS (vda): Ending clean mount > [ 150.342307] XFS (vda): Unmounting Filesystem > [ 150.346522] XFS (vdb): Unmounting Filesystem > [ 151.264135] XFS: kmalloc allocations by trans type > [ 151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024 > [ 151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144 > [ 151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536 > [ 151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696 > [ 151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384 > [ 151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696 > [ 151.272833] XFS: slab allocations by trans type > [ 151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0 > [ 151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0 > [ 151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0 > [ 151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0 > [ 151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0 > [ 151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0 > [ 151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0 > [ 151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0 > [ 151.283476] XFS: vmalloc allocations by trans type > [ 151.284535] XFS: page allocations by trans type > > Those XFS allocation stats are largest measured allocations done > under transaction context broken down by allocation and transaction > type. No failures that would result in looping, even though the > system invoked the OOM killer on a filesystem workload.... > > I need to break the slab allocations down further by cache (other > workloads are generating over 50 slab allocations per transaction), > but another hour's work and a few days of observation of the stats > in my normal day-to-day work wll get me all the information I need > to do a decent first pass at memory reservation requirements for > XFS. > This sounds like something that would serve us well under sysfs, particularly if we do adopt the kind of reservation model being discussed in this thread. I wouldn't expect these values to change drastically or that often, but they could certainly adjust over time to the point of being out of line with a reservation. A tool like this combined with Johannes' idea of a warning or something along those lines for a reservation overrun should give us all we need to identify something is wrong and have the ability to fix it. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 12:46 ` Brian Foster 0 siblings, 0 replies; 276+ messages in thread From: Brian Foster @ 2015-03-02 12:46 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 11:17:23AM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote: > > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > > > > > I'm trying to figure out if the current nofail allocators can get > > > > > their memory needs figured out beforehand. And reliably so - what > > > > > good are estimates that are right 90% of the time, when failing the > > > > > allocation means corrupting user data? What is the contingency plan? > > > > > > > > In the ideal world, we can figure out the exact memory needs > > > > beforehand. But we live in an imperfect world, and given that block > > > > devices *also* need memory, the answer is "of course not". We can't > > > > be perfect. But we can least give some kind of hint, and we can offer > > > > to wait before we get into a situation where we need to loop in > > > > GFP_NOWAIT --- which is the contingency/fallback plan. > > > > > > Overestimating should be fine, the result would a bit of false memory > > > pressure. But underestimating and looping can't be an option or the > > > original lockups will still be there. We need to guarantee forward > > > progress or the problem is somewhat mitigated at best - only now with > > > quite a bit more complexity in the allocator and the filesystems. > > > > The additional complexity in XFS is actually quite minor, and > > initial "rough worst case" memory usage estimates are not that hard > > to measure.... > > And, just to point out that the OOM killer can be invoked without a > single transaction-based filesystem ENOMEM failure, here's what > xfs/084 does on 4.0-rc1: > > [ 148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 > [ 148.822113] resvtest cpuset=/ mems_allowed=0 > [ 148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825 > [ 148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > [ 148.826471] 0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c > [ 148.828220] ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000 > [ 148.829958] 0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8 > [ 148.831734] Call Trace: > [ 148.832325] [<ffffffff81dcb570>] dump_stack+0x4c/0x65 > [ 148.833493] [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb > [ 148.834855] [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0 > [ 148.836195] [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40 > [ 148.837633] [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500 > [ 148.838925] [<ffffffff8117e44b>] out_of_memory+0x5b/0x80 > [ 148.840162] [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810 > [ 148.841592] [<ffffffff811c0531>] alloc_pages_current+0x91/0x100 > [ 148.842950] [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0 > [ 148.844286] [<ffffffff8117c688>] filemap_fault+0x1b8/0x420 > [ 148.845545] [<ffffffff811a05ed>] __do_fault+0x3d/0x70 > [ 148.846706] [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230 > [ 148.848042] [<ffffffff81090305>] __do_page_fault+0x1a5/0x460 > [ 148.849333] [<ffffffff81090675>] trace_do_page_fault+0x45/0x130 > [ 148.850681] [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0 > [ 148.852025] [<ffffffff81dd1567>] ? schedule+0x37/0x90 > [ 148.853187] [<ffffffff81dd8b88>] async_page_fault+0x28/0x30 > [ 148.854456] Mem-Info: > [ 148.854986] Node 0 DMA per-cpu: > [ 148.855727] CPU 0: hi: 0, btch: 1 usd: 0 > [ 148.856820] Node 0 DMA32 per-cpu: > [ 148.857600] CPU 0: hi: 186, btch: 31 usd: 0 > [ 148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0 > [ 148.858688] active_file:19 inactive_file:2 isolated_file:0 > [ 148.858688] unevictable:0 dirty:0 writeback:0 unstable:0 > [ 148.858688] free:1965 slab_reclaimable:2816 slab_unreclaimable:2184 > [ 148.858688] mapped:3 shmem:2 pagetables:1259 bounce:0 > [ 148.858688] free_cma:0 > [ 148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as > [ 148.874431] lowmem_reserve[]: 0 966 966 966 > [ 148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s > [ 148.884817] lowmem_reserve[]: 0 0 0 0 > [ 148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB > [ 148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB > [ 148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > [ 148.894949] 47361 total pagecache pages > [ 148.895816] 47334 pages in swap cache > [ 148.896657] Swap cache stats: add 124669, delete 77335, find 83/169 > [ 148.898057] Free swap = 0kB > [ 148.898714] Total swap = 497976kB > [ 148.899470] 262044 pages RAM > [ 148.900145] 0 pages HighMem/MovableOnly > [ 148.901006] 10253 pages reserved > [ 148.901735] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > [ 148.903637] [ 1204] 0 1204 6039 1 15 3 163 -1000 udevd > [ 148.905571] [ 1323] 0 1323 6038 1 14 3 165 -1000 udevd > [ 148.907499] [ 1324] 0 1324 6038 1 14 3 164 -1000 udevd > [ 148.909439] [ 2176] 0 2176 2524 0 6 2 571 0 dhclient > [ 148.911427] [ 2227] 0 2227 9267 0 22 3 95 0 rpcbind > [ 148.913392] [ 2632] 0 2632 64981 30 29 3 136 0 rsyslogd > [ 148.915391] [ 2686] 0 2686 1062 1 6 3 36 0 acpid > [ 148.917325] [ 2826] 0 2826 4753 0 12 2 44 0 atd > [ 148.919209] [ 2877] 0 2877 6473 0 17 3 66 0 cron > [ 148.921120] [ 2911] 104 2911 7078 1 17 3 81 0 dbus-daemon > [ 148.923150] [ 3591] 0 3591 13731 0 28 2 165 -1000 sshd > [ 148.925073] [ 3603] 0 3603 22024 0 43 2 215 0 winbindd > [ 148.927066] [ 3612] 0 3612 22024 0 42 2 216 0 winbindd > [ 148.929062] [ 3636] 0 3636 3722 1 11 3 41 0 getty > [ 148.930981] [ 3637] 0 3637 3722 1 11 3 40 0 getty > [ 148.932915] [ 3638] 0 3638 3722 1 11 3 39 0 getty > [ 148.934835] [ 3639] 0 3639 3722 1 11 3 40 0 getty > [ 148.936789] [ 3640] 0 3640 3722 1 11 3 40 0 getty > [ 148.938704] [ 3641] 0 3641 3722 1 10 3 38 0 getty > [ 148.940635] [ 3642] 0 3642 3677 1 11 3 40 0 getty > [ 148.942550] [ 3643] 0 3643 25894 2 52 2 248 0 sshd > [ 148.944469] [ 3649] 0 3649 146652 1 35 4 320 0 console-kit-dae > [ 148.946578] [ 3716] 0 3716 48287 1 31 4 171 0 polkitd > [ 148.948552] [ 3722] 1000 3722 25894 0 51 2 250 0 sshd > [ 148.950457] [ 3723] 1000 3723 5435 3 15 3 495 0 bash > [ 148.952375] [ 3742] 0 3742 17157 1 37 2 160 0 sudo > [ 148.954275] [ 3743] 0 3743 3365 1 11 3 516 0 check > [ 148.956229] [ 4130] 0 4130 3334 1 11 3 484 0 084 > [ 148.958108] [ 4342] 0 4342 314556 191159 619 4 119808 0 resvtest > [ 148.960104] [ 4343] 0 4343 3334 0 11 3 485 0 084 > [ 148.961990] [ 4344] 0 4344 3334 0 11 3 485 0 084 > [ 148.963876] [ 4345] 0 4345 3305 0 11 3 36 0 sed > [ 148.965766] [ 4346] 0 4346 3305 0 11 3 37 0 sed > [ 148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child > [ 148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB > [ 149.415288] XFS (vda): Unmounting Filesystem > [ 150.211229] XFS (vda): Mounting V5 Filesystem > [ 150.292092] XFS (vda): Ending clean mount > [ 150.342307] XFS (vda): Unmounting Filesystem > [ 150.346522] XFS (vdb): Unmounting Filesystem > [ 151.264135] XFS: kmalloc allocations by trans type > [ 151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024 > [ 151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144 > [ 151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536 > [ 151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696 > [ 151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384 > [ 151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696 > [ 151.272833] XFS: slab allocations by trans type > [ 151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0 > [ 151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0 > [ 151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0 > [ 151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0 > [ 151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0 > [ 151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0 > [ 151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0 > [ 151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0 > [ 151.283476] XFS: vmalloc allocations by trans type > [ 151.284535] XFS: page allocations by trans type > > Those XFS allocation stats are largest measured allocations done > under transaction context broken down by allocation and transaction > type. No failures that would result in looping, even though the > system invoked the OOM killer on a filesystem workload.... > > I need to break the slab allocations down further by cache (other > workloads are generating over 50 slab allocations per transaction), > but another hour's work and a few days of observation of the stats > in my normal day-to-day work wll get me all the information I need > to do a decent first pass at memory reservation requirements for > XFS. > This sounds like something that would serve us well under sysfs, particularly if we do adopt the kind of reservation model being discussed in this thread. I wouldn't expect these values to change drastically or that often, but they could certainly adjust over time to the point of being out of line with a reservation. A tool like this combined with Johannes' idea of a warning or something along those lines for a reservation overrun should give us all we need to identify something is wrong and have the ability to fix it. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 0:45 ` Dave Chinner @ 2015-02-28 18:36 ` Vlastimil Babka -1 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-02-28 18:36 UTC (permalink / raw) To: Dave Chinner, Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On 23.2.2015 1:45, Dave Chinner wrote: > On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote: >> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: >>> I will actively work around aanything that causes filesystem memory >>> pressure to increase the chance of oom killer invocations. The OOM >>> killer is not a solution - it is, by definition, a loose cannon and >>> so we should be reducing dependencies on it. >> >> Once we have a better-working alternative, sure. > > Great, but first a simple request: please stop writing code and > instead start architecting a solution to the problem. i.e. we need a > design and have that documented before code gets written. If you > watched my recent LCA talk, then you'll understand what I mean > when I say: stop programming and start engineering. About that... I guess good engineering also means looking at past solutions to the same problem. I expect there would be a lot of academic work on this, which might tell us what's (not) possible. And maybe even actual implementations with real-life experience to learn from? >>> I really don't care about the OOM Killer corner cases - it's >>> completely the wrong way line of development to be spending time on >>> and you aren't going to convince me otherwise. The OOM killer a >>> crutch used to justify having a memory allocation subsystem that >>> can't provide forward progress guarantee mechanisms to callers that >>> need it. >> >> We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. But won't even the reservation have potentially large impact on performance, if as you later suggest (IIUC), we don't actually dip into our reserves until regular reclaim starts failing? Doesn't that mean potentially lot of wasted memory? Right, it doesn't have to be if we allow clean reclaimable pages to be part of reserve, but still... > And, really, "reservation" != "preallocation". > > Maybe it's my filesystem background, but those to things are vastly > different things. > > Reservations are simply an *accounting* of the maximum amount of a > reserve required by an operation to guarantee forwards progress. In > filesystems, we do this for log space (transactions) and some do it > for filesystem space (e.g. delayed allocation needs correct ENOSPC > detection so we don't overcommit disk space). The VM already has > such concepts (e.g. watermarks and things like min_free_kbytes) that > it uses to ensure that there are sufficient reserves for certain > types of allocations to succeed. > > A reserve memory pool is no different - every time a memory reserve > occurs, a watermark is lifted to accommodate it, and the transaction > is not allowed to proceed until the amount of free memory exceeds > that watermark. The memory allocation subsystem then only allows > *allocations* marked correctly to allocate pages from that the > reserve that watermark protects. e.g. only allocations using > __GFP_RESERVE are allowed to dip into the reserve pool. > > By using watermarks, freeing of memory will automatically top > up the reserve pool which means that we guarantee that reclaimable > memory allocated for demand paging during transacitons doesn't > deplete the reserve pool permanently. As a result, when there is > plenty of free and/or reclaimable memory, the reserve pool > watermarks will have almost zero impact on performance and > behaviour. > > Further, because it's just accounting and behavioural thresholds, > this allows the mm subsystem to control how the reserve pool is > accounted internally. e.g. clean, reclaimable pages in the page > cache could serve as reserve pool pages as they can be immediately > reclaimed for allocation. This could be acheived by setting reclaim > targets first to the reserve pool watermark, then the second target > is enough pages to satisfy the current allocation. Hmm but what if the clean pages need us to take some locks to unmap and some proces holding them is blocked... Also we would need to potentally block a process that wants to dirty a page, is that being done now? > And, FWIW, there's nothing stopping this mechanism from have order > based reserve thresholds. e.g. IB could really do with a 64k reserve > pool threshold and hence help solve the long standing problems they > have with filling the receive ring in GFP_ATOMIC context... I don't know the details here, but if the allocation is done for incoming packets i.e. something you can't predict then how would you set the reserve for that? If they could predict, they would be able to preallocate necessary buffers already. > Sure, that's looking further down the track, but my point still > remains: we need a viable long term solution to this problem. Maybe > reservations are not the solution, but I don't see anyone else who > is thinking of how to address this architectural problem at a system > level right now. We need to design and document the model first, > then review it, then we can start working at the code level to > implement the solution we've designed. Right. A conference to discuss this on could come handy :) > Cheers, > > Dave. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-28 18:36 ` Vlastimil Babka 0 siblings, 0 replies; 276+ messages in thread From: Vlastimil Babka @ 2015-02-28 18:36 UTC (permalink / raw) To: Dave Chinner, Johannes Weiner Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On 23.2.2015 1:45, Dave Chinner wrote: > On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote: >> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: >>> I will actively work around aanything that causes filesystem memory >>> pressure to increase the chance of oom killer invocations. The OOM >>> killer is not a solution - it is, by definition, a loose cannon and >>> so we should be reducing dependencies on it. >> >> Once we have a better-working alternative, sure. > > Great, but first a simple request: please stop writing code and > instead start architecting a solution to the problem. i.e. we need a > design and have that documented before code gets written. If you > watched my recent LCA talk, then you'll understand what I mean > when I say: stop programming and start engineering. About that... I guess good engineering also means looking at past solutions to the same problem. I expect there would be a lot of academic work on this, which might tell us what's (not) possible. And maybe even actual implementations with real-life experience to learn from? >>> I really don't care about the OOM Killer corner cases - it's >>> completely the wrong way line of development to be spending time on >>> and you aren't going to convince me otherwise. The OOM killer a >>> crutch used to justify having a memory allocation subsystem that >>> can't provide forward progress guarantee mechanisms to callers that >>> need it. >> >> We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. But won't even the reservation have potentially large impact on performance, if as you later suggest (IIUC), we don't actually dip into our reserves until regular reclaim starts failing? Doesn't that mean potentially lot of wasted memory? Right, it doesn't have to be if we allow clean reclaimable pages to be part of reserve, but still... > And, really, "reservation" != "preallocation". > > Maybe it's my filesystem background, but those to things are vastly > different things. > > Reservations are simply an *accounting* of the maximum amount of a > reserve required by an operation to guarantee forwards progress. In > filesystems, we do this for log space (transactions) and some do it > for filesystem space (e.g. delayed allocation needs correct ENOSPC > detection so we don't overcommit disk space). The VM already has > such concepts (e.g. watermarks and things like min_free_kbytes) that > it uses to ensure that there are sufficient reserves for certain > types of allocations to succeed. > > A reserve memory pool is no different - every time a memory reserve > occurs, a watermark is lifted to accommodate it, and the transaction > is not allowed to proceed until the amount of free memory exceeds > that watermark. The memory allocation subsystem then only allows > *allocations* marked correctly to allocate pages from that the > reserve that watermark protects. e.g. only allocations using > __GFP_RESERVE are allowed to dip into the reserve pool. > > By using watermarks, freeing of memory will automatically top > up the reserve pool which means that we guarantee that reclaimable > memory allocated for demand paging during transacitons doesn't > deplete the reserve pool permanently. As a result, when there is > plenty of free and/or reclaimable memory, the reserve pool > watermarks will have almost zero impact on performance and > behaviour. > > Further, because it's just accounting and behavioural thresholds, > this allows the mm subsystem to control how the reserve pool is > accounted internally. e.g. clean, reclaimable pages in the page > cache could serve as reserve pool pages as they can be immediately > reclaimed for allocation. This could be acheived by setting reclaim > targets first to the reserve pool watermark, then the second target > is enough pages to satisfy the current allocation. Hmm but what if the clean pages need us to take some locks to unmap and some proces holding them is blocked... Also we would need to potentally block a process that wants to dirty a page, is that being done now? > And, FWIW, there's nothing stopping this mechanism from have order > based reserve thresholds. e.g. IB could really do with a 64k reserve > pool threshold and hence help solve the long standing problems they > have with filling the receive ring in GFP_ATOMIC context... I don't know the details here, but if the allocation is done for incoming packets i.e. something you can't predict then how would you set the reserve for that? If they could predict, they would be able to preallocate necessary buffers already. > Sure, that's looking further down the track, but my point still > remains: we need a viable long term solution to this problem. Maybe > reservations are not the solution, but I don't see anyone else who > is thinking of how to address this architectural problem at a system > level right now. We need to design and document the model first, > then review it, then we can start working at the code level to > implement the solution we've designed. Right. A conference to discuss this on could come handy :) > Cheers, > > Dave. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 0:45 ` Dave Chinner @ 2015-03-02 15:18 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-03-02 15:18 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Mon 23-02-15 11:45:21, Dave Chinner wrote: [...] > A reserve memory pool is no different - every time a memory reserve > occurs, a watermark is lifted to accommodate it, and the transaction > is not allowed to proceed until the amount of free memory exceeds > that watermark. The memory allocation subsystem then only allows > *allocations* marked correctly to allocate pages from that the > reserve that watermark protects. e.g. only allocations using > __GFP_RESERVE are allowed to dip into the reserve pool. The idea is sound. But I am pretty sure we will find many corner cases. E.g. what if the mere reservation attempt causes the system to go OOM and trigger the OOM killer? Sure that wouldn't be too much different from the OOM triggered during the allocation but there is one major difference. Reservations need to be estimated and I expect the estimation would be on the more conservative side and so the OOM might not happen without them. > By using watermarks, freeing of memory will automatically top > up the reserve pool which means that we guarantee that reclaimable > memory allocated for demand paging during transacitons doesn't > deplete the reserve pool permanently. As a result, when there is > plenty of free and/or reclaimable memory, the reserve pool > watermarks will have almost zero impact on performance and > behaviour. Typical busy system won't be very far away from the high watermark so there would be a reclaim performed during increased watermaks (aka reservation) and that might lead to visible performance degradation. This might be acceptable but it also adds a certain level of unpredictability when performance characteristics might change suddenly. > Further, because it's just accounting and behavioural thresholds, > this allows the mm subsystem to control how the reserve pool is > accounted internally. e.g. clean, reclaimable pages in the page > cache could serve as reserve pool pages as they can be immediately > reclaimed for allocation. But they also can turn into hard/impossible to reclaim as well. Clean pages might get dirty and e.g. swap backed pages run out of their backing storage. So I guess we cannot count with those pages without reclaiming them first and hiding them into the reserve. Which is what you suggest below probably but I wasn't really sure... > This could be acheived by setting reclaim targets first to the reserve > pool watermark, then the second target is enough pages to satisfy the > current allocation. > > And, FWIW, there's nothing stopping this mechanism from have order > based reserve thresholds. e.g. IB could really do with a 64k reserve > pool threshold and hence help solve the long standing problems they > have with filling the receive ring in GFP_ATOMIC context... > > Sure, that's looking further down the track, but my point still > remains: we need a viable long term solution to this problem. Maybe > reservations are not the solution, but I don't see anyone else who > is thinking of how to address this architectural problem at a system > level right now. I think the idea is good! It will just be quite tricky to get there without causing more problems than those being solved. The biggest question mark so far seems to be the reservation size estimation. If it is hard for any caller to know the size beforehand (which would be really close to the actually used size) then the whole complexity in the code sounds like an overkill and asking administrator to tune min_free_kbytes seems a better fit (we would still have to teach the allocator to access reserves when really necessary) because the system would behave more predictably (although some memory would be wasted). > We need to design and document the model first, then review it, then > we can start working at the code level to implement the solution we've > designed. I have already asked James to add this on LSF agenda but nothing has materialized on the schedule yet. I will poke him again. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 15:18 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-03-02 15:18 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Mon 23-02-15 11:45:21, Dave Chinner wrote: [...] > A reserve memory pool is no different - every time a memory reserve > occurs, a watermark is lifted to accommodate it, and the transaction > is not allowed to proceed until the amount of free memory exceeds > that watermark. The memory allocation subsystem then only allows > *allocations* marked correctly to allocate pages from that the > reserve that watermark protects. e.g. only allocations using > __GFP_RESERVE are allowed to dip into the reserve pool. The idea is sound. But I am pretty sure we will find many corner cases. E.g. what if the mere reservation attempt causes the system to go OOM and trigger the OOM killer? Sure that wouldn't be too much different from the OOM triggered during the allocation but there is one major difference. Reservations need to be estimated and I expect the estimation would be on the more conservative side and so the OOM might not happen without them. > By using watermarks, freeing of memory will automatically top > up the reserve pool which means that we guarantee that reclaimable > memory allocated for demand paging during transacitons doesn't > deplete the reserve pool permanently. As a result, when there is > plenty of free and/or reclaimable memory, the reserve pool > watermarks will have almost zero impact on performance and > behaviour. Typical busy system won't be very far away from the high watermark so there would be a reclaim performed during increased watermaks (aka reservation) and that might lead to visible performance degradation. This might be acceptable but it also adds a certain level of unpredictability when performance characteristics might change suddenly. > Further, because it's just accounting and behavioural thresholds, > this allows the mm subsystem to control how the reserve pool is > accounted internally. e.g. clean, reclaimable pages in the page > cache could serve as reserve pool pages as they can be immediately > reclaimed for allocation. But they also can turn into hard/impossible to reclaim as well. Clean pages might get dirty and e.g. swap backed pages run out of their backing storage. So I guess we cannot count with those pages without reclaiming them first and hiding them into the reserve. Which is what you suggest below probably but I wasn't really sure... > This could be acheived by setting reclaim targets first to the reserve > pool watermark, then the second target is enough pages to satisfy the > current allocation. > > And, FWIW, there's nothing stopping this mechanism from have order > based reserve thresholds. e.g. IB could really do with a 64k reserve > pool threshold and hence help solve the long standing problems they > have with filling the receive ring in GFP_ATOMIC context... > > Sure, that's looking further down the track, but my point still > remains: we need a viable long term solution to this problem. Maybe > reservations are not the solution, but I don't see anyone else who > is thinking of how to address this architectural problem at a system > level right now. I think the idea is good! It will just be quite tricky to get there without causing more problems than those being solved. The biggest question mark so far seems to be the reservation size estimation. If it is hard for any caller to know the size beforehand (which would be really close to the actually used size) then the whole complexity in the code sounds like an overkill and asking administrator to tune min_free_kbytes seems a better fit (we would still have to teach the allocator to access reserves when really necessary) because the system would behave more predictably (although some memory would be wasted). > We need to design and document the model first, then review it, then > we can start working at the code level to implement the solution we've > designed. I have already asked James to add this on LSF agenda but nothing has materialized on the schedule yet. I will poke him again. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 15:18 ` Michal Hocko @ 2015-03-02 16:05 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-02 16:05 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > On Mon 23-02-15 11:45:21, Dave Chinner wrote: > [...] > > A reserve memory pool is no different - every time a memory reserve > > occurs, a watermark is lifted to accommodate it, and the transaction > > is not allowed to proceed until the amount of free memory exceeds > > that watermark. The memory allocation subsystem then only allows > > *allocations* marked correctly to allocate pages from that the > > reserve that watermark protects. e.g. only allocations using > > __GFP_RESERVE are allowed to dip into the reserve pool. > > The idea is sound. But I am pretty sure we will find many corner > cases. E.g. what if the mere reservation attempt causes the system > to go OOM and trigger the OOM killer? Sure that wouldn't be too much > different from the OOM triggered during the allocation but there is one > major difference. Reservations need to be estimated and I expect the > estimation would be on the more conservative side and so the OOM might > not happen without them. The whole idea is that filesystems request the reserves while they can still sleep for progress or fail the macro-operation with -ENOMEM. And the estimate wouldn't just be on the conservative side, it would have to be the worst-case scenario. If we run out of reserves in an allocation that can not fail that would be a bug that can lock up the machine. We can then fall back to the OOM killer in a last-ditch effort to make forward progress, but as the victim tasks can get stuck behind state/locks held by the allocation side, the machine might lock up after all. > > By using watermarks, freeing of memory will automatically top > > up the reserve pool which means that we guarantee that reclaimable > > memory allocated for demand paging during transacitons doesn't > > deplete the reserve pool permanently. As a result, when there is > > plenty of free and/or reclaimable memory, the reserve pool > > watermarks will have almost zero impact on performance and > > behaviour. > > Typical busy system won't be very far away from the high watermark > so there would be a reclaim performed during increased watermaks > (aka reservation) and that might lead to visible performance > degradation. This might be acceptable but it also adds a certain level > of unpredictability when performance characteristics might change > suddenly. There is usually a good deal of clean cache. As Dave pointed out before, clean cache can be considered re-allocatable from NOFS contexts, and so we'd only have to maintain this invariant: min_wmark + private_reserves < free_pages + clean_cache > > Further, because it's just accounting and behavioural thresholds, > > this allows the mm subsystem to control how the reserve pool is > > accounted internally. e.g. clean, reclaimable pages in the page > > cache could serve as reserve pool pages as they can be immediately > > reclaimed for allocation. > > But they also can turn into hard/impossible to reclaim as well. Clean > pages might get dirty and e.g. swap backed pages run out of their > backing storage. So I guess we cannot count with those pages without > reclaiming them first and hiding them into the reserve. Which is what > you suggest below probably but I wasn't really sure... Pages reserved for use by the page cleaning path can't be considered dirtyable. They have to be included in the dirty_balance_reserve. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 16:05 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-02 16:05 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > On Mon 23-02-15 11:45:21, Dave Chinner wrote: > [...] > > A reserve memory pool is no different - every time a memory reserve > > occurs, a watermark is lifted to accommodate it, and the transaction > > is not allowed to proceed until the amount of free memory exceeds > > that watermark. The memory allocation subsystem then only allows > > *allocations* marked correctly to allocate pages from that the > > reserve that watermark protects. e.g. only allocations using > > __GFP_RESERVE are allowed to dip into the reserve pool. > > The idea is sound. But I am pretty sure we will find many corner > cases. E.g. what if the mere reservation attempt causes the system > to go OOM and trigger the OOM killer? Sure that wouldn't be too much > different from the OOM triggered during the allocation but there is one > major difference. Reservations need to be estimated and I expect the > estimation would be on the more conservative side and so the OOM might > not happen without them. The whole idea is that filesystems request the reserves while they can still sleep for progress or fail the macro-operation with -ENOMEM. And the estimate wouldn't just be on the conservative side, it would have to be the worst-case scenario. If we run out of reserves in an allocation that can not fail that would be a bug that can lock up the machine. We can then fall back to the OOM killer in a last-ditch effort to make forward progress, but as the victim tasks can get stuck behind state/locks held by the allocation side, the machine might lock up after all. > > By using watermarks, freeing of memory will automatically top > > up the reserve pool which means that we guarantee that reclaimable > > memory allocated for demand paging during transacitons doesn't > > deplete the reserve pool permanently. As a result, when there is > > plenty of free and/or reclaimable memory, the reserve pool > > watermarks will have almost zero impact on performance and > > behaviour. > > Typical busy system won't be very far away from the high watermark > so there would be a reclaim performed during increased watermaks > (aka reservation) and that might lead to visible performance > degradation. This might be acceptable but it also adds a certain level > of unpredictability when performance characteristics might change > suddenly. There is usually a good deal of clean cache. As Dave pointed out before, clean cache can be considered re-allocatable from NOFS contexts, and so we'd only have to maintain this invariant: min_wmark + private_reserves < free_pages + clean_cache > > Further, because it's just accounting and behavioural thresholds, > > this allows the mm subsystem to control how the reserve pool is > > accounted internally. e.g. clean, reclaimable pages in the page > > cache could serve as reserve pool pages as they can be immediately > > reclaimed for allocation. > > But they also can turn into hard/impossible to reclaim as well. Clean > pages might get dirty and e.g. swap backed pages run out of their > backing storage. So I guess we cannot count with those pages without > reclaiming them first and hiding them into the reserve. Which is what > you suggest below probably but I wasn't really sure... Pages reserved for use by the page cleaning path can't be considered dirtyable. They have to be included in the dirty_balance_reserve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 16:05 ` Johannes Weiner @ 2015-03-02 17:10 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-03-02 17:10 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Mon 02-03-15 11:05:37, Johannes Weiner wrote: > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: [...] > > Typical busy system won't be very far away from the high watermark > > so there would be a reclaim performed during increased watermaks > > (aka reservation) and that might lead to visible performance > > degradation. This might be acceptable but it also adds a certain level > > of unpredictability when performance characteristics might change > > suddenly. > > There is usually a good deal of clean cache. As Dave pointed out > before, clean cache can be considered re-allocatable from NOFS > contexts, and so we'd only have to maintain this invariant: > > min_wmark + private_reserves < free_pages + clean_cache Do I understand you correctly that we do not have to reclaim clean pages as per the above invariant? If yes, how do you reflect overcommit on the clean_cache from multiple requestor (who are doing reservations)? My point was that if we keep clean pages on the LRU rather than forcing to reclaim them via increased watermarks then it might happen that different callers with access to reserves wouldn't get promissed amount of reserved memory because clean_cache is basically a shared resource. [...] -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 17:10 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-03-02 17:10 UTC (permalink / raw) To: Johannes Weiner Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Mon 02-03-15 11:05:37, Johannes Weiner wrote: > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: [...] > > Typical busy system won't be very far away from the high watermark > > so there would be a reclaim performed during increased watermaks > > (aka reservation) and that might lead to visible performance > > degradation. This might be acceptable but it also adds a certain level > > of unpredictability when performance characteristics might change > > suddenly. > > There is usually a good deal of clean cache. As Dave pointed out > before, clean cache can be considered re-allocatable from NOFS > contexts, and so we'd only have to maintain this invariant: > > min_wmark + private_reserves < free_pages + clean_cache Do I understand you correctly that we do not have to reclaim clean pages as per the above invariant? If yes, how do you reflect overcommit on the clean_cache from multiple requestor (who are doing reservations)? My point was that if we keep clean pages on the LRU rather than forcing to reclaim them via increased watermarks then it might happen that different callers with access to reserves wouldn't get promissed amount of reserved memory because clean_cache is basically a shared resource. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 17:10 ` Michal Hocko @ 2015-03-02 17:27 ` Johannes Weiner -1 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-02 17:27 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 06:10:58PM +0100, Michal Hocko wrote: > On Mon 02-03-15 11:05:37, Johannes Weiner wrote: > > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > [...] > > > Typical busy system won't be very far away from the high watermark > > > so there would be a reclaim performed during increased watermaks > > > (aka reservation) and that might lead to visible performance > > > degradation. This might be acceptable but it also adds a certain level > > > of unpredictability when performance characteristics might change > > > suddenly. > > > > There is usually a good deal of clean cache. As Dave pointed out > > before, clean cache can be considered re-allocatable from NOFS > > contexts, and so we'd only have to maintain this invariant: > > > > min_wmark + private_reserves < free_pages + clean_cache > > Do I understand you correctly that we do not have to reclaim clean pages > as per the above invariant? > > If yes, how do you reflect overcommit on the clean_cache from multiple > requestor (who are doing reservations)? > My point was that if we keep clean pages on the LRU rather than forcing > to reclaim them via increased watermarks then it might happen that > different callers with access to reserves wouldn't get promissed amount > of reserved memory because clean_cache is basically a shared resource. The sum of all private reservations has to be accounted globally, we obviously can't overcommit the available resources in order to solve problems stemming from overcommiting the available resources. The page allocator can't hand out free pages and page reclaim can not reclaim clean cache unless that invariant is met. They both have to consider them consumed. It's the same as pre-allocation, the only thing we save is having to actually reclaim the pages and take them off the freelist at reservation time - which is a good optimization since the filesystem might not actually need them all. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 17:27 ` Johannes Weiner 0 siblings, 0 replies; 276+ messages in thread From: Johannes Weiner @ 2015-03-02 17:27 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Mon, Mar 02, 2015 at 06:10:58PM +0100, Michal Hocko wrote: > On Mon 02-03-15 11:05:37, Johannes Weiner wrote: > > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > [...] > > > Typical busy system won't be very far away from the high watermark > > > so there would be a reclaim performed during increased watermaks > > > (aka reservation) and that might lead to visible performance > > > degradation. This might be acceptable but it also adds a certain level > > > of unpredictability when performance characteristics might change > > > suddenly. > > > > There is usually a good deal of clean cache. As Dave pointed out > > before, clean cache can be considered re-allocatable from NOFS > > contexts, and so we'd only have to maintain this invariant: > > > > min_wmark + private_reserves < free_pages + clean_cache > > Do I understand you correctly that we do not have to reclaim clean pages > as per the above invariant? > > If yes, how do you reflect overcommit on the clean_cache from multiple > requestor (who are doing reservations)? > My point was that if we keep clean pages on the LRU rather than forcing > to reclaim them via increased watermarks then it might happen that > different callers with access to reserves wouldn't get promissed amount > of reserved memory because clean_cache is basically a shared resource. The sum of all private reservations has to be accounted globally, we obviously can't overcommit the available resources in order to solve problems stemming from overcommiting the available resources. The page allocator can't hand out free pages and page reclaim can not reclaim clean cache unless that invariant is met. They both have to consider them consumed. It's the same as pre-allocation, the only thing we save is having to actually reclaim the pages and take them off the freelist at reservation time - which is a good optimization since the filesystem might not actually need them all. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 15:18 ` Michal Hocko @ 2015-03-02 16:39 ` Theodore Ts'o -1 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-03-02 16:39 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > The idea is sound. But I am pretty sure we will find many corner > cases. E.g. what if the mere reservation attempt causes the system > to go OOM and trigger the OOM killer? Doctor, doctor, it hurts when I do that.... So don't trigger the OOM killer. We can let the caller decide whether the reservation request should block or return ENOMEM, but the whole point of the reservation request idea is that this happens *before* we've taken any mutexes, so blocking won't prevent forward progress. The file system could send down a different flag if we are doing writebacks for page cleaning purposes, in which case the reservation request would be a "just a heads up, we *will* be needing this much memory, but this is not something where we can block or return ENOMEM, so please give us the highest priority for using the free reserves". > I think the idea is good! It will just be quite tricky to get there > without causing more problems than those being solved. The biggest > question mark so far seems to be the reservation size estimation. If > it is hard for any caller to know the size beforehand (which would > be really close to the actually used size) then the whole complexity > in the code sounds like an overkill and asking administrator to tune > min_free_kbytes seems a better fit (we would still have to teach the > allocator to access reserves when really necessary) because the system > would behave more predictably (although some memory would be wasted). If we do need to teach the allocator to access reserves when really necessary, don't we have that already via GFP_NOIO/GFP_NOFS and GFP_NOFAIL? If the goal is do something more fine-grained, unfortunately at least for the short-term we'll need to preserve the existing behaviour and issue warnings until the file system starts adding GFP_NOFAIL to those memory allocations where previously, GFP_NOFS was effectively guaranteeing that failures would almostt never happen. I know at least one place discovered with recent change (and revert) where I'll be fixing ext4, but I suspect it won't be the only one, especially in the block device drivers. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 16:39 ` Theodore Ts'o 0 siblings, 0 replies; 276+ messages in thread From: Theodore Ts'o @ 2015-03-02 16:39 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > The idea is sound. But I am pretty sure we will find many corner > cases. E.g. what if the mere reservation attempt causes the system > to go OOM and trigger the OOM killer? Doctor, doctor, it hurts when I do that.... So don't trigger the OOM killer. We can let the caller decide whether the reservation request should block or return ENOMEM, but the whole point of the reservation request idea is that this happens *before* we've taken any mutexes, so blocking won't prevent forward progress. The file system could send down a different flag if we are doing writebacks for page cleaning purposes, in which case the reservation request would be a "just a heads up, we *will* be needing this much memory, but this is not something where we can block or return ENOMEM, so please give us the highest priority for using the free reserves". > I think the idea is good! It will just be quite tricky to get there > without causing more problems than those being solved. The biggest > question mark so far seems to be the reservation size estimation. If > it is hard for any caller to know the size beforehand (which would > be really close to the actually used size) then the whole complexity > in the code sounds like an overkill and asking administrator to tune > min_free_kbytes seems a better fit (we would still have to teach the > allocator to access reserves when really necessary) because the system > would behave more predictably (although some memory would be wasted). If we do need to teach the allocator to access reserves when really necessary, don't we have that already via GFP_NOIO/GFP_NOFS and GFP_NOFAIL? If the goal is do something more fine-grained, unfortunately at least for the short-term we'll need to preserve the existing behaviour and issue warnings until the file system starts adding GFP_NOFAIL to those memory allocations where previously, GFP_NOFS was effectively guaranteeing that failures would almostt never happen. I know at least one place discovered with recent change (and revert) where I'll be fixing ext4, but I suspect it won't be the only one, especially in the block device drivers. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 16:39 ` Theodore Ts'o @ 2015-03-02 16:58 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-03-02 16:58 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Mon 02-03-15 11:39:13, Theodore Ts'o wrote: > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > > The idea is sound. But I am pretty sure we will find many corner > > cases. E.g. what if the mere reservation attempt causes the system > > to go OOM and trigger the OOM killer? > > Doctor, doctor, it hurts when I do that.... > > So don't trigger the OOM killer. We can let the caller decide whether > the reservation request should block or return ENOMEM, but the whole > point of the reservation request idea is that this happens *before* > we've taken any mutexes, so blocking won't prevent forward progress. Maybe I wasn't clear. I wasn't concerned about the context which is doing to reservation. I was more concerned about all the other allocation requests which might fail now (becasuse they do not have access to the reserves). So you think that we should simply disable OOM killer while there is any reservation active? Wouldn't that be even more fragile when something goes terribly wrong? > The file system could send down a different flag if we are doing > writebacks for page cleaning purposes, in which case the reservation > request would be a "just a heads up, we *will* be needing this much > memory, but this is not something where we can block or return ENOMEM, > so please give us the highest priority for using the free reserves". Sure that thing is clear. > > I think the idea is good! It will just be quite tricky to get there > > without causing more problems than those being solved. The biggest > > question mark so far seems to be the reservation size estimation. If > > it is hard for any caller to know the size beforehand (which would > > be really close to the actually used size) then the whole complexity > > in the code sounds like an overkill and asking administrator to tune > > min_free_kbytes seems a better fit (we would still have to teach the > > allocator to access reserves when really necessary) because the system > > would behave more predictably (although some memory would be wasted). > > If we do need to teach the allocator to access reserves when really > necessary, don't we have that already via GFP_NOIO/GFP_NOFS and > GFP_NOFAIL? GFP_NOFAIL doesn't sound like the best fit. Not all NOFAIL callers need to access reserves - e.g. if they are not blocking anybody from making progress. > If the goal is do something more fine-grained, > unfortunately at least for the short-term we'll need to preserve the > existing behaviour and issue warnings until the file system starts > adding GFP_NOFAIL to those memory allocations where previously, > GFP_NOFS was effectively guaranteeing that failures would almostt > never happen. GFP_NOFS not failing is even worse than GFP_KERNEL not failing. Because the first one has only very limited ways to perform a reclaim. It basically relies on somebody else to make a progress and that is definitely a bad model. > I know at least one place discovered with recent change (and revert) > where I'll be fixing ext4, but I suspect it won't be the only one, > especially in the block device drivers. > > - Ted -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-02 16:58 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-03-02 16:58 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Mon 02-03-15 11:39:13, Theodore Ts'o wrote: > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > > The idea is sound. But I am pretty sure we will find many corner > > cases. E.g. what if the mere reservation attempt causes the system > > to go OOM and trigger the OOM killer? > > Doctor, doctor, it hurts when I do that.... > > So don't trigger the OOM killer. We can let the caller decide whether > the reservation request should block or return ENOMEM, but the whole > point of the reservation request idea is that this happens *before* > we've taken any mutexes, so blocking won't prevent forward progress. Maybe I wasn't clear. I wasn't concerned about the context which is doing to reservation. I was more concerned about all the other allocation requests which might fail now (becasuse they do not have access to the reserves). So you think that we should simply disable OOM killer while there is any reservation active? Wouldn't that be even more fragile when something goes terribly wrong? > The file system could send down a different flag if we are doing > writebacks for page cleaning purposes, in which case the reservation > request would be a "just a heads up, we *will* be needing this much > memory, but this is not something where we can block or return ENOMEM, > so please give us the highest priority for using the free reserves". Sure that thing is clear. > > I think the idea is good! It will just be quite tricky to get there > > without causing more problems than those being solved. The biggest > > question mark so far seems to be the reservation size estimation. If > > it is hard for any caller to know the size beforehand (which would > > be really close to the actually used size) then the whole complexity > > in the code sounds like an overkill and asking administrator to tune > > min_free_kbytes seems a better fit (we would still have to teach the > > allocator to access reserves when really necessary) because the system > > would behave more predictably (although some memory would be wasted). > > If we do need to teach the allocator to access reserves when really > necessary, don't we have that already via GFP_NOIO/GFP_NOFS and > GFP_NOFAIL? GFP_NOFAIL doesn't sound like the best fit. Not all NOFAIL callers need to access reserves - e.g. if they are not blocking anybody from making progress. > If the goal is do something more fine-grained, > unfortunately at least for the short-term we'll need to preserve the > existing behaviour and issue warnings until the file system starts > adding GFP_NOFAIL to those memory allocations where previously, > GFP_NOFS was effectively guaranteeing that failures would almostt > never happen. GFP_NOFS not failing is even worse than GFP_KERNEL not failing. Because the first one has only very limited ways to perform a reclaim. It basically relies on somebody else to make a progress and that is definitely a bad model. > I know at least one place discovered with recent change (and revert) > where I'll be fixing ext4, but I suspect it won't be the only one, > especially in the block device drivers. > > - Ted -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 16:58 ` Michal Hocko @ 2015-03-04 12:52 ` Dave Chinner -1 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 12:52 UTC (permalink / raw) To: Michal Hocko Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 05:58:23PM +0100, Michal Hocko wrote: > On Mon 02-03-15 11:39:13, Theodore Ts'o wrote: > > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > > > The idea is sound. But I am pretty sure we will find many corner > > > cases. E.g. what if the mere reservation attempt causes the system > > > to go OOM and trigger the OOM killer? > > > > Doctor, doctor, it hurts when I do that.... > > > > So don't trigger the OOM killer. We can let the caller decide whether > > the reservation request should block or return ENOMEM, but the whole > > point of the reservation request idea is that this happens *before* > > we've taken any mutexes, so blocking won't prevent forward progress. > > Maybe I wasn't clear. I wasn't concerned about the context which > is doing to reservation. I was more concerned about all the other > allocation requests which might fail now (becasuse they do not have > access to the reserves). So you think that we should simply disable OOM > killer while there is any reservation active? Wouldn't that be even more > fragile when something goes terribly wrong? That's a silly strawman. Why wouldn't you simply block them until the reserves are released when the transaction completes and the unused memory goes back to the free pool? Let me try another tack. My qualifications are as a distributed control system engineer, not a computer scientist. I see everything as a system of interconnected feedback loops: an operating system is nothing but a set of very complex, tightly interconnected control systems. Precedence? IO-less dirty throttling - that came about after I'd been advocating a control theory based algorithm for several years to solve the breakdown problems of dirty page throttling. We look at the code Fenguang Wu wrote as one of the major success stories of Linux - the writeback code just works and nobody ever has to tune it anymore. I see the problem of direct memory reclaim as being very similar to the problems the old IO based write throttling had: it has unbound concurrency, severe unfairness and breaks down badly when heavily loaded. As a control system, it has the same terrible design as the IO-based write throttling had. There are other many similarities, too. Allocation can only take place at the rate at which reclaim occurs, and we only have a limited budget of allocatable pages. This is the same as the dirty page throttling - dirtying pages is limited to the rate we can clean pages, and there are a limited budget of dirty pages in the system. Reclaiming pages is also done most efficiently by a single thread per zone where lots of internal context can be kept (kswapd). This is similar to optimal writeback of dirty pages requires a single thread with internal context per block device.. Waiting for free pages to arrive can be done by an ordered queuing system, and we can account for the number of pages each allocation requires in the queueing system and hence only need wake the number of waiters that will consume the memory just freed. Just like we do with the the dirty page throttling queue. As such, the same solutions could be applied. As the allocation demand exceeds the supply of free pages, we throttle allocation by sleeping on an ordered queue and only waking waiters at the rate at which kswapd reclaim can free pages. It's trivial to account accurately, and the feedback loop is relatively simple, too. We can also easily maintain a reserve of free pages this way, usable only by allocation marked with special flags. The reserve threshold can be dynamic, and tasks that request it to change can be blocked until the reserve has been built up to meet caler requirements. Allocations that are allowed to dip into the reserve may do so rather than being added to the queue that waits for reclaim. Reclaim would always fill the reserve back up to it's limits first, and tasks that have reservations can release them gradually as they mark them as consumed by the reservation context (e.g. when a filesystem joins an object to a transaction and modifies it), thereby reducing the reserve that task has available as it progresses. So, there's yet another possible solution to the allocation reservation problem, and one that solves several other problems that are being described as making reservation pools difficult or even impossible to implement. Seriously, I'm not expecting this problem to be solved tomorrow; what I want is reliable, deterministic memory allocation behaviour from the mm subsystem. I want people to be thinking about how to acheive that rather than limiting their solutions by what we have now and can hack into the current code, because otherwise we'll never end up with a reliable memory allocation reservation system.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-03-04 12:52 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-04 12:52 UTC (permalink / raw) To: Michal Hocko Cc: Theodore Ts'o, Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs On Mon, Mar 02, 2015 at 05:58:23PM +0100, Michal Hocko wrote: > On Mon 02-03-15 11:39:13, Theodore Ts'o wrote: > > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > > > The idea is sound. But I am pretty sure we will find many corner > > > cases. E.g. what if the mere reservation attempt causes the system > > > to go OOM and trigger the OOM killer? > > > > Doctor, doctor, it hurts when I do that.... > > > > So don't trigger the OOM killer. We can let the caller decide whether > > the reservation request should block or return ENOMEM, but the whole > > point of the reservation request idea is that this happens *before* > > we've taken any mutexes, so blocking won't prevent forward progress. > > Maybe I wasn't clear. I wasn't concerned about the context which > is doing to reservation. I was more concerned about all the other > allocation requests which might fail now (becasuse they do not have > access to the reserves). So you think that we should simply disable OOM > killer while there is any reservation active? Wouldn't that be even more > fragile when something goes terribly wrong? That's a silly strawman. Why wouldn't you simply block them until the reserves are released when the transaction completes and the unused memory goes back to the free pool? Let me try another tack. My qualifications are as a distributed control system engineer, not a computer scientist. I see everything as a system of interconnected feedback loops: an operating system is nothing but a set of very complex, tightly interconnected control systems. Precedence? IO-less dirty throttling - that came about after I'd been advocating a control theory based algorithm for several years to solve the breakdown problems of dirty page throttling. We look at the code Fenguang Wu wrote as one of the major success stories of Linux - the writeback code just works and nobody ever has to tune it anymore. I see the problem of direct memory reclaim as being very similar to the problems the old IO based write throttling had: it has unbound concurrency, severe unfairness and breaks down badly when heavily loaded. As a control system, it has the same terrible design as the IO-based write throttling had. There are other many similarities, too. Allocation can only take place at the rate at which reclaim occurs, and we only have a limited budget of allocatable pages. This is the same as the dirty page throttling - dirtying pages is limited to the rate we can clean pages, and there are a limited budget of dirty pages in the system. Reclaiming pages is also done most efficiently by a single thread per zone where lots of internal context can be kept (kswapd). This is similar to optimal writeback of dirty pages requires a single thread with internal context per block device.. Waiting for free pages to arrive can be done by an ordered queuing system, and we can account for the number of pages each allocation requires in the queueing system and hence only need wake the number of waiters that will consume the memory just freed. Just like we do with the the dirty page throttling queue. As such, the same solutions could be applied. As the allocation demand exceeds the supply of free pages, we throttle allocation by sleeping on an ordered queue and only waking waiters at the rate at which kswapd reclaim can free pages. It's trivial to account accurately, and the feedback loop is relatively simple, too. We can also easily maintain a reserve of free pages this way, usable only by allocation marked with special flags. The reserve threshold can be dynamic, and tasks that request it to change can be blocked until the reserve has been built up to meet caler requirements. Allocations that are allowed to dip into the reserve may do so rather than being added to the queue that waits for reclaim. Reclaim would always fill the reserve back up to it's limits first, and tasks that have reservations can release them gradually as they mark them as consumed by the reservation context (e.g. when a filesystem joins an object to a transaction and modifies it), thereby reducing the reserve that task has available as it progresses. So, there's yet another possible solution to the allocation reservation problem, and one that solves several other problems that are being described as making reservation pools difficult or even impossible to implement. Seriously, I'm not expecting this problem to be solved tomorrow; what I want is reliable, deterministic memory allocation behaviour from the mm subsystem. I want people to be thinking about how to acheive that rather than limiting their solutions by what we have now and can hack into the current code, because otherwise we'll never end up with a reliable memory allocation reservation system.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 12:23 ` Tetsuo Handa 2015-02-17 12:53 ` Johannes Weiner @ 2015-02-17 14:59 ` Michal Hocko 1 sibling, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-17 14:59 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Tue 17-02-15 21:23:26, Tetsuo Handa wrote: [...] > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? Because they cannot perform any IO/FS transactions and that would lead to a premature OOM conditions way too easily. OOM killer is a _last resort_ reclaim opportunity not something that would happen just because you happen to be not able to flush dirty pages. > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings > at kmem_alloc() in fs/xfs/kmem.c . > I think commit 9879de7373fcfb46 "mm: > page_alloc: embed OOM killing naturally into allocation slowpath" introduced > a regression and below one is the fix. > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > /* The OOM killer does not needlessly kill tasks for lowmem */ > if (high_zoneidx < ZONE_NORMAL) > goto out; > - /* The OOM killer does not compensate for light reclaim */ > - if (!(gfp_mask & __GFP_FS)) > - goto out; > /* > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. So NAK to this. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-10 15:19 ` Johannes Weiner 2015-02-11 2:23 ` Tetsuo Handa @ 2015-02-17 14:50 ` Michal Hocko 1 sibling, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-17 14:50 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Tue 10-02-15 10:19:34, Johannes Weiner wrote: [...] > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > if (high_zoneidx < ZONE_NORMAL) > goto out; > /* The OOM killer does not compensate for light reclaim */ > - if (!(gfp_mask & __GFP_FS)) > + if (!(gfp_mask & __GFP_FS)) { > + /* > + * XXX: Page reclaim didn't yield anything, > + * and the OOM killer can't be invoked, but > + * keep looping as per should_alloc_retry(). > + */ > + *did_some_progress = 1; > goto out; > + } > /* > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. Although the side effect of 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) is subtle and it would be much better if it was documented in the changelog (I have missed that too during review otherwise I would ask for that) I do not think this is a change in a good direction. Hopelessly retrying at the time when the reclaimm didn't help and OOM is not available is simply a bad(tm) choice. Besides that __GFP_WAIT callers should be prepared for the allocation failure and should better cope with it. So no, I really hate something like the above. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-09 11:44 ` Tetsuo Handa 2015-02-10 13:58 ` Tetsuo Handa @ 2015-02-17 14:37 ` Michal Hocko 2015-02-17 14:44 ` Michal Hocko 1 sibling, 1 reply; 276+ messages in thread From: Michal Hocko @ 2015-02-17 14:37 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes, torvalds On Mon 09-02-15 20:44:16, Tetsuo Handa wrote: > Hello. > > Today I tested Linux 3.19 and noticed unexpected behavior (A) (B) > shown below. > > (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition > despite we didn't remove the > > /* > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER > * means __GFP_NOFAIL, but that may not be true in other > * implementations. > */ > if (order <= PAGE_ALLOC_COSTLY_ORDER) > return 1; > > check in should_alloc_retry(). Is this what you expected? The code before 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) was looping on this kind of allocation even though GFP_NOFS didn't trigger OOM killer. This change was not intentional I guess but it makes sense on its own. We shouldn't simply loop in a hope that something happens and we finally make a progress. Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think this is a problem? Btw. this has nothing to do with TIF_MEMDIE and it would be much better to discuss it in a separate thread... > (B) When coredump to pipe is configured, the system stalls under OOM > condition due to memory allocation by coredump's reader side. > How should we handle this "expected to terminate shortly but unable > to terminate due to invisible dependency" case? What approaches > other than applying timeout on coredump's writer side are possible? > (Running inside memory cgroup is not an answer which I want.) This is really nasty and we have discussed that with Oleg some time ago. We have SIGNAL_GROUP_COREDUMP which prevents the OOM killer from selecting the task. The issue seems to be that OOM killer might inherently race with setting the flag. I have no idea what to do about this, unfortunately. Oleg? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 14:37 ` Michal Hocko @ 2015-02-17 14:44 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-17 14:44 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes, torvalds Ups, sorry I have missed the follow up emails in this thread. My filters got crazy and the rest got sorted into a different mailbox. Reading the rest now... On Tue 17-02-15 15:37:20, Michal Hocko wrote: > On Mon 09-02-15 20:44:16, Tetsuo Handa wrote: > > Hello. > > > > Today I tested Linux 3.19 and noticed unexpected behavior (A) (B) > > shown below. > > > > (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition > > despite we didn't remove the > > > > /* > > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER > > * means __GFP_NOFAIL, but that may not be true in other > > * implementations. > > */ > > if (order <= PAGE_ALLOC_COSTLY_ORDER) > > return 1; > > > > check in should_alloc_retry(). Is this what you expected? > > The code before 9879de7373fc (mm: page_alloc: embed OOM killing > naturally into allocation slowpath) was looping on this kind of > allocation even though GFP_NOFS didn't trigger OOM killer. This change > was not intentional I guess but it makes sense on its own. We shouldn't > simply loop in a hope that something happens and we finally make a > progress. > > Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think > this is a problem? > > Btw. this has nothing to do with TIF_MEMDIE and it would be much better > to discuss it in a separate thread... > > > (B) When coredump to pipe is configured, the system stalls under OOM > > condition due to memory allocation by coredump's reader side. > > How should we handle this "expected to terminate shortly but unable > > to terminate due to invisible dependency" case? What approaches > > other than applying timeout on coredump's writer side are possible? > > (Running inside memory cgroup is not an answer which I want.) > > This is really nasty and we have discussed that with Oleg some time > ago. We have SIGNAL_GROUP_COREDUMP which prevents the OOM killer > from selecting the task. The issue seems to be that OOM killer might > inherently race with setting the flag. I have no idea what to do about > this, unfortunately. > Oleg? > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2014-12-30 11:21 ` Michal Hocko 2014-12-30 13:33 ` Tetsuo Handa 2015-02-09 11:44 ` Tetsuo Handa @ 2015-02-16 11:23 ` Tetsuo Handa 2015-02-16 15:42 ` Johannes Weiner 2015-02-17 16:33 ` Michal Hocko 2 siblings, 2 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-16 11:23 UTC (permalink / raw) To: mhocko Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes, torvalds Michal Hocko wrote: > > but I think we need to be prepared for cases where sending SIGKILL to > > all threads sharing the same memory does not help. > > Sure, unkillable tasks are a problem which we have to handle. Having > GFP_KERNEL allocations looping without way out contributes to this which > is sad but your current data just show that sometimes it might take ages > to finish even without that going on. Hello. Can we resume TIF_MEMDIE stall discussion? I'd like to propose (1) Make several locks killable. (2) Implement TIF_MEMDIE timeout. (3) Replace kmalloc() with kmalloc_nofail() and kmalloc_noretry(). for handling TIF_MEMDIE stall problems. (1) Make several locks killable. On Linux 3.19, running below command line as an unprivileged user on a system with 4 CPUs / 2GB RAM / no swap can make the system unusable. $ for i in `seq 1 100`; do dd if=/dev/zero of=/tmp/file bs=104857600 count=100 & done ---------- An example with ext4 partition ---------- (...snipped...) [ 369.902616] dd D ffff88007fc12d00 0 9113 6418 0x00000080 [ 369.904867] ffff88007b460890 0000000000012d00 ffff88007b28ffd8 0000000000012d00 [ 369.907254] ffff88007b460890 ffff88007fc12d80 ffff88007a6eb360 0000000000000001 [ 369.909855] ffffffff810946cb 00000000000025f6 ffffffff8108ef1d 0000000000000000 [ 369.912054] Call Trace: [ 369.913175] [<ffffffff810946cb>] ? put_prev_entity+0x5b/0x2c0 [ 369.914960] [<ffffffff8108ef1d>] ? pick_next_entity+0x9d/0x170 [ 369.916778] [<ffffffff8109157e>] ? set_next_entity+0x4e/0x60 [ 369.918634] [<ffffffff81097953>] ? pick_next_task_fair+0x453/0x520 [ 369.920530] [<ffffffff8100c6e0>] ? __switch_to+0x240/0x570 [ 369.922263] [<ffffffff815799f9>] ? schedule_preempt_disabled+0x9/0x10 [ 369.924161] [<ffffffff8157af25>] ? __mutex_lock_slowpath+0xb5/0x120 [ 369.926106] [<ffffffff8157afa6>] ? mutex_lock+0x16/0x25 [ 369.927800] [<ffffffffa01f3acc>] ? ext4_file_write_iter+0x7c/0x3a0 [ext4] [ 369.929778] [<ffffffff81280fbc>] ? __clear_user+0x1c/0x40 [ 369.931491] [<ffffffff8112c876>] ? iov_iter_zero+0x66/0x2d0 [ 369.933235] [<ffffffff811732a3>] ? new_sync_write+0x83/0xd0 [ 369.934977] [<ffffffff8117397d>] ? vfs_write+0xad/0x1f0 [ 369.936703] [<ffffffff8101b57b>] ? syscall_trace_enter_phase1+0x19b/0x1b0 [ 369.938674] [<ffffffff8117459d>] ? SyS_write+0x4d/0xc0 [ 369.940336] [<ffffffff8157d329>] ? system_call_fastpath+0x12/0x17 (...snipped...) [ 498.421741] SysRq : Manual OOM execution [ 498.423627] kworker/3:3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 (...snipped...) [ 498.952807] Out of memory: Kill process 9113 (dd) score 57 or sacrifice child [ 498.954450] Killed process 9113 (dd) total-vm:210340kB, anon-rss:102500kB, file-rss:0kB (...snipped...) [ 502.068921] SysRq : Manual OOM execution [ 502.070825] kworker/3:3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 (...snipped...) [ 502.618222] Out of memory: Kill process 9113 (dd) score 57 or sacrifice child [ 502.620016] Killed process 9113 (dd) total-vm:210340kB, anon-rss:102500kB, file-rss:0kB (...snipped...) [ 503.900554] SysRq : Manual OOM execution [ 503.902387] kworker/3:3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 (...snipped...) [ 504.410444] Out of memory: Kill process 9113 (dd) score 57 or sacrifice child [ 504.412221] Killed process 9113 (dd) total-vm:210340kB, anon-rss:102500kB, file-rss:0kB (...snipped...) ---------- An example with ext4 partition ---------- ---------- An example with xfs partition ---------- (...snipped...) [ 127.135041] Out of memory: Kill process 2505 (dd) score 59 or sacrifice child [ 127.136460] Killed process 2505 (dd) total-vm:210340kB, anon-rss:102464kB, file-rss:1728kB (...snipped...) [ 243.672302] dd D ffff88005bd27cb8 12776 2505 2386 0x00100084 [ 243.674066] ffff88005bd27cb8 ffff88005bd27c98 ffff88007850c740 0000000000014080 [ 243.676005] 0000000000000000 ffff88005bd27fd8 0000000000014080 ffff88005835d740 [ 243.677916] ffff88007850c740 0000000000000014 ffff8800669bee50 ffff8800669bee54 [ 243.679823] Call Trace: [ 243.680478] [<ffffffff816b2799>] schedule_preempt_disabled+0x29/0x70 [ 243.682047] [<ffffffff816b43d5>] __mutex_lock_slowpath+0x95/0x100 [ 243.683548] [<ffffffff816b83e8>] ? page_fault+0x28/0x30 [ 243.684875] [<ffffffff816b4463>] mutex_lock+0x23/0x37 [ 243.686146] [<ffffffff8129df6c>] xfs_file_buffered_aio_write+0x6c/0x240 [ 243.687791] [<ffffffff813497b5>] ? __clear_user+0x25/0x50 [ 243.689121] [<ffffffff8117294d>] ? iov_iter_zero+0x6d/0x2e0 [ 243.690511] [<ffffffff8129e1b8>] xfs_file_write_iter+0x78/0x110 [ 243.691990] [<ffffffff811beb31>] new_sync_write+0x81/0xb0 [ 243.693329] [<ffffffff811bf2a7>] vfs_write+0xb7/0x1f0 [ 243.694581] [<ffffffff811bfeb6>] SyS_write+0x46/0xb0 [ 243.695834] [<ffffffff81109196>] ? __audit_syscall_exit+0x236/0x2e0 [ 243.697376] [<ffffffff816b64a9>] system_call_fastpath+0x12/0x17 (...snipped...) [ 291.433296] dd D ffff88005bd27cb8 12776 2505 2386 0x00100084 [ 291.433297] ffff88005bd27cb8 ffff88005bd27c98 ffff88007850c740 0000000000014080 [ 291.433298] 0000000000000000 ffff88005bd27fd8 0000000000014080 ffff88005835d740 [ 291.433298] ffff88007850c740 0000000000000014 ffff8800669bee50 ffff8800669bee54 [ 291.433299] Call Trace: [ 291.433300] [<ffffffff816b2799>] schedule_preempt_disabled+0x29/0x70 [ 291.433301] [<ffffffff816b43d5>] __mutex_lock_slowpath+0x95/0x100 [ 291.433302] [<ffffffff816b83e8>] ? page_fault+0x28/0x30 [ 291.433303] [<ffffffff816b4463>] mutex_lock+0x23/0x37 [ 291.433304] [<ffffffff8129df6c>] xfs_file_buffered_aio_write+0x6c/0x240 [ 291.433306] [<ffffffff813497b5>] ? __clear_user+0x25/0x50 [ 291.433307] [<ffffffff8117294d>] ? iov_iter_zero+0x6d/0x2e0 [ 291.433308] [<ffffffff8129e1b8>] xfs_file_write_iter+0x78/0x110 [ 291.433309] [<ffffffff811beb31>] new_sync_write+0x81/0xb0 [ 291.433311] [<ffffffff811bf2a7>] vfs_write+0xb7/0x1f0 [ 291.433312] [<ffffffff811bfeb6>] SyS_write+0x46/0xb0 [ 291.433313] [<ffffffff81109196>] ? __audit_syscall_exit+0x236/0x2e0 [ 291.433314] [<ffffffff816b64a9>] system_call_fastpath+0x12/0x17 (...snipped...) ---------- An example with xfs partition ---------- This is because the OOM killer happily tries to kill a process which is blocked at unkillable mutex_lock(). If locks shown above were killable, we can reduce the possibility of getting stuck. I didn't check whether it has livelocked or not. But too slow to wait is not acceptable. Oh, why every thread trying to allocate memory has to repeat the loop that might defer somebody who can make progress if CPU time was given? I wish only somebody like kswapd repeats the loop on behalf of all threads waiting at memory allocation slowpath... (2) Implement TIF_MEMDIE timeout. While the command line shown above is an artificial stresstest, I'm seeing troubles on real KVM systems where the guests hang entirely with many processes being blocked at jbd2_journal_commit_transaction() or jbd2_journal_get_write_access(). The root cause of guest's stall is not yet identified but is at least independent with TIF_MEMDIE. However, cron jobs which are blocked at those functions after I/O stall begins exhaust all of the system's memory and make the situation worse (e.g. load average exceeded 7000 on a guest with 2 CPUs as of occurrence of the OOM killer livelock). Unkillable locks in non-critical paths can be replaced with killable locks. But there are critical paths where fail-on-SIGKILL can lead to unwanted results (e.g. filesystem's error action such as remount as r/o or call panic() being taken), there are locks (e.g. rw_semaphore used by mmap_sem) where killable version does not exist, and there are wait_for_completion() calls where killable version does not worth complicating the code. If TIF_MEMDIE timeout were implemented, we can cope with the OOM killer livelock problem by choosing more OOM victims (for survive strategy) or calling panic() (for debug and reboot strategy). (3) Replace kmalloc() with kmalloc_nofail() and kmalloc_noretry(). Currently small allocations are implicitly treated like __GFP_NOFAIL unless TIF_MEMDIE is set. But silently changing small allocations like __GFP_NORETRY will cause obscure bugs. If TIF_MEMDIE timeout is implemented, we will no longer worry about unkillable tasks which is retrying forever at memory allocation; instead we kill more OOM victims and satisfy the request. Therefore, we could introduce kmalloc_nofail(size, gfp) which does kmalloc(size, gfp | __GFP_NOFAIL) (i.e. invoke the OOM killer) and kmalloc_noretry(size, gfp) which does kmalloc(size, gfp | __GFP_NORETRY) (i.e. do not invoke the OOM killer), and switch from kmalloc() to either kmalloc_noretry() or kmalloc_nofail(). Those who are doing smaller than PAGE_SIZE bytes allocations would wish to switch from kmalloc() to kmalloc_nofail() and eliminate untested memory allocation failure paths. Those who are well prepared for memory allocation failures would wish to switch from kmalloc() to kmalloc_noretry(). Eventually, kmalloc() which is implicitly treating small allocations like __GFP_NOFAIL and invoking the OOM killer will be abolished. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-16 11:23 ` Tetsuo Handa @ 2015-02-16 15:42 ` Johannes Weiner 2015-02-17 11:57 ` Tetsuo Handa 2015-02-17 16:33 ` Michal Hocko 1 sibling, 1 reply; 276+ messages in thread From: Johannes Weiner @ 2015-02-16 15:42 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote: > (2) Implement TIF_MEMDIE timeout. How about something like this? This should solve the deadlock problem in the page allocator, but it would also simplify the memcg OOM killer and allow its use by in-kernel faults again. -- ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-16 15:42 ` Johannes Weiner @ 2015-02-17 11:57 ` Tetsuo Handa 2015-02-17 13:16 ` Johannes Weiner 2015-02-23 22:08 ` David Rientjes 0 siblings, 2 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-17 11:57 UTC (permalink / raw) To: hannes Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds Johannes Weiner wrote: > On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote: > > (2) Implement TIF_MEMDIE timeout. > > How about something like this? This should solve the deadlock problem > in the page allocator, but it would also simplify the memcg OOM killer > and allow its use by in-kernel faults again. Yes, basic idea would be same with http://marc.info/?l=linux-mm&m=142002495532320&w=2 . But Michal and David do not like the timeout approach. http://marc.info/?l=linux-mm&m=141684783713564&w=2 http://marc.info/?l=linux-mm&m=141686814824684&w=2 Unless they change their opinion in response to the discovery explained at http://lwn.net/Articles/627419/ , timeout patches will not be accepted. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 11:57 ` Tetsuo Handa @ 2015-02-17 13:16 ` Johannes Weiner 2015-02-17 16:50 ` Michal Hocko 2015-02-23 22:08 ` David Rientjes 1 sibling, 1 reply; 276+ messages in thread From: Johannes Weiner @ 2015-02-17 13:16 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Tue, Feb 17, 2015 at 08:57:05PM +0900, Tetsuo Handa wrote: > Johannes Weiner wrote: > > On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote: > > > (2) Implement TIF_MEMDIE timeout. > > > > How about something like this? This should solve the deadlock problem > > in the page allocator, but it would also simplify the memcg OOM killer > > and allow its use by in-kernel faults again. > > Yes, basic idea would be same with > http://marc.info/?l=linux-mm&m=142002495532320&w=2 . > > But Michal and David do not like the timeout approach. > http://marc.info/?l=linux-mm&m=141684783713564&w=2 > http://marc.info/?l=linux-mm&m=141686814824684&w=2 I'm open to suggestions, but we can't just stick our heads in the sand and pretend that these are just unrelated bugs. They're not. As long as it's legal to enter the allocator with *anything* that can prevent another random task in the system from making progress, we have this deadlock potential. One side has to give up, and it can't be the page allocator because it has to support __GFP_NOFAIL allocations, which are usually exactly the allocations that are buried in hard-to-unwind state that is likely to trip up exiting OOM victims. The alternative would be lock dependency tracking, but I'm not sure it can be realistically done for production environments. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 13:16 ` Johannes Weiner @ 2015-02-17 16:50 ` Michal Hocko 2015-02-17 23:25 ` Dave Chinner 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2015-02-17 16:50 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Tue 17-02-15 08:16:18, Johannes Weiner wrote: > On Tue, Feb 17, 2015 at 08:57:05PM +0900, Tetsuo Handa wrote: > > Johannes Weiner wrote: > > > On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote: > > > > (2) Implement TIF_MEMDIE timeout. > > > > > > How about something like this? This should solve the deadlock problem > > > in the page allocator, but it would also simplify the memcg OOM killer > > > and allow its use by in-kernel faults again. > > > > Yes, basic idea would be same with > > http://marc.info/?l=linux-mm&m=142002495532320&w=2 . > > > > But Michal and David do not like the timeout approach. > > http://marc.info/?l=linux-mm&m=141684783713564&w=2 > > http://marc.info/?l=linux-mm&m=141686814824684&w=2 Yes I really hate time based solutions for reasons already explained in the referenced links. > I'm open to suggestions, but we can't just stick our heads in the sand > and pretend that these are just unrelated bugs. They're not. Requesting GFP_NOFAIL allocation with locks held is IMHO a bug and should be fixed. Hopelessly looping in the page allocator without GFP_NOFAIL is too risky as well and we should get rid of this. Why should we still try to loop when previous 1000 attempts failed with OOM killer invocation? Can we simply fail after a configurable number of attempts? This is prone to reveal unchecked allocation failures but those are bugs as well and we shouldn't pretend otherwise. > As long > as it's legal to enter the allocator with *anything* that can prevent > another random task in the system from making progress, we have this > deadlock potential. One side has to give up, and it can't be the page > allocator because it has to support __GFP_NOFAIL allocations, which > are usually exactly the allocations that are buried in hard-to-unwind > state that is likely to trip up exiting OOM victims. I am not convinced that GFP_NOFAIL is the biggest problem. Most if OOM livelocks I have seen were either due to GFP_KERNEL treated as GFP_NOFAIL or an incorrect gfp mask (e.g. GFP_FS added where not appropriate). I think we should focus on this part before we start adding heuristics into OOM killer. > The alternative would be lock dependency tracking, but I'm not sure it > can be realistically done for production environments. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 16:50 ` Michal Hocko @ 2015-02-17 23:25 ` Dave Chinner 2015-02-18 8:48 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Dave Chinner @ 2015-02-17 23:25 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Tue, Feb 17, 2015 at 05:50:24PM +0100, Michal Hocko wrote: > On Tue 17-02-15 08:16:18, Johannes Weiner wrote: > > On Tue, Feb 17, 2015 at 08:57:05PM +0900, Tetsuo Handa wrote: > > > Johannes Weiner wrote: > > > > On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote: > > > > > (2) Implement TIF_MEMDIE timeout. > > > > > > > > How about something like this? This should solve the deadlock problem > > > > in the page allocator, but it would also simplify the memcg OOM killer > > > > and allow its use by in-kernel faults again. > > > > > > Yes, basic idea would be same with > > > http://marc.info/?l=linux-mm&m=142002495532320&w=2 . > > > > > > But Michal and David do not like the timeout approach. > > > http://marc.info/?l=linux-mm&m=141684783713564&w=2 > > > http://marc.info/?l=linux-mm&m=141686814824684&w=2 > > Yes I really hate time based solutions for reasons already explained in > the referenced links. > > > I'm open to suggestions, but we can't just stick our heads in the sand > > and pretend that these are just unrelated bugs. They're not. > > Requesting GFP_NOFAIL allocation with locks held is IMHO a bug and > should be fixed. That's rather naive. Filesystems do demand paging of metadata within transactions, which means we are guaranteed to be holding locks when doing memory allocation. Indeed, this is what the GFP_NOFS allocation context is supposed to convey - we currently *hold locks* and so reclaim needs to be careful about recursion. I'll also argue that it means the OOM killer cannot kill the process attempting memory allocation for the same reason. We are also guaranteed to be in a state where memory allocation failure *cannot be tolerated* because failure to complete the modification leaves the filesystem in a "corrupt in memory" state. We don't use GFP_NOFAIL because it's deprecated, but the reality is that we need to ensure memory allocation eventually succeeds because we *cannot go backwards*. The choice is simple: memory allocation fails, we shut down the filesystem and guarantee that we DOS the entire machine because the filesystems have gone AWOL; or we keep trying memory allocation until it succeeds. So, memory allocation generally succeeds eventually, so we have these loops around kmalloc(), kmem_cache_alloc() and alloc_page() that ensure allocation succeeds. Those loops also guarantee we get warnings when allocation is repeatedly failing and we might have actually hit a OOM deadlock situation. > Hopelessly looping in the page allocator without GFP_NOFAIL is too risky > as well and we should get rid of this. Yet the exact situation we need GFP_NOFAIL is the situation that you are calling a bug. > Why should we still try to loop > when previous 1000 attempts failed with OOM killer invocation? Can we > simply fail after a configurable number of attempts? OTOH, why should the memory allocator care what failure policy the callers have? > This is prone to > reveal unchecked allocation failures but those are bugs as well and we > shouldn't pretend otherwise. > > > As long > > as it's legal to enter the allocator with *anything* that can prevent > > another random task in the system from making progress, we have this > > deadlock potential. One side has to give up, and it can't be the page > > allocator because it has to support __GFP_NOFAIL allocations, which > > are usually exactly the allocations that are buried in hard-to-unwind > > state that is likely to trip up exiting OOM victims. > > I am not convinced that GFP_NOFAIL is the biggest problem. Most if > OOM livelocks I have seen were either due to GFP_KERNEL treated as > GFP_NOFAIL or an incorrect gfp mask (e.g. GFP_FS added where not > appropriate). I think we should focus on this part before we start > adding heuristics into OOM killer. Having the OOM killer being able to kill the process that triggered it would be a good start. More often than not, that is the process that needs killing, and the oom killer implementation currently cannot do anything about that process. Make the OOM killer only be invoked by kswapd or some other independent kernel thread so that it is independent of the allocation context that needs to invoke it, and have the invoker wait to be told what to do. That way it can kill the invoking process if that's the one that needs to be killed, and then all "can't kill processes because the invoker holds locks they depend on" go away. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 23:25 ` Dave Chinner @ 2015-02-18 8:48 ` Michal Hocko 2015-02-18 11:23 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2015-02-18 8:48 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds On Wed 18-02-15 10:25:52, Dave Chinner wrote: > On Tue, Feb 17, 2015 at 05:50:24PM +0100, Michal Hocko wrote: > > On Tue 17-02-15 08:16:18, Johannes Weiner wrote: > > > On Tue, Feb 17, 2015 at 08:57:05PM +0900, Tetsuo Handa wrote: > > > > Johannes Weiner wrote: > > > > > On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote: > > > > > > (2) Implement TIF_MEMDIE timeout. > > > > > > > > > > How about something like this? This should solve the deadlock problem > > > > > in the page allocator, but it would also simplify the memcg OOM killer > > > > > and allow its use by in-kernel faults again. > > > > > > > > Yes, basic idea would be same with > > > > http://marc.info/?l=linux-mm&m=142002495532320&w=2 . > > > > > > > > But Michal and David do not like the timeout approach. > > > > http://marc.info/?l=linux-mm&m=141684783713564&w=2 > > > > http://marc.info/?l=linux-mm&m=141686814824684&w=2 > > > > Yes I really hate time based solutions for reasons already explained in > > the referenced links. > > > > > I'm open to suggestions, but we can't just stick our heads in the sand > > > and pretend that these are just unrelated bugs. They're not. > > > > Requesting GFP_NOFAIL allocation with locks held is IMHO a bug and > > should be fixed. > > That's rather naive. > > Filesystems do demand paging of metadata within transactions, which > means we are guaranteed to be holding locks when doing memory > allocation. Indeed, this is what the GFP_NOFS allocation context is > supposed to convey - we currently *hold locks* and so reclaim needs > to be careful about recursion. I'll also argue that it means the OOM > killer cannot kill the process attempting memory allocation for the > same reason. I am not sure I understand. Do you mean that OOM killer should attempt to select a victim which is doing doing GFP_NOFS allocation or an allocation in general? > We are also guaranteed to be in a state where memory allocation > failure *cannot be tolerated* because failure to complete the > modification leaves the filesystem in a "corrupt in memory" state. > We don't use GFP_NOFAIL because it's deprecated, but the reality is > that we need to ensure memory allocation eventually succeeds because > we *cannot go backwards*. > > The choice is simple: memory allocation fails, we shut down the > filesystem and guarantee that we DOS the entire machine because the > filesystems have gone AWOL; or we keep trying memory allocation > until it succeeds. Would it be possible to drop the locks and retry the allocations? Is the context which is doing this transaction a killable context? > So, memory allocation generally succeeds eventually, so we have > these loops around kmalloc(), kmem_cache_alloc() and alloc_page() > that ensure allocation succeeds. Those loops also guarantee we get > warnings when allocation is repeatedly failing and we might have > actually hit a OOM deadlock situation. As pointed in another email this should be done in the page allocator IMO. > > Hopelessly looping in the page allocator without GFP_NOFAIL is too risky > > as well and we should get rid of this. > > Yet the exact situation we need GFP_NOFAIL is the situation that you > are calling a bug. > > > Why should we still try to loop > > when previous 1000 attempts failed with OOM killer invocation? Can we > > simply fail after a configurable number of attempts? > > OTOH, why should the memory allocator care what failure policy the > callers have? It is not about failure policy of the caller. It is about how long the allocator tries before it gives up. A good allocator tries hard but not too much if the caller is able to handle the failure because it is the caller who defines the fallback policy. > > This is prone to > > reveal unchecked allocation failures but those are bugs as well and we > > shouldn't pretend otherwise. > > > > > As long > > > as it's legal to enter the allocator with *anything* that can prevent > > > another random task in the system from making progress, we have this > > > deadlock potential. One side has to give up, and it can't be the page > > > allocator because it has to support __GFP_NOFAIL allocations, which > > > are usually exactly the allocations that are buried in hard-to-unwind > > > state that is likely to trip up exiting OOM victims. > > > > I am not convinced that GFP_NOFAIL is the biggest problem. Most if > > OOM livelocks I have seen were either due to GFP_KERNEL treated as > > GFP_NOFAIL or an incorrect gfp mask (e.g. GFP_FS added where not > > appropriate). I think we should focus on this part before we start > > adding heuristics into OOM killer. > > Having the OOM killer being able to kill the process that triggered > it would be a good start. Not sure I understand. Do you mean sysctl_oom_kill_allocating_task? > More often than not, that is the process > that needs killing, and the oom killer implementation currently > cannot do anything about that process. Can you elaborate? AFAICS the process which has triggered the OOM is the easiest to kill victim. It is not blocked on any locks so it just needs to get outside of the kernel. > Make the OOM killer only be > invoked by kswapd or some other independent kernel thread so that it > is independent of the allocation context that needs to invoke it, > and have the invoker wait to be told what to do. Again, I am not sure I understand. The OOM killer doesn't block the context which has triggered OOM condition. Allocation is retried after OOM killer invocation and if the current context is the victim the allocation failure is expedited. > That way it can kill the invoking process if that's the one that > needs to be killed, and then all "can't kill processes because the > invoker holds locks they depend on" go away. Except that killing the messenger is not the best strategy... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 8:48 ` Michal Hocko @ 2015-02-18 11:23 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-18 11:23 UTC (permalink / raw) To: mhocko Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 [ cc fsdevel list - watch out for side effect of 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) which was merged between 3.19-rc6 and 3.19-rc7 , started from http://marc.info/?l=linux-mm&m=142348457310066&w=2 ] Replying in this post picked up from several posts in this thread. Michal Hocko wrote: > Besides that __GFP_WAIT callers should be prepared for the allocation > failure and should better cope with it. So no, I really hate something > like the above. Those who do not want to retry with invoking the OOM killer are using __GFP_WAIT + __GFP_NORETRY allocations. Those who want to retry with invoking the OOM killer are using __GFP_WAIT allocations. Those who must retry forever with invoking the OOM killer, no matter how many processes the OOM killer kills, are using __GFP_WAIT + __GFP_NOFAIL allocations. However, since use of __GFP_NOFAIL is prohibited, I think many of __GFP_WAIT users are expecting that the allocation fails only when "the OOM killer set TIF_MEMDIE flag to the caller but the caller failed to allocate from memory reserves". Also, the implementation before 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) effectively supported __GFP_WAIT users with such expectation. Michal Hocko wrote: > Because they cannot perform any IO/FS transactions and that would lead > to a premature OOM conditions way too easily. OOM killer is a _last > resort_ reclaim opportunity not something that would happen just because > you happen to be not able to flush dirty pages. But you should not have applied such change without making necessary changes to GFP_NOFS / GFP_NOIO users with such expectation and testing at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. Michal Hocko wrote: > Well, you are beating your machine to death so you can hardly get any > time guarantee. It would be nice to have a better feedback mechanism to > know when to back off and fail the allocation attempt which might be > blocking OOM victim to pass away. This is extremely tricky because we > shouldn't be too eager to fail just because of a sudden memory pressure. Michal Hocko wrote: > > I wish only somebody like kswapd repeats the loop on behalf of all > > threads waiting at memory allocation slowpath... > > This is the case when the kswapd is _able_ to cope with the memory > pressure. It looks wasteful for me that so many threads (greater than number of available CPUs) are sleeping at cond_resched() in shrink_slab() when checking SysRq-t. Imagine 1000 threads sleeping at cond_resched() in shrink_slab() on a machine with only 1 CPU. Each thread gets a chance to try calling reclaim function only when all other threads gave that thread a chance at cond_resched(). Such situation is almost mutually preventing from making progress. I wish the following mechanism. Prepare a kernel thread (for avoiding being OOM-killed) and let __GFP_WAIT and __GFP_WAIT + __GFP_NOFAIL users to wake up the kernel thread when they failed to allocate from free list. The kernel thread calls shrink_slab() etc. (and also out_of_memory() as needed) and wakes them sleeping at wait_for_event() up. Failing to allocate from free list is a rare case. Therefore, context switches for asking somebody else for reclaiming memory would be an acceptable overhead. If such mechanism are implemented, 1000 threads except the somebody can save CPU time by sleeping. Avoiding "almost mutually preventing from making progress" situation will drastically shorten the time guarantee even if I beat my machine to death. Such mechanism might be similar to Dave Chinner's Make the OOM killer only be invoked by kswapd or some other independent kernel thread so that it is independent of the allocation context that needs to invoke it, and have the invoker wait to be told what to do. suggestion. Dave Chinner wrote: > Filesystems do demand paging of metadata within transactions, which > means we are guaranteed to be holding locks when doing memory > allocation. Indeed, this is what the GFP_NOFS allocation context is > supposed to convey - we currently *hold locks* and so reclaim needs > to be careful about recursion. I'll also argue that it means the OOM > killer cannot kill the process attempting memory allocation for the > same reason. I agree with Dave Chinner about this. I tested on ext4 filesystem, one is stock Linux 3.19 and the other is Linux 3.19 with - /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) - goto out; applied. Running a Java-like stressing program (which is multi threaded and likely be chosen by the OOM killer due to huge memory usage) shown below with ext4 filesystem set to remount read-only upon filesystem error. # mount -o remount,errors=remount-ro / ---------- Testing program start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> static int file_writer(void *unused) { char buffer[128] = { }; int fd; snprintf(buffer, sizeof(buffer) - 1, "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT, 0600); unlink(buffer); while (write(fd, buffer, 1) == 1 && fsync(fd) == 0); return 0; } static void memory_consumer(void) { const int fd = open("/dev/zero", O_RDONLY); unsigned long size; char *buf = NULL; for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } read(fd, buf, size); /* Will cause OOM due to overcommit */ } int main(int argc, char *argv[]) { int i; for (i = 0; i < 100; i++) { char *cp = malloc(4 * 1024); if (!cp || clone(file_writer, cp + 4 * 1024, CLONE_SIGHAND | CLONE_VM, NULL) == -1) break; } memory_consumer(); while (1) pause(); return 0; } ---------- Testing program end ---------- The former showed that the ext4 filesystem is remounted read-only due to filesystem errors with 50%+ reproducibility. ---------- [ 72.440013] do_get_write_access: OOM for frozen_buffer [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory (...snipped....) [ 72.495559] do_get_write_access: OOM for frozen_buffer [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.496839] do_get_write_access: OOM for frozen_buffer [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.505766] Aborting journal on device sda1-8. [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only [ 72.505853] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 72.507995] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 72.508773] EXT4-fs (sda1): Remounting filesystem read-only [ 72.508775] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 72.547799] do_get_write_access: OOM for frozen_buffer [ 72.706692] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.035416] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.291732] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.422171] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.511862] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.589174] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.665302] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) ---------- On the other hand, the latter showed that the ext4 filesystem was never remounted read-only because filesystem errors did not occur, though several TIF_MEMDIE stalls which the timeout patch would handle were observed as with the former. As this is ext4 filesystem, this would use GFP_NOFS. But does using GFP_NOFS + __GFP_NOFAIL at ext4 filesystem solve the problem? I don't think so. The underlying block layer which ext4 filesystem calls would use GFP_NOIO. And memory allocation failures at block layer will result in I/O error which is observed by users as filesystem error. Does passing __GFP_NOFAIL down to block layer solve the problem? I don't think so. There is no means to teach block layer that filesystem layer is doing critical operations where failure results in serious problems. Then, does using GFP_NOIO + __GFP_NOFAIL at block layer solves the problem? I don't think so. It is nothing but bypassing /* The OOM killer does not compensate for light reclaim */ if (!(gfp_mask & __GFP_FS)) goto out; check by passing __GFP_NOFAIL flag. Michal Hocko wrote: > Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think > this is a problem? Killing a user space process or taking filesystem error actions (e.g. remount-ro or kernel panic), which choice is less painful for users? I believe that !(gfp_mask & __GFP_FS) check is a bug and should be removed. Rather, shouldn't allocations without __GFP_FS get more chance to succeed than allocations with __GFP_FS? If I were the author, I might have added below check instead. /* This is not a critical allocation. Don't invoke the OOM killer. */ if (gfp_mask & __GFP_FS) goto out; Falling into retry loop with same watermark might prevent rescuer threads from doing memory allocation which is needed for making free memory. Maybe we should use lower watermark for GFP_NOIO and below, middle watermark for GFP_NOFS, high watermark for GFP_KERNEL and above. ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-18 11:23 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-18 11:23 UTC (permalink / raw) To: mhocko Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 [ cc fsdevel list - watch out for side effect of 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) which was merged between 3.19-rc6 and 3.19-rc7 , started from http://marc.info/?l=linux-mm&m=142348457310066&w=2 ] Replying in this post picked up from several posts in this thread. Michal Hocko wrote: > Besides that __GFP_WAIT callers should be prepared for the allocation > failure and should better cope with it. So no, I really hate something > like the above. Those who do not want to retry with invoking the OOM killer are using __GFP_WAIT + __GFP_NORETRY allocations. Those who want to retry with invoking the OOM killer are using __GFP_WAIT allocations. Those who must retry forever with invoking the OOM killer, no matter how many processes the OOM killer kills, are using __GFP_WAIT + __GFP_NOFAIL allocations. However, since use of __GFP_NOFAIL is prohibited, I think many of __GFP_WAIT users are expecting that the allocation fails only when "the OOM killer set TIF_MEMDIE flag to the caller but the caller failed to allocate from memory reserves". Also, the implementation before 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) effectively supported __GFP_WAIT users with such expectation. Michal Hocko wrote: > Because they cannot perform any IO/FS transactions and that would lead > to a premature OOM conditions way too easily. OOM killer is a _last > resort_ reclaim opportunity not something that would happen just because > you happen to be not able to flush dirty pages. But you should not have applied such change without making necessary changes to GFP_NOFS / GFP_NOIO users with such expectation and testing at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. Michal Hocko wrote: > Well, you are beating your machine to death so you can hardly get any > time guarantee. It would be nice to have a better feedback mechanism to > know when to back off and fail the allocation attempt which might be > blocking OOM victim to pass away. This is extremely tricky because we > shouldn't be too eager to fail just because of a sudden memory pressure. Michal Hocko wrote: > > I wish only somebody like kswapd repeats the loop on behalf of all > > threads waiting at memory allocation slowpath... > > This is the case when the kswapd is _able_ to cope with the memory > pressure. It looks wasteful for me that so many threads (greater than number of available CPUs) are sleeping at cond_resched() in shrink_slab() when checking SysRq-t. Imagine 1000 threads sleeping at cond_resched() in shrink_slab() on a machine with only 1 CPU. Each thread gets a chance to try calling reclaim function only when all other threads gave that thread a chance at cond_resched(). Such situation is almost mutually preventing from making progress. I wish the following mechanism. Prepare a kernel thread (for avoiding being OOM-killed) and let __GFP_WAIT and __GFP_WAIT + __GFP_NOFAIL users to wake up the kernel thread when they failed to allocate from free list. The kernel thread calls shrink_slab() etc. (and also out_of_memory() as needed) and wakes them sleeping at wait_for_event() up. Failing to allocate from free list is a rare case. Therefore, context switches for asking somebody else for reclaiming memory would be an acceptable overhead. If such mechanism are implemented, 1000 threads except the somebody can save CPU time by sleeping. Avoiding "almost mutually preventing from making progress" situation will drastically shorten the time guarantee even if I beat my machine to death. Such mechanism might be similar to Dave Chinner's Make the OOM killer only be invoked by kswapd or some other independent kernel thread so that it is independent of the allocation context that needs to invoke it, and have the invoker wait to be told what to do. suggestion. Dave Chinner wrote: > Filesystems do demand paging of metadata within transactions, which > means we are guaranteed to be holding locks when doing memory > allocation. Indeed, this is what the GFP_NOFS allocation context is > supposed to convey - we currently *hold locks* and so reclaim needs > to be careful about recursion. I'll also argue that it means the OOM > killer cannot kill the process attempting memory allocation for the > same reason. I agree with Dave Chinner about this. I tested on ext4 filesystem, one is stock Linux 3.19 and the other is Linux 3.19 with - /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) - goto out; applied. Running a Java-like stressing program (which is multi threaded and likely be chosen by the OOM killer due to huge memory usage) shown below with ext4 filesystem set to remount read-only upon filesystem error. # mount -o remount,errors=remount-ro / ---------- Testing program start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> static int file_writer(void *unused) { char buffer[128] = { }; int fd; snprintf(buffer, sizeof(buffer) - 1, "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT, 0600); unlink(buffer); while (write(fd, buffer, 1) == 1 && fsync(fd) == 0); return 0; } static void memory_consumer(void) { const int fd = open("/dev/zero", O_RDONLY); unsigned long size; char *buf = NULL; for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } read(fd, buf, size); /* Will cause OOM due to overcommit */ } int main(int argc, char *argv[]) { int i; for (i = 0; i < 100; i++) { char *cp = malloc(4 * 1024); if (!cp || clone(file_writer, cp + 4 * 1024, CLONE_SIGHAND | CLONE_VM, NULL) == -1) break; } memory_consumer(); while (1) pause(); return 0; } ---------- Testing program end ---------- The former showed that the ext4 filesystem is remounted read-only due to filesystem errors with 50%+ reproducibility. ---------- [ 72.440013] do_get_write_access: OOM for frozen_buffer [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory (...snipped....) [ 72.495559] do_get_write_access: OOM for frozen_buffer [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.496839] do_get_write_access: OOM for frozen_buffer [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.505766] Aborting journal on device sda1-8. [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only [ 72.505853] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 72.507995] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 72.508773] EXT4-fs (sda1): Remounting filesystem read-only [ 72.508775] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 72.547799] do_get_write_access: OOM for frozen_buffer [ 72.706692] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.035416] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.291732] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.422171] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.511862] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.589174] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.665302] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) ---------- On the other hand, the latter showed that the ext4 filesystem was never remounted read-only because filesystem errors did not occur, though several TIF_MEMDIE stalls which the timeout patch would handle were observed as with the former. As this is ext4 filesystem, this would use GFP_NOFS. But does using GFP_NOFS + __GFP_NOFAIL at ext4 filesystem solve the problem? I don't think so. The underlying block layer which ext4 filesystem calls would use GFP_NOIO. And memory allocation failures at block layer will result in I/O error which is observed by users as filesystem error. Does passing __GFP_NOFAIL down to block layer solve the problem? I don't think so. There is no means to teach block layer that filesystem layer is doing critical operations where failure results in serious problems. Then, does using GFP_NOIO + __GFP_NOFAIL at block layer solves the problem? I don't think so. It is nothing but bypassing /* The OOM killer does not compensate for light reclaim */ if (!(gfp_mask & __GFP_FS)) goto out; check by passing __GFP_NOFAIL flag. Michal Hocko wrote: > Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think > this is a problem? Killing a user space process or taking filesystem error actions (e.g. remount-ro or kernel panic), which choice is less painful for users? I believe that !(gfp_mask & __GFP_FS) check is a bug and should be removed. Rather, shouldn't allocations without __GFP_FS get more chance to succeed than allocations with __GFP_FS? If I were the author, I might have added below check instead. /* This is not a critical allocation. Don't invoke the OOM killer. */ if (gfp_mask & __GFP_FS) goto out; Falling into retry loop with same watermark might prevent rescuer threads from doing memory allocation which is needed for making free memory. Maybe we should use lower watermark for GFP_NOIO and below, middle watermark for GFP_NOFS, high watermark for GFP_KERNEL and above. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 11:23 ` Tetsuo Handa @ 2015-02-18 12:29 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-18 12:29 UTC (permalink / raw) To: Tetsuo Handa Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 On Wed 18-02-15 20:23:19, Tetsuo Handa wrote: > [ cc fsdevel list - watch out for side effect of 9879de7373fc (mm: page_alloc: > embed OOM killing naturally into allocation slowpath) which was merged between > 3.19-rc6 and 3.19-rc7 , started from > http://marc.info/?l=linux-mm&m=142348457310066&w=2 ] > > Replying in this post picked up from several posts in this thread. > > Michal Hocko wrote: > > Besides that __GFP_WAIT callers should be prepared for the allocation > > failure and should better cope with it. So no, I really hate something > > like the above. > > Those who do not want to retry with invoking the OOM killer are using > __GFP_WAIT + __GFP_NORETRY allocations. > > Those who want to retry with invoking the OOM killer are using > __GFP_WAIT allocations. > > Those who must retry forever with invoking the OOM killer, no matter how > many processes the OOM killer kills, are using __GFP_WAIT + __GFP_NOFAIL > allocations. > > However, since use of __GFP_NOFAIL is prohibited, IT IS NOT PROHIBITED. It is highly discouraged because GFP_NOFAIL is a strong requirement and the caller should be really aware of the consequences. Especially when the allocation is done under locked context. > I think many of > __GFP_WAIT users are expecting that the allocation fails only when > "the OOM killer set TIF_MEMDIE flag to the caller but the caller > failed to allocate from memory reserves". This is not what __GFP_WAIT is defined for. It says that the allocator might sleep. > Also, the implementation > before 9879de7373fc (mm: page_alloc: embed OOM killing naturally > into allocation slowpath) effectively supported __GFP_WAIT users > with such expectation. same as GFP_KERNEL == GFP_NOFAIL for small allocations currently which causes a lot of troubles which were not anticipated at the time this was introduced. And we _should_ move away from that model. Because GFP_NOFAIL should be really explicit rather than implicit. > Michal Hocko wrote: > > Because they cannot perform any IO/FS transactions and that would lead > > to a premature OOM conditions way too easily. OOM killer is a _last > > resort_ reclaim opportunity not something that would happen just because > > you happen to be not able to flush dirty pages. > > But you should not have applied such change without making necessary > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since before git era). > Michal Hocko wrote: > > Well, you are beating your machine to death so you can hardly get any > > time guarantee. It would be nice to have a better feedback mechanism to > > know when to back off and fail the allocation attempt which might be > > blocking OOM victim to pass away. This is extremely tricky because we > > shouldn't be too eager to fail just because of a sudden memory pressure. > > Michal Hocko wrote: > > > I wish only somebody like kswapd repeats the loop on behalf of all > > > threads waiting at memory allocation slowpath... > > > > This is the case when the kswapd is _able_ to cope with the memory > > pressure. > > It looks wasteful for me that so many threads (greater than number of > available CPUs) are sleeping at cond_resched() in shrink_slab() when > checking SysRq-t. Imagine 1000 threads sleeping at cond_resched() in > shrink_slab() on a machine with only 1 CPU. Each thread gets a chance > to try calling reclaim function only when all other threads gave that > thread a chance at cond_resched(). Such situation is almost mutually > preventing from making progress. I wish the following mechanism. Feel free to send patches which are not breaking other loads... [...] > Michal Hocko wrote: > > Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think > > this is a problem? > > Killing a user space process or taking filesystem error actions (e.g. > remount-ro or kernel panic), which choice is less painful for users? > I believe that !(gfp_mask & __GFP_FS) check is a bug and should be removed. pre-mature OOM killer just because the current allocator context doesn't allow for real reclaim is even worse. > Rather, shouldn't allocations without __GFP_FS get more chance to succeed > than allocations with __GFP_FS? If I were the author, I might have added > below check instead. > > /* This is not a critical allocation. Don't invoke the OOM killer. */ > if (gfp_mask & __GFP_FS) > goto out; This doesn't make any sense what so ever. So regular GFP_KERNEL|USER allocations wouldn't invoke oom killer. This includes page faults and basically most of allocations. > Falling into retry loop with same watermark might prevent rescuer threads from > doing memory allocation which is needed for making free memory. Maybe we should > use lower watermark for GFP_NOIO and below, middle watermark for GFP_NOFS, high > watermark for GFP_KERNEL and above. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-18 12:29 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-18 12:29 UTC (permalink / raw) To: Tetsuo Handa Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 On Wed 18-02-15 20:23:19, Tetsuo Handa wrote: > [ cc fsdevel list - watch out for side effect of 9879de7373fc (mm: page_alloc: > embed OOM killing naturally into allocation slowpath) which was merged between > 3.19-rc6 and 3.19-rc7 , started from > http://marc.info/?l=linux-mm&m=142348457310066&w=2 ] > > Replying in this post picked up from several posts in this thread. > > Michal Hocko wrote: > > Besides that __GFP_WAIT callers should be prepared for the allocation > > failure and should better cope with it. So no, I really hate something > > like the above. > > Those who do not want to retry with invoking the OOM killer are using > __GFP_WAIT + __GFP_NORETRY allocations. > > Those who want to retry with invoking the OOM killer are using > __GFP_WAIT allocations. > > Those who must retry forever with invoking the OOM killer, no matter how > many processes the OOM killer kills, are using __GFP_WAIT + __GFP_NOFAIL > allocations. > > However, since use of __GFP_NOFAIL is prohibited, IT IS NOT PROHIBITED. It is highly discouraged because GFP_NOFAIL is a strong requirement and the caller should be really aware of the consequences. Especially when the allocation is done under locked context. > I think many of > __GFP_WAIT users are expecting that the allocation fails only when > "the OOM killer set TIF_MEMDIE flag to the caller but the caller > failed to allocate from memory reserves". This is not what __GFP_WAIT is defined for. It says that the allocator might sleep. > Also, the implementation > before 9879de7373fc (mm: page_alloc: embed OOM killing naturally > into allocation slowpath) effectively supported __GFP_WAIT users > with such expectation. same as GFP_KERNEL == GFP_NOFAIL for small allocations currently which causes a lot of troubles which were not anticipated at the time this was introduced. And we _should_ move away from that model. Because GFP_NOFAIL should be really explicit rather than implicit. > Michal Hocko wrote: > > Because they cannot perform any IO/FS transactions and that would lead > > to a premature OOM conditions way too easily. OOM killer is a _last > > resort_ reclaim opportunity not something that would happen just because > > you happen to be not able to flush dirty pages. > > But you should not have applied such change without making necessary > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since before git era). > Michal Hocko wrote: > > Well, you are beating your machine to death so you can hardly get any > > time guarantee. It would be nice to have a better feedback mechanism to > > know when to back off and fail the allocation attempt which might be > > blocking OOM victim to pass away. This is extremely tricky because we > > shouldn't be too eager to fail just because of a sudden memory pressure. > > Michal Hocko wrote: > > > I wish only somebody like kswapd repeats the loop on behalf of all > > > threads waiting at memory allocation slowpath... > > > > This is the case when the kswapd is _able_ to cope with the memory > > pressure. > > It looks wasteful for me that so many threads (greater than number of > available CPUs) are sleeping at cond_resched() in shrink_slab() when > checking SysRq-t. Imagine 1000 threads sleeping at cond_resched() in > shrink_slab() on a machine with only 1 CPU. Each thread gets a chance > to try calling reclaim function only when all other threads gave that > thread a chance at cond_resched(). Such situation is almost mutually > preventing from making progress. I wish the following mechanism. Feel free to send patches which are not breaking other loads... [...] > Michal Hocko wrote: > > Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think > > this is a problem? > > Killing a user space process or taking filesystem error actions (e.g. > remount-ro or kernel panic), which choice is less painful for users? > I believe that !(gfp_mask & __GFP_FS) check is a bug and should be removed. pre-mature OOM killer just because the current allocator context doesn't allow for real reclaim is even worse. > Rather, shouldn't allocations without __GFP_FS get more chance to succeed > than allocations with __GFP_FS? If I were the author, I might have added > below check instead. > > /* This is not a critical allocation. Don't invoke the OOM killer. */ > if (gfp_mask & __GFP_FS) > goto out; This doesn't make any sense what so ever. So regular GFP_KERNEL|USER allocations wouldn't invoke oom killer. This includes page faults and basically most of allocations. > Falling into retry loop with same watermark might prevent rescuer threads from > doing memory allocation which is needed for making free memory. Maybe we should > use lower watermark for GFP_NOIO and below, middle watermark for GFP_NOFS, high > watermark for GFP_KERNEL and above. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 12:29 ` Michal Hocko @ 2015-02-18 14:06 ` Tetsuo Handa -1 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-18 14:06 UTC (permalink / raw) To: mhocko Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 Michal Hocko wrote: > Tetsuo Handa wrote: > > Michal Hocko wrote: > > > Because they cannot perform any IO/FS transactions and that would lead > > > to a premature OOM conditions way too easily. OOM killer is a _last > > > resort_ reclaim opportunity not something that would happen just because > > > you happen to be not able to flush dirty pages. > > > > But you should not have applied such change without making necessary > > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing > > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. > > This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since > before git era). > Then, at least I expect that filesystem error actions will not be taken so trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for Linux 3.19-stable? ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-18 14:06 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-18 14:06 UTC (permalink / raw) To: mhocko Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 Michal Hocko wrote: > Tetsuo Handa wrote: > > Michal Hocko wrote: > > > Because they cannot perform any IO/FS transactions and that would lead > > > to a premature OOM conditions way too easily. OOM killer is a _last > > > resort_ reclaim opportunity not something that would happen just because > > > you happen to be not able to flush dirty pages. > > > > But you should not have applied such change without making necessary > > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing > > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. > > This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since > before git era). > Then, at least I expect that filesystem error actions will not be taken so trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for Linux 3.19-stable? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 14:06 ` Tetsuo Handa (?) @ 2015-02-18 14:25 ` Michal Hocko 2015-02-19 10:48 ` Tetsuo Handa -1 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2015-02-18 14:25 UTC (permalink / raw) To: Tetsuo Handa Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 On Wed 18-02-15 23:06:17, Tetsuo Handa wrote: > Michal Hocko wrote: > > Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > Because they cannot perform any IO/FS transactions and that would lead > > > > to a premature OOM conditions way too easily. OOM killer is a _last > > > > resort_ reclaim opportunity not something that would happen just because > > > > you happen to be not able to flush dirty pages. > > > > > > But you should not have applied such change without making necessary > > > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing > > > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. > > > > This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since > > before git era). > > > Then, at least I expect that filesystem error actions will not be taken so > trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for > Linux 3.19-stable? I do not understand. What kind of bug would be fixed by that change? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 14:25 ` Michal Hocko @ 2015-02-19 10:48 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 10:48 UTC (permalink / raw) To: mhocko Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 Michal Hocko wrote: > Tetsuo Handa wrote: > > Michal Hocko wrote: > > > Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > Because they cannot perform any IO/FS transactions and that would lead > > > > > to a premature OOM conditions way too easily. OOM killer is a _last > > > > > resort_ reclaim opportunity not something that would happen just because > > > > > you happen to be not able to flush dirty pages. > > > > > > > > But you should not have applied such change without making necessary > > > > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing > > > > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. > > > > > > This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since > > > before git era). > > > > > Then, at least I expect that filesystem error actions will not be taken so > > trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for > > Linux 3.19-stable? > > I do not understand. What kind of bug would be fixed by that change? That change fixes significant loss of file I/O reliability under extreme memory pressure. Today I tested how frequent filesystem errors occurs using scripted environment. ( Source code of a.out is http://marc.info/?l=linux-fsdevel&m=142425860904849&w=2 ) ---------- #!/bin/sh : > ~/trial.log for i in `seq 1 100` do mkfs.ext4 -q /dev/sdb1 || exit 1 mount -o errors=remount-ro /dev/sdb1 /tmp || exit 2 chmod 1777 /tmp su - demo -c ~demo/a.out if [ -w /tmp/ ] then echo -n "S" >> ~/trial.log else echo -n "F" >> ~/trial.log fi umount /tmp done ---------- We can see that filesystem errors are occurring frequently if GFP_NOFS / GFP_NOIO allocations give up without retrying. On the other hand, as far as these trials, TIF_MEMDIE stall was not observed if GFP_NOFS / GFP_NOIO allocations give up without retrying. Maybe giving up without retrying is keeping away from hitting stalls for this test case? Linux 3.19-rc6 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-rc6.txt.xz ) 0 filesystem errors out of 100 trials. 2 stalls. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS Linux 3.19 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19.txt.xz ) 44 filesystem errors out of 100 trials. 0 stalls. SSFFSSSFSSSFSFFFFSSFSSFSSSSSSFFFSFSFFSSSSSSFFFFSFSSFFFSSSSFSSFFFFFSSSSSFSSFSFSSFSFFFSFFFFFFFSSSSSSSS Linux 3.19 with http://marc.info/?l=linux-mm&m=142418465615672&w=2 applied. (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-patched.txt.xz ) 0 filesystem errors out of 100 trials. 2 stalls. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS If result of Linux 3.19 is what you wanted, we should chime fs developers for immediate action. (But __GFP_NOFAIL discussion between you and Dave is in progress. I don't know whether ext4 and underlying subsystems should start using __GFP_NOFAIL.) P.S. Just for experimental purpose, Linux 3.19 with below change applied gave better result than retrying GFP_NOFS / GFP_NOIO allocations without invoking the OOM killer. Short-lived small GFP_NOFS / GFP_NOIO allocations can use GFP_ATOMIC instead? How many bytes does blk_rq_map_kern() want? --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2867,6 +2867,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int classzone_idx; gfp_mask &= gfp_allowed_mask; + if (gfp_mask == GFP_NOFS || gfp_mask == GFP_NOIO) + gfp_mask = GFP_ATOMIC; lockdep_trace_alloc(gfp_mask); 0 filesystem errors out of 100 trials. 0 stalls. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-19 10:48 ` Tetsuo Handa 0 siblings, 0 replies; 276+ messages in thread From: Tetsuo Handa @ 2015-02-19 10:48 UTC (permalink / raw) To: mhocko Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 Michal Hocko wrote: > Tetsuo Handa wrote: > > Michal Hocko wrote: > > > Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > Because they cannot perform any IO/FS transactions and that would lead > > > > > to a premature OOM conditions way too easily. OOM killer is a _last > > > > > resort_ reclaim opportunity not something that would happen just because > > > > > you happen to be not able to flush dirty pages. > > > > > > > > But you should not have applied such change without making necessary > > > > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing > > > > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. > > > > > > This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since > > > before git era). > > > > > Then, at least I expect that filesystem error actions will not be taken so > > trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for > > Linux 3.19-stable? > > I do not understand. What kind of bug would be fixed by that change? That change fixes significant loss of file I/O reliability under extreme memory pressure. Today I tested how frequent filesystem errors occurs using scripted environment. ( Source code of a.out is http://marc.info/?l=linux-fsdevel&m=142425860904849&w=2 ) ---------- #!/bin/sh : > ~/trial.log for i in `seq 1 100` do mkfs.ext4 -q /dev/sdb1 || exit 1 mount -o errors=remount-ro /dev/sdb1 /tmp || exit 2 chmod 1777 /tmp su - demo -c ~demo/a.out if [ -w /tmp/ ] then echo -n "S" >> ~/trial.log else echo -n "F" >> ~/trial.log fi umount /tmp done ---------- We can see that filesystem errors are occurring frequently if GFP_NOFS / GFP_NOIO allocations give up without retrying. On the other hand, as far as these trials, TIF_MEMDIE stall was not observed if GFP_NOFS / GFP_NOIO allocations give up without retrying. Maybe giving up without retrying is keeping away from hitting stalls for this test case? Linux 3.19-rc6 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-rc6.txt.xz ) 0 filesystem errors out of 100 trials. 2 stalls. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS Linux 3.19 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19.txt.xz ) 44 filesystem errors out of 100 trials. 0 stalls. SSFFSSSFSSSFSFFFFSSFSSFSSSSSSFFFSFSFFSSSSSSFFFFSFSSFFFSSSSFSSFFFFFSSSSSFSSFSFSSFSFFFSFFFFFFFSSSSSSSS Linux 3.19 with http://marc.info/?l=linux-mm&m=142418465615672&w=2 applied. (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-patched.txt.xz ) 0 filesystem errors out of 100 trials. 2 stalls. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS If result of Linux 3.19 is what you wanted, we should chime fs developers for immediate action. (But __GFP_NOFAIL discussion between you and Dave is in progress. I don't know whether ext4 and underlying subsystems should start using __GFP_NOFAIL.) P.S. Just for experimental purpose, Linux 3.19 with below change applied gave better result than retrying GFP_NOFS / GFP_NOIO allocations without invoking the OOM killer. Short-lived small GFP_NOFS / GFP_NOIO allocations can use GFP_ATOMIC instead? How many bytes does blk_rq_map_kern() want? --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2867,6 +2867,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int classzone_idx; gfp_mask &= gfp_allowed_mask; + if (gfp_mask == GFP_NOFS || gfp_mask == GFP_NOIO) + gfp_mask = GFP_ATOMIC; lockdep_trace_alloc(gfp_mask); 0 filesystem errors out of 100 trials. 0 stalls. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 10:48 ` Tetsuo Handa @ 2015-02-20 8:26 ` Michal Hocko -1 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 8:26 UTC (permalink / raw) To: Tetsuo Handa Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 On Thu 19-02-15 19:48:16, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > I do not understand. What kind of bug would be fixed by that change? > > That change fixes significant loss of file I/O reliability under extreme > memory pressure. > > Today I tested how frequent filesystem errors occurs using scripted environment. > ( Source code of a.out is http://marc.info/?l=linux-fsdevel&m=142425860904849&w=2 ) > > ---------- > #!/bin/sh > : > ~/trial.log > for i in `seq 1 100` > do > mkfs.ext4 -q /dev/sdb1 || exit 1 > mount -o errors=remount-ro /dev/sdb1 /tmp || exit 2 > chmod 1777 /tmp > su - demo -c ~demo/a.out > if [ -w /tmp/ ] > then > echo -n "S" >> ~/trial.log > else > echo -n "F" >> ~/trial.log > fi > umount /tmp > done > ---------- > > We can see that filesystem errors are occurring frequently if GFP_NOFS / GFP_NOIO > allocations give up without retrying. I would suggest reporting this to ext people (in a separate thread please) and see what is the proper fix. > On the other hand, as far as these trials, > TIF_MEMDIE stall was not observed if GFP_NOFS / GFP_NOIO allocations give up > without retrying. Maybe giving up without retrying is keeping away from hitting > stalls for this test case? This is expected because those allocations are with locks held and so the chances to release the lock are higher. [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? @ 2015-02-20 8:26 ` Michal Hocko 0 siblings, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-20 8:26 UTC (permalink / raw) To: Tetsuo Handa Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, linux-fsdevel, fernando_b1 On Thu 19-02-15 19:48:16, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > I do not understand. What kind of bug would be fixed by that change? > > That change fixes significant loss of file I/O reliability under extreme > memory pressure. > > Today I tested how frequent filesystem errors occurs using scripted environment. > ( Source code of a.out is http://marc.info/?l=linux-fsdevel&m=142425860904849&w=2 ) > > ---------- > #!/bin/sh > : > ~/trial.log > for i in `seq 1 100` > do > mkfs.ext4 -q /dev/sdb1 || exit 1 > mount -o errors=remount-ro /dev/sdb1 /tmp || exit 2 > chmod 1777 /tmp > su - demo -c ~demo/a.out > if [ -w /tmp/ ] > then > echo -n "S" >> ~/trial.log > else > echo -n "F" >> ~/trial.log > fi > umount /tmp > done > ---------- > > We can see that filesystem errors are occurring frequently if GFP_NOFS / GFP_NOIO > allocations give up without retrying. I would suggest reporting this to ext people (in a separate thread please) and see what is the proper fix. > On the other hand, as far as these trials, > TIF_MEMDIE stall was not observed if GFP_NOFS / GFP_NOIO allocations give up > without retrying. Maybe giving up without retrying is keeping away from hitting > stalls for this test case? This is expected because those allocations are with locks held and so the chances to release the lock are higher. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 11:57 ` Tetsuo Handa 2015-02-17 13:16 ` Johannes Weiner @ 2015-02-23 22:08 ` David Rientjes 2015-02-24 11:20 ` Tetsuo Handa 1 sibling, 1 reply; 276+ messages in thread From: David Rientjes @ 2015-02-23 22:08 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, mhocko, david, dchinner, linux-mm, oleg, akpm, mgorman, torvalds On Tue, 17 Feb 2015, Tetsuo Handa wrote: > Yes, basic idea would be same with > http://marc.info/?l=linux-mm&m=142002495532320&w=2 . > > But Michal and David do not like the timeout approach. > http://marc.info/?l=linux-mm&m=141684783713564&w=2 > http://marc.info/?l=linux-mm&m=141686814824684&w=2 > > Unless they change their opinion in response to the discovery explained at > http://lwn.net/Articles/627419/ , timeout patches will not be accepted. > Unfortunately, timeout based solutions aren't guaranteed to provide anything more helpful. The problem you're referring to is when the oom kill victim is waiting on a mutex and cannot make forward progress even though it has access to memory reserves. Threads that are holding the mutex and allocate in a blockable context will cause the oom killer to defer forever because it sees the presence of a victim waiting to exit. TaskA TaskB ===== ===== mutex_lock(i_mutex) allocate memory oom kill TaskB mutex_lock(i_mutex) In this scenario, nothing on the system will be able to allocate memory without some type of memory reserve since at least one thread is holding the mutex that the victim needs and is looping forever, unless memory is freed by something else on the system which allows TaskA to allocate and drop the mutex. In a timeout based solution, this would be detected and another thread would be chosen for oom kill. There's currently no way for the oom killer to select a process that isn't waiting for that same mutex, however. If it does, then the process has been killed needlessly since it cannot make forward progress itself without grabbing the mutex. Certainly, it would be better to eventually kill something else in the hope that it does not need the mutex and will free some memory which would allow the thread that had originally been deferring forever, TaskA, in the oom killer waiting for the original victim, TaskB, to exit. If that's the solution, then TaskA had been killed unnecessarily itself. Perhaps we should consider an alternative: allow threads, such as TaskA, that are deferring for a long amount of time to simply allocate with ALLOC_NO_WATERMARKS itself in that scenario in the hope that the allocation succeeding will eventually allow it to drop the mutex. Two problems: (1) there's no guarantee that the simple allocation is all TaskA needs before it will drop the lock and (2) another thread could immediately grab the same mutex and allocate, in which the same series of events repeats. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 22:08 ` David Rientjes @ 2015-02-24 11:20 ` Tetsuo Handa 2015-02-24 15:20 ` Theodore Ts'o 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-02-24 11:20 UTC (permalink / raw) To: rientjes Cc: hannes, mhocko, david, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 David Rientjes wrote: > Perhaps we should consider an alternative: allow threads, such as TaskA, > that are deferring for a long amount of time to simply allocate with > ALLOC_NO_WATERMARKS itself in that scenario in the hope that the > allocation succeeding will eventually allow it to drop the mutex. Two > problems: (1) there's no guarantee that the simple allocation is all TaskA > needs before it will drop the lock and (2) another thread could > immediately grab the same mutex and allocate, in which the same series of > events repeats. We can see that effectively GFP_NOFAIL allocations with a lock held (e.g. filesystem transaction) exist, can't we? ---------------------------------------- TaskA TaskB TaskC TaskD TaskE call mutex_lock() call mutex_lock() call mutex_lock() call mutex_lock() call mutex_lock() do GFP_NOFAIL allocation oom kill TaskA waiting for TaskA to die will do something with allocated memory will call mutex_unlock() will do GFP_NOFAIL allocation will wait for TaskA to die will do something with allocated memory will call mutex_unlock() will do GFP_NOFAIL allocation will wait for TaskA to die will do something with allocated memory will call mutex_unlock() will do GFP_NOFAIL allocation will wait for TaskA to die will do something with allocated memory will call mutex_unlock() will do GFP_NOFAIL allocation ---------------------------------------- Allowing ALLOC_NO_WATERMARKS to TaskB helps nothing. We don't want to allow ALLOC_NO_WATERMARKS to TaskC, TaskD, TaskE and TaskA when they do the same sequence TaskB did, or we will deplete memory reserves. > In a timeout based solution, this would be detected and another thread > would be chosen for oom kill. There's currently no way for the oom killer > to select a process that isn't waiting for that same mutex, however. If > it does, then the process has been killed needlessly since it cannot make > forward progress itself without grabbing the mutex. Right. The OOM killer cannot understand that there is such lock dependency. And do you think there will be a way available for the OOM killer to select a process that isn't waiting for that same mutex in the neare future? (Remembering mutex's address currently waiting for using "struct task_struct" would do, but will not be accepted due to performance penalty. Simplified form would be to check "struct task_struct"->state , but will not be perfect.) > Certainly, it would be better to eventually kill something else in the > hope that it does not need the mutex and will free some memory which would > allow the thread that had originally been deferring forever, TaskA, in the > oom killer waiting for the original victim, TaskB, to exit. If that's the > solution, then TaskA had been killed unnecessarily itself. Complaining about unnecessarily killed processes is preventing us from making forward progress. The memory reserves are something like a balloon. To guarantee forward progress, the balloon must not become empty. All memory managing techniques except the OOM killer are trying to control "deflator of the balloon" via various throttling heuristics. On the other hand, the OOM killer is the only memory managing technique which is trying to control "inflator of the balloon" via several throttling heuristics. The OOM killer is invoked when all memory managing techniques except the OOM killer failed to make forward progress. Therefore, the OOM killer is responsible for making forward progress for "deflator of the balloon" and is granted the prerogative to send SIGKILL to any process. Given the fact that the OOM killer cannot understand lock dependency and there are effectively GFP_NOFAIL allocations, it is inevitable that the OOM killer fails to choose one correct process that will make forward progress. Currently the OOM killer is invoked as one shot mode. This mode helps us to reduce the possibility of depleting the memory reserves and killing processes unnecessarily. But this mode is bothering people with "silently stalling forever" problem when the bullet from the OOM killer missed the target. This mode is also bothering people with "complete system crash" problem when the bullet from SysRq-f missed the target, for they have to use SysRq-i or SysRq-c or SysRq-b which is far more unnecessary kill of processes in order to solve the OOM condition. My proposal is to allow the OOM killer be invoked as consecutive shots mode. Although consecutive shots mode may increase possibility of killing processes unnecessarily, trying to kill an unkillable process in one shot mode is after all unnecessary kill of processes. The root cause is the same (i.e. the OOM killer cannot understand the dependency). My patch can stop bothering people with "silently stalling forever" / "complete system crash" problems by retrying the oom kill attempt than wait forever. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-24 11:20 ` Tetsuo Handa @ 2015-02-24 15:20 ` Theodore Ts'o 2015-02-24 21:02 ` Dave Chinner 0 siblings, 1 reply; 276+ messages in thread From: Theodore Ts'o @ 2015-02-24 15:20 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, hannes, mhocko, david, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 On Tue, Feb 24, 2015 at 08:20:11PM +0900, Tetsuo Handa wrote: > > In a timeout based solution, this would be detected and another thread > > would be chosen for oom kill. There's currently no way for the oom killer > > to select a process that isn't waiting for that same mutex, however. If > > it does, then the process has been killed needlessly since it cannot make > > forward progress itself without grabbing the mutex. > > Right. The OOM killer cannot understand that there is such lock dependency.... > The memory reserves are something like a balloon. To guarantee forward > progress, the balloon must not become empty. All memory managing techniques > except the OOM killer are trying to control "deflator of the balloon" via > various throttling heuristics. On the other hand, the OOM killer is the only > memory managing technique which is trying to control "inflator of the balloon" > via several throttling heuristics..... The mm developers have suggested in the past whether we could solve problems by preallocating memory in advance. Sometimes this is very hard to do because we don't know exactly how much or if we need memory, or in order to do this, we would need to completely restructure the code because the memory allocation is happening deep in the call stack, potentially in some other subsystem. So I wonder if we can solve the problem by having a subsystem reserving memory in advance of taking the mutexes. We do something like this in ext3/ext4 --- when we allocate a (sub-)transaction handle, we give a worst case estimate of how many blocks we might need to dirty under that handle, and if there isn't enough space in the journal, we block in the start_handle() call while the current transaction is closed, and the transaction handle will be attached to the next transaction. In the memory allocation scenario, it's a bit more complicated, since the memory might be allocated in a slab that requires a higher-order page allocation, but would it be sufficient if we do something rough where the foreground kernel thread "reserves" a few pages before it starts doing something that requires mutexes. The reservation would be reserved on an accounting basis, and kernel codepath which has reserved pages would get priority over kernel threads running under a task_struct which hsa not reserved pages. If there the system doesn't have enough pages available, then the reservation request would block the process until more memory is available. This wouldn't necessary help in cases where the memory is required for cleaning dirty pages (although in those cases you really *do* want to let the memory allocation succeed --- so maybe there should be a way to hint to the mm subsystem that a memory allocation should be given higher priority since it might help get the system out of the ham that it is in). However, for "normal" operations, where blocking a process who was about to execute, say, a read(2) or a open(2) system call early, *before* it takes some mutex, it owuld be good if we could provide a certain amount of admission control when memory pressure is specially high. Would this be a viable strategy? Even if this was a hint that wasn't perfect (i.e., it some cases a kernel thread might end up requiring more pages than it had hinted, which would not be considered fatal, although the excess requested pages would be treated the same way as if no reservation was made at all, meaning the memory allocation would be more likely to fail and a GFP_NOFAIL allocation would loop for longer), I would think this could only help us do a better job of "keeping the baloon from getting completely deflated". Cheers, - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-24 15:20 ` Theodore Ts'o @ 2015-02-24 21:02 ` Dave Chinner 2015-02-25 14:31 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Dave Chinner @ 2015-02-24 21:02 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 On Tue, Feb 24, 2015 at 10:20:33AM -0500, Theodore Ts'o wrote: > On Tue, Feb 24, 2015 at 08:20:11PM +0900, Tetsuo Handa wrote: > > > In a timeout based solution, this would be detected and another thread > > > would be chosen for oom kill. There's currently no way for the oom killer > > > to select a process that isn't waiting for that same mutex, however. If > > > it does, then the process has been killed needlessly since it cannot make > > > forward progress itself without grabbing the mutex. > > > > Right. The OOM killer cannot understand that there is such lock dependency.... > > > The memory reserves are something like a balloon. To guarantee forward > > progress, the balloon must not become empty. All memory managing techniques > > except the OOM killer are trying to control "deflator of the balloon" via > > various throttling heuristics. On the other hand, the OOM killer is the only > > memory managing technique which is trying to control "inflator of the balloon" > > via several throttling heuristics..... > > The mm developers have suggested in the past whether we could solve > problems by preallocating memory in advance. Sometimes this is very > hard to do because we don't know exactly how much or if we need > memory, or in order to do this, we would need to completely > restructure the code because the memory allocation is happening deep > in the call stack, potentially in some other subsystem. > > So I wonder if we can solve the problem by having a subsystem > reserving memory in advance of taking the mutexes. We do something > like this in ext3/ext4 --- when we allocate a (sub-)transaction > handle, we give a worst case estimate of how many blocks we might need > to dirty under that handle, and if there isn't enough space in the > journal, we block in the start_handle() call while the current > transaction is closed, and the transaction handle will be attached to > the next transaction. This exact discussion is already underway. My initial proposal: http://oss.sgi.com/archives/xfs/2015-02/msg00314.html Why mempools don't work but transaction based reservations will: http://oss.sgi.com/archives/xfs/2015-02/msg00339.html Reservation needs to be an accounting mechanisms, not preallocation: http://oss.sgi.com/archives/xfs/2015-02/msg00456.html http://oss.sgi.com/archives/xfs/2015-02/msg00457.html http://oss.sgi.com/archives/xfs/2015-02/msg00458.html And that's where the discussion currently sits. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-24 21:02 ` Dave Chinner @ 2015-02-25 14:31 ` Tetsuo Handa 2015-02-27 7:39 ` Dave Chinner 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-02-25 14:31 UTC (permalink / raw) To: david Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 Dave Chinner wrote: > This exact discussion is already underway. > > My initial proposal: > > http://oss.sgi.com/archives/xfs/2015-02/msg00314.html > > Why mempools don't work but transaction based reservations will: > > http://oss.sgi.com/archives/xfs/2015-02/msg00339.html > > Reservation needs to be an accounting mechanisms, not preallocation: > > http://oss.sgi.com/archives/xfs/2015-02/msg00456.html > http://oss.sgi.com/archives/xfs/2015-02/msg00457.html > http://oss.sgi.com/archives/xfs/2015-02/msg00458.html > > And that's where the discussion currently sits. I got two problems (one is stall at io_schedule(), the other is kernel panic due to xfs's assertion failure) using Linux 3.19. I guess those problems are caused by not retrying !GFP_FS allocations under OOM. Will those problems go away by using transaction based reservations? And if yes, are they simple enough to backport to vendor's kernels? (From http://I-love.SAKURA.ne.jp/tmp/serial-20150225-1.txt.xz ) ---------- [ 1225.773411] kworker/3:0H D ffff88007cadb4f8 11632 27 2 0x00000000 [ 1225.776911] ffff88007cadb4f8 ffff88007cadb508 ffff88007cac6740 0000000000014080 [ 1225.780670] ffffffff8101cd19 ffff88007cadbfd8 0000000000014080 ffff88007c28b740 [ 1225.784431] ffff88007cac6740 ffff88007cadb540 ffff88007f8d4998 ffff88007cadb540 [ 1225.788766] Call Trace: [ 1225.789988] [<ffffffff8101cd19>] ? read_tsc+0x9/0x10 [ 1225.792444] [<ffffffff812acbd9>] ? xfs_iunpin_wait+0x19/0x20 [ 1225.795228] [<ffffffff816b2590>] io_schedule+0xa0/0x130 [ 1225.797802] [<ffffffff812a9569>] __xfs_iunpin_wait+0xe9/0x140 [ 1225.800621] [<ffffffff810af3b0>] ? autoremove_wake_function+0x40/0x40 [ 1225.803770] [<ffffffff812acbd9>] xfs_iunpin_wait+0x19/0x20 [ 1225.806471] [<ffffffff812a209c>] xfs_reclaim_inode+0x7c/0x360 [ 1225.809283] [<ffffffff812a25d7>] xfs_reclaim_inodes_ag+0x257/0x370 [ 1225.812308] [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0 [ 1225.815532] [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190 [ 1225.817951] [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40 [ 1225.819373] [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20 [ 1225.820898] [<ffffffff811c29e9>] super_cache_scan+0x169/0x170 [ 1225.822245] [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370 [ 1225.823588] [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240 [ 1225.824830] [<ffffffff8115e0dc>] do_try_to_free_pages+0x16c/0x460 [ 1225.826230] [<ffffffff8115e48a>] try_to_free_pages+0xba/0x150 [ 1225.827570] [<ffffffff81151542>] __alloc_pages_nodemask+0x5b2/0x9d0 [ 1225.829030] [<ffffffff8119ecbc>] kmem_getpages+0x8c/0x200 [ 1225.830277] [<ffffffff811a122b>] fallback_alloc+0x17b/0x230 [ 1225.831561] [<ffffffff811a107b>] ____cache_alloc_node+0x18b/0x1c0 [ 1225.833061] [<ffffffff811a3b00>] kmem_cache_alloc+0x330/0x5c0 [ 1225.834435] [<ffffffff8133c9d9>] ? ida_pre_get+0x69/0x100 [ 1225.835719] [<ffffffff8133c9d9>] ida_pre_get+0x69/0x100 [ 1225.836963] [<ffffffff8133d312>] ida_simple_get+0x42/0xf0 [ 1225.838248] [<ffffffff81086211>] create_worker+0x31/0x1c0 [ 1225.839519] [<ffffffff81087831>] worker_thread+0x3d1/0x4d0 [ 1225.840800] [<ffffffff81087460>] ? rescuer_thread+0x3a0/0x3a0 [ 1225.842123] [<ffffffff8108c5e2>] kthread+0xd2/0xf0 [ 1225.843234] [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0 [ 1225.844978] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 [ 1225.846481] [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0 [ 1225.847718] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 [ 1225.849279] kswapd0 D ffff88007708f998 11552 45 2 0x00000000 [ 1225.850977] ffff88007708f998 0000000000000000 ffff88007c28b740 0000000000014080 [ 1225.852798] 0000000000000003 ffff88007708ffd8 0000000000014080 ffff880077ff2740 [ 1225.854575] ffff88007c28b740 0000000000000000 ffff88007948e3a8 ffff88007948e3ac [ 1225.856358] Call Trace: [ 1225.856928] [<ffffffff816b2799>] schedule_preempt_disabled+0x29/0x70 [ 1225.858384] [<ffffffff816b43d5>] __mutex_lock_slowpath+0x95/0x100 [ 1225.859799] [<ffffffff816b4463>] mutex_lock+0x23/0x37 [ 1225.860983] [<ffffffff812a264c>] xfs_reclaim_inodes_ag+0x2cc/0x370 [ 1225.862403] [<ffffffff8109eb48>] ? __enqueue_entity+0x78/0x80 [ 1225.863742] [<ffffffff810a5f37>] ? enqueue_entity+0x237/0x8f0 [ 1225.865100] [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0 [ 1225.866659] [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190 [ 1225.868106] [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40 [ 1225.869522] [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20 [ 1225.871015] [<ffffffff811c29e9>] super_cache_scan+0x169/0x170 [ 1225.872338] [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370 [ 1225.873679] [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240 [ 1225.874920] [<ffffffff8115ed2d>] kswapd+0x4fd/0x9c0 [ 1225.876049] [<ffffffff8115e830>] ? mem_cgroup_shrink_node_zone+0x140/0x140 [ 1225.877654] [<ffffffff8108c5e2>] kthread+0xd2/0xf0 [ 1225.878762] [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0 [ 1225.880495] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 [ 1225.881996] [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0 [ 1225.883336] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 ---------- (From http://I-love.SAKURA.ne.jp/tmp/serial-20150225-2.txt.xz + http://I-love.SAKURA.ne.jp/tmp/crash-20150225-2.log.xz ) ---------- [ 189.586204] Out of memory: Kill process 3701 (a.out) score 834 or sacrifice child [ 189.586205] Killed process 3701 (a.out) total-vm:2167392kB, anon-rss:1465820kB, file-rss:4kB [ 189.586210] Kill process 3702 (a.out) sharing same memory [ 189.586211] Kill process 3714 (a.out) sharing same memory [ 189.586212] Kill process 3748 (a.out) sharing same memory [ 189.586213] Kill process 3755 (a.out) sharing same memory [ 189.593470] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(mp), file: fs/xfs/xfs_inode.c, line: 1701 [ 189.593491] ------------[ cut here ]------------ [ 189.593492] kernel BUG at fs/xfs/xfs_message.c:106! [ 189.593493] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [ 189.593511] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_raw iptable_filter ip_tables coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper cryptd dm_mirror dm_region_hash dm_log microcode dm_mod ppdev parport_pc pcspkr vmw_balloon serio_raw vmw_vmci parport shpchp i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput ata_generic pata_acpi sd_mod ata_piix mptspi libata scsi_transport_spi e1000 mptscsih mptbase floppy [ 189.593512] CPU: 1 PID: 3755 Comm: a.out Not tainted 3.19.0 #42 [ 189.593512] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 189.593513] task: ffff88007a848740 ti: ffff88005c064000 task.ti: ffff88005c064000 [ 189.593517] RIP: 0010:[<ffffffff812af992>] [<ffffffff812af992>] assfail+0x22/0x30 [ 189.593517] RSP: 0000:ffff88005c067af8 EFLAGS: 00010292 [ 189.593518] RAX: 0000000000000054 RBX: ffff880079349c00 RCX: 0000000000000050 [ 189.593518] RDX: 0000000000005050 RSI: 0000000000000282 RDI: 0000000000000282 [ 189.593519] RBP: ffff88005c067af8 R08: 0000000000000282 R09: 0000000000000000 [ 189.593519] R10: ffffffff81ec95c8 R11: 656c696166206e6f R12: ffff88005ee92800 [ 189.593519] R13: 00000000fffffff4 R14: ffffffff81838140 R15: ffff880064505390 [ 189.593520] FS: 00007f62d93e0740(0000) GS:ffff88007f840000(0000) knlGS:0000000000000000 [ 189.593521] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 189.593521] CR2: 00007fb901282763 CR3: 0000000077b00000 CR4: 00000000000407e0 [ 189.593562] Stack: [ 189.593564] ffff88005c067b38 ffffffff812ab2d7 ffff880079349e48 ffff88007a6feef0 [ 189.593564] ffff88005c067b38 ffff880079349c00 0000000000000001 ffff880079349db8 [ 189.593565] ffff88005c067b58 ffffffff812acb98 ffff880079349db8 ffff880079349c00 [ 189.593565] Call Trace: [ 189.593568] [<ffffffff812ab2d7>] xfs_inactive_truncate+0x67/0x150 [ 189.593569] [<ffffffff812acb98>] xfs_inactive+0x1c8/0x1f0 [ 189.593570] [<ffffffff812b3216>] xfs_fs_evict_inode+0x86/0xd0 [ 189.593572] [<ffffffff811da0f8>] evict+0xb8/0x190 [ 189.593574] [<ffffffff811daa15>] iput+0xf5/0x180 [ 189.593575] [<ffffffff811d5b58>] __dentry_kill+0x188/0x1f0 [ 189.593576] [<ffffffff811d5c65>] dput+0xa5/0x170 [ 189.593577] [<ffffffff811c0dbd>] __fput+0x16d/0x1e0 [ 189.593578] [<ffffffff811c0e7e>] ____fput+0xe/0x10 [ 189.593580] [<ffffffff8108ac9f>] task_work_run+0xaf/0xf0 [ 189.593582] [<ffffffff81071638>] do_exit+0x2d8/0xbe0 [ 189.593583] [<ffffffff8107a5df>] ? recalc_sigpending+0x1f/0x60 [ 189.593584] [<ffffffff81071fcf>] do_group_exit+0x3f/0xa0 [ 189.593585] [<ffffffff8107d322>] get_signal+0x1d2/0x6f0 [ 189.593588] [<ffffffff810134e8>] do_signal+0x28/0x720 [ 189.593589] [<ffffffff811c1825>] ? __sb_end_write+0x35/0x70 [ 189.593591] [<ffffffff811bf362>] ? vfs_write+0x172/0x1f0 [ 189.593592] [<ffffffff81013c2c>] do_notify_resume+0x4c/0x90 [ 189.593594] [<ffffffff816b6747>] int_signal+0x12/0x17 [ 189.593602] Code: 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 f1 41 89 d0 48 c7 c6 48 8b 97 81 48 89 fa 31 c0 48 89 e5 31 ff e8 de fb ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 [ 189.593603] RIP [<ffffffff812af992>] assfail+0x22/0x30 [ 189.593604] RSP <ffff88005c067af8> ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-25 14:31 ` Tetsuo Handa @ 2015-02-27 7:39 ` Dave Chinner 2015-02-27 12:42 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Dave Chinner @ 2015-02-27 7:39 UTC (permalink / raw) To: Tetsuo Handa Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 On Wed, Feb 25, 2015 at 11:31:17PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > This exact discussion is already underway. > > > > My initial proposal: > > > > http://oss.sgi.com/archives/xfs/2015-02/msg00314.html > > > > Why mempools don't work but transaction based reservations will: > > > > http://oss.sgi.com/archives/xfs/2015-02/msg00339.html > > > > Reservation needs to be an accounting mechanisms, not preallocation: > > > > http://oss.sgi.com/archives/xfs/2015-02/msg00456.html > > http://oss.sgi.com/archives/xfs/2015-02/msg00457.html > > http://oss.sgi.com/archives/xfs/2015-02/msg00458.html > > > > And that's where the discussion currently sits. > > I got two problems (one is stall at io_schedule() This is a typical "blame the messenger" bug report. XFS is stuck in inode reclaim waiting for log IO completion to occur, along with all the other processes iin xfs_log_force also stuck waiting for the same Io completion. You need to find where that IO completion that everything is waiting on has got stuck or show that it's not a lost IO and actually an XFS problem. e.g has the IO stack got stuck on a mempool somewhere? > , the other is kernel panic > due to xfs's assertion failure) using Linux 3.19. > http://I-love.SAKURA.ne.jp/tmp/crash-20150225-2.log.xz ) > ---------- > [ 189.586204] Out of memory: Kill process 3701 (a.out) score 834 or sacrifice child > [ 189.586205] Killed process 3701 (a.out) total-vm:2167392kB, anon-rss:1465820kB, file-rss:4kB > [ 189.586210] Kill process 3702 (a.out) sharing same memory > [ 189.586211] Kill process 3714 (a.out) sharing same memory > [ 189.586212] Kill process 3748 (a.out) sharing same memory > [ 189.586213] Kill process 3755 (a.out) sharing same memory > [ 189.593470] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(mp), file: fs/xfs/xfs_inode.c, line: 1701 Which is a failure of xfs_trans_reserve(), and through the calling context and parameters can only be from xfs_log_reserve(). That's got a pretty clear cause: tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, KM_SLEEP | KM_MAYFAIL); if (!tic) return -ENOMEM; And the reason for the ASSERT is pretty clear: we put it there because we need to know - as developers - what failures (if any) ever come through that path. This is called from evict(): > [ 189.593565] Call Trace: > [ 189.593568] [<ffffffff812ab2d7>] xfs_inactive_truncate+0x67/0x150 > [ 189.593569] [<ffffffff812acb98>] xfs_inactive+0x1c8/0x1f0 > [ 189.593570] [<ffffffff812b3216>] xfs_fs_evict_inode+0x86/0xd0 > [ 189.593572] [<ffffffff811da0f8>] evict+0xb8/0x190 > [ 189.593574] [<ffffffff811daa15>] iput+0xf5/0x180 And as such there is no mechanism for actually reporting the error to userspace and in failing here we are about to leak an inode. When an XFS developer is testing new code, having a failure like that get trapped is immensely useful. However, on production systems, we can just keep going because it's not a fatal error and, even more importantly, the leaked inode will get cleaned up by log recovery next time the filesystem is mounted. IOWs, when you run CONFIG_XFS_DEBUG=y, you'll often get failures that are valuable to XFS developers but have no runtime effect on production systems. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-27 7:39 ` Dave Chinner @ 2015-02-27 12:42 ` Tetsuo Handa 2015-02-27 13:12 ` Dave Chinner 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-02-27 12:42 UTC (permalink / raw) To: david Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 Dave Chinner wrote: > On Wed, Feb 25, 2015 at 11:31:17PM +0900, Tetsuo Handa wrote: > > I got two problems (one is stall at io_schedule() > > This is a typical "blame the messenger" bug report. XFS is stuck in > inode reclaim waiting for log IO completion to occur, along with all > the other processes iin xfs_log_force also stuck waiting for the > same Io completion. I wanted to know whether transaction based reservations can solve these problems. Inside filesystem layer, I guess you can calculate how much memory is needed for your filesystem transaction. But I'm wondering whether we can calculate how much memory is needed inside block layer. If block layer failed to reserve memory, won't file I/O fail under extreme memory pressure? And if __GFP_NOFAIL were used inside block layer, won't the OOM killer deadlock problem arise? > > You need to find where that IO completion that everything is waiting > on has got stuck or show that it's not a lost IO and actually an > XFS problem. e.g has the IO stack got stuck on a mempool somewhere? > I didn't get a vmcore for this stall. But it seemed to me that kworker/3:0H is doing xfs_fs_free_cached_objects() => xfs_reclaim_inodes_nr() => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan) => xfs_reclaim_inode() because mutex_trylock(&pag->pag_ici_reclaim_lock) was succeessful => xfs_iunpin_wait(ip) because xfs_ipincount(ip) was non 0 => __xfs_iunpin_wait() => waiting inside io_schedule() for somebody to unpin kswapd0 is doing xfs_fs_free_cached_objects() => xfs_reclaim_inodes_nr() => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan) => not calling xfs_reclaim_inode() because mutex_trylock(&pag->pag_ici_reclaim_lock) failed due to kworker/3:0H => SYNC_TRYLOCK is dropped for retry loop due to if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { trylock = 0; goto restart; } => calling mutex_lock(&pag->pag_ici_reclaim_lock) and gets blocked due to kworker/3:0H kworker/3:0H is trying to free memory but somebody needs memory to make forward progress. kswapd0 is also trying to free memory but is blocked by kworker/3:0H already holding the lock. Since kswapd0 cannot make forward progress, somebody can't allocate memory. Finally the system started stalling. Is this decoding correct? ---------- [ 1225.773411] kworker/3:0H D ffff88007cadb4f8 11632 27 2 0x00000000 [ 1225.776911] ffff88007cadb4f8 ffff88007cadb508 ffff88007cac6740 0000000000014080 [ 1225.780670] ffffffff8101cd19 ffff88007cadbfd8 0000000000014080 ffff88007c28b740 [ 1225.784431] ffff88007cac6740 ffff88007cadb540 ffff88007f8d4998 ffff88007cadb540 [ 1225.788766] Call Trace: [ 1225.789988] [<ffffffff8101cd19>] ? read_tsc+0x9/0x10 [ 1225.792444] [<ffffffff812acbd9>] ? xfs_iunpin_wait+0x19/0x20 [ 1225.795228] [<ffffffff816b2590>] io_schedule+0xa0/0x130 [ 1225.797802] [<ffffffff812a9569>] __xfs_iunpin_wait+0xe9/0x140 arch/x86/include/asm/atomic.h:27 fs/xfs/xfs_inode.c:2433 [ 1225.800621] [<ffffffff810af3b0>] ? autoremove_wake_function+0x40/0x40 [ 1225.803770] [<ffffffff812acbd9>] xfs_iunpin_wait+0x19/0x20 fs/xfs/xfs_inode.c:2443 [ 1225.806471] [<ffffffff812a209c>] xfs_reclaim_inode+0x7c/0x360 include/linux/spinlock.h:309 fs/xfs/xfs_inode.h:144 fs/xfs/xfs_icache.c:920 [ 1225.809283] [<ffffffff812a25d7>] xfs_reclaim_inodes_ag+0x257/0x370 fs/xfs/xfs_icache.c:1105 [ 1225.812308] [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0 [ 1225.815532] [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190 [ 1225.817951] [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40 fs/xfs/xfs_icache.c:1166 [ 1225.819373] [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20 [ 1225.820898] [<ffffffff811c29e9>] super_cache_scan+0x169/0x170 [ 1225.822245] [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370 [ 1225.823588] [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240 [ 1225.824830] [<ffffffff8115e0dc>] do_try_to_free_pages+0x16c/0x460 [ 1225.826230] [<ffffffff8115e48a>] try_to_free_pages+0xba/0x150 [ 1225.827570] [<ffffffff81151542>] __alloc_pages_nodemask+0x5b2/0x9d0 [ 1225.829030] [<ffffffff8119ecbc>] kmem_getpages+0x8c/0x200 [ 1225.830277] [<ffffffff811a122b>] fallback_alloc+0x17b/0x230 [ 1225.831561] [<ffffffff811a107b>] ____cache_alloc_node+0x18b/0x1c0 [ 1225.833061] [<ffffffff811a3b00>] kmem_cache_alloc+0x330/0x5c0 [ 1225.834435] [<ffffffff8133c9d9>] ? ida_pre_get+0x69/0x100 [ 1225.835719] [<ffffffff8133c9d9>] ida_pre_get+0x69/0x100 [ 1225.836963] [<ffffffff8133d312>] ida_simple_get+0x42/0xf0 [ 1225.838248] [<ffffffff81086211>] create_worker+0x31/0x1c0 [ 1225.839519] [<ffffffff81087831>] worker_thread+0x3d1/0x4d0 [ 1225.840800] [<ffffffff81087460>] ? rescuer_thread+0x3a0/0x3a0 [ 1225.842123] [<ffffffff8108c5e2>] kthread+0xd2/0xf0 [ 1225.843234] [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0 [ 1225.844978] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 [ 1225.846481] [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0 [ 1225.847718] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 [ 1225.849279] kswapd0 D ffff88007708f998 11552 45 2 0x00000000 [ 1225.850977] ffff88007708f998 0000000000000000 ffff88007c28b740 0000000000014080 [ 1225.852798] 0000000000000003 ffff88007708ffd8 0000000000014080 ffff880077ff2740 [ 1225.854575] ffff88007c28b740 0000000000000000 ffff88007948e3a8 ffff88007948e3ac [ 1225.856358] Call Trace: [ 1225.856928] [<ffffffff816b2799>] schedule_preempt_disabled+0x29/0x70 [ 1225.858384] [<ffffffff816b43d5>] __mutex_lock_slowpath+0x95/0x100 [ 1225.859799] [<ffffffff816b4463>] mutex_lock+0x23/0x37 arch/x86/include/asm/current.h:14 kernel/locking/mutex.h:22 kernel/locking/mutex.c:103 [ 1225.860983] [<ffffffff812a264c>] xfs_reclaim_inodes_ag+0x2cc/0x370 fs/xfs/xfs_icache.c:1034 [ 1225.862403] [<ffffffff8109eb48>] ? __enqueue_entity+0x78/0x80 [ 1225.863742] [<ffffffff810a5f37>] ? enqueue_entity+0x237/0x8f0 [ 1225.865100] [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0 [ 1225.866659] [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190 [ 1225.868106] [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40 fs/xfs/xfs_icache.c:1166 [ 1225.869522] [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20 [ 1225.871015] [<ffffffff811c29e9>] super_cache_scan+0x169/0x170 [ 1225.872338] [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370 [ 1225.873679] [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240 [ 1225.874920] [<ffffffff8115ed2d>] kswapd+0x4fd/0x9c0 [ 1225.876049] [<ffffffff8115e830>] ? mem_cgroup_shrink_node_zone+0x140/0x140 [ 1225.877654] [<ffffffff8108c5e2>] kthread+0xd2/0xf0 [ 1225.878762] [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0 [ 1225.880495] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 [ 1225.881996] [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0 [ 1225.883336] [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180 ---------- I killed mutex_lock() and memory allocation from shrinker functions in drivers/gpu/drm/ttm/ttm_page_alloc[_dma].c because I observed that kswapd0 was blocked for so long at mutex_lock(). If kswapd0 is blocked forever at e.g. mutex_lock() inside shrinker functions, who else can make forward progress? Shouldn't we avoid calling functions which could potentially block for unpredictable duration (e.g. unkillable locks and/or completion) from shrinker functions? > IOWs, when you run CONFIG_XFS_DEBUG=y, you'll often get failures > that are valuable to XFS developers but have no runtime effect on > production systems. Oh, I didn't know this failure is specific to CONFIG_XFS_DEBUG=y ... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-27 12:42 ` Tetsuo Handa @ 2015-02-27 13:12 ` Dave Chinner 2015-03-04 12:41 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Dave Chinner @ 2015-02-27 13:12 UTC (permalink / raw) To: Tetsuo Handa Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 On Fri, Feb 27, 2015 at 09:42:55PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > On Wed, Feb 25, 2015 at 11:31:17PM +0900, Tetsuo Handa wrote: > > > I got two problems (one is stall at io_schedule() > > > > This is a typical "blame the messenger" bug report. XFS is stuck in > > inode reclaim waiting for log IO completion to occur, along with all > > the other processes iin xfs_log_force also stuck waiting for the > > same Io completion. > > I wanted to know whether transaction based reservations can solve these > problems. Inside filesystem layer, I guess you can calculate how much > memory is needed for your filesystem transaction. But I'm wondering > whether we can calculate how much memory is needed inside block layer. > If block layer failed to reserve memory, won't file I/O fail under > extreme memory pressure? And if __GFP_NOFAIL were used inside block > layer, won't the OOM killer deadlock problem arise? > > > > > You need to find where that IO completion that everything is waiting > > on has got stuck or show that it's not a lost IO and actually an > > XFS problem. e.g has the IO stack got stuck on a mempool somewhere? > > > > I didn't get a vmcore for this stall. But it seemed to me that > > kworker/3:0H is doing > > xfs_fs_free_cached_objects() > => xfs_reclaim_inodes_nr() > => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan) > => xfs_reclaim_inode() because mutex_trylock(&pag->pag_ici_reclaim_lock) > was succeessful > => xfs_iunpin_wait(ip) because xfs_ipincount(ip) was non 0 > => __xfs_iunpin_wait() > => waiting inside io_schedule() for somebody to unpin > > kswapd0 is doing > > xfs_fs_free_cached_objects() > => xfs_reclaim_inodes_nr() > => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan) > => not calling xfs_reclaim_inode() because > mutex_trylock(&pag->pag_ici_reclaim_lock) failed due to kworker/3:0H > => SYNC_TRYLOCK is dropped for retry loop due to > > if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { > trylock = 0; > goto restart; > } > > => calling mutex_lock(&pag->pag_ici_reclaim_lock) and gets blocked > due to kworker/3:0H > > kworker/3:0H is trying to free memory but somebody needs memory to make > forward progress. kswapd0 is also trying to free memory but is blocked by > kworker/3:0H already holding the lock. Since kswapd0 cannot make forward > progress, somebody can't allocate memory. Finally the system started > stalling. Is this decoding correct? Yes. The per-ag lock is a key throttling point for reclaim when there are many more direct reclaimers than there are allocation groups. System performance drops badly in low memory conditions if we have more than one reclaimer operating on an allocation group at a time as they interfere and contend with each other. Effectively multiple rclaimers within the one AG turn ascending offset order inode writeback into random IO, which is orders of magnitude slower than a single thread can clean and reclaim those same inodes. Quite simply: if one thread can't make progress due to be stuck waiting for IO, then another hundred threads trying to do the same operations are unlikely to make progress, either. Thing is, the io layer below XFS that appears to be stuck does GFP_NOIO allocations, and therefore direct reclaim for mempool allocation in the block layer cannot get stuck on GFP_FS level reclaim operations.... > I killed mutex_lock() and memory allocation from shrinker functions > in drivers/gpu/drm/ttm/ttm_page_alloc[_dma].c because I observed that > kswapd0 was blocked for so long at mutex_lock(). Which, to me, is fixing a symptom rather than understanding the root cause of why lower layers are not making progress as they are supposed to. > If kswapd0 is blocked forever at e.g. mutex_lock() inside shrinker > functions, who else can make forward progress? You can't get into these filesystem shrinkers when you do GFP_NOIO allocations, as the IO path does. > Shouldn't we avoid calling functions which could potentially block for > unpredictable duration (e.g. unkillable locks and/or completion) from > shrinker functions? No, because otherwise we can't throttle allocation and reclaim to the rate at which IO can clean dirty objects. i.e. we do this for the same reason we throttle page cache dirtying to the rate at which we can clean dirty pages.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-27 13:12 ` Dave Chinner @ 2015-03-04 12:41 ` Tetsuo Handa 2015-03-04 13:25 ` Dave Chinner 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-03-04 12:41 UTC (permalink / raw) To: david Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 Dave Chinner wrote: > On Fri, Feb 27, 2015 at 09:42:55PM +0900, Tetsuo Handa wrote: > > If kswapd0 is blocked forever at e.g. mutex_lock() inside shrinker > > functions, who else can make forward progress? > > You can't get into these filesystem shrinkers when you do GFP_NOIO > allocations, as the IO path does. > > > Shouldn't we avoid calling functions which could potentially block for > > unpredictable duration (e.g. unkillable locks and/or completion) from > > shrinker functions? > > No, because otherwise we can't throttle allocation and reclaim to > the rate at which IO can clean dirty objects. i.e. we do this for > the same reason we throttle page cache dirtying to the rate at which > we can clean dirty pages.... I'm misunderstanding something. The description for kswapd() function in mm/vmscan.c says "This basically trickles out pages so that we have _some_ free memory available even if there is no other activity that frees anything up". Forever blocking kswapd0 somewhere inside filesystem shrinker functions is equivalent with removing kswapd() function because it also prevents non filesystem shrinker functions from being called by kswapd0, doesn't it? Then, the description will become "We won't have _some_ free memory available if there is no other activity that frees anything up", won't it? Does kswapd0 exist only for reducing the delay caused by reclaiming synchronously? Disabling kswapd0 affects nothing about functionality? The system can make forward progress even if nobody can call non filesystem shrinkers, can't it? If yes, then why do we need to make special handling for excluding kswapd0 at while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; } loop inside shrink_inactive_list() ? I can't understand the difference between "kswapd0 sleeping forever at too_many_isolated() loop inside shrink_inactive_list()" and "kswapd0 sleeping forever at mutex_lock() inside xfs_reclaim_inodes_ag()". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 12:41 ` Tetsuo Handa @ 2015-03-04 13:25 ` Dave Chinner 2015-03-04 14:11 ` Tetsuo Handa 0 siblings, 1 reply; 276+ messages in thread From: Dave Chinner @ 2015-03-04 13:25 UTC (permalink / raw) To: Tetsuo Handa Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 On Wed, Mar 04, 2015 at 09:41:01PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > On Fri, Feb 27, 2015 at 09:42:55PM +0900, Tetsuo Handa wrote: > > > If kswapd0 is blocked forever at e.g. mutex_lock() inside shrinker > > > functions, who else can make forward progress? > > > > You can't get into these filesystem shrinkers when you do GFP_NOIO > > allocations, as the IO path does. > > > > > Shouldn't we avoid calling functions which could potentially block for > > > unpredictable duration (e.g. unkillable locks and/or completion) from > > > shrinker functions? > > > > No, because otherwise we can't throttle allocation and reclaim to > > the rate at which IO can clean dirty objects. i.e. we do this for > > the same reason we throttle page cache dirtying to the rate at which > > we can clean dirty pages.... > > I'm misunderstanding something. The description for kswapd() function > in mm/vmscan.c says "This basically trickles out pages so that we have > _some_ free memory available even if there is no other activity that frees > anything up". Sure. > Forever blocking kswapd0 somewhere inside filesystem shrinker functions is > equivalent with removing kswapd() function because it also prevents non > filesystem shrinker functions from being called by kswapd0, doesn't it? Yes, but that's not intentional. Remember, we keep talking about the filesystem not being able to guarantee forwards progress if allocations block forever? Well... > Then, the description will become "We won't have _some_ free memory available > if there is no other activity that frees anything up", won't it? ... we've ended up blocking kswapd because it's waiting on a journal commit to complete, and that journal commit is blocked waiting for forwards progress in memory allocation... Yes, it's another one of those nasty dependencies I keep pointing out that filesystems have, and that can only be solved by guaranteeing we can always make forwards allocation progress from transaction reserve to transaction commit. > Does kswapd0 exist only for reducing the delay caused by reclaiming > synchronously? Disabling kswapd0 affects nothing about functionality? > The system can make forward progress even if nobody can call non filesystem > shrinkers, can't it? The throttling is required to control the unbound parallelism of direct reclaim. If we don't do this, inode cache reclaim causes random inode writeback and thrashes the disks with random IO, causing severe degradation in performance under heavy memory pressure. So we throttle inode reclaim to a single thread per AG so we get nice sequential IO patterns from inode cache reclaim - the difference is that we can reclaim several hundred thousand dirty inodes per second versus a few hundred... And because memory allocation is bound by reclaim speed, we throttle the direct reclaimers to prevent IO breakdown conditions from occurring and hence keep performance under memory pressure relatively high and mostly predictable. It's rare that kswapd actually gets stuck like this - I've only ever seen it once, and I've never had anyone running a production system report deadlocks like this... > I can't understand the difference between "kswapd0 sleeping forever at > too_many_isolated() loop inside shrink_inactive_list()" and "kswapd0 > sleeping forever at mutex_lock() inside xfs_reclaim_inodes_ag()". I don't really care. The direct reclaim behaviour is a much bigger problem, and the risk of occasionally having problems with kswapd is miniscule in comparison. Sure, you can provoke it, but unless you are intentially doing nasty things to production systems, it will never be a problem that you trip over. We can't solve every problem with the current memory allcoatin/reclaim design - we've chosen the lesser evil here, and we're going to have to live with it until we get a more robust memory allocation subsystem implementation. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 13:25 ` Dave Chinner @ 2015-03-04 14:11 ` Tetsuo Handa 2015-03-05 1:36 ` Dave Chinner 0 siblings, 1 reply; 276+ messages in thread From: Tetsuo Handa @ 2015-03-04 14:11 UTC (permalink / raw) To: david Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 Dave Chinner wrote: > > Forever blocking kswapd0 somewhere inside filesystem shrinker functions is > > equivalent with removing kswapd() function because it also prevents non > > filesystem shrinker functions from being called by kswapd0, doesn't it? > > Yes, but that's not intentional. Remember, we keep talking about the > filesystem not being able to guarantee forwards progress if > allocations block forever? Well... > > > Then, the description will become "We won't have _some_ free memory available > > if there is no other activity that frees anything up", won't it? > > ... we've ended up blocking kswapd because it's waiting on a journal > commit to complete, and that journal commit is blocked waiting for > forwards progress in memory allocation... > > Yes, it's another one of those nasty dependencies I keep pointing > out that filesystems have, and that can only be solved by > guaranteeing we can always make forwards allocation progress from > transaction reserve to transaction commit. If this is an unexpected deadlock, don't we want below change for xfs_reclaim_inodes_ag() ? - if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { + if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0 && !current_is_kswapd()) { trylock = 0; goto restart; } > It's rare that kswapd actually gets stuck like this - I've only ever > seen it once, and I've never had anyone running a production system > report deadlocks like this... I guess we will unlikely see this again, for so far this is observed with only Linux 3.19 which lacks commit cc87317726f8 ("mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change"). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 14:11 ` Tetsuo Handa @ 2015-03-05 1:36 ` Dave Chinner 0 siblings, 0 replies; 276+ messages in thread From: Dave Chinner @ 2015-03-05 1:36 UTC (permalink / raw) To: Tetsuo Handa Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm, mgorman, torvalds, fernando_b1 On Wed, Mar 04, 2015 at 11:11:48PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > > Forever blocking kswapd0 somewhere inside filesystem shrinker functions is > > > equivalent with removing kswapd() function because it also prevents non > > > filesystem shrinker functions from being called by kswapd0, doesn't it? > > > > Yes, but that's not intentional. Remember, we keep talking about the > > filesystem not being able to guarantee forwards progress if > > allocations block forever? Well... > > > > > Then, the description will become "We won't have _some_ free memory available > > > if there is no other activity that frees anything up", won't it? > > > > ... we've ended up blocking kswapd because it's waiting on a journal > > commit to complete, and that journal commit is blocked waiting for > > forwards progress in memory allocation... > > > > Yes, it's another one of those nasty dependencies I keep pointing > > out that filesystems have, and that can only be solved by > > guaranteeing we can always make forwards allocation progress from > > transaction reserve to transaction commit. > > If this is an unexpected deadlock, don't we want below change for > xfs_reclaim_inodes_ag() ? > > - if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { > + if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0 && !current_is_kswapd()) { > trylock = 0; > goto restart; > } What, so when direct reclaim has choked up all inode reclaim slots completely kswapd just burns CPU spinning while it fails to make progress? Besides, that does not address the actual issue that caused kswapd to block on a log force. That's caused by the SYNC_WAIT flag telling reclaim to wait for IO completion - this is the reclaim throttling mechanism we need to prevent reclaim from degrading to random IO patterns and completely trashing reclaim rates. Hence reclaiming an inode waits in xfs_iunpin_wait() for the log to be flushed before reclaiming inode that is pinned by an unflushed transaction. This works because there is also a background reclaim worker running doing fast, highly efficient, sequential order, non-blocking asynchronous inode writeback. Hence, more often than not, reclaim does not block on more than one dirty inode per scan because the rest of the inodes it walks have already been cleaned and are ready for immediate reclaim. We have multiple layers of reclaim work going on in XFS even within each cache/shrinker infrastructure. Indeed, If I start having to explain how this inode shrinker algorithm ties back into journal tail pushing to optimise async metadata flushing so that the XFS buffer cache shrinker hits clean inode buffers and hence can reclaim the memory the inode shrinker consumes doing inode writeback as quickly as possible, then I think heads might start to explode. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-16 11:23 ` Tetsuo Handa 2015-02-16 15:42 ` Johannes Weiner @ 2015-02-17 16:33 ` Michal Hocko 1 sibling, 0 replies; 276+ messages in thread From: Michal Hocko @ 2015-02-17 16:33 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes, torvalds On Mon 16-02-15 20:23:16, Tetsuo Handa wrote: [...] > (1) Make several locks killable. > > On Linux 3.19, running below command line as an unprivileged user > on a system with 4 CPUs / 2GB RAM / no swap can make the system unusable. > > $ for i in `seq 1 100`; do dd if=/dev/zero of=/tmp/file bs=104857600 count=100 & done > [...] > This is because the OOM killer happily tries to kill a process which is > blocked at unkillable mutex_lock(). If locks shown above were killable, > we can reduce the possibility of getting stuck. > > I didn't check whether it has livelocked or not. But too slow to wait is > not acceptable. Well, you are beating your machine to death so you can hardly get any time guarantee. It would be nice to have a better feedback mechanism to know when to back off and fail the allocation attempt which might be blocking OOM victim to pass away. This is extremely tricky because we shouldn't be too eager to fail just because of a sudden memory pressure. > Oh, why every thread trying to allocate memory has to repeat > the loop that might defer somebody who can make progress if CPU time was > given? I guess you are talking about direct reclaim and the whole priority loop? Well, this is what I was talking above. Sometimes we really have to go down to low priorities and basically scan the world in order to find something reclaimable. If we bail out too early we might see pre mature allocation failures and which could lead to reduced QoS. > I wish only somebody like kswapd repeats the loop on behalf of all > threads waiting at memory allocation slowpath... This is the case when the kswapd is _able_ to cope with the memory pressure. [...] > (3) Replace kmalloc() with kmalloc_nofail() and kmalloc_noretry(). > > Currently small allocations are implicitly treated like __GFP_NOFAIL > unless TIF_MEMDIE is set. But silently changing small allocations like > __GFP_NORETRY will cause obscure bugs. If TIF_MEMDIE timeout is implemented, > we will no longer worry about unkillable tasks which is retrying forever at > memory allocation; instead we kill more OOM victims and satisfy the request. I think this is a bad approach. GFP_KERNEL != __GFP_NORETRY and we should treat it like that. Killing more victims is a bad solution because it doesn't guarantee any progress (just look at your example of hundreds processes with large RSS hammering the same file - you would have to kill all of them at once). Besides that any timeout solution is prone to unexpected delays due to reasons which are not related to the allocation latency. > Therefore, we could introduce kmalloc_nofail(size, gfp) which does > kmalloc(size, gfp | __GFP_NOFAIL) (i.e. invoke the OOM killer) and > kmalloc_noretry(size, gfp) which does kmalloc(size, gfp | __GFP_NORETRY) > (i.e. do not invoke the OOM killer), and switch from kmalloc() to either > kmalloc_noretry() or kmalloc_nofail(). This sounds like a major overkill. We already have gfp flags for that. What would this buy us? > Those who are doing smaller than > PAGE_SIZE bytes allocations would wish to switch from kmalloc() to > kmalloc_nofail() and eliminate untested memory allocation failure paths. nofail allocations should be discouraged and used only if any other measure would fail. > Those who are well prepared for memory allocation failures would wish to > switch from kmalloc() to kmalloc_noretry(). Eventually, kmalloc() which is > implicitly treating small allocations like __GFP_NOFAIL and invoking the > OOM killer will be abolished. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) 2014-12-20 22:35 ` Dave Chinner 2014-12-21 8:45 ` Tetsuo Handa @ 2014-12-29 17:40 ` Michal Hocko 2014-12-29 18:45 ` Linus Torvalds 1 sibling, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-29 17:40 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, Andrew Morton, Mel Gorman, Johannes Weiner, Linus Torvalds On Sun 21-12-14 09:35:04, Dave Chinner wrote: [...] > Oh, boy. > > struct page *grab_cache_page_write_begin(struct address_space *mapping, > pgoff_t index, unsigned flags) > { > struct page *page; > int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT; > > if (flags & AOP_FLAG_NOFS) > fgp_flags |= FGP_NOFS; > > page = pagecache_get_page(mapping, index, fgp_flags, > mapping_gfp_mask(mapping), > GFP_KERNEL); > if (page) > wait_for_stable_page(page); > > return page; > } > > There are *3* different memory allocation controls passed to > pagecache_get_page. The first is via AOP_FLAG_NOFS, where the caller > explicitly says this allocation is in filesystem context with locks > held, and so all allocations need to be done in GFP_NOFS context. > This is used to override the second and third gfp parameters. > > The second is mapping_gfp_mask(mapping), which is the *default > allocation context* the filesystem wants the page cache to use for > allocating pages to the mapping. > > The third is a hard coded GFP_KERNEL, which is used for radix tree > node allocation. > > Why are there separate allocation contexts for the radix tree nodes > and the page cache pages when they are done under *exactly the same > caller context*? Either we are allowed to recurse into the > filesystem or we aren't, and the inode mapping mask defines that > context for all page cache allocations, not just the pages > themselves. > > And to point out how many filesystems this affects, > the loop device, btrfs, f2fs, gfs2, jfs, logfs, nil2fs, reiserfs > and XFS all use this mapping default to clear __GFP_FS from > page cache allocations. Only ext4 and gfs2 use AOP_FLAG_NOFS in > their ->write_begin callouts to prevent recusrion. > > IOWs, grab_cache_page_write_begin/pagecache_get_page multiple > allocation contexts are just wrong. It does not match the way > filesystems are informing the page cache of allocation context to > avoid recursion (for avoiding stack overflow and/or deadlock). > AOP_FLAG_NOFS should go away, and all filesystems should modify the > mapping gfp mask to set their allocation context. If should be used > *everywhere* pages are allocated into the page cache, and for all > allocations related to tracking those allocated pages. I guess the following would be a first simple step to remove the bug you are mentioning above. It would be simple enough to put into stable as well. What do you think? --- ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) 2014-12-29 17:40 ` [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) Michal Hocko @ 2014-12-29 18:45 ` Linus Torvalds 2014-12-29 19:33 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Linus Torvalds @ 2014-12-29 18:45 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Tetsuo Handa, Dave Chinner, linux-mm, David Rientjes, Oleg Nesterov, Andrew Morton, Mel Gorman, Johannes Weiner So I think this patch is definitely going in the right direction, but at least the __GFP_WRITE handling is insane: (Patch edited to show the resulting code, without the old deleted lines) On Mon, Dec 29, 2014 at 9:40 AM, Michal Hocko <mhocko@suse.cz> wrote: > @@ -1105,13 +1102,11 @@ no_page: > if (!page && (fgp_flags & FGP_CREAT)) { > int err; > if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping)) > + gfp_mask |= __GFP_WRITE; > + if (fgp_flags & FGP_NOFS) > + gfp_mask &= ~__GFP_FS; > > + page = __page_cache_alloc(gfp_mask); > if (!page) > return NULL; > > @@ -1122,7 +1117,7 @@ no_page: > if (fgp_flags & FGP_ACCESSED) > __SetPageReferenced(page); > > + err = add_to_page_cache_lru(page, mapping, offset, gfp_mask); Passing __GFP_WRITE into the radix tree allocation routines is not sane. So you'd have to mask the bit out again here (unconditionally is fine). But other than that this seems to be a sane cleanup. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) 2014-12-29 18:45 ` Linus Torvalds @ 2014-12-29 19:33 ` Michal Hocko 2014-12-30 13:42 ` Michal Hocko 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-29 19:33 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Tetsuo Handa, Dave Chinner, linux-mm, David Rientjes, Oleg Nesterov, Andrew Morton, Mel Gorman, Johannes Weiner On Mon 29-12-14 10:45:22, Linus Torvalds wrote: > So I think this patch is definitely going in the right direction, but > at least the __GFP_WRITE handling is insane: > > (Patch edited to show the resulting code, without the old deleted lines) > > On Mon, Dec 29, 2014 at 9:40 AM, Michal Hocko <mhocko@suse.cz> wrote: > > @@ -1105,13 +1102,11 @@ no_page: > > if (!page && (fgp_flags & FGP_CREAT)) { > > int err; > > if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping)) > > + gfp_mask |= __GFP_WRITE; > > + if (fgp_flags & FGP_NOFS) > > + gfp_mask &= ~__GFP_FS; > > > > + page = __page_cache_alloc(gfp_mask); > > if (!page) > > return NULL; > > > > @@ -1122,7 +1117,7 @@ no_page: > > if (fgp_flags & FGP_ACCESSED) > > __SetPageReferenced(page); > > > > + err = add_to_page_cache_lru(page, mapping, offset, gfp_mask); > > Passing __GFP_WRITE into the radix tree allocation routines is not > sane. So you'd have to mask the bit out again here (unconditionally is > fine). Good point! --- ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) 2014-12-29 19:33 ` Michal Hocko @ 2014-12-30 13:42 ` Michal Hocko 2014-12-30 21:45 ` Linus Torvalds 0 siblings, 1 reply; 276+ messages in thread From: Michal Hocko @ 2014-12-30 13:42 UTC (permalink / raw) To: Andrew Morton Cc: Dave Chinner, Tetsuo Handa, Dave Chinner, linux-mm, David Rientjes, Oleg Nesterov, Linus Torvalds, Mel Gorman, Johannes Weiner Andrew, I've noticed you have taken the patch to mm tree already. I have realized I haven't marked it for stable which is worth it IMO because debugging nasty reclaim recursion bugs is definitely a pain and might fix one and even if it doesn't it is rather straightforward and shouldn't break anything. So if nobody has anything against I would mark this for stable 3.16+ AFAICS. On Mon 29-12-14 20:33:12, Michal Hocko wrote: > From 3242f56ae8886a3c605d93960e77176dfe1dff43 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 29 Dec 2014 20:30:35 +0100 > Subject: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page > > 2457aec63745 (mm: non-atomically mark page accessed during page cache > allocation where possible) has added a separate parameter for specifying > gfp mask for radix tree allocations. > > Not only this is less than optimal from the API point of view > because it is error prone, it is also buggy currently because > grab_cache_page_write_begin is using GFP_KERNEL for radix tree and > if fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by > AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then > the radix tree allocation wouldn't obey the restriction and might > recurse into filesystem and cause deadlocks. This is the case for > most filesystems unfortunately because only ext4 and gfs2 are using > AOP_FLAG_NOFS. > > Let's simply remove radix_gfp_mask parameter because the allocation > context is same for both page cache and for the radix tree. Just make > sure that the radix tree gets only the sane subset of the mask (e.g. do > not pass __GFP_WRITE). > > Long term it is more preferable to convert remaining users of > AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this > interface even further. > > Reported-by: Dave Chinner <david@fromorbit.com> > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > include/linux/pagemap.h | 13 ++++++------- > mm/filemap.c | 29 ++++++++++++----------------- > 2 files changed, 18 insertions(+), 24 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 7ea069cd3257..4b3736f7065c 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -251,7 +251,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping, > #define FGP_NOWAIT 0x00000020 > > struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, > - int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask); > + int fgp_flags, gfp_t cache_gfp_mask); > > /** > * find_get_page - find and get a page reference > @@ -266,13 +266,13 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, > static inline struct page *find_get_page(struct address_space *mapping, > pgoff_t offset) > { > - return pagecache_get_page(mapping, offset, 0, 0, 0); > + return pagecache_get_page(mapping, offset, 0, 0); > } > > static inline struct page *find_get_page_flags(struct address_space *mapping, > pgoff_t offset, int fgp_flags) > { > - return pagecache_get_page(mapping, offset, fgp_flags, 0, 0); > + return pagecache_get_page(mapping, offset, fgp_flags, 0); > } > > /** > @@ -292,7 +292,7 @@ static inline struct page *find_get_page_flags(struct address_space *mapping, > static inline struct page *find_lock_page(struct address_space *mapping, > pgoff_t offset) > { > - return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0); > + return pagecache_get_page(mapping, offset, FGP_LOCK, 0); > } > > /** > @@ -319,7 +319,7 @@ static inline struct page *find_or_create_page(struct address_space *mapping, > { > return pagecache_get_page(mapping, offset, > FGP_LOCK|FGP_ACCESSED|FGP_CREAT, > - gfp_mask, gfp_mask & GFP_RECLAIM_MASK); > + gfp_mask); > } > > /** > @@ -340,8 +340,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping, > { > return pagecache_get_page(mapping, index, > FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT, > - mapping_gfp_mask(mapping), > - GFP_NOFS); > + mapping_gfp_mask(mapping)); > } > > struct page *find_get_entry(struct address_space *mapping, pgoff_t offset); > diff --git a/mm/filemap.c b/mm/filemap.c > index e8905bc3cbd7..11477d3b7838 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1046,8 +1046,7 @@ EXPORT_SYMBOL(find_lock_entry); > * @mapping: the address_space to search > * @offset: the page index > * @fgp_flags: PCG flags > - * @cache_gfp_mask: gfp mask to use for the page cache data page allocation > - * @radix_gfp_mask: gfp mask to use for radix tree node allocation > + * @gfp_mask: gfp mask to use for the page cache data page allocation > * > * Looks up the page cache slot at @mapping & @offset. > * > @@ -1056,11 +1055,9 @@ EXPORT_SYMBOL(find_lock_entry); > * FGP_ACCESSED: the page will be marked accessed > * FGP_LOCK: Page is return locked > * FGP_CREAT: If page is not present then a new page is allocated using > - * @cache_gfp_mask and added to the page cache and the VM's LRU > - * list. If radix tree nodes are allocated during page cache > - * insertion then @radix_gfp_mask is used. The page is returned > - * locked and with an increased refcount. Otherwise, %NULL is > - * returned. > + * @gfp_mask and added to the page cache and the VM's LRU > + * list. The page is returned locked and with an increased > + * refcount. Otherwise, %NULL is returned. > * > * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even > * if the GFP flags specified for FGP_CREAT are atomic. > @@ -1068,7 +1065,7 @@ EXPORT_SYMBOL(find_lock_entry); > * If there is a page cache page, it is returned with an increased refcount. > */ > struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, > - int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask) > + int fgp_flags, gfp_t gfp_mask) > { > struct page *page; > > @@ -1105,13 +1102,11 @@ no_page: > if (!page && (fgp_flags & FGP_CREAT)) { > int err; > if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping)) > - cache_gfp_mask |= __GFP_WRITE; > - if (fgp_flags & FGP_NOFS) { > - cache_gfp_mask &= ~__GFP_FS; > - radix_gfp_mask &= ~__GFP_FS; > - } > + gfp_mask |= __GFP_WRITE; > + if (fgp_flags & FGP_NOFS) > + gfp_mask &= ~__GFP_FS; > > - page = __page_cache_alloc(cache_gfp_mask); > + page = __page_cache_alloc(gfp_mask); > if (!page) > return NULL; > > @@ -1122,7 +1117,8 @@ no_page: > if (fgp_flags & FGP_ACCESSED) > __SetPageReferenced(page); > > - err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask); > + err = add_to_page_cache_lru(page, mapping, offset, > + gfp_mask & GFP_RECLAIM_MASK); > if (unlikely(err)) { > page_cache_release(page); > page = NULL; > @@ -2443,8 +2439,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, > fgp_flags |= FGP_NOFS; > > page = pagecache_get_page(mapping, index, fgp_flags, > - mapping_gfp_mask(mapping), > - GFP_KERNEL); > + mapping_gfp_mask(mapping)); > if (page) > wait_for_stable_page(page); > > -- > 2.1.4 > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
* Re: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) 2014-12-30 13:42 ` Michal Hocko @ 2014-12-30 21:45 ` Linus Torvalds 0 siblings, 0 replies; 276+ messages in thread From: Linus Torvalds @ 2014-12-30 21:45 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Dave Chinner, Tetsuo Handa, Dave Chinner, linux-mm, David Rientjes, Oleg Nesterov, Mel Gorman, Johannes Weiner On Tue, Dec 30, 2014 at 5:42 AM, Michal Hocko <mhocko@suse.cz> wrote: > > I've noticed you have taken the patch to mm tree already. I have > realized I haven't marked it for stable which is worth it IMO because > debugging nasty reclaim recursion bugs is definitely a pain and might > fix one and even if it doesn't it is rather straightforward and > shouldn't break anything. So if nobody has anything against I would mark > this for stable 3.16+ AFAICS. I already applied it (as commit 45f87de57f8f), so if you think it's stable material - and I agree that it looks that way - you should just email stable@vger.kernel.org about it. I think it might be a good idea to wait a week or two to make sure it doesn't have any unexpected side effects. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 276+ messages in thread
end of thread, other threads:[~2015-03-14 13:53 UTC | newest] Thread overview: 276+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-12-12 13:54 [RFC PATCH] oom: Don't count on mm-less current process Tetsuo Handa 2014-12-16 12:47 ` Michal Hocko 2014-12-17 11:54 ` Tetsuo Handa 2014-12-17 13:08 ` Michal Hocko 2014-12-18 12:11 ` Tetsuo Handa 2014-12-18 15:33 ` Michal Hocko 2014-12-19 12:07 ` Tetsuo Handa 2014-12-19 12:49 ` Michal Hocko 2014-12-20 9:13 ` Tetsuo Handa 2014-12-20 11:42 ` Tetsuo Handa 2014-12-22 20:25 ` Michal Hocko 2014-12-23 1:00 ` Tetsuo Handa 2014-12-23 9:51 ` Michal Hocko 2014-12-23 11:46 ` Tetsuo Handa 2014-12-23 11:57 ` Tetsuo Handa 2014-12-23 12:12 ` Tetsuo Handa 2014-12-23 12:27 ` Michal Hocko 2014-12-23 12:24 ` Michal Hocko 2014-12-23 13:00 ` Tetsuo Handa 2014-12-23 13:09 ` Michal Hocko 2014-12-23 13:20 ` Tetsuo Handa 2014-12-23 13:43 ` Michal Hocko 2014-12-23 14:11 ` Tetsuo Handa 2014-12-23 14:57 ` Michal Hocko 2014-12-19 12:22 ` How to handle TIF_MEMDIE stalls? Tetsuo Handa 2014-12-20 2:03 ` Dave Chinner 2014-12-20 12:41 ` Tetsuo Handa 2014-12-20 22:35 ` Dave Chinner 2014-12-21 8:45 ` Tetsuo Handa 2014-12-21 20:42 ` Dave Chinner 2014-12-22 16:57 ` Michal Hocko 2014-12-22 21:30 ` Dave Chinner 2014-12-23 9:41 ` Johannes Weiner 2014-12-24 1:06 ` Dave Chinner 2014-12-24 2:40 ` Linus Torvalds 2014-12-29 18:19 ` Michal Hocko 2014-12-30 6:42 ` Tetsuo Handa 2014-12-30 11:21 ` Michal Hocko 2014-12-30 13:33 ` Tetsuo Handa 2014-12-31 10:24 ` Tetsuo Handa 2015-02-09 11:44 ` Tetsuo Handa 2015-02-10 13:58 ` Tetsuo Handa 2015-02-10 15:19 ` Johannes Weiner 2015-02-11 2:23 ` Tetsuo Handa 2015-02-11 13:37 ` Tetsuo Handa 2015-02-11 18:50 ` Oleg Nesterov 2015-02-11 18:59 ` Oleg Nesterov 2015-03-14 13:03 ` Tetsuo Handa 2015-02-17 12:23 ` Tetsuo Handa 2015-02-17 12:53 ` Johannes Weiner 2015-02-17 15:38 ` Michal Hocko 2015-02-17 22:54 ` Dave Chinner 2015-02-17 22:54 ` Dave Chinner 2015-02-17 23:32 ` Dave Chinner 2015-02-17 23:32 ` Dave Chinner 2015-02-18 8:25 ` Michal Hocko 2015-02-18 8:25 ` Michal Hocko 2015-02-18 10:48 ` Dave Chinner 2015-02-18 10:48 ` Dave Chinner 2015-02-18 12:16 ` Michal Hocko 2015-02-18 12:16 ` Michal Hocko 2015-02-18 21:31 ` Dave Chinner 2015-02-18 21:31 ` Dave Chinner 2015-02-19 9:40 ` Michal Hocko 2015-02-19 9:40 ` Michal Hocko 2015-02-19 22:03 ` Dave Chinner 2015-02-19 22:03 ` Dave Chinner 2015-02-20 9:27 ` Michal Hocko 2015-02-20 9:27 ` Michal Hocko 2015-02-19 11:01 ` Johannes Weiner 2015-02-19 11:01 ` Johannes Weiner 2015-02-19 12:29 ` Michal Hocko 2015-02-19 12:29 ` Michal Hocko 2015-02-19 12:58 ` Michal Hocko 2015-02-19 12:58 ` Michal Hocko 2015-02-19 15:29 ` Tetsuo Handa 2015-02-19 15:29 ` Tetsuo Handa 2015-02-19 15:29 ` Tetsuo Handa 2015-02-19 21:53 ` Tetsuo Handa 2015-02-19 21:53 ` Tetsuo Handa 2015-02-19 21:53 ` Tetsuo Handa 2015-02-20 9:13 ` Michal Hocko 2015-02-20 9:13 ` Michal Hocko 2015-02-20 13:37 ` Stefan Ring 2015-02-20 13:37 ` Stefan Ring 2015-02-19 13:29 ` Tetsuo Handa 2015-02-19 13:29 ` Tetsuo Handa 2015-02-19 13:29 ` Tetsuo Handa 2015-02-20 9:10 ` Michal Hocko 2015-02-20 9:10 ` Michal Hocko 2015-02-20 12:20 ` Tetsuo Handa 2015-02-20 12:20 ` Tetsuo Handa 2015-02-20 12:20 ` Tetsuo Handa 2015-02-20 12:38 ` Michal Hocko 2015-02-20 12:38 ` Michal Hocko 2015-02-19 21:43 ` Dave Chinner 2015-02-19 21:43 ` Dave Chinner 2015-02-20 12:48 ` Michal Hocko 2015-02-20 12:48 ` Michal Hocko 2015-02-20 23:09 ` Dave Chinner 2015-02-20 23:09 ` Dave Chinner 2015-02-19 10:24 ` Johannes Weiner 2015-02-19 10:24 ` Johannes Weiner 2015-02-19 22:52 ` Dave Chinner 2015-02-19 22:52 ` Dave Chinner 2015-02-20 10:36 ` Tetsuo Handa 2015-02-20 10:36 ` Tetsuo Handa 2015-02-20 23:15 ` Dave Chinner 2015-02-20 23:15 ` Dave Chinner 2015-02-21 3:20 ` Theodore Ts'o 2015-02-21 3:20 ` Theodore Ts'o 2015-02-21 9:19 ` Andrew Morton 2015-02-21 9:19 ` Andrew Morton 2015-02-21 13:48 ` Tetsuo Handa 2015-02-21 13:48 ` Tetsuo Handa 2015-02-21 13:48 ` Tetsuo Handa 2015-02-21 21:38 ` Dave Chinner 2015-02-21 21:38 ` Dave Chinner 2015-02-21 21:38 ` Dave Chinner 2015-02-22 0:20 ` Johannes Weiner 2015-02-22 0:20 ` Johannes Weiner 2015-02-23 10:48 ` Michal Hocko 2015-02-23 10:48 ` Michal Hocko 2015-02-23 10:48 ` Michal Hocko 2015-02-23 11:23 ` Tetsuo Handa 2015-02-23 11:23 ` Tetsuo Handa 2015-02-23 11:23 ` Tetsuo Handa 2015-02-23 21:33 ` David Rientjes 2015-02-23 21:33 ` David Rientjes 2015-02-23 21:33 ` David Rientjes 2015-02-22 14:48 ` __GFP_NOFAIL and oom_killer_disabled? Tetsuo Handa 2015-02-23 10:21 ` Michal Hocko 2015-02-23 13:03 ` Tetsuo Handa 2015-02-24 18:14 ` Michal Hocko 2015-02-25 11:22 ` Tetsuo Handa 2015-02-25 16:02 ` Michal Hocko 2015-02-25 21:48 ` Tetsuo Handa 2015-02-25 21:51 ` Andrew Morton 2015-02-21 12:00 ` How to handle TIF_MEMDIE stalls? Tetsuo Handa 2015-02-21 12:00 ` Tetsuo Handa 2015-02-21 12:00 ` Tetsuo Handa 2015-02-23 10:26 ` Michal Hocko 2015-02-23 10:26 ` Michal Hocko 2015-02-23 10:26 ` Michal Hocko 2015-02-21 11:12 ` Tetsuo Handa 2015-02-21 11:12 ` Tetsuo Handa 2015-02-21 21:48 ` Dave Chinner 2015-02-21 21:48 ` Dave Chinner 2015-02-21 23:52 ` Johannes Weiner 2015-02-21 23:52 ` Johannes Weiner 2015-02-23 0:45 ` Dave Chinner 2015-02-23 0:45 ` Dave Chinner 2015-02-23 1:29 ` Andrew Morton 2015-02-23 1:29 ` Andrew Morton 2015-02-23 7:32 ` Dave Chinner 2015-02-23 7:32 ` Dave Chinner 2015-02-27 18:24 ` Vlastimil Babka 2015-02-27 18:24 ` Vlastimil Babka 2015-02-28 0:03 ` Dave Chinner 2015-02-28 0:03 ` Dave Chinner 2015-02-28 15:17 ` Theodore Ts'o 2015-02-28 15:17 ` Theodore Ts'o 2015-03-02 9:39 ` Vlastimil Babka 2015-03-02 9:39 ` Vlastimil Babka 2015-03-02 22:31 ` Dave Chinner 2015-03-02 22:31 ` Dave Chinner 2015-03-03 9:13 ` Vlastimil Babka 2015-03-03 9:13 ` Vlastimil Babka 2015-03-04 1:33 ` Dave Chinner 2015-03-04 1:33 ` Dave Chinner 2015-03-04 8:50 ` Vlastimil Babka 2015-03-04 8:50 ` Vlastimil Babka 2015-03-04 11:03 ` Dave Chinner 2015-03-04 11:03 ` Dave Chinner 2015-03-07 0:20 ` Johannes Weiner 2015-03-07 0:20 ` Johannes Weiner 2015-03-07 3:43 ` Dave Chinner 2015-03-07 3:43 ` Dave Chinner 2015-03-07 15:08 ` Johannes Weiner 2015-03-07 15:08 ` Johannes Weiner 2015-03-02 20:22 ` Johannes Weiner 2015-03-02 20:22 ` Johannes Weiner 2015-03-02 23:12 ` Dave Chinner 2015-03-02 23:12 ` Dave Chinner 2015-03-03 2:50 ` Johannes Weiner 2015-03-03 2:50 ` Johannes Weiner 2015-03-04 6:52 ` Dave Chinner 2015-03-04 6:52 ` Dave Chinner 2015-03-04 15:04 ` Johannes Weiner 2015-03-04 15:04 ` Johannes Weiner 2015-03-04 17:38 ` Theodore Ts'o 2015-03-04 17:38 ` Theodore Ts'o 2015-03-04 23:17 ` Dave Chinner 2015-03-04 23:17 ` Dave Chinner 2015-02-28 16:29 ` Johannes Weiner 2015-02-28 16:29 ` Johannes Weiner 2015-02-28 16:41 ` Theodore Ts'o 2015-02-28 16:41 ` Theodore Ts'o 2015-02-28 22:15 ` Johannes Weiner 2015-02-28 22:15 ` Johannes Weiner 2015-03-01 11:17 ` Tetsuo Handa 2015-03-01 11:17 ` Tetsuo Handa 2015-03-06 11:53 ` Tetsuo Handa 2015-03-06 11:53 ` Tetsuo Handa 2015-03-01 13:43 ` Theodore Ts'o 2015-03-01 13:43 ` Theodore Ts'o 2015-03-01 16:15 ` Johannes Weiner 2015-03-01 16:15 ` Johannes Weiner 2015-03-01 19:36 ` Theodore Ts'o 2015-03-01 19:36 ` Theodore Ts'o 2015-03-01 20:44 ` Johannes Weiner 2015-03-01 20:44 ` Johannes Weiner 2015-03-01 20:17 ` Johannes Weiner 2015-03-01 20:17 ` Johannes Weiner 2015-03-01 21:48 ` Dave Chinner 2015-03-01 21:48 ` Dave Chinner 2015-03-02 0:17 ` Dave Chinner 2015-03-02 0:17 ` Dave Chinner 2015-03-02 12:46 ` Brian Foster 2015-03-02 12:46 ` Brian Foster 2015-02-28 18:36 ` Vlastimil Babka 2015-02-28 18:36 ` Vlastimil Babka 2015-03-02 15:18 ` Michal Hocko 2015-03-02 15:18 ` Michal Hocko 2015-03-02 16:05 ` Johannes Weiner 2015-03-02 16:05 ` Johannes Weiner 2015-03-02 17:10 ` Michal Hocko 2015-03-02 17:10 ` Michal Hocko 2015-03-02 17:27 ` Johannes Weiner 2015-03-02 17:27 ` Johannes Weiner 2015-03-02 16:39 ` Theodore Ts'o 2015-03-02 16:39 ` Theodore Ts'o 2015-03-02 16:58 ` Michal Hocko 2015-03-02 16:58 ` Michal Hocko 2015-03-04 12:52 ` Dave Chinner 2015-03-04 12:52 ` Dave Chinner 2015-02-17 14:59 ` Michal Hocko 2015-02-17 14:50 ` Michal Hocko 2015-02-17 14:37 ` Michal Hocko 2015-02-17 14:44 ` Michal Hocko 2015-02-16 11:23 ` Tetsuo Handa 2015-02-16 15:42 ` Johannes Weiner 2015-02-17 11:57 ` Tetsuo Handa 2015-02-17 13:16 ` Johannes Weiner 2015-02-17 16:50 ` Michal Hocko 2015-02-17 23:25 ` Dave Chinner 2015-02-18 8:48 ` Michal Hocko 2015-02-18 11:23 ` Tetsuo Handa 2015-02-18 11:23 ` Tetsuo Handa 2015-02-18 12:29 ` Michal Hocko 2015-02-18 12:29 ` Michal Hocko 2015-02-18 14:06 ` Tetsuo Handa 2015-02-18 14:06 ` Tetsuo Handa 2015-02-18 14:25 ` Michal Hocko 2015-02-19 10:48 ` Tetsuo Handa 2015-02-19 10:48 ` Tetsuo Handa 2015-02-20 8:26 ` Michal Hocko 2015-02-20 8:26 ` Michal Hocko 2015-02-23 22:08 ` David Rientjes 2015-02-24 11:20 ` Tetsuo Handa 2015-02-24 15:20 ` Theodore Ts'o 2015-02-24 21:02 ` Dave Chinner 2015-02-25 14:31 ` Tetsuo Handa 2015-02-27 7:39 ` Dave Chinner 2015-02-27 12:42 ` Tetsuo Handa 2015-02-27 13:12 ` Dave Chinner 2015-03-04 12:41 ` Tetsuo Handa 2015-03-04 13:25 ` Dave Chinner 2015-03-04 14:11 ` Tetsuo Handa 2015-03-05 1:36 ` Dave Chinner 2015-02-17 16:33 ` Michal Hocko 2014-12-29 17:40 ` [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) Michal Hocko 2014-12-29 18:45 ` Linus Torvalds 2014-12-29 19:33 ` Michal Hocko 2014-12-30 13:42 ` Michal Hocko 2014-12-30 21:45 ` Linus Torvalds
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.