All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] oom: Don't count on mm-less current process.
@ 2014-12-12 13:54 Tetsuo Handa
  2014-12-16 12:47 ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-12 13:54 UTC (permalink / raw)
  To: linux-mm; +Cc: mhocko, rientjes, oleg

>From 29d0b34a1c60e91ace8e1208a415ca371e6851fe Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 12 Dec 2014 21:29:06 +0900
Subject: [PATCH] oom: Don't count on mm-less current process.

out_of_memory() doesn't trigger OOM killer if the current task is already
exiting or it has fatal signals pending, and gives the task access to
memory reserves instead. This is done to prevent from livelocks described by
commit 9ff4868e3051d912 ("mm, oom: allow exiting threads to have access to
memory reserves") and commit 7b98c2e402eaa1f2 ("oom: give current access to
memory reserves if it has been killed") as well as to prevent from unnecessary
killing of other tasks, with heuristic that the current task would finish
soon and release its resources.

However, this heuristic doesn't work as expected when out_of_memory() is
triggered by an allocation after the current task has already released
its memory in exit_mm() (e.g. from exit_task_work()) because it might
livelock waiting for a memory which gets never released while there are
other tasks sitting on a lot of memory.

Therefore, consider doing checks as with sysctl_oom_kill_allocating_task
case before giving the current task access to memory reserves.

Note that this patch cannot prevent somebody from calling oom_kill_process()
with a victim task when the victim task already got PF_EXITING flag and
released its memory. This means that the OOM killer is kept disabled for
unpredictable duration when the victim task is unkillable due to dependency
which is invisible to the OOM killer (e.g. waiting for lock held by somebody)
after somebody set TIF_MEMDIE flag on the victim task by calling
oom_kill_process(). What is unfortunate, a local unprivileged user can make
the victim task unkillable on purpose. There are two approaches for mitigating
this problem. Workaround is to use sysctl-tunable panic on TIF_MEMDIE timeout
(Detect DoS attacks and react. Easy to backport. Works for memory depletion
bugs caused by kernel code.) and preferred fix is to develop complete kernel
memory allocation tracking (Try to avoid DoS but do nothing when failed to
avoid. Hard to backport. Works for memory depletion attacks caused by user
programs). Anyway that's beyond what this patch can do.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h |  3 +++
 mm/memcontrol.c     |  8 +++++++-
 mm/oom_kill.c       | 12 +++++++++---
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 4971874..eee5802 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -64,6 +64,9 @@ extern void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 			       int order, const nodemask_t *nodemask);
 
+extern bool oom_unkillable_task(struct task_struct *p,
+				struct mem_cgroup *memcg,
+				const nodemask_t *nodemask);
 extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c6ac50e..6d9532d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1558,8 +1558,14 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
+	 *
+	 * However, if current is calling out_of_memory() by doing memory
+	 * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING
+	 * was set by exit_signals() and mm was released by exit_mm(), it is
+	 * wrong to expect current to exit and free its memory quickly.
 	 */
-	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
+	if ((fatal_signal_pending(current) || current->flags & PF_EXITING) &&
+	    current->mm && !oom_unkillable_task(current, memcg, NULL)) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 481d550..01719d6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -118,8 +118,8 @@ found:
 }
 
 /* return true if the task is not adequate as candidate victim task. */
-static bool oom_unkillable_task(struct task_struct *p,
-		struct mem_cgroup *memcg, const nodemask_t *nodemask)
+bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *memcg,
+			 const nodemask_t *nodemask)
 {
 	if (is_global_init(p))
 		return true;
@@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
+	 *
+	 * However, if current is calling out_of_memory() by doing memory
+	 * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING
+	 * was set by exit_signals() and mm was released by exit_mm(), it is
+	 * wrong to expect current to exit and free its memory quickly.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    current->mm && !oom_unkillable_task(current, NULL, nodemask)) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-12 13:54 [RFC PATCH] oom: Don't count on mm-less current process Tetsuo Handa
@ 2014-12-16 12:47 ` Michal Hocko
  2014-12-17 11:54   ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-16 12:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg

On Fri 12-12-14 22:54:53, Tetsuo Handa wrote:
> >From 29d0b34a1c60e91ace8e1208a415ca371e6851fe Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 12 Dec 2014 21:29:06 +0900
> Subject: [PATCH] oom: Don't count on mm-less current process.
> 
> out_of_memory() doesn't trigger OOM killer if the current task is already
> exiting or it has fatal signals pending, and gives the task access to
> memory reserves instead. This is done to prevent from livelocks described by
> commit 9ff4868e3051d912 ("mm, oom: allow exiting threads to have access to
> memory reserves") and commit 7b98c2e402eaa1f2 ("oom: give current access to
> memory reserves if it has been killed") as well as to prevent from unnecessary
> killing of other tasks, with heuristic that the current task would finish
> soon and release its resources.
> 
> However, this heuristic doesn't work as expected when out_of_memory() is
> triggered by an allocation after the current task has already released
> its memory in exit_mm() (e.g. from exit_task_work()) because it might
> livelock waiting for a memory which gets never released while there are
> other tasks sitting on a lot of memory.
> 
> Therefore, consider doing checks as with sysctl_oom_kill_allocating_task
> case before giving the current task access to memory reserves.

The most important part is to check whether the current has still its
address sapce. So please be explicit about that refering to a sysctl and
not mentioning what is the check is not helpful much. Besides that I do
not think oom_unkillable_task which you have added is really correct.
See below.

> Note that this patch cannot prevent somebody from calling oom_kill_process()
> with a victim task when the victim task already got PF_EXITING flag and
> released its memory. This means that the OOM killer is kept disabled for
> unpredictable duration when the victim task is unkillable due to dependency
> which is invisible to the OOM killer (e.g. waiting for lock held by somebody)
> after somebody set TIF_MEMDIE flag on the victim task by calling
> oom_kill_process(). What is unfortunate, a local unprivileged user can make
> the victim task unkillable on purpose. There are two approaches for mitigating
> this problem. Workaround is to use sysctl-tunable panic on TIF_MEMDIE timeout
> (Detect DoS attacks and react. Easy to backport. Works for memory depletion
> bugs caused by kernel code.) and preferred fix is to develop complete kernel
> memory allocation tracking (Try to avoid DoS but do nothing when failed to
> avoid. Hard to backport. Works for memory depletion attacks caused by user
> programs). Anyway that's beyond what this patch can do.

And I think the whole paragraph is not really relevant for the patch.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/oom.h |  3 +++
>  mm/memcontrol.c     |  8 +++++++-
>  mm/oom_kill.c       | 12 +++++++++---
>  3 files changed, 19 insertions(+), 4 deletions(-)
> 
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c6ac50e..6d9532d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1558,8 +1558,14 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
> +	 *
> +	 * However, if current is calling out_of_memory() by doing memory
> +	 * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING
> +	 * was set by exit_signals() and mm was released by exit_mm(), it is
> +	 * wrong to expect current to exit and free its memory quickly.
>  	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> +	if ((fatal_signal_pending(current) || current->flags & PF_EXITING) &&
> +	    current->mm && !oom_unkillable_task(current, memcg, NULL)) {
>  		set_thread_flag(TIF_MEMDIE);
>  		return;
>  	}

Why do you check oom_unkillable_task for memcg OOM killer?

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 481d550..01719d6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
[...]
> @@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
> +	 *
> +	 * However, if current is calling out_of_memory() by doing memory
> +	 * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING
> +	 * was set by exit_signals() and mm was released by exit_mm(), it is
> +	 * wrong to expect current to exit and free its memory quickly.
>  	 */
> -	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> +	if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
> +	    current->mm && !oom_unkillable_task(current, NULL, nodemask)) {
>  		set_thread_flag(TIF_MEMDIE);
>  		return;
>  	}

Calling oom_unkillable_task doesn't make much sense to me. Even if it made
sense it should be in a separate patch, no?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-16 12:47 ` Michal Hocko
@ 2014-12-17 11:54   ` Tetsuo Handa
  2014-12-17 13:08     ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-17 11:54 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg

Michal Hocko wrote:
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > ---
> >  include/linux/oom.h |  3 +++
> >  mm/memcontrol.c     |  8 +++++++-
> >  mm/oom_kill.c       | 12 +++++++++---
> >  3 files changed, 19 insertions(+), 4 deletions(-)
> >
> [...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c6ac50e..6d9532d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1558,8 +1558,14 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >       * If current has a pending SIGKILL or is exiting, then automatically
> >       * select it.  The goal is to allow it to allocate so that it may
> >       * quickly exit and free its memory.
> > +     *
> > +     * However, if current is calling out_of_memory() by doing memory
> > +     * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING
> > +     * was set by exit_signals() and mm was released by exit_mm(), it is
> > +     * wrong to expect current to exit and free its memory quickly.
> >       */
> > -     if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> > +     if ((fatal_signal_pending(current) || current->flags & PF_EXITING) &&
> > +         current->mm && !oom_unkillable_task(current, memcg, NULL)) {
> >            set_thread_flag(TIF_MEMDIE);
> >            return;
> >       }
>
> Why do you check oom_unkillable_task for memcg OOM killer?
>

I'm not familiar with memcg. But I think the condition whether TIF_MEMDIE
flag should be set or not should be same between the memcg OOM killer and
the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag
can prevent the global OOM killer from killing other threads when the memcg
OOM killer and the global OOM killer run concurrently (the worst corner case).
When a malicious user runs a memory consumer program which triggers memcg OOM
killer deadlock inside some memcg, it will result in the global OOM killer
deadlock when the global OOM killer is triggered by other user's tasks.

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 481d550..01719d6 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> [...]
> > @@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >       * If current has a pending SIGKILL or is exiting, then automatically
> >       * select it.  The goal is to allow it to allocate so that it may
> >       * quickly exit and free its memory.
> > +     *
> > +     * However, if current is calling out_of_memory() by doing memory
> > +     * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING
> > +     * was set by exit_signals() and mm was released by exit_mm(), it is
> > +     * wrong to expect current to exit and free its memory quickly.
> >       */
> > -     if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> > +     if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
> > +         current->mm && !oom_unkillable_task(current, NULL, nodemask)) {
> >            set_thread_flag(TIF_MEMDIE);
> >            return;
> >       }
>
> Calling oom_unkillable_task doesn't make much sense to me. Even if it made
> sense it should be in a separate patch, no?

At least for the global OOM case, current may be a kernel thread, doesn't it?
Such kernel thread can do memory allocation from exit_task_work(), and trigger
the global OOM killer, and disable the global OOM killer and prevent other
threads from allocating memory, can't it?

We can utilize memcg for reducing the possibility of triggering the global
OOM killer. But if we failed to prevent the global OOM killer from triggering,
the global OOM killer is responsible for solving the OOM condition than keeping
the system stalled for presumably forever. Panic on TIF_MEMDIE timeout can act
like /proc/sys/vm/panic_on_oom only when the OOM killer chose (by chance or
by a trap) an unkillable (due to e.g. lock dependency loop) task. Of course,
for those who prefer the system kept stalled over the OOM condition solved,
such action should be optional and thus I'm happy to propose sysctl-tunable
version.

I think that

    if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE))
        return true;

check should be added to oom_unkillable_task() because mm-less thread can
release little memory (except invisible memory if any). And if we add
TIF_MEMDIE timeout check to oom_unkillable_task(), we can wait for mm-less
TIF_MEMDIE thread for a short period before trying to kill other threads
(as with with-mm TIF_MEMDIE threads which I demonstrated you off-list on
Sat, 13 Dec 2014 23:28:33 +0900).

The post exit_mm() issues will remain as long as OOM deadlock by pre
exit_mm() issues remains. And as I demonstrated you off-list, OOM deadlock
by pre exit_mm() issues is too difficult to solve because you will need to
track every lock dependency like lockdep does. Thus, I think that this
"oom: Don't count on mm-less current process." patch itself is a junk and
I added "the whole paragraph" for guiding you to "how to handle TIF_MEMDIE
deadlock caused by pre exit_mm() issues".

Generally memcg should work, but memcg depends on coordination with userspace
where the targets I'm troubleshooting (i.e. currently deployed enterprise
servers) do not have. The cause of deadlock/slowdown may be not a malicious
user's attacks but bugs in enterprise applications or kernel modules. To debug
troubles in currently deployed enterprise servers, I want a solution to "handle
TIF_MEMDIE deadlock caused by pre exit_mm() issues without depending on memcg".
But to backport the solution to currently deployed enterprise servers, it needs
to be first accepted by upstream. You say "Upstream kernels do not need
TIF_MEMDIE timeout. Use memcg and you will not see the global OOM condition."
but I can't force the targets to use memcg. Well, it's a chicken-and-egg
situation...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-17 11:54   ` Tetsuo Handa
@ 2014-12-17 13:08     ` Michal Hocko
  2014-12-18 12:11       ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-17 13:08 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg

On Wed 17-12-14 20:54:53, Tetsuo Handa wrote:
[...]
> I'm not familiar with memcg.

This check doesn't make any sense for this path because the task is part
of the memcg, otherwise it wouldn't trigger charge for it and couldn't
cause the OOM killer. Kernel threads do not have their address space
they cannot trigger memcg OOM killer. As you provide NULL nodemask then
this is basically a check for task being part of the memcg. The check
for current->mm is not needed as well because task will not trigger a
charge after exit_mm.

> But I think the condition whether TIF_MEMDIE
> flag should be set or not should be same between the memcg OOM killer and
> the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag
> can prevent the global OOM killer from killing other threads when the memcg
> OOM killer and the global OOM killer run concurrently (the worst corner case).
> When a malicious user runs a memory consumer program which triggers memcg OOM
> killer deadlock inside some memcg, it will result in the global OOM killer
> deadlock when the global OOM killer is triggered by other user's tasks.

Hope that the above exaplains your concerns here.

> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 481d550..01719d6 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > [...]
> > > @@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > >       * If current has a pending SIGKILL or is exiting, then automatically
> > >       * select it.  The goal is to allow it to allocate so that it may
> > >       * quickly exit and free its memory.
> > > +     *
> > > +     * However, if current is calling out_of_memory() by doing memory
> > > +     * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING
> > > +     * was set by exit_signals() and mm was released by exit_mm(), it is
> > > +     * wrong to expect current to exit and free its memory quickly.
> > >       */
> > > -     if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> > > +     if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
> > > +         current->mm && !oom_unkillable_task(current, NULL, nodemask)) {
> > >            set_thread_flag(TIF_MEMDIE);
> > >            return;
> > >       }
> >
> > Calling oom_unkillable_task doesn't make much sense to me. Even if it made
> > sense it should be in a separate patch, no?
> 
> At least for the global OOM case, current may be a kernel thread, doesn't it?

then mm would be NULL most of the time so current->mm check wouldn't
give it TIF_MEMDIE and the task itself will be exluded later on during
tasks scanning.

> Such kernel thread can do memory allocation from exit_task_work(), and trigger
> the global OOM killer, and disable the global OOM killer and prevent other
> threads from allocating memory, can't it?
> 
> We can utilize memcg for reducing the possibility of triggering the global
> OOM killer.

I do not get this. Memcg charge happens after the allocation is done so
the global OOM killer would trigger before memcg one.

> But if we failed to prevent the global OOM killer from triggering,
> the global OOM killer is responsible for solving the OOM condition than keeping
> the system stalled for presumably forever. Panic on TIF_MEMDIE timeout can act
> like /proc/sys/vm/panic_on_oom only when the OOM killer chose (by chance or
> by a trap) an unkillable (due to e.g. lock dependency loop) task. Of course,
> for those who prefer the system kept stalled over the OOM condition solved,
> such action should be optional and thus I'm happy to propose sysctl-tunable
> version.

You are getting offtopic again (which is pretty annoying to be honest as
it is going all over again and again). Please focus on a single thing at
a time.

> I think that
> 
>     if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE))
>         return true;
> 
> check should be added to oom_unkillable_task() because mm-less thread can
> release little memory (except invisible memory if any).

Why do you think this makes more sense than handling this very special
case in out_of_memory? I really do not see any reason to to make
oom_unkillable_task more complicated.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-17 13:08     ` Michal Hocko
@ 2014-12-18 12:11       ` Tetsuo Handa
  2014-12-18 15:33         ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-18 12:11 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg

Michal Hocko wrote:
> On Wed 17-12-14 20:54:53, Tetsuo Handa wrote:
> [...]
> > I'm not familiar with memcg.
>
> This check doesn't make any sense for this path because the task is part
> of the memcg, otherwise it wouldn't trigger charge for it and couldn't
> cause the OOM killer. Kernel threads do not have their address space
> they cannot trigger memcg OOM killer. As you provide NULL nodemask then
> this is basically a check for task being part of the memcg.

So !oom_unkillable_task(current, memcg, NULL) is always true for
mem_cgroup_out_of_memory() case, isn't it?

>                                                             The check
> for current->mm is not needed as well because task will not trigger a
> charge after exit_mm.

So current->mm != NULL is always true for mem_cgroup_out_of_memory()
case, isn't it?

>
> > But I think the condition whether TIF_MEMDIE
> > flag should be set or not should be same between the memcg OOM killer and
> > the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag
> > can prevent the global OOM killer from killing other threads when the memcg
> > OOM killer and the global OOM killer run concurrently (the worst corner case).
> > When a malicious user runs a memory consumer program which triggers memcg OOM
> > killer deadlock inside some memcg, it will result in the global OOM killer
> > deadlock when the global OOM killer is triggered by other user's tasks.
>
> Hope that the above exaplains your concerns here.
>

Thread1 in memcg1 asks for memory, and thread1 gets requested amount of
memory without triggering the global OOM killer, and requested amount of
memory is charged to memcg1, and the memcg OOM killer is triggered.
While the memcg OOM killer is searching for a victim from threads in
memcg1, thread2 in memcg2 asks for the memory. Thread2 fails to get
requested amount of memory without triggering the global OOM killer.
Now the global OOM killer starts searching for a victim from all threads
whereas the memcg OOM killer chooses thread1 in memcg1 and sets TIF_MEMDIE
flag on thread1 in memcg1. Then, the global OOM killer finds that thread1
in memcg1 already has TIF_MEMDIE flag set, and waits for thread1 in memcg1
to terminate than chooses another victim from all threads. However, when
thread1 in memcg1 cannot be terminated immediately for some reason, thread2
in memcg2 is blocked by thread1 in memcg1.

> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 481d550..01719d6 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > [...]
> > > > @@ -649,8 +649,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > > >       * If current has a pending SIGKILL or is exiting, then automatically
> > > >       * select it.  The goal is to allow it to allocate so that it may
> > > >       * quickly exit and free its memory.
> > > > +     *
> > > > +     * However, if current is calling out_of_memory() by doing memory
> > > > +     * allocation from e.g. exit_task_work() in do_exit() after PF_EXITING
> > > > +     * was set by exit_signals() and mm was released by exit_mm(), it is
> > > > +     * wrong to expect current to exit and free its memory quickly.
> > > >       */
> > > > -     if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> > > > +     if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
> > > > +         current->mm && !oom_unkillable_task(current, NULL, nodemask)) {
> > > >            set_thread_flag(TIF_MEMDIE);
> > > >            return;
> > > >       }
> > >
> > > Calling oom_unkillable_task doesn't make much sense to me. Even if it made
> > > sense it should be in a separate patch, no?
> >
> > At least for the global OOM case, current may be a kernel thread, doesn't it?
>
> then mm would be NULL most of the time so current->mm check wouldn't
> give it TIF_MEMDIE and the task itself will be exluded later on during
> tasks scanning.
>
> > Such kernel thread can do memory allocation from exit_task_work(), and trigger
> > the global OOM killer, and disable the global OOM killer and prevent other
> > threads from allocating memory, can't it?
> >
> > We can utilize memcg for reducing the possibility of triggering the global
> > OOM killer.
>
> I do not get this. Memcg charge happens after the allocation is done so
> the global OOM killer would trigger before memcg one.

I mean, someone triggers the global OOM killer between somebody else triggered
the memcg OOM killer and the memcg OOM killer finishes.

> > But if we failed to prevent the global OOM killer from triggering,
> > the global OOM killer is responsible for solving the OOM condition than keeping
> > the system stalled for presumably forever. Panic on TIF_MEMDIE timeout can act
> > like /proc/sys/vm/panic_on_oom only when the OOM killer chose (by chance or
> > by a trap) an unkillable (due to e.g. lock dependency loop) task. Of course,
> > for those who prefer the system kept stalled over the OOM condition solved,
> > such action should be optional and thus I'm happy to propose sysctl-tunable
> > version.
>
> You are getting offtopic again (which is pretty annoying to be honest as
> it is going all over again and again). Please focus on a single thing at
> a time.
>

I think focusing on only mm-less case makes no sense, for with-mm case
ruins efforts made for mm-less case. My question is quite simple.
How can we avoid memory allocation stalls when

  System has 2048MB of RAM and no swap.
  Memcg1 for task1 has quota 512MB and 400MB in use.
  Memcg2 for task2 has quota 512MB and 400MB in use.
  Memcg3 for task3 has quota 512MB and 400MB in use.
  Memcg4 for task4 has quota 512MB and 400MB in use.
  Memcg5 for task5 has quota 512MB and 1MB in use.

and task5 launches below memory consumption program which would trigger
the global OOM killer before triggering the memcg OOM killer?

---------- XFS + OOM killer dependency stall reproducer start ----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
	static char buf[4096];
	const int fd = open("file", O_CREAT | O_WRONLY, 0600);
	while (write(fd, buf, sizeof(buf)) == sizeof(buf))
		fsync(fd);
	close(fd);
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	unsigned long size;
	const int fd = open("/dev/zero", O_RDONLY);
	char *buf = NULL;
	if (fd == -1)
		return 1;
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp)
			break;
		buf = cp;
	}
	for (i = 0; i < 128; i++) {
		char *cp = malloc(4096);
		if (!cp || clone(file_writer, cp + 4096,
				 CLONE_SIGHAND | CLONE_VM, NULL) == -1)
			break;
	}
	read(fd, buf, size);
	return 0;
}
---------- XFS + OOM killer dependency stall reproducer end ----------

The global OOM killer will try to kill this program because this program
will be using 400MB+ of RAM by the time the global OOM killer is triggered.
But sometimes this program cannot be terminated by the global OOM killer
due to XFS lock dependency.

You can see what is happening from OOM traces after uptime > 320 seconds of
http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not
configured on this program.

Trying to apply quota using memcg for safeguard is fine. But don't forget
to prepare for the global OOM killer. And please don't reject with "use
memcg and never over-commit", for my proposal is for analyzing/avoiding

  stalls caused by not only a malicious user's attacks but bugs in
  enterprise applications or kernel modules

and/or

  stalls of servers where coordination with userspace is impossible

.

> > I think that
> >
> >     if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE))
> >         return true;
> >
> > check should be added to oom_unkillable_task() because mm-less thread can
> > release little memory (except invisible memory if any).
>
> Why do you think this makes more sense than handling this very special
> case in out_of_memory? I really do not see any reason to to make
> oom_unkillable_task more complicated.

Because everyone can safely skip victim threads who don't have mm.
Handling setting of TIF_MEMDIE in the caller is racy. Somebody may set
TIF_MEMDIE at oom_kill_process() even if we avoided setting TIF_MEMDIE at
out_of_memory(). There will be more locations where TIF_MEMDIE is set; even
out-of-tree modules might set TIF_MEMDIE.

Nonetheless, I don't think

    if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE))
        return true;

check is perfect because we anyway need to prepare for both mm-less and
with-mm cases.

My concern is not "whether TIF_MEMDIE flag should be set or not". My concern
is not "whether task->mm is NULL or not". My concern is "whether threads with
TIF_MEMDIE flag retard other process' memory allocation or not".
Above-mentioned program is an example of with-mm threads retarding
other process' memory allocation.

I know you don't like timeout approach, but adding

    if (sysctl_memdie_timeout_secs && test_tsk_thread_flag(task, TIF_MEMDIE) &&
        time_after(jiffies, task->memdie_start + sysctl_memdie_timeout_secs * HZ))
        return true;

check to oom_unkillable_task() will take care of both mm-less and with-mm
cases because everyone can safely skip the TIF_MEMDIE victim threads who
cannot be terminated immediately for some reason.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-18 12:11       ` Tetsuo Handa
@ 2014-12-18 15:33         ` Michal Hocko
  2014-12-19 12:07           ` Tetsuo Handa
  2014-12-19 12:22           ` How to handle TIF_MEMDIE stalls? Tetsuo Handa
  0 siblings, 2 replies; 276+ messages in thread
From: Michal Hocko @ 2014-12-18 15:33 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg

On Thu 18-12-14 21:11:26, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 17-12-14 20:54:53, Tetsuo Handa wrote:
> > [...]
> > > I'm not familiar with memcg.
> >
> > This check doesn't make any sense for this path because the task is part
> > of the memcg, otherwise it wouldn't trigger charge for it and couldn't
> > cause the OOM killer. Kernel threads do not have their address space
> > they cannot trigger memcg OOM killer. As you provide NULL nodemask then
> > this is basically a check for task being part of the memcg.
> 
> So !oom_unkillable_task(current, memcg, NULL) is always true for
> mem_cgroup_out_of_memory() case, isn't it?

yes, unless the task has moved away from the memcg since the charge
happened but that is not important because the charge happened for the
given memcg and so the OOM should happen there.

> >                                                             The check
> > for current->mm is not needed as well because task will not trigger a
> > charge after exit_mm.
> 
> So current->mm != NULL is always true for mem_cgroup_out_of_memory()
> case, isn't it?

yes

> > > But I think the condition whether TIF_MEMDIE
> > > flag should be set or not should be same between the memcg OOM killer and
> > > the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag
> > > can prevent the global OOM killer from killing other threads when the memcg
> > > OOM killer and the global OOM killer run concurrently (the worst corner case).
> > > When a malicious user runs a memory consumer program which triggers memcg OOM
> > > killer deadlock inside some memcg, it will result in the global OOM killer
> > > deadlock when the global OOM killer is triggered by other user's tasks.
> >
> > Hope that the above exaplains your concerns here.
> >
> 
> Thread1 in memcg1 asks for memory, and thread1 gets requested amount of
> memory without triggering the global OOM killer, and requested amount of
> memory is charged to memcg1, and the memcg OOM killer is triggered.
> While the memcg OOM killer is searching for a victim from threads in
> memcg1, thread2 in memcg2 asks for the memory. Thread2 fails to get
> requested amount of memory without triggering the global OOM killer.
> Now the global OOM killer starts searching for a victim from all threads
> whereas the memcg OOM killer chooses thread1 in memcg1 and sets TIF_MEMDIE
> flag on thread1 in memcg1. Then, the global OOM killer finds that thread1
> in memcg1 already has TIF_MEMDIE flag set, and waits for thread1 in memcg1
> to terminate than chooses another victim from all threads. However, when
> thread1 in memcg1 cannot be terminated immediately for some reason, thread2
> in memcg2 is blocked by thread1 in memcg1.

Sigh... T1 triggers memcg OOM killer _only_ from the page fault path and so it
will get to signal processing right away and eventually gets to exit_mm
where it releases its memory. If that doesn't suffice to release enough
memory then we are back to the original problem. So I do not think memcg
adds anything new to the problem.

[...]
> I think focusing on only mm-less case makes no sense, for with-mm case
> ruins efforts made for mm-less case.

No. It is quite opposite. Excluding mm less current from PF_EXITING
resp. fatal_signal_pending heuristics makes perfect sense from the OOM
killer POV. The reasons are described in the changelog.

> My question is quite simple. How can we avoid memory allocation stalls when
> 
>   System has 2048MB of RAM and no swap.
>   Memcg1 for task1 has quota 512MB and 400MB in use.
>   Memcg2 for task2 has quota 512MB and 400MB in use.
>   Memcg3 for task3 has quota 512MB and 400MB in use.
>   Memcg4 for task4 has quota 512MB and 400MB in use.
>   Memcg5 for task5 has quota 512MB and 1MB in use.
> 
> and task5 launches below memory consumption program which would trigger
> the global OOM killer before triggering the memcg OOM killer?
> 
[...]
> The global OOM killer will try to kill this program because this program
> will be using 400MB+ of RAM by the time the global OOM killer is triggered.
> But sometimes this program cannot be terminated by the global OOM killer
> due to XFS lock dependency.
> 
> You can see what is happening from OOM traces after uptime > 320 seconds of
> http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not
> configured on this program.

This is clearly a separate issue. It is a lock dependency and that alone
_cannot_ be handled from OOM killer as it doesn't understand lock
dependencies. This should be addressed from the xfs point of view IMHO
but I am not familiar with this filesystem to tell you how or whether it
is possible.

[...]
> > >     if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE))
> > >         return true;
> > >
> > > check should be added to oom_unkillable_task() because mm-less thread can
> > > release little memory (except invisible memory if any).
> >
> > Why do you think this makes more sense than handling this very special
> > case in out_of_memory? I really do not see any reason to to make
> > oom_unkillable_task more complicated.
> 
> Because everyone can safely skip victim threads who don't have mm.

And that is handled already. Check oom_badness and its find_lock_task_mm
oom_scan_process_thread and its task->mm and out_of_memory and the
complete sysctl_oom_kill_allocating_task check.

> Handling setting of TIF_MEMDIE in the caller is racy.

Any operation on another task is racy, that's why I prefer current->mm
check in out_of_memory.

> Somebody may set
> TIF_MEMDIE at oom_kill_process() even if we avoided setting TIF_MEMDIE at
> out_of_memory(). There will be more locations where TIF_MEMDIE is set; even
> out-of-tree modules might set TIF_MEMDIE.

TIF_MEMDIE should be set only when we _know_ the task will free _some_
memory and when we are killing the OOM victim. The only place I can see
that would break the first condition is out_of_memory for the current
which passed exit_mm(). That is the point why I've suggested you this
patch and it would be much more easier if we could simply finished that
one without pulling other things in.

Out-of-tree and even in-tree modules have no bussines in setting the
flag. lowmemory killer is doing that but that is an abuse and should be
fixed in other way. TIF_MEMDIE is not a flag anybody can touch.

> Nonetheless, I don't think
> 
>     if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE))
>         return true;
> 
> check is perfect because we anyway need to prepare for both mm-less and
> with-mm cases.
> 
> My concern is not "whether TIF_MEMDIE flag should be set or not". My concern
> is not "whether task->mm is NULL or not". My concern is "whether threads with
> TIF_MEMDIE flag retard other process' memory allocation or not".
> Above-mentioned program is an example of with-mm threads retarding
> other process' memory allocation.

There is no way you can guarantee something like that. OOM is the _last_
resort. Things are in a pretty bad state already when it hits. It is the
last attempt to reclaim some memory. System might be in an arbitrary
state at this time.
I really hate to repeat myself but you are trying to "fix" your problem
at a wrong level.

> I know you don't like timeout approach, but adding
> 
>     if (sysctl_memdie_timeout_secs && test_tsk_thread_flag(task, TIF_MEMDIE) &&
>         time_after(jiffies, task->memdie_start + sysctl_memdie_timeout_secs * HZ))
>         return true;
> 
> check to oom_unkillable_task() will take care of both mm-less and with-mm
> cases because everyone can safely skip the TIF_MEMDIE victim threads who
> cannot be terminated immediately for some reason.

It will not take care of anything. It will start shooting to more
processes after some timeout, which is hard to get right, and there
wouldn't be any guaratee multiple victims will help because they might
end up blocking on the very same or other lock on the way out. Jeez are
you even reading feedback you are getting?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-18 15:33         ` Michal Hocko
@ 2014-12-19 12:07           ` Tetsuo Handa
  2014-12-19 12:49             ` Michal Hocko
  2014-12-19 12:22           ` How to handle TIF_MEMDIE stalls? Tetsuo Handa
  1 sibling, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-19 12:07 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg

Michal Hocko wrote:
> On Thu 18-12-14 21:11:26, Tetsuo Handa wrote:
> > > > But I think the condition whether TIF_MEMDIE
> > > > flag should be set or not should be same between the memcg OOM killer and
> > > > the global OOM killer, for a thread inside some memcg with TIF_MEMDIE flag
> > > > can prevent the global OOM killer from killing other threads when the memcg
> > > > OOM killer and the global OOM killer run concurrently (the worst corner case).
> > > > When a malicious user runs a memory consumer program which triggers memcg OOM
> > > > killer deadlock inside some memcg, it will result in the global OOM killer
> > > > deadlock when the global OOM killer is triggered by other user's tasks.
> > >
> > > Hope that the above exaplains your concerns here.
> > >
> >
> > Thread1 in memcg1 asks for memory, and thread1 gets requested amount of
> > memory without triggering the global OOM killer, and requested amount of
> > memory is charged to memcg1, and the memcg OOM killer is triggered.
> > While the memcg OOM killer is searching for a victim from threads in
> > memcg1, thread2 in memcg2 asks for the memory. Thread2 fails to get
> > requested amount of memory without triggering the global OOM killer.
> > Now the global OOM killer starts searching for a victim from all threads
> > whereas the memcg OOM killer chooses thread1 in memcg1 and sets TIF_MEMDIE
> > flag on thread1 in memcg1. Then, the global OOM killer finds that thread1
> > in memcg1 already has TIF_MEMDIE flag set, and waits for thread1 in memcg1
> > to terminate than chooses another victim from all threads. However, when
> > thread1 in memcg1 cannot be terminated immediately for some reason, thread2
> > in memcg2 is blocked by thread1 in memcg1.
>
> Sigh... T1 triggers memcg OOM killer _only_ from the page fault path and so it
> will get to signal processing right away and eventually gets to exit_mm
> where it releases its memory. If that doesn't suffice to release enough
> memory then we are back to the original problem. So I do not think memcg
> adds anything new to the problem.
>
The memcg OOM killer is triggered upon page fault than memory charge, I see.
But the memcg OOM killer is not relevant to my concern. It's a matter of
which OOM killer sets TIF_MEMDIE flag.

> > [...]
> > > I think focusing on only mm-less case makes no sense, for with-mm case
> > ruins efforts made for mm-less case.
>
> No. It is quite opposite. Excluding mm less current from PF_EXITING
> resp. fatal_signal_pending heuristics makes perfect sense from the OOM
> killer POV. The reasons are described in the changelog.
>

OK. Below is an updated patch.
----------------------------------------
>From 3c68c66a72f0dbfc66f9799a00fbaa1f0217befb Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 19 Dec 2014 20:49:06 +0900
Subject: [PATCH v2] oom: Don't count on mm-less current process.

out_of_memory() doesn't trigger the OOM killer if the current task is already
exiting or it has fatal signals pending, and gives the task access to memory
reserves instead. However, doing so is wrong if out_of_memory() is called by
an allocation (e.g. from exit_task_work()) after the current task has already
released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set
TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by
the task sitting in the final schedule() waiting for its parent to reap it.
It will trigger an OOM livelock if its parent is unable to reap it due to
doing an allocation and waiting for the OOM killer to kill it.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 481d550..e87391f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -649,8 +649,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
+	 *
+	 * But don't select if current has already released its mm and cleared
+	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    current->mm) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* How to handle TIF_MEMDIE stalls?
  2014-12-18 15:33         ` Michal Hocko
  2014-12-19 12:07           ` Tetsuo Handa
@ 2014-12-19 12:22           ` Tetsuo Handa
  2014-12-20  2:03             ` Dave Chinner
  1 sibling, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-19 12:22 UTC (permalink / raw)
  To: mhocko, dchinner; +Cc: linux-mm, rientjes, oleg

(Renamed thread's title and invited Dave Chinner. A memory stressing program
at http://marc.info/?l=linux-mm&m=141890469424353&w=2 can trigger stalls on
a system with 4 CPUs/2048MB of RAM/no swap. I want to hear your opinion.)

Michal Hocko wrote:
> > My question is quite simple. How can we avoid memory allocation stalls when
> >
> >   System has 2048MB of RAM and no swap.
> >   Memcg1 for task1 has quota 512MB and 400MB in use.
> >   Memcg2 for task2 has quota 512MB and 400MB in use.
> >   Memcg3 for task3 has quota 512MB and 400MB in use.
> >   Memcg4 for task4 has quota 512MB and 400MB in use.
> >   Memcg5 for task5 has quota 512MB and 1MB in use.
> >
> > and task5 launches below memory consumption program which would trigger
> > the global OOM killer before triggering the memcg OOM killer?
> >
> [...]
> > The global OOM killer will try to kill this program because this program
> > will be using 400MB+ of RAM by the time the global OOM killer is triggered.
> > But sometimes this program cannot be terminated by the global OOM killer
> > due to XFS lock dependency.
> >
> > You can see what is happening from OOM traces after uptime > 320 seconds of
> > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not
> > configured on this program.
>
> This is clearly a separate issue. It is a lock dependency and that alone
> _cannot_ be handled from OOM killer as it doesn't understand lock
> dependencies. This should be addressed from the xfs point of view IMHO
> but I am not familiar with this filesystem to tell you how or whether it
> is possible.
>
Then, let's ask Dave Chinner whether he can address it. My opinion is that
everybody is doing __GFP_WAIT memory allocation without understanding the
entire dependencies. Everybody is only prepared for allocation failures
because everybody is expecting that the OOM killer shall somehow solve the
OOM condition (except that some are expecting that memory stress that will
trigger the OOM killer must not be given). I am neither familiar with XFS,
but I don't think this issue can be addressed from the XFS point of view.

For example, https://lkml.org/lkml/2014/7/2/249 stalls at blk_rq_map_kern()
which I'm suspecting it as one of causes of the stall due to happening
inside disk I/O event of XFS partition. If XFS were responsible for
avoiding stall at blk_rq_map_kern() (on the assumption that XFS triggered
that disk I/O event), XFS (filesystem layer) somehow needs to drop
__GFP_WAIT flag from scsi_execute() (SCSI layer). We will end up with
passing gfp flags to every function which might do memory allocation.
Is everybody happy with such code complication/bloat?

----------
int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
                 int data_direction, void *buffer, unsigned bufflen,
                 unsigned char *sense, int timeout, int retries, u64 flags,
                 int *resid)
{
        struct request *req;
        int write = (data_direction == DMA_TO_DEVICE);
        int ret = DRIVER_ERROR << 24;

        req = blk_get_request(sdev->request_queue, write, __GFP_WAIT);
        if (IS_ERR(req))
                return ret;
        blk_rq_set_block_pc(req);

        if (bufflen &&  blk_rq_map_kern(sdev->request_queue, req,
                                        buffer, bufflen, __GFP_WAIT))
                goto out;

        req->cmd_len = COMMAND_SIZE(cmd[0]);
        memcpy(req->cmd, cmd, req->cmd_len);
        req->sense = sense;
        req->sense_len = 0;
        req->retries = retries;
        req->timeout = timeout;
        req->cmd_flags |= flags | REQ_QUIET | REQ_PREEMPT;

        /*
         * head injection *required* here otherwise quiesce won't work
         */
        blk_execute_rq(req->q, NULL, req, 1);

        /*
         * Some devices (USB mass-storage in particular) may transfer
         * garbage data together with a residue indicating that the data
         * is invalid.  Prevent the garbage from being misinterpreted
         * and prevent security leaks by zeroing out the excess data.
         */
        if (unlikely(req->resid_len > 0 && req->resid_len <= bufflen))
                memset(buffer + (bufflen - req->resid_len), 0, req->resid_len);

        if (resid)
                *resid = req->resid_len;
        ret = req->errors;
 out:
        blk_put_request(req);

        return ret;
}
----------

By the way, if __GFP_WAIT requests had higher priority (lower or ignore
the watermark?) than GFP_NOIO or GFP_NOFS or GFP_KERNEL requests, could
blk_rq_map_kern() avoid the stall and allow XFS to proceed (and release
XFS lock and terminate the OOM victim)?

> > Somebody may set
> > TIF_MEMDIE at oom_kill_process() even if we avoided setting TIF_MEMDIE at
> > out_of_memory(). There will be more locations where TIF_MEMDIE is set; even
> > out-of-tree modules might set TIF_MEMDIE.
>
> TIF_MEMDIE should be set only when we _know_ the task will free _some_
> memory and when we are killing the OOM victim. The only place I can see
> that would break the first condition is out_of_memory for the current
> which passed exit_mm(). That is the point why I've suggested you this
> patch and it would be much more easier if we could simply finished that
> one without pulling other things in.

I agree that TIF_MEMDIE should be set only when we know the task will free
some memory, but currently setting TIF_MEMDIE on the OOM victim is causing
stalls which I want to analyze/debug via patchset posted at
http://marc.info/?l=linux-mm&m=141671817211121&w=2 because we forever wait
until the OOM victim terminates. In serial-20141213.txt.xz, TIF_MEMDIE was
set on the OOM victim which is even unkillable by SysRq-f.

> > Nonetheless, I don't think
> >
> >     if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE))
> >         return true;
> >
> > check is perfect because we anyway need to prepare for both mm-less and
> > with-mm cases.
> >
> > My concern is not "whether TIF_MEMDIE flag should be set or not". My concern
> > is not "whether task->mm is NULL or not". My concern is "whether threads with
> > TIF_MEMDIE flag retard other process' memory allocation or not".
> > Above-mentioned program is an example of with-mm threads retarding
> > other process' memory allocation.
>
> There is no way you can guarantee something like that. OOM is the _last_
> resort. Things are in a pretty bad state already when it hits. It is the
> last attempt to reclaim some memory. System might be in an arbitrary
> state at this time.
> I really hate to repeat myself but you are trying to "fix" your problem
> at a wrong level.

I think that the OOM killer is responsible for killing the OOM condition or
triggering kernel panic. I don't like that the OOM killer is failing to kill
the OOM condition as it claims to be.

>
> > I know you don't like timeout approach, but adding
> >
> >     if (sysctl_memdie_timeout_secs && test_tsk_thread_flag(task, TIF_MEMDIE) &&
> >         time_after(jiffies, task->memdie_start + sysctl_memdie_timeout_secs * HZ))
> >         return true;
> >
> > check to oom_unkillable_task() will take care of both mm-less and with-mm
> > cases because everyone can safely skip the TIF_MEMDIE victim threads who
> > cannot be terminated immediately for some reason.
>
> It will not take care of anything. It will start shooting to more
> processes after some timeout, which is hard to get right, and there
> wouldn't be any guaratee multiple victims will help because they might
> end up blocking on the very same or other lock on the way out.

If you don't like skip on timeout approach, I'm OK with triggering kernel
panic on timeout approach. Analyzing vmcore will give us some hints about
what was happening.

>                                                                Jeez are
> you even reading feedback you are getting?

Of course, I'm reading your feedback.

The "[RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls."
will become unnecessary after all bugs are identified and fixed. I agree
that bugs should be identified and fixed, but XFS stall is nothing but an
example which I can reproduce on my desktop. My role is to analyze and
respond to kernel troubles such as unexpected stalls, panics, reboots
occurred on customer's servers which I don't have access. I will encounter
various different troubles which I can't predict how to obtain information.
Therefore, I want some unattended built-in assistance for understanding
what was happening in chronological order and identifying/fixing the bugs.
Existing built-in debugging hooks which requires administrator's operation
might help after understanding what was happening.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-19 12:07           ` Tetsuo Handa
@ 2014-12-19 12:49             ` Michal Hocko
  2014-12-20  9:13               ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-19 12:49 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg

On Fri 19-12-14 21:07:53, Tetsuo Handa wrote:
[...]
> >From 3c68c66a72f0dbfc66f9799a00fbaa1f0217befb Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 19 Dec 2014 20:49:06 +0900
> Subject: [PATCH v2] oom: Don't count on mm-less current process.
> 
> out_of_memory() doesn't trigger the OOM killer if the current task is already
> exiting or it has fatal signals pending, and gives the task access to memory
> reserves instead. However, doing so is wrong if out_of_memory() is called by
> an allocation (e.g. from exit_task_work()) after the current task has already
> released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set
> TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by
> the task sitting in the final schedule() waiting for its parent to reap it.
> It will trigger an OOM livelock if its parent is unable to reap it due to
> doing an allocation and waiting for the OOM killer to kill it.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Acked-by: Michal Hocko <mhocko@suse.cz>

Just a nit, You could start the condition with current->mm because it
is the simplest check. We do not have to check for signals pending or
PF_EXITING at all if it is NULL. But this is not a hot path so it
doesn't matter much. It is just a good practice to start with the
simplest tests first.

Please also make sure to add Andrew to CC when sending the patch again
so that he knows about it and picks it up.

Thanks!

> ---
>  mm/oom_kill.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 481d550..e87391f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -649,8 +649,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
> +	 *
> +	 * But don't select if current has already released its mm and cleared
> +	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
>  	 */
> -	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> +	if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
> +	    current->mm) {
>  		set_thread_flag(TIF_MEMDIE);
>  		return;
>  	}
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-19 12:22           ` How to handle TIF_MEMDIE stalls? Tetsuo Handa
@ 2014-12-20  2:03             ` Dave Chinner
  2014-12-20 12:41               ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Dave Chinner @ 2014-12-20  2:03 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-mm, rientjes, oleg, david

On Fri, Dec 19, 2014 at 09:22:49PM +0900, Tetsuo Handa wrote:
> (Renamed thread's title and invited Dave Chinner. A memory stressing program
> at http://marc.info/?l=linux-mm&m=141890469424353&w=2 can trigger stalls on
> a system with 4 CPUs/2048MB of RAM/no swap. I want to hear your opinion.)
> 
> Michal Hocko wrote:
> > > My question is quite simple. How can we avoid memory allocation stalls when
> > >
> > >   System has 2048MB of RAM and no swap.
> > >   Memcg1 for task1 has quota 512MB and 400MB in use.
> > >   Memcg2 for task2 has quota 512MB and 400MB in use.
> > >   Memcg3 for task3 has quota 512MB and 400MB in use.
> > >   Memcg4 for task4 has quota 512MB and 400MB in use.
> > >   Memcg5 for task5 has quota 512MB and 1MB in use.
> > >
> > > and task5 launches below memory consumption program which would trigger
> > > the global OOM killer before triggering the memcg OOM killer?
> > >
> > [...]
> > > The global OOM killer will try to kill this program because this program
> > > will be using 400MB+ of RAM by the time the global OOM killer is triggered.
> > > But sometimes this program cannot be terminated by the global OOM killer
> > > due to XFS lock dependency.
> > >
> > > You can see what is happening from OOM traces after uptime > 320 seconds of
> > > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not
> > > configured on this program.
> >
> > This is clearly a separate issue. It is a lock dependency and that alone
> > _cannot_ be handled from OOM killer as it doesn't understand lock
> > dependencies. This should be addressed from the xfs point of view IMHO
> > but I am not familiar with this filesystem to tell you how or whether it
> > is possible.

What XFS lock dependency? I see nothing in that output file that indicates a
lock dependency problem - can you point out what the issue is here?

> Then, let's ask Dave Chinner whether he can address it. My opinion is that
> everybody is doing __GFP_WAIT memory allocation without understanding the
> entire dependencies. Everybody is only prepared for allocation failures
> because everybody is expecting that the OOM killer shall somehow solve the
> OOM condition (except that some are expecting that memory stress that will
> trigger the OOM killer must not be given). I am neither familiar with XFS,
> but I don't think this issue can be addressed from the XFS point of view.

Well, I can't comment (nor am I going to waste time speculating)
until someone actually explains the XFS lock dependency that is
apparently causing reclaim problems.

Has lockdep reported any problems?

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-19 12:49             ` Michal Hocko
@ 2014-12-20  9:13               ` Tetsuo Handa
  2014-12-20 11:42                 ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-20  9:13 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: linux-mm, rientjes, oleg

Michal Hocko wrote:
> On Fri 19-12-14 21:07:53, Tetsuo Handa wrote:
> [...]
> > >From 3c68c66a72f0dbfc66f9799a00fbaa1f0217befb Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Fri, 19 Dec 2014 20:49:06 +0900
> > Subject: [PATCH v2] oom: Don't count on mm-less current process.
> > 
> > out_of_memory() doesn't trigger the OOM killer if the current task is already
> > exiting or it has fatal signals pending, and gives the task access to memory
> > reserves instead. However, doing so is wrong if out_of_memory() is called by
> > an allocation (e.g. from exit_task_work()) after the current task has already
> > released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set
> > TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by
> > the task sitting in the final schedule() waiting for its parent to reap it.
> > It will trigger an OOM livelock if its parent is unable to reap it due to
> > doing an allocation and waiting for the OOM killer to kill it.
> > 
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> Just a nit, You could start the condition with current->mm because it
> is the simplest check. We do not have to check for signals pending or
> PF_EXITING at all if it is NULL. But this is not a hot path so it
> doesn't matter much. It is just a good practice to start with the
> simplest tests first.
> 
> Please also make sure to add Andrew to CC when sending the patch again
> so that he knows about it and picks it up.
> 
> Thanks!
> 
I see. Here is v3 patch. Andrew, would you please pick this up?

By the way, Michal, I think there is still an unlikely race window at
set_tsk_thread_flag(p, TIF_MEMDIE) in oom_kill_process(). For example,
task1 calls out_of_memory() and select_bad_process() is called from
out_of_memory(). oom_scan_process_thread(task2) is called from
select_bad_process(). oom_scan_process_thread() returns OOM_SCAN_OK
because task2->mm != NULL and task_will_free_mem(task2) == false.
select_bad_process() calls get_task_struct(task2) and returns task2.
Task1 goes to sleep and task2 is woken up. Task2 enters into do_exit()
and gets PF_EXITING at exit_signals() and releases mm at exit_mm().
Task2 goes to sleep and task1 is woken up. Task1 calls
oom_kill_process(task2). oom_kill_process() sets TIF_MEMDIE on task2
because task_will_free_mem(task2) == true due to PF_EXITING already set...
Should we do like

        if (task_will_free_mem(p)) {
		if (p->mm)
	                set_tsk_thread_flag(p, TIF_MEMDIE);
                put_task_struct(p);
                return;
        }

at oom_kill_process() ? Or even if we do so, how to check if task1 went
to sleep between task2->mm and set_tsk_thread_flag(task2, TIF_MEMDIE) ?
This race window is very very unlikely because releasing task2->mm is
expected to release some memory. But if somebody else consumed memory
released by exit_mm(task2), I think there is nothing to protect.
----------------------------------------
>From 3a75c92a03cf17d9505bbb7fc9c81603daac9da0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 20 Dec 2014 17:18:37 +0900
Subject: [PATCH v3] oom: Don't count on mm-less current process.

out_of_memory() doesn't trigger the OOM killer if the current task is already
exiting or it has fatal signals pending, and gives the task access to memory
reserves instead. However, doing so is wrong if out_of_memory() is called by
an allocation (e.g. from exit_task_work()) after the current task has already
released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set
TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by
the task sitting in the final schedule() waiting for its parent to reap it.
It will trigger an OOM livelock if its parent is unable to reap it due to
doing an allocation and waiting for the OOM killer to kill it.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.cz>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d503e9c..f82dd13 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -643,8 +643,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
+	 *
+	 * But don't select if current has already released its mm and cleared
+	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (current->mm &&
+	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-20  9:13               ` Tetsuo Handa
@ 2014-12-20 11:42                 ` Tetsuo Handa
  2014-12-22 20:25                   ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-20 11:42 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: linux-mm, rientjes, oleg

Tetsuo Handa wrote:
> By the way, Michal, I think there is still an unlikely race window at
> set_tsk_thread_flag(p, TIF_MEMDIE) in oom_kill_process(). For example,
> task1 calls out_of_memory() and select_bad_process() is called from
> out_of_memory(). oom_scan_process_thread(task2) is called from
> select_bad_process(). oom_scan_process_thread() returns OOM_SCAN_OK
> because task2->mm != NULL and task_will_free_mem(task2) == false.
> select_bad_process() calls get_task_struct(task2) and returns task2.
> Task1 goes to sleep and task2 is woken up. Task2 enters into do_exit()
> and gets PF_EXITING at exit_signals() and releases mm at exit_mm().
> Task2 goes to sleep and task1 is woken up. Task1 calls
> oom_kill_process(task2). oom_kill_process() sets TIF_MEMDIE on task2
> because task_will_free_mem(task2) == true due to PF_EXITING already set...
> Should we do like
> 
>         if (task_will_free_mem(p)) {
> 		if (p->mm)
> 	                set_tsk_thread_flag(p, TIF_MEMDIE);
>                 put_task_struct(p);
>                 return;
>         }
> 
> at oom_kill_process() ? Or even if we do so, how to check if task1 went
> to sleep between task2->mm and set_tsk_thread_flag(task2, TIF_MEMDIE) ?
> This race window is very very unlikely because releasing task2->mm is
> expected to release some memory. But if somebody else consumed memory
> released by exit_mm(task2), I think there is nothing to protect.
Well, this could happen if task2 is one of threads in a multi-threaded
process like Java where exit_mm(task2) decrements refcount than releases
memory. Below is a patch. Michal, please check.
----------------------------------------
>From a2ebb5b873ec5af45e0bea9ea6da2a93c0f06c35 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 20 Dec 2014 20:05:14 +0900
Subject: [PATCH] oom: Close race of setting TIF_MEMDIE to mm-less process.

exit_mm() and oom_kill_process() could race with regard to handling of
TIF_MEMDIE flag if sequence described below occurred.

P1 calls out_of_memory(). out_of_memory() calls select_bad_process().
select_bad_process() calls oom_scan_process_thread(P2). If P2->mm != NULL
and task_will_free_mem(P2) == false, oom_scan_process_thread(P2) returns
OOM_SCAN_OK. And if P2 is chosen as a victim task, select_bad_process()
returns P2 after calling get_task_struct(P2). Then, P1 goes to sleep and
P2 is woken up. P2 enters into do_exit() and gets PF_EXITING at exit_signals()
and releases mm at exit_mm(). Then, P2 goes to sleep and P1 is woken up.
P1 calls oom_kill_process(P2). oom_kill_process() sets TIF_MEMDIE on P2
because task_will_free_mem(P2) == true due to PF_EXITING already set.
Afterward, oom_scan_process_thread(P2) will return OOM_SCAN_ABORT because
test_tsk_thread_flag(P2, TIF_MEMDIE) is checked before P2->mm is checked.

If TIF_MEMDIE was again set to P2, the OOM killer will be blocked by P2
sitting in the final schedule() waiting for P2's parent to reap P2.
It will trigger an OOM livelock if P2's parent is unable to reap P2 due to
doing an allocation and waiting for the OOM killer to kill P2.

To close this race window, clear TIF_MEMDIE if P2->mm == NULL after
set_tsk_thread_flag(P2, TIF_MEMDIE) is done.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 kernel/exit.c | 1 +
 mm/oom_kill.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/kernel/exit.c b/kernel/exit.c
index 1ea4369..46d72e6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -435,6 +435,7 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
+	smp_wmb(); /* Avoid race with oom_kill_process(). */
 	clear_thread_flag(TIF_MEMDIE);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f82dd13..c8ae445 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -440,6 +440,9 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	if (task_will_free_mem(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
+		smp_rmb(); /* Avoid race with exit_mm(). */
+		if (unlikely(!p->mm))
+			clear_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
 		return;
 	}
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-20  2:03             ` Dave Chinner
@ 2014-12-20 12:41               ` Tetsuo Handa
  2014-12-20 22:35                 ` Dave Chinner
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-20 12:41 UTC (permalink / raw)
  To: dchinner; +Cc: mhocko, linux-mm, rientjes, oleg, david

Dave Chinner wrote:
> On Fri, Dec 19, 2014 at 09:22:49PM +0900, Tetsuo Handa wrote:
> > > > The global OOM killer will try to kill this program because this program
> > > > will be using 400MB+ of RAM by the time the global OOM killer is triggered.
> > > > But sometimes this program cannot be terminated by the global OOM killer
> > > > due to XFS lock dependency.
> > > >
> > > > You can see what is happening from OOM traces after uptime > 320 seconds of
> > > > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not
> > > > configured on this program.
> > >
> > > This is clearly a separate issue. It is a lock dependency and that alone
> > > _cannot_ be handled from OOM killer as it doesn't understand lock
> > > dependencies. This should be addressed from the xfs point of view IMHO
> > > but I am not familiar with this filesystem to tell you how or whether it
> > > is possible.
> 
> What XFS lock dependency? I see nothing in that output file that indicates a
> lock dependency problem - can you point out what the issue is here?

This is a problem which lockdep cannot report.

The problem is that an OOM-victim task is unable to terminate because it is
blocked for waiting for (I don't know which lock but) one of locks used by XFS.

----------
[  320.788387] Kill process 10732 (a.out) sharing same memory
(...snipped...)
[  398.641724] a.out           D ffff880077e42638     0 10732      1 0x00000084
[  398.643705]  ffff8800770ebcb8 0000000000000082 ffff8800770ebc88 ffff880077e42210
[  398.645819]  0000000000012500 ffff8800770ebfd8 0000000000012500 ffff880077e42210
[  398.647917]  ffff8800770ebcb8 ffff88007b4a2a48 ffff88007b4a2a4c ffff880077e42210
[  398.650009] Call Trace:
[  398.651094]  [<ffffffff8159f954>] schedule_preempt_disabled+0x24/0x70
[  398.652913]  [<ffffffff815a1705>] __mutex_lock_slowpath+0xb5/0x120
[  398.654679]  [<ffffffff815a178e>] mutex_lock+0x1e/0x32
[  398.656262]  [<ffffffffa023b58a>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs]
[  398.658350]  [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs]
[  398.660191]  [<ffffffff8117edd9>] new_sync_write+0x89/0xd0
[  398.661829]  [<ffffffff8117f742>] vfs_write+0xb2/0x1f0
[  398.663397]  [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70
[  398.665190]  [<ffffffff81180200>] SyS_write+0x50/0xc0
[  398.666745]  [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0
[  398.668539]  [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17
(...snipped...)
[  897.190487] Out of memory: Kill process 10732 (a.out) score 898 or sacrifice child
[  897.192236] Killed process 10732 (a.out) total-vm:2166864kB, anon-rss:1727976kB, file-rss:0kB
(...snipped...)
[  904.819053] a.out           D ffff880077e42638     0 10732      1 0x00100084
[  904.820967]  ffff8800770ebcb8 0000000000000082 ffff8800770ebc88 ffff880077e42210
[  904.823011]  0000000000012500 ffff8800770ebfd8 0000000000012500 ffff880077e42210
[  904.825054]  ffff8800770ebcb8 ffff88007b4a2a48 ffff88007b4a2a4c ffff880077e42210
[  904.827137] Call Trace:
[  904.828174]  [<ffffffff8159f954>] schedule_preempt_disabled+0x24/0x70
[  904.829924]  [<ffffffff815a1705>] __mutex_lock_slowpath+0xb5/0x120
[  904.831634]  [<ffffffff815a178e>] mutex_lock+0x1e/0x32
[  904.833148]  [<ffffffffa023b58a>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs]
[  904.835178]  [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs]
[  904.836980]  [<ffffffff8117edd9>] new_sync_write+0x89/0xd0
[  904.838561]  [<ffffffff8117f742>] vfs_write+0xb2/0x1f0
[  904.840094]  [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70
[  904.841846]  [<ffffffff81180200>] SyS_write+0x50/0xc0
[  904.844026]  [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0
[  904.845826]  [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17
----------

I don't know how block layer requests are issued by filesystem layer's
activities, but PID=10832 is blocked for so long at blk_rq_map_kern() doing
__GFP_WAIT allocation. I'm sure that this blk_rq_map_kern() is issued by XFS
filesystem's activities because this system has only /dev/sda1 formatted as
XFS and there is no swap memory.

----------
[  393.696527] kworker/1:1     R  running task        0    43      2 0x00000000
[  393.698561] Workqueue: events_freezable_power_ disk_events_workfn
[  393.700339]  ffff88007c5437d8 0000000000000046 ffff88007c5438a0 ffff88007c4b4cc0
[  393.702513]  0000000000012500 ffff88007c543fd8 0000000000012500 ffff88007c4b4cc0
[  393.704631]  0000000000000020 ffff88007c5438b0 0000000000000002 ffffffff81848408
[  393.706748] Call Trace:
[  393.707924]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  393.709572]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  393.711206]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  393.713001]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  393.714679]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  393.716538]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  393.718262]  [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380
[  393.719959]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  393.721628]  [<ffffffff8125c119>] bio_copy_kern+0x49/0x100
[  393.723240]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  393.725043]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  393.726695]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  393.728407]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  393.730021]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  393.731776]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  393.733561]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  393.735235]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  393.737027]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  393.738918]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  393.740602]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  393.742254]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  393.743898]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  393.745495]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  393.747152]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  393.748637]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  393.750438]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  393.752004]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
(...snipped...)
[  525.157216] kworker/1:0     R  running task        0 10832      2 0x00000080
[  525.159187] Workqueue: events_freezable_power_ disk_events_workfn
[  525.160907]  ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190
[  525.162956]  0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190
[  525.165010]  0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408
[  525.167068] Call Trace:
[  525.168100]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  525.169679]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  525.171241]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  525.172960]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  525.174580]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  525.176302]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  525.177982]  [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380
[  525.179631]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  525.181215]  [<ffffffff8125c119>] bio_copy_kern+0x49/0x100
[  525.182785]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  525.184545]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  525.186156]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  525.187831]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  525.189418]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  525.191148]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  525.192969]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  525.194688]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  525.196455]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  525.198291]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  525.199984]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  525.201616]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  525.203264]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  525.204799]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  525.206436]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  525.207902]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  525.209655]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  525.211206]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
(...snipped...)
[  619.934144] kworker/1:0     R  running task        0 10832      2 0x00000080
[  619.936060] Workqueue: events_freezable_power_ disk_events_workfn
[  619.937833]  ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190
[  619.939912]  0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190
[  619.942010]  0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408
[  619.944123] Call Trace:
[  619.945168]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  619.946697]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  619.948271]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  619.949968]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  619.951576]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  619.953387]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  619.955062]  [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380
[  619.956726]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  619.958289]  [<ffffffff8125c119>] bio_copy_kern+0x49/0x100
[  619.959886]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  619.961641]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  619.963229]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  619.964904]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  619.966499]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  619.968182]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  619.969936]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  619.971583]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  619.973346]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  619.975213]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  619.976865]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  619.978497]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  619.980179]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  619.981793]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  619.983468]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  619.984939]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  619.986684]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  619.988231]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
(...snipped...)
[  715.930998] kworker/1:0     R  running task        0 10832      2 0x00000080
[  715.932930] Workqueue: events_freezable_power_ disk_events_workfn
[  715.934670]  ffff880076fb9b40 0000000000000400 ffff88007c8ab8a0 0000000000000000
[  715.936814]  ffff88007c8ab7e8 ffff88007c8abfd8 0000000000012500 ffff88007c894190
[  715.938869]  0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408
[  715.940909] Call Trace:
[  715.942017]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  715.943638]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  715.945256]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  715.947001]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  715.948603]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  715.950298]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  715.952010]  [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380
[  715.953658]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  715.955324]  [<ffffffff8125c119>] bio_copy_kern+0x49/0x100
[  715.956929]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  715.958693]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  715.960722]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  715.962488]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  715.964142]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  715.965870]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  715.967615]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  715.969255]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  715.971061]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  715.972981]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  715.974692]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  715.976330]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  715.978090]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  715.979723]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  715.981361]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  715.982794]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  715.984554]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  715.986116]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
(...snipped...)
[  798.788405] kworker/1:0     R  running task        0 10832      2 0x00000088
[  798.790344] Workqueue: events_freezable_power_ disk_events_workfn
[  798.792191]  ffff880035e3f340 0000000000000400 ffff88007c8ab8a0 0000000000000000
[  798.794328]  ffff88007c8ab7e8 ffffffff8112132a ffff88007c8ab908 ffff88007cfee800
[  798.796395]  0000000000000020 0000000000000000 ffff88007c8ab838 ffff88007c8ab8b0
[  798.798458] Call Trace:
[  798.799525]  [<ffffffff8112132a>] ? shrink_slab_node+0x3a/0x1b0
[  798.801229]  [<ffffffff81122063>] ? shrink_slab+0x83/0x150
[  798.802809]  [<ffffffff811252bf>] ? do_try_to_free_pages+0x35f/0x4d0
[  798.804586]  [<ffffffff811254c4>] ? try_to_free_pages+0x94/0xc0
[  798.806250]  [<ffffffff8111a793>] ? __alloc_pages_nodemask+0x4e3/0xa40
[  798.808050]  [<ffffffff8115a8ce>] ? alloc_pages_current+0x8e/0x100
[  798.809759]  [<ffffffff8125bed6>] ? bio_copy_user_iov+0x1d6/0x380
[  798.811500]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  798.813053]  [<ffffffff8125c119>] ? bio_copy_kern+0x49/0x100
[  798.814699]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  798.816494]  [<ffffffff81265e6f>] ? blk_rq_map_kern+0x6f/0x130
[  798.818421]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  798.820083]  [<ffffffff813a66cf>] ? scsi_execute+0x12f/0x160
[  798.821733]  [<ffffffff813a7f14>] ? scsi_execute_req_flags+0x84/0xf0
[  798.823454]  [<ffffffffa01e29cc>] ? sr_check_events+0xbc/0x2e0 [sr_mod]
[  798.825312]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  798.826930]  [<ffffffffa01d6177>] ? cdrom_check_events+0x17/0x30 [cdrom]
[  798.828733]  [<ffffffffa01e2e5d>] ? sr_block_check_events+0x2d/0x30 [sr_mod]
[  798.830594]  [<ffffffff812701c6>] ? disk_check_events+0x56/0x1b0
[  798.832338]  [<ffffffff81270331>] ? disk_events_workfn+0x11/0x20
[  798.834013]  [<ffffffff8107ceaf>] ? process_one_work+0x13f/0x370
[  798.835682]  [<ffffffff8107de99>] ? worker_thread+0x119/0x500
[  798.837350]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  798.838990]  [<ffffffff81082f7c>] ? kthread+0xdc/0x100
[  798.840489]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  798.842258]  [<ffffffff815a383c>] ? ret_from_fork+0x7c/0xb0
[  798.843837]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
(...snipped...)
[  850.354473] kworker/1:0     R  running task        0 10832      2 0x00000080
[  850.356549] Workqueue: events_freezable_power_ disk_events_workfn
[  850.358273]  ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190
[  850.360359]  0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190
[  850.362427]  0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408
[  850.364505] Call Trace:
[  850.365504]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  850.369185]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  850.371553]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  850.373384]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  850.375503]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  850.377333]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  850.379100]  [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380
[  850.380763]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  850.382362]  [<ffffffff8125c119>] bio_copy_kern+0x49/0x100
[  850.384008]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  850.385799]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  850.387572]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  850.389995]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  850.391575]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  850.393298]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  850.395050]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  850.396696]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  850.398459]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  850.400321]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  850.401986]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  850.403621]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  850.405618]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  850.407336]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  850.411190]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  850.412677]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  850.414454]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  850.416010]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
(...snipped...)
[  907.302050] kworker/1:0     R  running task        0 10832      2 0x00000080
[  907.303961] Workqueue: events_freezable_power_ disk_events_workfn
[  907.305706]  ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190
[  907.307761]  0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190
[  907.309894]  0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408
[  907.311949] Call Trace:
[  907.312989]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  907.314578]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  907.316182]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  907.317889]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  907.319535]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  907.321259]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  907.322945]  [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380
[  907.324606]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  907.326196]  [<ffffffff8125c119>] bio_copy_kern+0x49/0x100
[  907.327788]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  907.329549]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  907.331184]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  907.332877]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  907.334452]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  907.336156]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  907.337893]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  907.339539]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  907.341289]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  907.343115]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  907.344771]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  907.346421]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  907.348057]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  907.349650]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  907.351295]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  907.352765]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  907.354520]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  907.356097]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
----------

I don't know which process is holding the mutex which PID=10732 is waiting
for, but I suspect that a process holding the mutex which PID=10732 is waiting
for is waiting for completion of disk I/O which is processed by PID=10832.

If my suspect is correct, it's a AB-BA livelock because the OOM killer is
waiting for PID=10732 to terminate whereas PID=10832 cannot complete disk
I/O due to waiting for the OOM killer. Unfortunately I'm not familiar with
XFS, thus I can't find who is.

Maybe PID=10802 than PID=10832? Then, why both PID=10802 and PID=10832 are
blocked for memory allocation?

----------
[  715.162520] a.out           R  running task        0 10802      1 0x00000084
[  715.164482]  ffff88007b877898 0000000000000082 ffff88007b877960 ffff8800751bc050
[  715.166574]  0000000000012500 ffff88007b877fd8 0000000000012500 ffff8800751bc050
[  715.169036]  0000000000000020 ffff88007b877970 0000000000000003 ffffffff81848408
[  715.171125] Call Trace:
[  715.172185]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  715.173773]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  715.175356]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  715.177088]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  715.178721]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  715.180583]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  715.182203]  [<ffffffff81111b27>] __page_cache_alloc+0xa7/0xc0
[  715.183864]  [<ffffffff8111263b>] pagecache_get_page+0x6b/0x1e0
[  715.185533]  [<ffffffffa02522ae>] ? xfs_trans_commit+0x13e/0x230 [xfs]
[  715.187314]  [<ffffffff811127de>] grab_cache_page_write_begin+0x2e/0x50
[  715.189108]  [<ffffffffa02301cf>] xfs_vm_write_begin+0x2f/0xe0 [xfs]
[  715.190876]  [<ffffffff8111188c>] generic_perform_write+0xcc/0x1d0
[  715.192610]  [<ffffffffa023b50f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs]
[  715.194526]  [<ffffffffa023b5ef>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs]
[  715.196580]  [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs]
[  715.198368]  [<ffffffff8117edd9>] new_sync_write+0x89/0xd0
[  715.200029]  [<ffffffff8117f742>] vfs_write+0xb2/0x1f0
[  715.201576]  [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70
[  715.203309]  [<ffffffff81180200>] SyS_write+0x50/0xc0
[  715.204866]  [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0
[  715.206613]  [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17
(...snipped...)
[  906.533722] a.out           R  running task        0 10802      1 0x00000084
[  906.535671]  ffff88007b877898 0000000000000082 ffff88007b877960 ffff8800751bc050
[  906.537699]  0000000000012500 ffff88007b877fd8 0000000000012500 ffff8800751bc050
[  906.539838]  0000000000000020 ffff88007b877970 0000000000000003 ffffffff81848408
[  906.541916] Call Trace:
[  906.543075]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  906.544610]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  906.546223]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  906.547941]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  906.549622]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  906.551357]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  906.553070]  [<ffffffff81111b27>] __page_cache_alloc+0xa7/0xc0
[  906.554748]  [<ffffffff8111263b>] pagecache_get_page+0x6b/0x1e0
[  906.556409]  [<ffffffffa02522ae>] ? xfs_trans_commit+0x13e/0x230 [xfs]
[  906.558180]  [<ffffffff811127de>] grab_cache_page_write_begin+0x2e/0x50
[  906.560242]  [<ffffffffa02301cf>] xfs_vm_write_begin+0x2f/0xe0 [xfs]
[  906.562027]  [<ffffffff8111188c>] generic_perform_write+0xcc/0x1d0
[  906.563851]  [<ffffffffa023b50f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs]
[  906.565838]  [<ffffffffa023b5ef>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs]
[  906.567892]  [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs]
[  906.569719]  [<ffffffff8117edd9>] new_sync_write+0x89/0xd0
[  906.571300]  [<ffffffff8117f742>] vfs_write+0xb2/0x1f0
[  906.572836]  [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70
[  906.574578]  [<ffffffff81180200>] SyS_write+0x50/0xc0
[  906.576198]  [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0
[  906.577929]  [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17
----------

Anyway stalling for 10 minutes upon OOM (and can't solve with SysRq-f) is
unusable for me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-20 12:41               ` Tetsuo Handa
@ 2014-12-20 22:35                 ` Dave Chinner
  2014-12-21  8:45                   ` Tetsuo Handa
  2014-12-29 17:40                   ` [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) Michal Hocko
  0 siblings, 2 replies; 276+ messages in thread
From: Dave Chinner @ 2014-12-20 22:35 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: dchinner, mhocko, linux-mm, rientjes, oleg

On Sat, Dec 20, 2014 at 09:41:22PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > On Fri, Dec 19, 2014 at 09:22:49PM +0900, Tetsuo Handa wrote:
> > > > > The global OOM killer will try to kill this program because this program
> > > > > will be using 400MB+ of RAM by the time the global OOM killer is triggered.
> > > > > But sometimes this program cannot be terminated by the global OOM killer
> > > > > due to XFS lock dependency.
> > > > >
> > > > > You can see what is happening from OOM traces after uptime > 320 seconds of
> > > > > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not
> > > > > configured on this program.
> > > >
> > > > This is clearly a separate issue. It is a lock dependency and that alone
> > > > _cannot_ be handled from OOM killer as it doesn't understand lock
> > > > dependencies. This should be addressed from the xfs point of view IMHO
> > > > but I am not familiar with this filesystem to tell you how or whether it
> > > > is possible.
> > 
> > What XFS lock dependency? I see nothing in that output file that indicates a
> > lock dependency problem - can you point out what the issue is here?
> 
> This is a problem which lockdep cannot report.
> 
> The problem is that an OOM-victim task is unable to terminate because it is
> blocked for waiting for (I don't know which lock but) one of locks used by XFS.

That's not an XFS problem - XFS relies on the memory reclaim
subsystem being able to make progress. If the memory reclaim
subsystem cannot make progress, then there's a bug in the memory
reclaim subsystem, not a problem with the OOM killer.

IOWs, you're not looking at the right place to solve the problem.

> ----------
> [  320.788387] Kill process 10732 (a.out) sharing same memory
> (...snipped...)
> [  398.641724] a.out           D ffff880077e42638     0 10732      1 0x00000084
> [  398.643705]  ffff8800770ebcb8 0000000000000082 ffff8800770ebc88 ffff880077e42210
> [  398.645819]  0000000000012500 ffff8800770ebfd8 0000000000012500 ffff880077e42210
> [  398.647917]  ffff8800770ebcb8 ffff88007b4a2a48 ffff88007b4a2a4c ffff880077e42210
> [  398.650009] Call Trace:
> [  398.651094]  [<ffffffff8159f954>] schedule_preempt_disabled+0x24/0x70
> [  398.652913]  [<ffffffff815a1705>] __mutex_lock_slowpath+0xb5/0x120
> [  398.654679]  [<ffffffff815a178e>] mutex_lock+0x1e/0x32
> [  398.656262]  [<ffffffffa023b58a>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs]
> [  398.658350]  [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs]
> [  398.660191]  [<ffffffff8117edd9>] new_sync_write+0x89/0xd0
> [  398.661829]  [<ffffffff8117f742>] vfs_write+0xb2/0x1f0
> [  398.663397]  [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70
> [  398.665190]  [<ffffffff81180200>] SyS_write+0x50/0xc0
> [  398.666745]  [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0
> [  398.668539]  [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17

These processes are blocked because some other process is holding the
i_mutex - likely another write that is blocked in memory reclaim
during page cache allocation. Yup:

[  398.852364] a.out           R  running task        0 10739      1 0x00000084
[  398.854312]  ffff8800751d3898 0000000000000082 ffff8800751d3960 ffff880035c42a80
[  398.856369]  0000000000012500 ffff8800751d3fd8 0000000000012500 ffff880035c42a80
[  398.858440]  0000000000000020 ffff8800751d3970 0000000000000003 ffffffff81848408
[  398.860497] Call Trace:
[  398.861602]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  398.863195]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  398.864799]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  398.866536]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  398.868177]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  398.869920]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  398.871647]  [<ffffffff81111b27>] __page_cache_alloc+0xa7/0xc0
[  398.873785]  [<ffffffff8111263b>] pagecache_get_page+0x6b/0x1e0
[  398.875468]  [<ffffffff811127de>] grab_cache_page_write_begin+0x2e/0x50
[  398.881857]  [<ffffffffa02301cf>] xfs_vm_write_begin+0x2f/0xe0 [xfs]
[  398.883553]  [<ffffffff8111188c>] generic_perform_write+0xcc/0x1d0
[  398.885210]  [<ffffffffa023b50f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs]
[  398.887100]  [<ffffffffa023b5ef>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs]
[  398.889135]  [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs]
[  398.890907]  [<ffffffff8117edd9>] new_sync_write+0x89/0xd0
[  398.892495]  [<ffffffff8117f742>] vfs_write+0xb2/0x1f0
[  398.894017]  [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70
[  398.895768]  [<ffffffff81180200>] SyS_write+0x50/0xc0
[  398.897273]  [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0
[  398.899013]  [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17

That's what's holding the i_mutex. This is normal, and *every*
filesystem holds the i_mutex here for buffered writes. Stop
trying to shoot the messenger...

Oh, boy.

struct page *grab_cache_page_write_begin(struct address_space *mapping,
                                        pgoff_t index, unsigned flags)
{
        struct page *page;
        int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;

        if (flags & AOP_FLAG_NOFS)
                fgp_flags |= FGP_NOFS;

        page = pagecache_get_page(mapping, index, fgp_flags,
                        mapping_gfp_mask(mapping),
                        GFP_KERNEL);
        if (page)
                wait_for_stable_page(page);

        return page;
}

There are *3* different memory allocation controls passed to
pagecache_get_page. The first is via AOP_FLAG_NOFS, where the caller
explicitly says this allocation is in filesystem context with locks
held, and so all allocations need to be done in GFP_NOFS context.
This is used to override the second and third gfp parameters.

The second is mapping_gfp_mask(mapping), which is the *default
allocation context* the filesystem wants the page cache to use for
allocating pages to the mapping.

The third is a hard coded GFP_KERNEL, which is used for radix tree
node allocation.

Why are there separate allocation contexts for the radix tree nodes
and the page cache pages when they are done under *exactly the same
caller context*? Either we are allowed to recurse into the
filesystem or we aren't, and the inode mapping mask defines that
context for all page cache allocations, not just the pages
themselves.

And to point out how many filesystems this affects,
the loop device, btrfs, f2fs, gfs2, jfs, logfs, nil2fs, reiserfs
and XFS all use this mapping default to clear __GFP_FS from
page cache allocations. Only ext4 and gfs2 use AOP_FLAG_NOFS in
their ->write_begin callouts to prevent recusrion.

IOWs, grab_cache_page_write_begin/pagecache_get_page multiple
allocation contexts are just wrong.  It does not match the way
filesystems are informing the page cache of allocation context to
avoid recursion (for avoiding stack overflow and/or deadlock).
AOP_FLAG_NOFS should go away, and all filesystems should modify the
mapping gfp mask to set their allocation context. If should be used
*everywhere* pages are allocated into the page cache, and for all
allocations related to tracking those allocated pages.

Now, that's not the problem directly related to this lockup, but
it's indicative of how far the page cache code has become from
reality over the past few years...

So, going back to the lockup, doesn't hte fact that so many
processes are spinning in the shrinker tell you that there's a
problem in that area? i.e. this:

[  398.861602]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  398.863195]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  398.864799]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0

tells me a shrinker is not making progress for some reason.  I'd
suggest that you run some tracing to find out what shrinker it is
stuck in. there are tracepoints in shrink_slab that will tell you
what shrinker is iterating for long periods of time. i.e instead of
ranting and pointing fingers at everyone, you need to keep digging
until you know exactly where reclaim progress is stalling.

> I don't know how block layer requests are issued by filesystem layer's
> activities, but PID=10832 is blocked for so long at blk_rq_map_kern() doing
> __GFP_WAIT allocation. I'm sure that this blk_rq_map_kern() is issued by XFS
> filesystem's activities because this system has only /dev/sda1 formatted as
> XFS and there is no swap memory.

Sorry, what?

[  525.184545]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  525.186156]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  525.187831]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  525.189418]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  525.191148]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  525.192969]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  525.194688]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  525.196455]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  525.198291]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  525.199984]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  525.201616]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  525.203264]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  525.204799]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  525.206436]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  525.207902]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  525.209655]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  525.211206]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0

That's a CDROM event through the SCSI stack via a raw scsi device.
If you read the code you'd see that scsi_execute() is the function
using __GFP_WAIT semantics. This has *absolutely nothing* to do with
XFS, and clearly has nothing to do with anything related to the
problem you are seeing.

> Anyway stalling for 10 minutes upon OOM (and can't solve with
> SysRq-f) is unusable for me.

OOM-killing is not a magic button that will miraculously make the
system work when you oversubscribe it severely.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-20 22:35                 ` Dave Chinner
@ 2014-12-21  8:45                   ` Tetsuo Handa
  2014-12-21 20:42                     ` Dave Chinner
  2014-12-29 18:19                     ` Michal Hocko
  2014-12-29 17:40                   ` [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) Michal Hocko
  1 sibling, 2 replies; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-21  8:45 UTC (permalink / raw)
  To: david; +Cc: dchinner, mhocko, linux-mm, rientjes, oleg

Thank you for detailed explanation.

Dave Chinner wrote:
> So, going back to the lockup, doesn't hte fact that so many
> processes are spinning in the shrinker tell you that there's a
> problem in that area? i.e. this:
> 
> [  398.861602]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
> [  398.863195]  [<ffffffff81122119>] shrink_slab+0x139/0x150
> [  398.864799]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
> 
> tells me a shrinker is not making progress for some reason.  I'd
> suggest that you run some tracing to find out what shrinker it is
> stuck in. there are tracepoints in shrink_slab that will tell you
> what shrinker is iterating for long periods of time. i.e instead of
> ranting and pointing fingers at everyone, you need to keep digging
> until you know exactly where reclaim progress is stalling.

I checked using below patch that shrink_slab() is called for many times but
each call took 0 jiffies and freed 0 objects. I think shrink_slab() is merely
reported since it likely works as a location for yielding CPU resource.

----------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..ac8b46a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1661,6 +1661,14 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+	/* Jiffies spent since the start of outermost memory allocation */
+	unsigned long gfp_start;
+	/* GFP flags passed to innermost memory allocation */
+	gfp_t gfp_flags;
+	/* # of shrink_slab() calls since outermost memory allocation. */
+	unsigned int shrink_slab_counter;
+	/* # of OOM-killer skipped. */
+	atomic_t oom_killer_skip_counter;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89e7283..26dcdf8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4522,6 +4522,22 @@ out_unlock:
 	return retval;
 }
 
+static void print_memalloc_info(const struct task_struct *p)
+{
+	const gfp_t gfp = p->gfp_flags & __GFP_WAIT;
+
+	/*
+	 * __alloc_pages_nodemask() doesn't use smp_wmb() between
+	 * updating ->gfp_start and ->gfp_flags. But reading stale
+	 * ->gfp_start value harms nothing but printing bogus duration.
+	 * Correct duration will be printed when this function is
+	 * called for the next time.
+	 */
+	if (unlikely(gfp))
+		printk(KERN_INFO "MemAlloc: %ld jiffies on 0x%x\n",
+		       jiffies - p->gfp_start, gfp);
+}
+
 static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
 void sched_show_task(struct task_struct *p)
@@ -4554,6 +4570,7 @@ void sched_show_task(struct task_struct *p)
 		task_pid_nr(p), ppid,
 		(unsigned long)task_thread_info(p)->flags);
 
+	print_memalloc_info(p);
 	print_worker_info(KERN_INFO, p);
 	show_stack(p, NULL);
 }
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b..5b014d0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -319,6 +319,10 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		case OOM_SCAN_CONTINUE:
 			continue;
 		case OOM_SCAN_ABORT:
+			if (atomic_inc_return(&p->oom_killer_skip_counter) % 1000 == 0)
+				printk(KERN_INFO "%s(%d) the OOM killer was skipped "
+				       "for %u times.\n", p->comm, p->pid,
+				       atomic_read(&p->oom_killer_skip_counter));
 			rcu_read_unlock();
 			return (struct task_struct *)(-1UL);
 		case OOM_SCAN_OK:
@@ -444,6 +448,10 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
+		if (atomic_inc_return(&p->oom_killer_skip_counter) % 1000 == 0)
+			printk(KERN_INFO "%s(%d) the OOM killer was skipped "
+			       "for %u times.\n", p->comm, p->pid,
+			       atomic_read(&p->oom_killer_skip_counter));
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
 		return;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 616a2c9..d1c872f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2790,6 +2790,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	int classzone_idx;
+	const gfp_t old_gfp_flags = current->gfp_flags;
+
+	if (!old_gfp_flags) {
+		current->gfp_start = jiffies;
+		current->shrink_slab_counter = 0;
+	}
+	current->gfp_flags = gfp_mask;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2798,7 +2805,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
 	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+		goto nopage;
 
 	/*
 	 * Check the zones suitable for the gfp_mask contain at least one
@@ -2806,7 +2813,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * of GFP_THISNODE and a memoryless node
 	 */
 	if (unlikely(!zonelist->_zonerefs->zone))
-		return NULL;
+		goto nopage;
 
 	if (IS_ENABLED(CONFIG_CMA) && migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
@@ -2850,6 +2857,9 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
+nopage:
+	current->gfp_flags = old_gfp_flags;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcb4707..5690f2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -365,6 +365,7 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 {
 	struct shrinker *shrinker;
 	unsigned long freed = 0;
+	const unsigned long start = jiffies;
 
 	if (nr_pages_scanned == 0)
 		nr_pages_scanned = SWAP_CLUSTER_MAX;
@@ -397,6 +398,15 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 	up_read(&shrinker_rwsem);
 out:
+	{
+		struct task_struct *p = current;
+		if (++p->shrink_slab_counter % 100000 == 0)
+			printk(KERN_INFO "%s(%d) shrink_slab() was called for "
+			       "%u times. This time freed %lu object and took "
+			       "%lu jiffies. Spent %lu jiffies till now.\n",
+			       p->comm, p->pid, p->shrink_slab_counter, freed,
+			       jiffies - start, jiffies - p->gfp_start);
+	}
 	cond_resched();
 	return freed;
 }
----------

Traces from uptime > 484 seconds of
http://I-love.SAKURA.ne.jp/tmp/serial-20141221.txt.xz is a stalled case.
PID=12718 got SIGKILL for the first time when PID=12716 got SIGKILL with
TIF_MEMDIE at 484 sec. When PID=12717 got TIF_MEMDIE at 540 sec, the OOM
killer was skipped for 28000 times till 547 sec, but PID=12717 was able
to terminate because somebody has released enough memory for PID=12717 to
call exit_mm(). When PID=12718 got TIF_MEMDIE at 548 sec, the OOM killer was
skipped for 2059000 times till 983 sec, indicating that PID=12718 was not
able to terminate because nobody has released enough memory for PID=12718
to call exit_mm(). Is this interpretation correct?

> That's not an XFS problem - XFS relies on the memory reclaim
> subsystem being able to make progress. If the memory reclaim
> subsystem cannot make progress, then there's a bug in the memory
> reclaim subsystem, not a problem with the OOM killer.

Since trying to trigger the OOM killer means that memory reclaim subsystem
has gave up, the memory reclaim subsystem had been unable to find
reclaimable memory after PID=12718 got TIF_MEMDIE at 548 sec.
Is this interpretation correct?

And traces of PID=12718 after 548 sec remained unchanged.
Does this mean that there is a bug in the memory reclaim subsystem?

----------
[  799.490009] a.out           D ffff8800764918a0     0 12718      1 0x00100084
[  799.491903]  ffff880077d7fca8 0000000000000086 ffff880076491470 ffff880077d7ffd8
[  799.493924]  0000000000013640 0000000000013640 ffff8800358c8210 ffff880076491470
[  799.495938]  0000000000000000 ffff88007c8a3e48 ffff88007c8a3e4c ffff880076491470
[  799.497964] Call Trace:
[  799.498971]  [<ffffffff81618669>] schedule_preempt_disabled+0x29/0x70
[  799.500746]  [<ffffffff8161a555>] __mutex_lock_slowpath+0xb5/0x120
[  799.502402]  [<ffffffff8161a5e3>] mutex_lock+0x23/0x37
[  799.503944]  [<ffffffffa025fb47>] xfs_file_buffered_aio_write.isra.9+0x77/0x270 [xfs]
[  799.505939]  [<ffffffff8109e274>] ? finish_task_switch+0x54/0x150
[  799.507638]  [<ffffffffa025fdc3>] xfs_file_write_iter+0x83/0x130 [xfs]
[  799.509416]  [<ffffffff811ce76e>] new_sync_write+0x8e/0xd0
[  799.510990]  [<ffffffff811cf0f7>] vfs_write+0xb7/0x1f0
[  799.512484]  [<ffffffff81022d9c>] ? do_audit_syscall_entry+0x6c/0x70
[  799.514226]  [<ffffffff811cfbe5>] SyS_write+0x55/0xd0
[  799.515752]  [<ffffffff8161c9e9>] system_call_fastpath+0x12/0x17
(...snipped...)
[  954.595576] a.out           D ffff8800764918a0     0 12718      1 0x00100084
[  954.597544]  ffff880077d7fca8 0000000000000086 ffff880076491470 ffff880077d7ffd8
[  954.599565]  0000000000013640 0000000000013640 ffff8800358c8210 ffff880076491470
[  954.601634]  0000000000000000 ffff88007c8a3e48 ffff88007c8a3e4c ffff880076491470
[  954.604091] Call Trace:
[  954.607766]  [<ffffffff81618669>] schedule_preempt_disabled+0x29/0x70
[  954.609792]  [<ffffffff8161a555>] __mutex_lock_slowpath+0xb5/0x120
[  954.611644]  [<ffffffff8161a5e3>] mutex_lock+0x23/0x37
[  954.613256]  [<ffffffffa025fb47>] xfs_file_buffered_aio_write.isra.9+0x77/0x270 [xfs]
[  954.615261]  [<ffffffff8109e274>] ? finish_task_switch+0x54/0x150
[  954.616990]  [<ffffffffa025fdc3>] xfs_file_write_iter+0x83/0x130 [xfs]
[  954.619180]  [<ffffffff811ce76e>] new_sync_write+0x8e/0xd0
[  954.620798]  [<ffffffff811cf0f7>] vfs_write+0xb7/0x1f0
[  954.622345]  [<ffffffff81022d9c>] ? do_audit_syscall_entry+0x6c/0x70
[  954.624073]  [<ffffffff811cfbe5>] SyS_write+0x55/0xd0
[  954.625549]  [<ffffffff8161c9e9>] system_call_fastpath+0x12/0x17
----------

I guess __alloc_pages_direct_reclaim() returns NULL with did_some_progress > 0
so that __alloc_pages_may_oom() will not be called easily. As long as
try_to_free_pages() returns non-zero, __alloc_pages_direct_reclaim() might
return NULL with did_some_progress > 0. So, do_try_to_free_pages() is called
for many times and is likely to return non-zero. And when
__alloc_pages_may_oom() is called, TIF_MEMDIE is set on the thread waiting
for mutex_lock(&"struct inode"->i_mutex) at xfs_file_buffered_aio_write()
and I see no further progress.

I don't know where to examine next. Would you please teach me command line
for tracepoints to examine?


> That's a CDROM event through the SCSI stack via a raw scsi device.
> If you read the code you'd see that scsi_execute() is the function
> using __GFP_WAIT semantics. This has *absolutely nothing* to do with
> XFS, and clearly has nothing to do with anything related to the
> problem you are seeing.

Oops, sorry. I was misunderstanding that

[  907.336156]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  907.337893]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  907.339539]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  907.341289]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]

lines are garbage. But indeed there is a chain

  disk_check_events() =>
    disk->fops->check_events(disk, clearing) == sr_block_check_events() =>
      cdrom_check_events() =>
        cdrom_update_events() =>
          cdi->ops->check_events() == sr_check_events() =>
            sr_get_events() =>
              scsi_execute_req()

that indicates it is blocked at CDROM event.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-21  8:45                   ` Tetsuo Handa
@ 2014-12-21 20:42                     ` Dave Chinner
  2014-12-22 16:57                       ` Michal Hocko
  2014-12-29 18:19                     ` Michal Hocko
  1 sibling, 1 reply; 276+ messages in thread
From: Dave Chinner @ 2014-12-21 20:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: dchinner, mhocko, linux-mm, rientjes, oleg

On Sun, Dec 21, 2014 at 05:45:32PM +0900, Tetsuo Handa wrote:
> Thank you for detailed explanation.
> 
> Dave Chinner wrote:
> > So, going back to the lockup, doesn't hte fact that so many
> > processes are spinning in the shrinker tell you that there's a
> > problem in that area? i.e. this:
> > 
> > [  398.861602]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
> > [  398.863195]  [<ffffffff81122119>] shrink_slab+0x139/0x150
> > [  398.864799]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
> > 
> > tells me a shrinker is not making progress for some reason.  I'd
> > suggest that you run some tracing to find out what shrinker it is
> > stuck in. there are tracepoints in shrink_slab that will tell you
> > what shrinker is iterating for long periods of time. i.e instead of
> > ranting and pointing fingers at everyone, you need to keep digging
> > until you know exactly where reclaim progress is stalling.
> 
> I checked using below patch that shrink_slab() is called for many times but
> each call took 0 jiffies and freed 0 objects. I think shrink_slab() is merely
> reported since it likely works as a location for yielding CPU resource.

So we've got a situation where memory reclaim is not making
progress because there's nothing left to free, and everything is
backed up waiting for memory allocation to complete so that locks
can be released.


> Since trying to trigger the OOM killer means that memory reclaim subsystem
> has gave up, the memory reclaim subsystem had been unable to find
> reclaimable memory after PID=12718 got TIF_MEMDIE at 548 sec.
> Is this interpretation correct?

"memory reclaim gave up"? So why the hell isn't it returning a
failure to the caller?

i.e. We have a perfectly good page cache allocation failure error
path here all the way back to userspace, but we're invoking the
OOM-killer to kill random processes rather than returning ENOMEM to
the processes that are generating the memory demand?

Further: when did the oom-killer become the primary method
of handling situations when memory allocation needs to fail?
__GFP_WAIT does *not* mean memory allocation can't fail - that's what
__GFP_NOFAIL means. And none of the page cache allocations use
__GFP_NOFAIL, so why aren't we getting an allocation failure before
the oom-killer is kicked?

> I guess __alloc_pages_direct_reclaim() returns NULL with did_some_progress > 0
> so that __alloc_pages_may_oom() will not be called easily. As long as
> try_to_free_pages() returns non-zero, __alloc_pages_direct_reclaim() might
> return NULL with did_some_progress > 0. So, do_try_to_free_pages() is called
> for many times and is likely to return non-zero. And when
> __alloc_pages_may_oom() is called, TIF_MEMDIE is set on the thread waiting
> for mutex_lock(&"struct inode"->i_mutex) at xfs_file_buffered_aio_write()
> and I see no further progress.

Of course - TIF_MEMDIE doesn't do anything to the task that is
blocked, and the SIGKILL signal can't be delivered until the syscall
completes or the kernel code checks for pending signals and handles
EINTR directly. Mutexes are uninterruptible by design so there's no
EINTR processing, hence the oom killer cannot make progress when
everything is blocked on mutexes waiting for memory allocation to
succeed or fail.

i.e. until the lock holder exists from direct memory reclaim and
releases the locks it holds, the oom killer will not be able to save
the system. IOWs, the problem is that memory allocation is not
failing when it should....

Focussing on the OOM killer here is the wrong way to solve this
problem - the problem that needs to be solved is sane handling of
OOM conditions to avoid needing to invoke the OOM-killer...

> I don't know where to examine next. Would you please teach me command line
> for tracepoints to examine?

Tracepoints for what purpose?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-21 20:42                     ` Dave Chinner
@ 2014-12-22 16:57                       ` Michal Hocko
  2014-12-22 21:30                         ` Dave Chinner
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-22 16:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Tetsuo Handa, dchinner, linux-mm, rientjes, oleg

On Mon 22-12-14 07:42:49, Dave Chinner wrote:
[...]
> "memory reclaim gave up"? So why the hell isn't it returning a
> failure to the caller?
> 
> i.e. We have a perfectly good page cache allocation failure error
> path here all the way back to userspace, but we're invoking the
> OOM-killer to kill random processes rather than returning ENOMEM to
> the processes that are generating the memory demand?
> 
> Further: when did the oom-killer become the primary method
> of handling situations when memory allocation needs to fail?
> __GFP_WAIT does *not* mean memory allocation can't fail - that's what
> __GFP_NOFAIL means. And none of the page cache allocations use
> __GFP_NOFAIL, so why aren't we getting an allocation failure before
> the oom-killer is kicked?

Well, it has been an unwritten rule that GFP_KERNEL allocations for
low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago
decision which would be tricky to fix now without silently breaking a
lot of code. Sad...
Nevertheless the caller can prevent from an endless loop by using
__GFP_NORETRY so this could be used as a workaround. The default should
be opposite IMO and only those who really require some guarantee should
use a special flag for that purpose.

> > I guess __alloc_pages_direct_reclaim() returns NULL with did_some_progress > 0
> > so that __alloc_pages_may_oom() will not be called easily. As long as
> > try_to_free_pages() returns non-zero, __alloc_pages_direct_reclaim() might
> > return NULL with did_some_progress > 0. So, do_try_to_free_pages() is called
> > for many times and is likely to return non-zero. And when
> > __alloc_pages_may_oom() is called, TIF_MEMDIE is set on the thread waiting
> > for mutex_lock(&"struct inode"->i_mutex) at xfs_file_buffered_aio_write()
> > and I see no further progress.
> 
> Of course - TIF_MEMDIE doesn't do anything to the task that is
> blocked, and the SIGKILL signal can't be delivered until the syscall
> completes or the kernel code checks for pending signals and handles
> EINTR directly. Mutexes are uninterruptible by design so there's no
> EINTR processing, hence the oom killer cannot make progress when
> everything is blocked on mutexes waiting for memory allocation to
> succeed or fail.
> 
> i.e. until the lock holder exists from direct memory reclaim and
> releases the locks it holds, the oom killer will not be able to save
> the system. IOWs, the problem is that memory allocation is not
> failing when it should....
> 
> Focussing on the OOM killer here is the wrong way to solve this
> problem - the problem that needs to be solved is sane handling of
> OOM conditions to avoid needing to invoke the OOM-killer...

Completely agreed!

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-20 11:42                 ` Tetsuo Handa
@ 2014-12-22 20:25                   ` Michal Hocko
  2014-12-23  1:00                     ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-22 20:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg

On Sat 20-12-14 20:42:08, Tetsuo Handa wrote:
[...]
> >From a2ebb5b873ec5af45e0bea9ea6da2a93c0f06c35 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 20 Dec 2014 20:05:14 +0900
> Subject: [PATCH] oom: Close race of setting TIF_MEMDIE to mm-less process.
> 
> exit_mm() and oom_kill_process() could race with regard to handling of
> TIF_MEMDIE flag if sequence described below occurred.
> 
> P1 calls out_of_memory(). out_of_memory() calls select_bad_process().
> select_bad_process() calls oom_scan_process_thread(P2). If P2->mm != NULL
> and task_will_free_mem(P2) == false, oom_scan_process_thread(P2) returns
> OOM_SCAN_OK. And if P2 is chosen as a victim task, select_bad_process()
> returns P2 after calling get_task_struct(P2). Then, P1 goes to sleep and
> P2 is woken up. P2 enters into do_exit() and gets PF_EXITING at exit_signals()
> and releases mm at exit_mm(). Then, P2 goes to sleep and P1 is woken up.
> P1 calls oom_kill_process(P2). oom_kill_process() sets TIF_MEMDIE on P2
> because task_will_free_mem(P2) == true due to PF_EXITING already set.
> Afterward, oom_scan_process_thread(P2) will return OOM_SCAN_ABORT because
> test_tsk_thread_flag(P2, TIF_MEMDIE) is checked before P2->mm is checked.
> 
> If TIF_MEMDIE was again set to P2, the OOM killer will be blocked by P2
> sitting in the final schedule() waiting for P2's parent to reap P2.
> It will trigger an OOM livelock if P2's parent is unable to reap P2 due to
> doing an allocation and waiting for the OOM killer to kill P2.
>
> To close this race window, clear TIF_MEMDIE if P2->mm == NULL after
> set_tsk_thread_flag(P2, TIF_MEMDIE) is done.

I do not think this patch is sufficient. P2 could pass exit_mm() right
after task_unlock in oom_kill_process and we would set TIF_MEMDIE to
this task as well. Something like the following should work and it
doesn't add memory barriers trickery.
---

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-22 16:57                       ` Michal Hocko
@ 2014-12-22 21:30                         ` Dave Chinner
  2014-12-23  9:41                           ` Johannes Weiner
  0 siblings, 1 reply; 276+ messages in thread
From: Dave Chinner @ 2014-12-22 21:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, dchinner, linux-mm, rientjes, oleg

On Mon, Dec 22, 2014 at 05:57:36PM +0100, Michal Hocko wrote:
> On Mon 22-12-14 07:42:49, Dave Chinner wrote:
> [...]
> > "memory reclaim gave up"? So why the hell isn't it returning a
> > failure to the caller?
> > 
> > i.e. We have a perfectly good page cache allocation failure error
> > path here all the way back to userspace, but we're invoking the
> > OOM-killer to kill random processes rather than returning ENOMEM to
> > the processes that are generating the memory demand?
> > 
> > Further: when did the oom-killer become the primary method
> > of handling situations when memory allocation needs to fail?
> > __GFP_WAIT does *not* mean memory allocation can't fail - that's what
> > __GFP_NOFAIL means. And none of the page cache allocations use
> > __GFP_NOFAIL, so why aren't we getting an allocation failure before
> > the oom-killer is kicked?
> 
> Well, it has been an unwritten rule that GFP_KERNEL allocations for
> low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago
> decision which would be tricky to fix now without silently breaking a
> lot of code. Sad...

Wow.

We have *always* been told memory allocations are not guaranteed to
succeed, ever, unless __GFP_NOFAIL is set, but that's deprecated and
nobody is allowed to use it any more.

Lots of code has dependencies on memory allocation making progress
or failing for the system to work in low memory situations. The page
cache is one of them, which means all filesystems have that
dependency. We don't explicitly ask memory allocations to fail, we
*expect* the memory allocation failures will occur in low memory
conditions. We've been designing and writing code with this in mind
for the past 15 years.

How did we get so far away from the message of "the memory allocator
never guarantees success" that it will never fail to allocate memory
even if it means we livelock the entire system?

> Nevertheless the caller can prevent from an endless loop by using
> __GFP_NORETRY so this could be used as a workaround.

That's just a never-ending game of whack-a-mole that we will
continually lose. It's not a workable solution.

> The default should be opposite IMO and only those who really
> require some guarantee should use a special flag for that purpose.

Yup, totally agree.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-22 20:25                   ` Michal Hocko
@ 2014-12-23  1:00                     ` Tetsuo Handa
  2014-12-23  9:51                       ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-23  1:00 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg

Michal Hocko wrote:
> OOM killer tries to exlude tasks which do not have mm_struct associated
s/exlude/exclude/

> Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock
> which will serialize the OOM killer with exit_mm which sets task->mm to
> NULL.
Nice idea.

By the way, find_lock_task_mm(victim) may succeed if victim->mm == NULL and
one of threads in victim thread-group has non-NULL mm. That case is handled
by victim != p branch below. But where was p->signal->oom_score_adj !=
OOM_SCORE_ADJ_MIN checked? (In other words, don't we need to check like
t->mm && t->signal->oom_score_adj != OOM_SCORE_ADJ_MIN at find_lock_task_mm()
for OOM-kill case?)

Also, why not to call set_tsk_thread_flag() and do_send_sig_info() together
like below

 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
 		return;
 	} else if (victim != p) {
 		get_task_struct(p);
 		put_task_struct(victim);
 		victim = p;
 	}
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
+	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
 	task_unlock(victim);

than wait for for_each_process() loop in case current task went to sleep
immediately after task_unlock(victim)? Or is there a reason we had been
setting TIF_MEMDIE after the for_each_process() loop? If the reason was
to minimize the duration of OOM killer being disabled due to TIF_MEMDIE,
shouldn't we do like below?

 	rcu_read_unlock();
 
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	task_lock(victim);
+	if (victim->mm)
+		set_tsk_thread_flag(victim, TIF_MEMDIE);
+	task_unlock(victim);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-22 21:30                         ` Dave Chinner
@ 2014-12-23  9:41                           ` Johannes Weiner
  2014-12-24  1:06                             ` Dave Chinner
  0 siblings, 1 reply; 276+ messages in thread
From: Johannes Weiner @ 2014-12-23  9:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michal Hocko, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg,
	Andrew Morton, Linus Torvalds

On Tue, Dec 23, 2014 at 08:30:58AM +1100, Dave Chinner wrote:
> On Mon, Dec 22, 2014 at 05:57:36PM +0100, Michal Hocko wrote:
> > On Mon 22-12-14 07:42:49, Dave Chinner wrote:
> > [...]
> > > "memory reclaim gave up"? So why the hell isn't it returning a
> > > failure to the caller?
> > > 
> > > i.e. We have a perfectly good page cache allocation failure error
> > > path here all the way back to userspace, but we're invoking the
> > > OOM-killer to kill random processes rather than returning ENOMEM to
> > > the processes that are generating the memory demand?
> > > 
> > > Further: when did the oom-killer become the primary method
> > > of handling situations when memory allocation needs to fail?
> > > __GFP_WAIT does *not* mean memory allocation can't fail - that's what
> > > __GFP_NOFAIL means. And none of the page cache allocations use
> > > __GFP_NOFAIL, so why aren't we getting an allocation failure before
> > > the oom-killer is kicked?
> > 
> > Well, it has been an unwritten rule that GFP_KERNEL allocations for
> > low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago
> > decision which would be tricky to fix now without silently breaking a
> > lot of code. Sad...
> 
> Wow.
> 
> We have *always* been told memory allocations are not guaranteed to
> succeed, ever, unless __GFP_NOFAIL is set, but that's deprecated and
> nobody is allowed to use it any more.
> 
> Lots of code has dependencies on memory allocation making progress
> or failing for the system to work in low memory situations. The page
> cache is one of them, which means all filesystems have that
> dependency. We don't explicitly ask memory allocations to fail, we
> *expect* the memory allocation failures will occur in low memory
> conditions. We've been designing and writing code with this in mind
> for the past 15 years.
> 
> How did we get so far away from the message of "the memory allocator
> never guarantees success" that it will never fail to allocate memory
> even if it means we livelock the entire system?

I think this isn't as much an allocation guarantee as it is based on
the thought that once we can't satisfy such low orders anymore the
system is so entirely unusable that the only remaining thing to do is
to kill processes one by one until the situation is resolved.

Hard to say, though, because this has been the behavior for longer
than the initial git import of the tree, without any code comment.

And yes, it's flawed, because the allocating task looping might be
what's holding up progress, as we can see here.

> > Nevertheless the caller can prevent from an endless loop by using
> > __GFP_NORETRY so this could be used as a workaround.
> 
> That's just a never-ending game of whack-a-mole that we will
> continually lose. It's not a workable solution.

Agreed.

> > The default should be opposite IMO and only those who really
> > require some guarantee should use a special flag for that purpose.
> 
> Yup, totally agree.

So how about something like the following change?  It restricts the
allocator's endless OOM killing loop to __GFP_NOFAIL contexts, which
are annotated in the callsite and thus easier to review for locks etc.
Otherwise, the allocator tries only as long as page reclaim makes
progress, the idea being that failures are handled gracefully in the
callsites, and page faults restarting automatically anyway.  The OOM
killing in that case is deferred to the end of the exception handler.

Preliminary testing confirms that the system is indeed trying just as
hard before OOM killing in the page fault case.  However, it doesn't
look like all callsites are prepared for failing smaller allocations:

[   55.553822] Out of memory: Kill process 240 (anonstress) score 158 or sacrifice child
[   55.561787] Killed process 240 (anonstress) total-vm:1540044kB, anon-rss:1284068kB, file-rss:468kB
[   55.571083] BUG: unable to handle kernel paging request at 00000000004006bd
[   55.578156] IP: [<00000000004006bd>] 0x4006bd
[   55.582584] PGD c8f3f067 PUD c8f48067 PMD c8f15067 PTE 0
[   55.588016] Oops: 0014 [#1] SMP 
[   55.591337] CPU: 1 PID: 240 Comm: anonstress Not tainted 3.18.0-mm1-00081-gf6137925fc97-dirty #188
[   55.600435] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS, BIOS P1.30 05/10/2012
[   55.610030] task: ffff8802139b9a10 ti: ffff8800c8f64000 task.ti: ffff8800c8f64000
[   55.617623] RIP: 0033:[<00000000004006bd>]  [<00000000004006bd>] 0x4006bd
[   55.624512] RSP: 002b:00007fffd43b7220  EFLAGS: 00010206
[   55.629901] RAX: 00007f87e6e01000 RBX: 0000000000000000 RCX: 00007f87f64fe25a
[   55.637104] RDX: 00007f879881a000 RSI: 000000005dc00000 RDI: 0000000000000000
[   55.644331] RBP: 00007fffd43b7240 R08: 00000000ffffffff R09: 0000000000000000
[   55.651569] R10: 0000000000000022 R11: 0000000000000283 R12: 0000000000400570
[   55.658796] R13: 00007fffd43b7340 R14: 0000000000000000 R15: 0000000000000000
[   55.666040] FS:  00007f87f69d1700(0000) GS:ffff88021f280000(0000) knlGS:0000000000000000
[   55.674221] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   55.680055] CR2: 00007fdd676ad480 CR3: 00000000c8f3e000 CR4: 00000000000407e0
[   55.687272] 
[   55.688780] RIP  [<00000000004006bd>] 0x4006bd
[   55.693304]  RSP <00007fffd43b7220>
[   55.696850] CR2: 00000000004006bd
[   55.700207] ---[ end trace b9cb4f44f8e47bc3 ]---
[   55.704903] Kernel panic - not syncing: Fatal exception
[   55.710208] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[   55.720517] Rebooting in 30 seconds..

Obvious bugs aside, though, the thought of failing order-0 allocations
after such a long time is scary...

---

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23  1:00                     ` Tetsuo Handa
@ 2014-12-23  9:51                       ` Michal Hocko
  2014-12-23 11:46                         ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-23  9:51 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg

On Tue 23-12-14 10:00:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OOM killer tries to exlude tasks which do not have mm_struct associated
> s/exlude/exclude/

Fixed

> > Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock
> > which will serialize the OOM killer with exit_mm which sets task->mm to
> > NULL.
> Nice idea.
> 
> By the way, find_lock_task_mm(victim) may succeed if victim->mm == NULL and
> one of threads in victim thread-group has non-NULL mm. That case is handled
> by victim != p branch below. But where was p->signal->oom_score_adj !=
> OOM_SCORE_ADJ_MIN checked?
>
> (In other words, don't we need to check like
> t->mm && t->signal->oom_score_adj != OOM_SCORE_ADJ_MIN at find_lock_task_mm()
> for OOM-kill case?)

oom_score_adj is shared between threads.

> Also, why not to call set_tsk_thread_flag() and do_send_sig_info() together
> like below

What would be an advantage? I am not really sure whether the two locks
might nest as well.

>  	p = find_lock_task_mm(victim);
>  	if (!p) {
>  		put_task_struct(victim);
>  		return;
>  	} else if (victim != p) {
>  		get_task_struct(p);
>  		put_task_struct(victim);
>  		victim = p;
>  	}
>  
>  	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
> +	set_tsk_thread_flag(victim, TIF_MEMDIE);
> +	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
>  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>  		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
>  	task_unlock(victim);
> 
> than wait for for_each_process() loop in case current task went to sleep
> immediately after task_unlock(victim)? Or is there a reason we had been
> setting TIF_MEMDIE after the for_each_process() loop? If the reason was
> to minimize the duration of OOM killer being disabled due to TIF_MEMDIE,
> shouldn't we do like below?

No, global parallel OOM killer is disabled by oom zonelist lock at this
moment for most paths so TIF_MEMDIE setting little bit earlier doesn't
make any difference.

>  	rcu_read_unlock();
>  
> -	set_tsk_thread_flag(victim, TIF_MEMDIE);
> +	task_lock(victim);
> +	if (victim->mm)
> +		set_tsk_thread_flag(victim, TIF_MEMDIE);
> +	task_unlock(victim);
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>  	put_task_struct(victim);

This would work as well but I am not sure it is much more nicer. It is
the find_lock_task_mm() part which tells the final victim so setting
TIF_MEMDIE is logical there.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23  9:51                       ` Michal Hocko
@ 2014-12-23 11:46                         ` Tetsuo Handa
  2014-12-23 11:57                           ` Tetsuo Handa
  2014-12-23 12:24                           ` Michal Hocko
  0 siblings, 2 replies; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-23 11:46 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg

Michal Hocko wrote:
> > Also, why not to call set_tsk_thread_flag() and do_send_sig_info() together
> > like below
> 
> What would be an advantage? I am not really sure whether the two locks
> might nest as well.

I imagined that current thread sets TIF_MEMDIE on a victim thread, then
sleeps for 30 seconds immediately after task_unlock() (it's an overdone
delay), and finally sets SIGKILL on that victim thread. If such a delay
happened, that victim thread is free to abuse TIF_MEMDIE for that period.
Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better.

 	rcu_read_unlock();
 
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	task_lock(victim);
+	if (victim->mm)
+		set_tsk_thread_flag(victim, TIF_MEMDIE);
+	task_unlock(victim);
 	put_task_struct(victim);

If such a delay is theoretically impossible, I'm OK with your patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 11:46                         ` Tetsuo Handa
@ 2014-12-23 11:57                           ` Tetsuo Handa
  2014-12-23 12:12                             ` Tetsuo Handa
  2014-12-23 12:27                             ` Michal Hocko
  2014-12-23 12:24                           ` Michal Hocko
  1 sibling, 2 replies; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-23 11:57 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg

Tetsuo Handa wrote:
> If such a delay is theoretically impossible, I'm OK with your patch.
> 

Oops, I forgot to mention that task_unlock(p) should be called before
put_task_struct(p), in case p->usage == 1 at put_task_struct(p).

 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (task_will_free_mem(p)) {
+	task_lock(p);
+	if (p->mm && task_will_free_mem(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		task_unlock(p);
 		return;
 	}
+	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(p, gfp_mask, order, memcg, nodemask);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 11:57                           ` Tetsuo Handa
@ 2014-12-23 12:12                             ` Tetsuo Handa
  2014-12-23 12:27                             ` Michal Hocko
  1 sibling, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-23 12:12 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg

Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > If such a delay is theoretically impossible, I'm OK with your patch.
> > 
> 
> Oops, I forgot to mention that task_unlock(p) should be called before
> put_task_struct(p), in case p->usage == 1 at put_task_struct(p).
> 
After all, something like below?
----------------------------------------
>From 63e9317553688944e27b6054ccc059b82064605e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 23 Dec 2014 21:04:43 +0900
Subject: [PATCH] oom: Make sure that TIF_MEMDIE is set under task_lock

OOM killer tries to exclude tasks which do not have mm_struct associated
because killing such a task wouldn't help much. The OOM victim gets
TIF_MEMDIE set to disable OOM killer while the current victim releases
the memory and then enables the OOM killer again by dropping the flag.

oom_kill_process is currently prone to a race condition when the OOM
victim is already exiting and TIF_MEMDIE is set after it the task
releases its address space. This might theoretically lead to OOM
livelock if the OOM victim blocks on an allocation later during exiting
because it wouldn't kill any other process and the exiting one won't be
able to exit. The situation is highly unlikely because the OOM victim is
expected to release some memory which should help to sort out OOM
situation.

Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock
which will serialize the OOM killer with exit_mm which sets task->mm to
NULL. Also, reverse the order of sending SIGKILL and setting TIF_MEMDIE
so that preemption will not allow the victim task to abuse TIF_MEMDIE.

Setting the flag for current is not necessary because check and set is
not racy.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/oom_kill.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d503e9c..91079ec 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -438,11 +438,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (task_will_free_mem(p)) {
-		set_tsk_thread_flag(p, TIF_MEMDIE);
-		put_task_struct(p);
-		return;
-	}
+	if (task_will_free_mem(victim))
+		goto set_memdie_flag;
 
 	if (__ratelimit(&oom_rs))
 		dump_header(p, gfp_mask, order, memcg, nodemask);
@@ -522,8 +519,12 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		}
 	rcu_read_unlock();
 
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+ set_memdie_flag:
+	task_lock(victim);
+	if (victim->mm)
+		set_tsk_thread_flag(victim, TIF_MEMDIE);
+	task_unlock(victim);
 	put_task_struct(victim);
 }
 #undef K
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 11:46                         ` Tetsuo Handa
  2014-12-23 11:57                           ` Tetsuo Handa
@ 2014-12-23 12:24                           ` Michal Hocko
  2014-12-23 13:00                             ` Tetsuo Handa
  1 sibling, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-23 12:24 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg

On Tue 23-12-14 20:46:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > Also, why not to call set_tsk_thread_flag() and do_send_sig_info() together
> > > like below
> > 
> > What would be an advantage? I am not really sure whether the two locks
> > might nest as well.
> 
> I imagined that current thread sets TIF_MEMDIE on a victim thread, then
> sleeps for 30 seconds immediately after task_unlock() (it's an overdone
> delay),

Only if the current task was preempted for such a long time. Which
doesn't sound too probable to me.

> and finally sets SIGKILL on that victim thread. If such a delay
> happened, that victim thread is free to abuse TIF_MEMDIE for that period.
> Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better.

I don't know, I can hardly find a scenario where it would make any
difference in the real life. If the victim needs to allocate a memory to
finish then it would trigger OOM again and have to wait/loop until this
OOM killer releases the oom zonelist lock just to find out it already
has TIF_MEMDIE set and can dive into memory reserves. Which way is more
correct is a question but I wouldn't change it without having a really
good reason. This whole code is subtle already, let's not make it even
more so.

> 
>  	rcu_read_unlock();
>  
> -	set_tsk_thread_flag(victim, TIF_MEMDIE);
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	task_lock(victim);
> +	if (victim->mm)
> +		set_tsk_thread_flag(victim, TIF_MEMDIE);
> +	task_unlock(victim);
>  	put_task_struct(victim);
> 
> If such a delay is theoretically impossible, I'm OK with your patch.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 11:57                           ` Tetsuo Handa
  2014-12-23 12:12                             ` Tetsuo Handa
@ 2014-12-23 12:27                             ` Michal Hocko
  1 sibling, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2014-12-23 12:27 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg

On Tue 23-12-14 20:57:23, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > If such a delay is theoretically impossible, I'm OK with your patch.
> > 
> 
> Oops, I forgot to mention that task_unlock(p) should be called before
> put_task_struct(p), in case p->usage == 1 at put_task_struct(p).

True. It would be quite surprising to see p->mm != NULL if the OOM
killer was the only one to hold a reference to the task. So it shouldn't
make any difference AFAICS. It is a good practice to change that though.
Fixed.

[...]

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 12:24                           ` Michal Hocko
@ 2014-12-23 13:00                             ` Tetsuo Handa
  2014-12-23 13:09                               ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-23 13:00 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg

Michal Hocko wrote:
> > and finally sets SIGKILL on that victim thread. If such a delay
> > happened, that victim thread is free to abuse TIF_MEMDIE for that period.
> > Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better.
> 
> I don't know, I can hardly find a scenario where it would make any
> difference in the real life. If the victim needs to allocate a memory to
> finish then it would trigger OOM again and have to wait/loop until this
> OOM killer releases the oom zonelist lock just to find out it already
> has TIF_MEMDIE set and can dive into memory reserves. Which way is more
> correct is a question but I wouldn't change it without having a really
> good reason. This whole code is subtle already, let's not make it even
> more so.

gfp_to_alloc_flags() in mm/page_alloc.c sets ALLOC_NO_WATERMARKS if
the victim task has TIF_MEMDIE flag, doesn't it?

        if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
                if (gfp_mask & __GFP_MEMALLOC)
                        alloc_flags |= ALLOC_NO_WATERMARKS;
                else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
                        alloc_flags |= ALLOC_NO_WATERMARKS;
                else if (!in_interrupt() &&
                                ((current->flags & PF_MEMALLOC) ||
                                 unlikely(test_thread_flag(TIF_MEMDIE))))
                        alloc_flags |= ALLOC_NO_WATERMARKS;
        }

Then, I think deferring SIGKILL might widen race window for abusing TIF_MEMDIE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 13:00                             ` Tetsuo Handa
@ 2014-12-23 13:09                               ` Michal Hocko
  2014-12-23 13:20                                 ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-23 13:09 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg

On Tue 23-12-14 22:00:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > and finally sets SIGKILL on that victim thread. If such a delay
> > > happened, that victim thread is free to abuse TIF_MEMDIE for that period.
> > > Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better.
> > 
> > I don't know, I can hardly find a scenario where it would make any
> > difference in the real life. If the victim needs to allocate a memory to
> > finish then it would trigger OOM again and have to wait/loop until this
> > OOM killer releases the oom zonelist lock just to find out it already
> > has TIF_MEMDIE set and can dive into memory reserves. Which way is more
> > correct is a question but I wouldn't change it without having a really
> > good reason. This whole code is subtle already, let's not make it even
> > more so.
> 
> gfp_to_alloc_flags() in mm/page_alloc.c sets ALLOC_NO_WATERMARKS if
> the victim task has TIF_MEMDIE flag, doesn't it?

This is the whole point of TIF_MEMDIE.

[...]

> Then, I think deferring SIGKILL might widen race window for abusing TIF_MEMDIE.

How would it abuse the flag? The OOM victim has to die and if it needs
to allocate then we have to allow it to do so otherwise the whole
exercise was pointless. fatal_signal_pending check is not so widespread
in the kernel that the task would notice it immediately.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 13:09                               ` Michal Hocko
@ 2014-12-23 13:20                                 ` Tetsuo Handa
  2014-12-23 13:43                                   ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-23 13:20 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg

Michal Hocko wrote:
> On Tue 23-12-14 22:00:52, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > and finally sets SIGKILL on that victim thread. If such a delay
> > > > happened, that victim thread is free to abuse TIF_MEMDIE for that period.
> > > > Thus, I thought sending SIGKILL followed by setting TIF_MEMDIE is better.
> > > 
> > > I don't know, I can hardly find a scenario where it would make any
> > > difference in the real life. If the victim needs to allocate a memory to
> > > finish then it would trigger OOM again and have to wait/loop until this
> > > OOM killer releases the oom zonelist lock just to find out it already
> > > has TIF_MEMDIE set and can dive into memory reserves. Which way is more
> > > correct is a question but I wouldn't change it without having a really
> > > good reason. This whole code is subtle already, let's not make it even
> > > more so.
> > 
> > gfp_to_alloc_flags() in mm/page_alloc.c sets ALLOC_NO_WATERMARKS if
> > the victim task has TIF_MEMDIE flag, doesn't it?
> 
> This is the whole point of TIF_MEMDIE.
> 
> [...]
> 
> > Then, I think deferring SIGKILL might widen race window for abusing TIF_MEMDIE.
> 
> How would it abuse the flag? The OOM victim has to die and if it needs
> to allocate then we have to allow it to do so otherwise the whole
> exercise was pointless. fatal_signal_pending check is not so widespread
> in the kernel that the task would notice it immediately.

I'm talking about possible delay between TIF_MEMDIE was set on the victim
and SIGKILL is delivered to the victim. Why the victim has to die before
receiving SIGKILL? The victim can access memory reserves until SIGKILL is
delivered, can't it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 13:20                                 ` Tetsuo Handa
@ 2014-12-23 13:43                                   ` Michal Hocko
  2014-12-23 14:11                                     ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-23 13:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg

On Tue 23-12-14 22:20:57, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 23-12-14 22:00:52, Tetsuo Handa wrote:
[...]
> > > Then, I think deferring SIGKILL might widen race window for abusing TIF_MEMDIE.
> > 
> > How would it abuse the flag? The OOM victim has to die and if it needs
> > to allocate then we have to allow it to do so otherwise the whole
> > exercise was pointless. fatal_signal_pending check is not so widespread
> > in the kernel that the task would notice it immediately.
> 
> I'm talking about possible delay between TIF_MEMDIE was set on the victim
> and SIGKILL is delivered to the victim.

I can read what you wrote. You are just ignoring my questions it seems
because I haven't got any reason _why it matters_. My point was that the
victim might be looping in the kernel and doing other allocations until
it notices it has fatal_signal_pending and bail out. So the delay
between setting the flag and sending the signal is not that important
AFAICS.

> Why the victim has to die before receiving SIGKILL?

It has to die to resolve the current OOM condition. I haven't written
anything about dying before receiving SIGKILL.

> The victim can access memory reserves until SIGKILL is delivered,
> can't it?

And why does that matter? It would have to do such an allocation anyway
because it wouldn't proceed without it... And the only difference
between having the flag and not having it is that the allocation has
higher chance to succeed with the flag so it will not trigger the OOM
killer again right away. See the point or am I missing something here?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 13:43                                   ` Michal Hocko
@ 2014-12-23 14:11                                     ` Tetsuo Handa
  2014-12-23 14:57                                       ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-23 14:11 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, rientjes, oleg

Michal Hocko wrote:
> On Tue 23-12-14 22:20:57, Tetsuo Handa wrote:
> > I'm talking about possible delay between TIF_MEMDIE was set on the victim
> > and SIGKILL is delivered to the victim.
> 
> I can read what you wrote. You are just ignoring my questions it seems
> because I haven't got any reason _why it matters_. My point was that the
> victim might be looping in the kernel and doing other allocations until
> it notices it has fatal_signal_pending and bail out. So the delay
> between setting the flag and sending the signal is not that important
> AFAICS.

My point is that the victim might not be looping in the kernel
when getting TIF_MEMDIE.

Situation:

  P1: A process who called the OOM killer
  P2: A process who is chosen by the OOM killer

  P2 is running a program shown below.
----------
int main(int argc, char *argv[])
{
	const int fd = open("/dev/zero", O_RDONLY);
	char *buf = malloc(1024 * 1048576);
	if (fd == -1 || !buf)
		return 1;
	memset(buf, 0, 512 * 1048576);
	sleep(10);
	read(fd, buf, 1024 * 1048576);
	return 0;
}
----------

Sequence:

  (1) P2 is sleeping at sleep(10).
  (2) P1 triggers the OOM killer and P2 is chosen.
  (3) The OOM killer sets TIF_MEMDIE on P2.
  (4) P2 wakes up as sleep(10) expired.
  (5) P2 calls read().
  (6) P2 triggers page fault inside read().
  (7) P2 allocates from memory reserves for handling page fault.
  (8) The OOM killer sends SIGKILL to P2.
  (9) P2 receives SIGKILL after all memory reserves were
      allocated for handling page fault.
  (10) P2 starts steps for die, but memory reserves may be
       already empty.

My worry:

  More the delay between (3) and (8) becomes longer (e.g. 30 seconds
  for an overdone case), more likely to cause memory reserves being
  consumed before (9). If (3) and (8) are reversed, P2 will notice
  fatal_signal_pending() and bail out before allocating a lot of
  memory from memory reserves.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [RFC PATCH] oom: Don't count on mm-less current process.
  2014-12-23 14:11                                     ` Tetsuo Handa
@ 2014-12-23 14:57                                       ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2014-12-23 14:57 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, rientjes, oleg

On Tue 23-12-14 23:11:01, Tetsuo Handa wrote:
[...]
>   (1) P2 is sleeping at sleep(10).
>   (2) P1 triggers the OOM killer and P2 is chosen.
>   (3) The OOM killer sets TIF_MEMDIE on P2.
>   (4) P2 wakes up as sleep(10) expired.
>   (5) P2 calls read().
>   (6) P2 triggers page fault inside read().
>   (7) P2 allocates from memory reserves for handling page fault.
>   (8) The OOM killer sends SIGKILL to P2.
>   (9) P2 receives SIGKILL after all memory reserves were
>       allocated for handling page fault.
>   (10) P2 starts steps for die, but memory reserves may be
>        already empty.

How is that any different from any other task which allocates with
TIF_MEMDIE already set and fatal_signal_pending without checking for
the later?
 
> My worry:
> 
>   More the delay between (3) and (8) becomes longer (e.g. 30 seconds
>   for an overdone case), more likely to cause memory reserves being
>   consumed before (9). If (3) and (8) are reversed, P2 will notice
>   fatal_signal_pending() and bail out before allocating a lot of
>   memory from memory reserves.

And my suspicion is that this has never been a real problem and I really
do not like to fiddle with the code for non-existing problems. If
you are sure that the reverse order is correct and doesn't cause any
other issues then you are free to send a separate patch with a proper
justification. The patch I've posted fixes a different problem and
putting more stuff in it is just not right! I really hate how you
conflate different issues all the time, TBH.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-23  9:41                           ` Johannes Weiner
@ 2014-12-24  1:06                             ` Dave Chinner
  2014-12-24  2:40                               ` Linus Torvalds
  0 siblings, 1 reply; 276+ messages in thread
From: Dave Chinner @ 2014-12-24  1:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg,
	Andrew Morton, Linus Torvalds

On Tue, Dec 23, 2014 at 04:41:32AM -0500, Johannes Weiner wrote:
> On Tue, Dec 23, 2014 at 08:30:58AM +1100, Dave Chinner wrote:
> > On Mon, Dec 22, 2014 at 05:57:36PM +0100, Michal Hocko wrote:
> > > On Mon 22-12-14 07:42:49, Dave Chinner wrote:
> > > [...]
> > > > "memory reclaim gave up"? So why the hell isn't it returning a
> > > > failure to the caller?
> > > > 
> > > > i.e. We have a perfectly good page cache allocation failure error
> > > > path here all the way back to userspace, but we're invoking the
> > > > OOM-killer to kill random processes rather than returning ENOMEM to
> > > > the processes that are generating the memory demand?
> > > > 
> > > > Further: when did the oom-killer become the primary method
> > > > of handling situations when memory allocation needs to fail?
> > > > __GFP_WAIT does *not* mean memory allocation can't fail - that's what
> > > > __GFP_NOFAIL means. And none of the page cache allocations use
> > > > __GFP_NOFAIL, so why aren't we getting an allocation failure before
> > > > the oom-killer is kicked?
> > > 
> > > Well, it has been an unwritten rule that GFP_KERNEL allocations for
> > > low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago
> > > decision which would be tricky to fix now without silently breaking a
> > > lot of code. Sad...
> > 
> > Wow.
> > 
> > We have *always* been told memory allocations are not guaranteed to
> > succeed, ever, unless __GFP_NOFAIL is set, but that's deprecated and
> > nobody is allowed to use it any more.
> > 
> > Lots of code has dependencies on memory allocation making progress
> > or failing for the system to work in low memory situations. The page
> > cache is one of them, which means all filesystems have that
> > dependency. We don't explicitly ask memory allocations to fail, we
> > *expect* the memory allocation failures will occur in low memory
> > conditions. We've been designing and writing code with this in mind
> > for the past 15 years.
> > 
> > How did we get so far away from the message of "the memory allocator
> > never guarantees success" that it will never fail to allocate memory
> > even if it means we livelock the entire system?
> 
> I think this isn't as much an allocation guarantee as it is based on
> the thought that once we can't satisfy such low orders anymore the
> system is so entirely unusable that the only remaining thing to do is
> to kill processes one by one until the situation is resolved.
> 
> Hard to say, though, because this has been the behavior for longer
> than the initial git import of the tree, without any code comment.
> 
> And yes, it's flawed, because the allocating task looping might be
> what's holding up progress, as we can see here.

Worse, it can be the task that is consuming all the memory, as canbe
seen by this failure on xfs/084 on my single CPU. 1GB RAM VM. This
test has been failing like this about 30% of the time since 3.18-rc1:

[ 4083.059309] Mem-Info:
[ 4083.059693] Node 0 DMA per-cpu:
[ 4083.060246] CPU    0: hi:    0, btch:   1 usd:   0
[ 4083.061041] Node 0 DMA32 per-cpu:
[ 4083.061612] CPU    0: hi:  186, btch:  31 usd:  50
[ 4083.062407] active_anon:119604 inactive_anon:119575 isolated_anon:0
[ 4083.062407]  active_file:29 inactive_file:58 isolated_file:0
[ 4083.062407]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 4083.062407]  free:1953 slab_reclaimable:2881 slab_unreclaimable:2484
[ 4083.062407]  mapped:27 shmem:2 pagetables:928 bounce:0
[ 4083.062407]  free_cma:0
[ 4083.067475] Node 0 DMA free:3924kB min:60kB low:72kB high:88kB active_anon:5612kB inactive_anon:5792kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(as
[ 4083.073986] lowmem_reserve[]: 0 966 966 966
[ 4083.074808] Node 0 DMA32 free:3888kB min:3944kB low:4928kB high:5916kB active_anon:472804kB inactive_anon:472508kB active_file:116kB inactive_file:232kB unevictabls
[ 4083.081570] lowmem_reserve[]: 0 0 0 0
[ 4083.082268] Node 0 DMA: 7*4kB (U) 9*8kB (UM) 7*16kB (UM) 4*32kB (U) 4*64kB (U) 2*128kB (U) 2*256kB (UM) 1*512kB (M) 0*1024kB 1*2048kB (R) 0*4096kB = 3924kB
[ 4083.084829] Node 0 DMA32: 16*4kB (U) 0*8kB 1*16kB (R) 1*32kB (R) 1*64kB (R) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3888kB
[ 4083.087287] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 4083.088657] 47956 total pagecache pages
[ 4083.089275] 47858 pages in swap cache
[ 4083.089856] Swap cache stats: add 416328, delete 368470, find 818589/929518
[ 4083.090941] Free swap  = 0kB
[ 4083.091398] Total swap = 497976kB
[ 4083.091923] 262044 pages RAM
[ 4083.092405] 0 pages HighMem/MovableOnly
[ 4083.093016] 10167 pages reserved
[ 4083.093528] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[ 4083.094749] [ 1195]     0  1195     5992       24      16      152         -1000 udevd
[ 4083.095981] [ 1326]     0  1326     5991       50      15      128         -1000 udevd
[ 4083.097224] [ 3835]     0  3835     2529        0       6      573         -1000 dhclient
[ 4083.098497] [ 3886]     0  3886    13099        0      27      153         -1000 sshd
[ 4083.099716] [ 3892]     0  3892    25770        1      52      233         -1000 sshd
[ 4083.100939] [ 3970]  1000  3970    25770        8      50      227         -1000 sshd
[ 4083.102164] [ 3971]  1000  3971     5276        1      14      493         -1000 bash
[ 4083.103386] [ 4062]     0  4062    16887        1      36      118         -1000 sudo
[ 4083.104667] [ 4063]     0  4063     3044      192      10      162         -1000 check
[ 4083.105952] [ 6708]     0  6708     5991       35      15      143         -1000 udevd
[ 4083.107244] [18113]     0 18113     2584        1       9      288         -1000 084
[ 4083.108517] [18317]     0 18317   316605   191037     623   121971         -1000 resvtest
[ 4083.109852] [18318]     0 18318     2584        0       9      288         -1000 084
[ 4083.111117] [18319]     0 18319     2584        0       9      288         -1000 084
[ 4083.112431] [18320]     0 18320     3258        0      11       36         -1000 sed
[ 4083.113692] [18321]     0 18321     3258        0      11       36         -1000 sed
[ 4083.114950] Kernel panic - not syncing: Out of memory and no killable processes...
[ 4083.114950]
[ 4083.116420] CPU: 0 PID: 18317 Comm: resvtest Not tainted 3.19.0-rc1-dgc+ #650
[ 4083.116423] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 4083.116423]  ffffffff823357a0 ffff88003d98faa8 ffffffff81d87acb 0000000000008686
[ 4083.116423]  ffffffff8219b348 ffff88003d98fb28 ffffffff81d813c1 000000000000000b
[ 4083.116423]  0000000000000008 ffff88003d98fb38 ffff88003d98fad8 0000000000000000
[ 4083.116423] Call Trace:
[ 4083.116423]  [<ffffffff81d87acb>] dump_stack+0x45/0x57
[ 4083.116423]  [<ffffffff81d813c1>] panic+0xc1/0x1eb
[ 4083.116423]  [<ffffffff81174dea>] out_of_memory+0x4fa/0x500
[ 4083.116423]  [<ffffffff81179969>] __alloc_pages_nodemask+0x7a9/0x8a0
[ 4083.116423]  [<ffffffff811b8c77>] alloc_pages_vma+0x97/0x160
[ 4083.116423]  [<ffffffff8119b0c3>] handle_mm_fault+0x963/0xc20
[ 4083.116423]  [<ffffffff814ec802>] ? xfs_file_buffered_aio_write+0x1e2/0x240
[ 4083.116423]  [<ffffffff8108bf24>] __do_page_fault+0x1b4/0x570
[ 4083.116423]  [<ffffffff8119f5e1>] ? vma_merge+0x211/0x330
[ 4083.116423]  [<ffffffff811a0808>] ? do_brk+0x268/0x350
[ 4083.116423]  [<ffffffff8108c395>] trace_do_page_fault+0x45/0x100
[ 4083.116423]  [<ffffffff8108778e>] do_async_page_fault+0x1e/0xd0
[ 4083.116423]  [<ffffffff81d946f8>] async_page_fault+0x28/0x30
[ 4083.116423] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

This needs to fail the allocation so that the process consuming all
the memory fails the page fault and SEGVs. Otherwise the OOM-killer
just runs wild killing everything else in the system until there's
nothing left to kill and the system panics.

> > > The default should be opposite IMO and only those who really
> > > require some guarantee should use a special flag for that purpose.
> > 
> > Yup, totally agree.
> 
> So how about something like the following change?  It restricts the
> allocator's endless OOM killing loop to __GFP_NOFAIL contexts, which
> are annotated in the callsite and thus easier to review for locks etc.
> Otherwise, the allocator tries only as long as page reclaim makes
> progress, the idea being that failures are handled gracefully in the
> callsites, and page faults restarting automatically anyway.  The OOM
> killing in that case is deferred to the end of the exception handler.
> 
> Preliminary testing confirms that the system is indeed trying just as
> hard before OOM killing in the page fault case.  However, it doesn't
> look like all callsites are prepared for failing smaller allocations:

Then we need to fix those bugs.

> [   55.553822] Out of memory: Kill process 240 (anonstress) score 158 or sacrifice child
> [   55.561787] Killed process 240 (anonstress) total-vm:1540044kB, anon-rss:1284068kB, file-rss:468kB
> [   55.571083] BUG: unable to handle kernel paging request at 00000000004006bd
> [   55.578156] IP: [<00000000004006bd>] 0x4006bd

That's an offset of >4MB from a null pointer. Doesn't seem likely
that it's caused by a failure of a order 0 allocation. The lack of
a stack trace is worrying, though....

> Obvious bugs aside, though, the thought of failing order-0 allocations
> after such a long time is scary...

The reliance on the OOM-killer to save the system from memory
starvation when users put the page cache under pressure via write(2)
is even scarier, IMO.

> ---
> From 0b204ee379aa5502a1c4dce5df51de96448b5163 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 22 Dec 2014 17:16:43 -0500
> Subject: [patch] mm: page_alloc: avoid page allocation vs. OOM killing
>  deadlock

Remind me to test whatever you've come up with in a couple of weeks
after the xmas break, though it's more likely to be late january
before i'll get to it given LCA will be keeping me busy in the new
year...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-24  1:06                             ` Dave Chinner
@ 2014-12-24  2:40                               ` Linus Torvalds
  0 siblings, 0 replies; 276+ messages in thread
From: Linus Torvalds @ 2014-12-24  2:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Tetsuo Handa, Dave Chinner,
	linux-mm, David Rientjes, Oleg Nesterov, Andrew Morton

On Tue, Dec 23, 2014 at 5:06 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Worse, it can be the task that is consuming all the memory, as canbe
> seen by this failure on xfs/084 on my single CPU. 1GB RAM VM. This
> test has been failing like this about 30% of the time since 3.18-rc1:

Quite frankly, uif you can realiably handle memory allocation failures
and they won't cause problems for other processes, you should use
GFP_USER, not GFP_KERNEL.

GFP_KERNEL does mean "try really hard".  That has *always* been true.
We used to have a __GFP_HIGH set in GFP_KERNEL exactly for that
reason.

We seem lost that distinction between GFP_USER and GFP_KERNEL long
ago, and then re-grew it in a weaker form as GFP_HARDWALL. That may be
part of the problem: the kernel cannot easily distinguish between "we
should try really hard to satisfy this allocation" and "we can easily
fail it".

Maybe we could just use that GFP_HARDWALL bit for it. Possibly rename
it, but for *testing* it somebody could try this trivial/minimal
test-patch.

    diff --git a/mm/page_alloc.c b/mm/page_alloc.c
    index 7633c503a116..7cacd45b47ce 100644
    --- a/mm/page_alloc.c
    +++ b/mm/page_alloc.c
    @@ -2307,6 +2307,10 @@ should_alloc_retry(gfp_t gfp_mask, unsigned
int order,
             if (!did_some_progress && pm_suspended_storage())
                     return 0;

    +        /* GFP_USER allocations don't re-try */
    +        if (gfp_mask & __GFP_HIGHWALL)
    +                return 0;
    +
             /*
              * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
              * means __GFP_NOFAIL, but that may not be true in other

which is intentionally whitespace-damaged, because it really is meant
as a "this is a starting point for experimentation by VM people"
rather than as a "apply this patch and you're good to go" patch..

Hmm?

                            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?)
  2014-12-20 22:35                 ` Dave Chinner
  2014-12-21  8:45                   ` Tetsuo Handa
@ 2014-12-29 17:40                   ` Michal Hocko
  2014-12-29 18:45                     ` Linus Torvalds
  1 sibling, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-29 17:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, linux-mm, rientjes, oleg, Andrew Morton,
	Mel Gorman, Johannes Weiner, Linus Torvalds

On Sun 21-12-14 09:35:04, Dave Chinner wrote:
[...]
> Oh, boy.
> 
> struct page *grab_cache_page_write_begin(struct address_space *mapping,
>                                         pgoff_t index, unsigned flags)
> {
>         struct page *page;
>         int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;
> 
>         if (flags & AOP_FLAG_NOFS)
>                 fgp_flags |= FGP_NOFS;
> 
>         page = pagecache_get_page(mapping, index, fgp_flags,
>                         mapping_gfp_mask(mapping),
>                         GFP_KERNEL);
>         if (page)
>                 wait_for_stable_page(page);
> 
>         return page;
> }
> 
> There are *3* different memory allocation controls passed to
> pagecache_get_page. The first is via AOP_FLAG_NOFS, where the caller
> explicitly says this allocation is in filesystem context with locks
> held, and so all allocations need to be done in GFP_NOFS context.
> This is used to override the second and third gfp parameters.
> 
> The second is mapping_gfp_mask(mapping), which is the *default
> allocation context* the filesystem wants the page cache to use for
> allocating pages to the mapping.
> 
> The third is a hard coded GFP_KERNEL, which is used for radix tree
> node allocation.
> 
> Why are there separate allocation contexts for the radix tree nodes
> and the page cache pages when they are done under *exactly the same
> caller context*? Either we are allowed to recurse into the
> filesystem or we aren't, and the inode mapping mask defines that
> context for all page cache allocations, not just the pages
> themselves.
> 
> And to point out how many filesystems this affects,
> the loop device, btrfs, f2fs, gfs2, jfs, logfs, nil2fs, reiserfs
> and XFS all use this mapping default to clear __GFP_FS from
> page cache allocations. Only ext4 and gfs2 use AOP_FLAG_NOFS in
> their ->write_begin callouts to prevent recusrion.
> 
> IOWs, grab_cache_page_write_begin/pagecache_get_page multiple
> allocation contexts are just wrong.  It does not match the way
> filesystems are informing the page cache of allocation context to
> avoid recursion (for avoiding stack overflow and/or deadlock).
> AOP_FLAG_NOFS should go away, and all filesystems should modify the
> mapping gfp mask to set their allocation context. If should be used
> *everywhere* pages are allocated into the page cache, and for all
> allocations related to tracking those allocated pages.

I guess the following would be a first simple step to remove the bug you
are mentioning above. It would be simple enough to put into stable as
well. What do you think?
---

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-21  8:45                   ` Tetsuo Handa
  2014-12-21 20:42                     ` Dave Chinner
@ 2014-12-29 18:19                     ` Michal Hocko
  2014-12-30  6:42                       ` Tetsuo Handa
  1 sibling, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-29 18:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, linux-mm, rientjes, oleg, Andrew Morton,
	Mel Gorman, Johannes Weiner, Linus Torvalds

On Sun 21-12-14 17:45:32, Tetsuo Handa wrote:
[...]
> Traces from uptime > 484 seconds of
> http://I-love.SAKURA.ne.jp/tmp/serial-20141221.txt.xz is a stalled case.
[  548.449780] Out of memory: Kill process 12718 (a.out) score 890 or sacrifice child
[...]
[  954.595576] a.out           D ffff8800764918a0     0 12718      1 0x00100084
[  954.597544]  ffff880077d7fca8 0000000000000086 ffff880076491470 ffff880077d7ffd8
[  954.599565]  0000000000013640 0000000000013640 ffff8800358c8210 ffff880076491470
[  954.601634]  0000000000000000 ffff88007c8a3e48 ffff88007c8a3e4c ffff880076491470
[  954.604091] Call Trace:
[  954.607766]  [<ffffffff81618669>] schedule_preempt_disabled+0x29/0x70
[  954.609792]  [<ffffffff8161a555>] __mutex_lock_slowpath+0xb5/0x120
[  954.611644]  [<ffffffff8161a5e3>] mutex_lock+0x23/0x37
[  954.613256]  [<ffffffffa025fb47>] xfs_file_buffered_aio_write.isra.9+0x77/0x270 [xfs]
[...]

and it seems that it is blocked by another allocator:
[  957.178207] a.out           R  running task        0 12804      1 0x00000084
[  957.180304] MemAlloc: 471962 jiffies on 0x10
[  957.181738]  ffff8800355df868 0000000000000086 ffff88007be98940 ffff8800355dffd8
[  957.183831]  0000000000013640 0000000000013640 ffff88007c4174b0 ffff88007be98940
[  957.185916]  0000000000000000 ffff8800355df940 0000000000000000 ffffffff81a621e8
[  957.188067] Call Trace:
[  957.189130]  [<ffffffff81618509>] _cond_resched+0x29/0x40
[  957.190790]  [<ffffffff8117752a>] shrink_slab+0x17a/0x1d0
[  957.192384]  [<ffffffff8117a330>] do_try_to_free_pages+0x280/0x450
[  957.194117]  [<ffffffff8117a5da>] try_to_free_pages+0xda/0x170
[  957.195800]  [<ffffffff8116db23>] __alloc_pages_nodemask+0x633/0xa50
[  957.197615]  [<ffffffff811b1ce7>] alloc_pages_current+0x97/0x110
[  957.199314]  [<ffffffff81164797>] __page_cache_alloc+0xa7/0xc0
[  957.201026]  [<ffffffff811652b0>] pagecache_get_page+0x70/0x1e0
[  957.202724]  [<ffffffff81165453>] grab_cache_page_write_begin+0x33/0x50
[  957.204546]  [<ffffffffa0252cb4>] xfs_vm_write_begin+0x34/0xe0 [xfs]

but this task managed to make some progress because we can clearly see
that pid 12718 (oom victim) managed to move on and get to OOM killer
many times
[  961.062042] a.out(12718) the OOM killer was skipped for 1965000 times.
[...]
[  983.140589] a.out(12718) the OOM killer was skipped for 2059000 times.

This shouldn't happen for the xfs pagecache allocation because
they all should be !__GFS_FS and we do not trigger OOM killer in
that case and fail instead. But as already pointed out by Dave
grab_cache_page_write_begin uses GFP_KERNEL for the radix tree
allocation and that would trigger the OOM killer. The rest is our
hopeless attempt to not fail the allocation. I believe that the patch
from http://marc.info/?l=linux-mm&m=141987483503279 should help in this
particular case. There are still other cases where we can livelock but
this seems to be a clear bug in grab_cache_page_write_begin.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?)
  2014-12-29 17:40                   ` [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) Michal Hocko
@ 2014-12-29 18:45                     ` Linus Torvalds
  2014-12-29 19:33                       ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Linus Torvalds @ 2014-12-29 18:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Tetsuo Handa, Dave Chinner, linux-mm,
	David Rientjes, Oleg Nesterov, Andrew Morton, Mel Gorman,
	Johannes Weiner

So I think this patch is definitely going in the right direction, but
at least the __GFP_WRITE handling is insane:

(Patch edited to show the resulting code, without the old deleted lines)

On Mon, Dec 29, 2014 at 9:40 AM, Michal Hocko <mhocko@suse.cz> wrote:
> @@ -1105,13 +1102,11 @@ no_page:
>         if (!page && (fgp_flags & FGP_CREAT)) {
>                 int err;
>                 if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
> +                       gfp_mask |= __GFP_WRITE;
> +               if (fgp_flags & FGP_NOFS)
> +                       gfp_mask &= ~__GFP_FS;
>
> +               page = __page_cache_alloc(gfp_mask);
>                 if (!page)
>                         return NULL;
>
> @@ -1122,7 +1117,7 @@ no_page:
>                 if (fgp_flags & FGP_ACCESSED)
>                         __SetPageReferenced(page);
>
> +               err = add_to_page_cache_lru(page, mapping, offset, gfp_mask);

Passing __GFP_WRITE into the radix tree allocation routines is not
sane. So you'd have to mask the bit out again here (unconditionally is
fine).

But other than that this seems to be a sane cleanup.

                            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?)
  2014-12-29 18:45                     ` Linus Torvalds
@ 2014-12-29 19:33                       ` Michal Hocko
  2014-12-30 13:42                         ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-29 19:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Tetsuo Handa, Dave Chinner, linux-mm,
	David Rientjes, Oleg Nesterov, Andrew Morton, Mel Gorman,
	Johannes Weiner

On Mon 29-12-14 10:45:22, Linus Torvalds wrote:
> So I think this patch is definitely going in the right direction, but
> at least the __GFP_WRITE handling is insane:
> 
> (Patch edited to show the resulting code, without the old deleted lines)
> 
> On Mon, Dec 29, 2014 at 9:40 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > @@ -1105,13 +1102,11 @@ no_page:
> >         if (!page && (fgp_flags & FGP_CREAT)) {
> >                 int err;
> >                 if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
> > +                       gfp_mask |= __GFP_WRITE;
> > +               if (fgp_flags & FGP_NOFS)
> > +                       gfp_mask &= ~__GFP_FS;
> >
> > +               page = __page_cache_alloc(gfp_mask);
> >                 if (!page)
> >                         return NULL;
> >
> > @@ -1122,7 +1117,7 @@ no_page:
> >                 if (fgp_flags & FGP_ACCESSED)
> >                         __SetPageReferenced(page);
> >
> > +               err = add_to_page_cache_lru(page, mapping, offset, gfp_mask);
> 
> Passing __GFP_WRITE into the radix tree allocation routines is not
> sane. So you'd have to mask the bit out again here (unconditionally is
> fine).

Good point!
--- 

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-29 18:19                     ` Michal Hocko
@ 2014-12-30  6:42                       ` Tetsuo Handa
  2014-12-30 11:21                         ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-30  6:42 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes,
	torvalds

Michal Hocko wrote:
> but this task managed to make some progress because we can clearly see
> that pid 12718 (oom victim) managed to move on and get to OOM killer
> many times
> [  961.062042] a.out(12718) the OOM killer was skipped for 1965000 times.
> [...]
> [  983.140589] a.out(12718) the OOM killer was skipped for 2059000 times.
> 
Excuse me for the confusing message. The a.out(12718) printed here is not
the caller of OOM killer but the victim keeping the OOM killer disabled.
Thus, this task could not manage to make some progress and I called it
"a stalled case".

> There are still other cases where we can livelock but
> this seems to be a clear bug in grab_cache_page_write_begin.

We might want to discuss below case as a separate topic, but is a TIF_MEMDIE
stall anyway. I retested using 3.19-rc2 with diff shown below. If I start
a.out and b.out (where b.out is a copy of a.out) with slight delay (a few
deciseconds), I can observe that the a.out is unable to die due to b.out
asking for memory or holding lock.
http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-1.txt.xz is a case
where I think a.out keeps the OOM killer disabled and
http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-2.txt.xz is a case
where I think a.out cannot die within reasonable duration due to b.out .
I don't know whether cgroups can help or not, but I think we need to be
prepared for cases where sending SIGKILL to all threads sharing the same
memory does not help.

---------- diff start ----------
mm-get-rid-of-radix-tree-gfp-mask-for-pagecache_get_page-was-re-how-to-handle-tif_memdie-stalls.patch
oom-dont-count-on-mm-less-current-process.patch
oom-make-sure-that-tif_memdie-is-set-under-task_lock.patch
my patch for debug printk() on memory allocation stall
my patch for boot failure by bd809af16e3ab1f8 "x86: Enable PAT to use cache mode translation tables"

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index a97ee08..cab1578 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -718,9 +718,6 @@ void __init zone_sizes_init(void)
 
 void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
 {
-	/* entry 0 MUST be WB (hardwired to speed up translations) */
-	BUG_ON(!entry && cache != _PAGE_CACHE_MODE_WB);
-
 	__cachemode2pte_tbl[cache] = __cm_idx2pte(entry);
 	__pte2cachemode_tbl[entry] = cache;
 }
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7ea069c..4b3736f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -251,7 +251,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 #define FGP_NOWAIT		0x00000020
 
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
-		int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask);
+		int fgp_flags, gfp_t cache_gfp_mask);
 
 /**
  * find_get_page - find and get a page reference
@@ -266,13 +266,13 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 static inline struct page *find_get_page(struct address_space *mapping,
 					pgoff_t offset)
 {
-	return pagecache_get_page(mapping, offset, 0, 0, 0);
+	return pagecache_get_page(mapping, offset, 0, 0);
 }
 
 static inline struct page *find_get_page_flags(struct address_space *mapping,
 					pgoff_t offset, int fgp_flags)
 {
-	return pagecache_get_page(mapping, offset, fgp_flags, 0, 0);
+	return pagecache_get_page(mapping, offset, fgp_flags, 0);
 }
 
 /**
@@ -292,7 +292,7 @@ static inline struct page *find_get_page_flags(struct address_space *mapping,
 static inline struct page *find_lock_page(struct address_space *mapping,
 					pgoff_t offset)
 {
-	return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0);
+	return pagecache_get_page(mapping, offset, FGP_LOCK, 0);
 }
 
 /**
@@ -319,7 +319,7 @@ static inline struct page *find_or_create_page(struct address_space *mapping,
 {
 	return pagecache_get_page(mapping, offset,
 					FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
-					gfp_mask, gfp_mask & GFP_RECLAIM_MASK);
+					gfp_mask);
 }
 
 /**
@@ -340,8 +340,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
 {
 	return pagecache_get_page(mapping, index,
 			FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
-			mapping_gfp_mask(mapping),
-			GFP_NOFS);
+			mapping_gfp_mask(mapping));
 }
 
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..69d367f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1701,6 +1701,14 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+	/* Jiffies spent since the start of outermost memory allocation */
+	unsigned long gfp_start;
+	/* GFP flags passed to innermost memory allocation */
+	gfp_t gfp_flags;
+	/* # of shrink_slab() calls since outermost memory allocation. */
+	unsigned int shrink_slab_counter;
+	/* # of OOM-killer skipped. */
+	atomic_t oom_killer_skip_counter;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b5797b7..e7fc702 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4502,6 +4502,22 @@ out_unlock:
 	return retval;
 }
 
+static void print_memalloc_info(const struct task_struct *p)
+{
+	const gfp_t gfp = p->gfp_flags & __GFP_WAIT;
+
+	/*
+	 * __alloc_pages_nodemask() doesn't use smp_wmb() between
+	 * updating ->gfp_start and ->gfp_flags. But reading stale
+	 * ->gfp_start value harms nothing but printing bogus duration.
+	 * Correct duration will be printed when this function is
+	 * called for the next time.
+	 */
+	if (unlikely(gfp))
+		printk(KERN_INFO "MemAlloc: %ld jiffies on 0x%x\n",
+		       jiffies - p->gfp_start, gfp);
+}
+
 static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
 void sched_show_task(struct task_struct *p)
@@ -4536,6 +4552,7 @@ void sched_show_task(struct task_struct *p)
 		task_pid_nr(p), ppid,
 		(unsigned long)task_thread_info(p)->flags);
 
+	print_memalloc_info(p);
 	print_worker_info(KERN_INFO, p);
 	show_stack(p, NULL);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index bd8543c..673e458 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1046,8 +1046,7 @@ EXPORT_SYMBOL(find_lock_entry);
  * @mapping: the address_space to search
  * @offset: the page index
  * @fgp_flags: PCG flags
- * @cache_gfp_mask: gfp mask to use for the page cache data page allocation
- * @radix_gfp_mask: gfp mask to use for radix tree node allocation
+ * @gfp_mask: gfp mask to use for the page cache data page allocation
  *
  * Looks up the page cache slot at @mapping & @offset.
  *
@@ -1056,11 +1055,9 @@ EXPORT_SYMBOL(find_lock_entry);
  * FGP_ACCESSED: the page will be marked accessed
  * FGP_LOCK: Page is return locked
  * FGP_CREAT: If page is not present then a new page is allocated using
- *		@cache_gfp_mask and added to the page cache and the VM's LRU
- *		list. If radix tree nodes are allocated during page cache
- *		insertion then @radix_gfp_mask is used. The page is returned
- *		locked and with an increased refcount. Otherwise, %NULL is
- *		returned.
+ *		@gfp_mask and added to the page cache and the VM's LRU
+ *		list. The page is returned locked and with an increased
+ *		refcount. Otherwise, %NULL is returned.
  *
  * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
  * if the GFP flags specified for FGP_CREAT are atomic.
@@ -1068,7 +1065,7 @@ EXPORT_SYMBOL(find_lock_entry);
  * If there is a page cache page, it is returned with an increased refcount.
  */
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
-	int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask)
+	int fgp_flags, gfp_t gfp_mask)
 {
 	struct page *page;
 
@@ -1105,13 +1102,11 @@ no_page:
 	if (!page && (fgp_flags & FGP_CREAT)) {
 		int err;
 		if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
-			cache_gfp_mask |= __GFP_WRITE;
-		if (fgp_flags & FGP_NOFS) {
-			cache_gfp_mask &= ~__GFP_FS;
-			radix_gfp_mask &= ~__GFP_FS;
-		}
+			gfp_mask |= __GFP_WRITE;
+		if (fgp_flags & FGP_NOFS)
+			gfp_mask &= ~__GFP_FS;
 
-		page = __page_cache_alloc(cache_gfp_mask);
+		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			return NULL;
 
@@ -1122,7 +1117,8 @@ no_page:
 		if (fgp_flags & FGP_ACCESSED)
 			__SetPageReferenced(page);
 
-		err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask);
+		err = add_to_page_cache_lru(page, mapping, offset,
+				gfp_mask & GFP_RECLAIM_MASK);
 		if (unlikely(err)) {
 			page_cache_release(page);
 			page = NULL;
@@ -2443,8 +2439,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 		fgp_flags |= FGP_NOFS;
 
 	page = pagecache_get_page(mapping, index, fgp_flags,
-			mapping_gfp_mask(mapping),
-			GFP_KERNEL);
+			mapping_gfp_mask(mapping));
 	if (page)
 		wait_for_stable_page(page);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d503e9c..2f3ece1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -304,6 +304,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	rcu_read_lock();
 	for_each_process_thread(g, p) {
 		unsigned int points;
+		unsigned int count;
 
 		switch (oom_scan_process_thread(p, totalpages, nodemask,
 						force_kill)) {
@@ -314,6 +315,14 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		case OOM_SCAN_CONTINUE:
 			continue;
 		case OOM_SCAN_ABORT:
+			count = atomic_inc_return(&p->oom_killer_skip_counter);
+			if (count % 1000 == 0)
+				printk(KERN_INFO "%s(pid=%d,flags=0x%x) "
+				       "waited for %s(pid=%d,flags=0x%x) for "
+				       "%u times at select_bad_process().\n",
+				       current->comm, current->pid,
+				       current->gfp_flags, p->comm, p->pid,
+				       p->gfp_flags, count);
 			rcu_read_unlock();
 			return (struct task_struct *)(-1UL);
 		case OOM_SCAN_OK:
@@ -438,11 +447,22 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (task_will_free_mem(p)) {
+	task_lock(p);
+	if (p->mm && task_will_free_mem(p)) {
+		unsigned int count =
+			atomic_inc_return(&p->oom_killer_skip_counter);
+		if (count % 1000 == 0)
+			printk(KERN_INFO "%s(pid=%d,flags=0x%x) waited for "
+			       "%s(pid=%d,flags=0x%x) for %u times at "
+			       "oom_kill_process().\n", current->comm,
+			       current->pid, current->gfp_flags, p->comm,
+			       p->pid, p->gfp_flags, count);
 		set_tsk_thread_flag(p, TIF_MEMDIE);
+		task_unlock(p);
 		put_task_struct(p);
 		return;
 	}
+	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(p, gfp_mask, order, memcg, nodemask);
@@ -492,6 +512,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
+	set_tsk_thread_flag(victim, TIF_MEMDIE);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -522,7 +543,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		}
 	rcu_read_unlock();
 
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
 }
@@ -643,8 +663,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
+	 *
+	 * But don't select if current has already released its mm and cleared
+	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (current->mm &&
+	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7633c50..a3b0c5a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2877,6 +2877,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	int classzone_idx;
+	const gfp_t old_gfp_flags = current->gfp_flags;
+
+	if (!old_gfp_flags) {
+		current->gfp_start = jiffies;
+		current->shrink_slab_counter = 0;
+	}
+	current->gfp_flags = gfp_mask;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2885,7 +2892,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
 	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+		goto nopage;
 
 	/*
 	 * Check the zones suitable for the gfp_mask contain at least one
@@ -2893,7 +2900,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * of GFP_THISNODE and a memoryless node
 	 */
 	if (unlikely(!zonelist->_zonerefs->zone))
-		return NULL;
+		goto nopage;
 
 	if (IS_ENABLED(CONFIG_CMA) && migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
@@ -2937,6 +2944,9 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
+nopage:
+	current->gfp_flags = old_gfp_flags;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd9a72b..7d736d6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -368,6 +368,7 @@ unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid,
 {
 	struct shrinker *shrinker;
 	unsigned long freed = 0;
+	const unsigned long start = jiffies;
 
 	if (nr_scanned == 0)
 		nr_scanned = SWAP_CLUSTER_MAX;
@@ -397,6 +398,13 @@ unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid,
 
 	up_read(&shrinker_rwsem);
 out:
+	if (++current->shrink_slab_counter % 100000 == 0)
+		printk(KERN_INFO "%s(pid=%d,flags=0x%x) called "
+		       "shrink_slab() for %u times. This time freed "
+		       "%lu object and took %lu jiffies. Spent %lu "
+		       "jiffies till now.\n", current->comm, current->pid,
+		       current->gfp_flags, current->shrink_slab_counter, freed,
+		       jiffies - start, jiffies - current->gfp_start);
 	cond_resched();
 	return freed;
 }
---------- diff end ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-30  6:42                       ` Tetsuo Handa
@ 2014-12-30 11:21                         ` Michal Hocko
  2014-12-30 13:33                           ` Tetsuo Handa
                                             ` (2 more replies)
  0 siblings, 3 replies; 276+ messages in thread
From: Michal Hocko @ 2014-12-30 11:21 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes,
	torvalds

On Tue 30-12-14 15:42:56, Tetsuo Handa wrote:
[...]
> We might want to discuss below case as a separate topic, but is a TIF_MEMDIE
> stall anyway. I retested using 3.19-rc2 with diff shown below. If I start
> a.out and b.out (where b.out is a copy of a.out) with slight delay (a few
> deciseconds), I can observe that the a.out is unable to die due to b.out
> asking for memory or holding lock.
> http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-1.txt.xz is a case
> where I think a.out keeps the OOM killer disabled and

[   53.748454] b.out invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[   53.807397] active_anon:448903 inactive_anon:2082 isolated_anon:0
[   53.807397]  active_file:0 inactive_file:9 isolated_file:0
[   53.807397]  unevictable:0 dirty:3 writeback:0 unstable:0
[   53.807397]  free:13079 slab_reclaimable:1227 slab_unreclaimable:4520
[   53.807397]  mapped:380 shmem:2151 pagetables:2059 bounce:0
[   53.807397]  free_cma:0
[...]
[   53.856598] Free swap  = 0kB
[   53.857908] Total swap = 0kB
[   53.859218] 524157 pages RAM

This situation looks quite hopeless. We cannot swap yet we have over 80%
of memory occupied by anon memory. There is still around ~50M free and
few pages in the reclaimable slab which should be sufficient to help
TIF_MEMDIE to make some progress on the other hand.

[   54.380517] Out of memory: Kill process 3596 (a.out) score 719 or sacrifice child
[   54.382091] Killed process 3596 (a.out) total-vm:2166864kB, anon-rss:1383880kB, file-rss:4kB
[...]
[  348.134718] a.out           D ffff880036fefcb8     0  3596      1 0x00100084
[  348.136616]  ffff880036fefcb8 ffff880036fefc88 ffff88007c204550 00000000000130c0
[  348.138645]  ffff880036feffd8 00000000000130c0 ffff88007c204550 ffff880036fefcb8
[  348.140657]  ffff88007ca45248 ffff88007ca4524c ffff88007c204550 00000000ffffffff
[  348.142672] Call Trace:
[  348.143662]  [<ffffffff815bddb4>] schedule_preempt_disabled+0x24/0x70
[  348.145379]  [<ffffffff815bfb65>] __mutex_lock_slowpath+0xb5/0x120
[  348.147153]  [<ffffffff815bfbee>] mutex_lock+0x1e/0x32
[  348.148644]  [<ffffffffa02463ca>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs]
[  348.150637]  [<ffffffff8100d62f>] ? __switch_to+0x15f/0x580
[  348.152209]  [<ffffffffa02465dd>] xfs_file_write_iter+0x7d/0x120 [xfs]
[  348.153961]  [<ffffffff81178009>] new_sync_write+0x89/0xd0
[  348.155506]  [<ffffffff811787f2>] vfs_write+0xb2/0x1f0
[  348.157004]  [<ffffffff8101b994>] ? do_audit_syscall_entry+0x64/0x70
[  348.158715]  [<ffffffff81179440>] SyS_write+0x50/0xc0
[  348.160188]  [<ffffffff810f9ffe>] ? __audit_syscall_exit+0x22e/0x2d0

and this is the case for most a.out and b.out threads basically because
all of them contend on a single file. The holder of the lock right now
seems to be:

[  355.559722] b.out           R  running task        0  3843   3724 0x00000080
[  355.561700] MemAlloc: 21916 jiffies on 0x10
[  355.563056]  ffff88007c3f3808 ffff88007c3f37d8 ffff88007c3e4d60 00000000000130c0
[  355.565346]  ffff88007c3f3fd8 00000000000130c0 ffff88007c3e4d60 ffff880036f02b48
[  355.567440]  ffffffff81848588 0000000000000400 0000000000000000 ffff88007c3f39c8
[  355.569517] Call Trace:
[  355.570557]  [<ffffffff815bdc72>] _cond_resched+0x22/0x40
[  355.572167]  [<ffffffff811249f2>] shrink_node_slabs+0x242/0x310
[  355.573846]  [<ffffffff81127155>] shrink_zone+0x175/0x1c0
[  355.575410]  [<ffffffff81127590>] do_try_to_free_pages+0x1d0/0x3e0
[  355.577339]  [<ffffffff81127834>] try_to_free_pages+0x94/0xc0
[  355.579015]  [<ffffffff8111d4c5>] __alloc_pages_nodemask+0x535/0xaa0
[  355.580759]  [<ffffffff8115cf9c>] alloc_pages_current+0x8c/0x100
[  355.582446]  [<ffffffff811148f7>] __page_cache_alloc+0xa7/0xc0
[  355.584092]  [<ffffffff81115364>] pagecache_get_page+0x54/0x1b0
[  355.585773]  [<ffffffffa025d11e>] ? xfs_trans_commit+0x13e/0x230 [xfs]
[  355.587553]  [<ffffffff811154e8>] grab_cache_page_write_begin+0x28/0x50
[  355.589349]  [<ffffffffa023b04f>] xfs_vm_write_begin+0x2f/0xe0 [xfs]
[  355.591096]  [<ffffffff8111465c>] generic_perform_write+0xbc/0x1c0
[  355.592816]  [<ffffffffa024634f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs]
[  355.594718]  [<ffffffffa024642f>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs]

So it is trying to reclaim at least something but it will take some time
for it to realize this will not fly. The allocation will fail eventually,
though, because this is !__GFP_FS allocation and the same will apply to
a.out waiting for the lock as well.

$ grep "waited for.*select_bad_process" serial-20141230-ab-1.txt | sed 's@.*\((pid=.*waited for.*\) for.*@\1@' | sort | uniq -c
      1 (pid=2,flags=0x2000d0) waited for a.out(pid=3596,flags=0x0)
    809 (pid=3724,flags=0x280da) waited for a.out(pid=3596,flags=0x0)

[  351.915586] b.out           R  running task        0  3724   3572 0x00000080
[  351.917619] MemAlloc: 29906 jiffies on 0x10
[  351.919012]  ffff88007b8d7948 ffff88007fffc6c0 ffff88007c5751b0 00000000000130c0
[  351.921096]  ffff88007b8d7fd8 00000000000130c0 ffff88007c5751b0 0000000000000000
[  351.923228]  0000000000000000 00000000000280da 0000000000000002 0000000000000000
[  351.925374] Call Trace:
[  351.926466]  [<ffffffff815bdc72>] _cond_resched+0x22/0x40
[  351.928073]  [<ffffffff8111d477>] __alloc_pages_nodemask+0x4e7/0xaa0
[  351.929828]  [<ffffffff8115f302>] alloc_pages_vma+0x92/0x160
[  351.931502]  [<ffffffff8113fa11>] handle_mm_fault+0xbe1/0xed0
[  351.933171]  [<ffffffff815c2847>] ? native_iret+0x7/0x7
[  351.934719]  [<ffffffff8105502c>] __do_page_fault+0x1dc/0x5b0
[  351.936412]  [<ffffffff8111d125>] ? __alloc_pages_nodemask+0x195/0xaa0
[  351.938191]  [<ffffffff81055431>] do_page_fault+0x31/0x70
[  351.939769]  [<ffffffff815c3638>] page_fault+0x28/0x30
[  351.941322]  [<ffffffff812b1940>] ? __clear_user+0x20/0x50
[  351.942921]  [<ffffffff81139538>] iov_iter_zero+0x68/0x2f0
[  351.944503]  [<ffffffff8138a4e7>] read_iter_zero+0x47/0xb0
[  351.946135]  [<ffffffff81177f46>] new_sync_read+0x86/0xc0
[  351.947703]  [<ffffffff811791b3>] __vfs_read+0x13/0x50
[  351.949216]  [<ffffffff81179271>] vfs_read+0x81/0x140
[  351.950757]  [<ffffffff81179380>] SyS_read+0x50/0xc0
[  351.952277]  [<ffffffff810f9ffe>] ? __audit_syscall_exit+0x22e/0x2d0
[  351.953995]  [<ffffffff815c1c29>] system_call_fastpath+0x12/0x17

So the OOM blocked task is sitting in the page fault caused by clearing
the user buffer. According to your debugging patch this should be
GFP_HIGHUSER_MOVABLE | __GFP_ZERO allocation which is the case where we
retry without failing most of the time.
I am not familiar with the VFS code much but it seems we are not sitting
on any locks that would block the OOM victim later on (I am not entirely
sure about FDPUT_POS_UNLOCK from fdget_pos but all tasks are past this
calling it without blocking so it shouldn't matter). So even if the page
fault failed with ENOMEM it wouldn't help us much here.

That being said this doesn't look like a live lock or a lockup. System
should recover from this state but it might take a lot of time (there
are hundreds of tasks waiting on the i_mutex lock, each will try to
allocate and fail and OOM victims will have to get out of the kernel and
die). I am not sure we can do much about that from the allocator POV. A
possible way would be refraining from the reclaim efforts when it is
clear that nothing is really reclaimable. But I suspect this would be
tricky to get right.

> http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-2.txt.xz is a case

[   44.588785] Out of memory: Kill process 3599 (a.out) score 773 or sacrifice child
[   44.590418] Killed process 3599 (a.out) total-vm:2166864kB, anon-rss:1488688kB, file-rss:4kB
[...]
[   44.640689] a.out: page allocation failure: order:0, mode:0x280da
[   44.640690] CPU: 2 PID: 3599 Comm: a.out Not tainted 3.19.0-rc2+ #20
[...]
[   44.641125] a.out: page allocation failure: order:0, mode:0x2015a
[   44.641126] CPU: 2 PID: 3599 Comm: a.out Not tainted 3.19.0-rc2+ #20

So the OOM victim is failing the allocation because we prevent endless
loops in the allocator for TIF_MEMDIE tasks and then it dies (it is not
among Sysrq+t output AFAICS). We still have to wait for all the tasks
sharing mm with it.

many of them are in:
[  402.300859] a.out           x ffff88007be53ce8     0  3601      1 0x00000086
[  402.303407]  ffff88007be53ce8 ffff88007c962450 ffff880078d10e60 00000000000130c0
[  402.305478]  ffff88007be53fd8 00000000000130c0 ffff880078d10e60 ffff880078d114a8
[  402.307519]  ffff880078d114a8 ffff880078d11170 ffff88007c0a9220 ffff880078d10e60
[  402.309547] Call Trace:
[  402.310551]  [<ffffffff815bd8c4>] schedule+0x24/0x70
[  402.312040]  [<ffffffff8106a4ea>] do_exit+0x6ba/0xb10
[  402.313531]  [<ffffffff8106b7da>] do_group_exit+0x3a/0xa0
[  402.315082]  [<ffffffff81075de8>] get_signal+0x188/0x690
[  402.316629]  [<ffffffff815bd43a>] ? __schedule+0x27a/0x6e0
[  402.318196]  [<ffffffff8100e4f2>] do_signal+0x32/0x750
[  402.319744]  [<ffffffffa02611c4>] ? _xfs_log_force_lsn+0xc4/0x2f0 [xfs]
[  402.321729]  [<ffffffffa0245489>] ? xfs_file_fsync+0x159/0x1b0 [xfs]
[  402.323461]  [<ffffffff8100ec5c>] do_notify_resume+0x4c/0x90
[  402.325135]  [<ffffffff815c1ec7>] int_signal+0x12/0x17

so they have already dropped reference to mm_struct but some of them are
still waiting in the write path to fail and exit:
[  402.271983] a.out           D ffff88007c047cb8     0  3600      1 0x00000084
[  402.273866]  ffff88007c047cb8 ffff88007c047c88 ffff8800793d8ba0 00000000000130c0
[  402.275872]  ffff88007c047fd8 00000000000130c0 ffff8800793d8ba0 ffff88007c047cb8
[  402.277878]  ffff88007ae56a48 ffff88007ae56a4c ffff8800793d8ba0 00000000ffffffff
[  402.279888] Call Trace:
[  402.280874]  [<ffffffff815bddb4>] schedule_preempt_disabled+0x24/0x70
[  402.282597]  [<ffffffff815bfb65>] __mutex_lock_slowpath+0xb5/0x120
[  402.284266]  [<ffffffff815bfbee>] mutex_lock+0x1e/0x32
[  402.285756]  [<ffffffffa02463ca>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs]
[  402.287741]  [<ffffffff8100d62f>] ? __switch_to+0x15f/0x580
[  402.289311]  [<ffffffffa02465dd>] xfs_file_write_iter+0x7d/0x120 [xfs]
[  402.291050]  [<ffffffff81178009>] new_sync_write+0x89/0xd0
[  402.292596]  [<ffffffff811787f2>] vfs_write+0xb2/0x1f0
[  402.294075]  [<ffffffff8101b994>] ? do_audit_syscall_entry+0x64/0x70
[  402.295774]  [<ffffffff81179440>] SyS_write+0x50/0xc0
[  402.297239]  [<ffffffff810f9ffe>] ? __audit_syscall_exit+0x22e/0x2d0
[  402.298947]  [<ffffffff815c1c29>] system_call_fastpath+0x12/0x17

while one of them is holding the lock:
[  402.736525] a.out           R  running task        0  3617      1 0x00000084
[  402.738452] MemAlloc: 358299 jiffies on 0x10
[  402.739812]  ffff88007ba63808 ffff88007ba637d8 ffff8800792f2510 00000000000130c0
[  402.741972]  ffff88007ba63fd8 00000000000130c0 ffff8800792f2510 ffff880078d1bb48
[  402.744029]  ffffffff81848588 0000000000000400 0000000000000000 ffff88007ba639c8
[  402.746135] Call Trace:
[  402.747153]  [<ffffffff815bdc72>] _cond_resched+0x22/0x40
[  402.748718]  [<ffffffff811249f2>] shrink_node_slabs+0x242/0x310
[  402.750432]  [<ffffffff81127155>] shrink_zone+0x175/0x1c0
[  402.751996]  [<ffffffff81127590>] do_try_to_free_pages+0x1d0/0x3e0
[  402.753686]  [<ffffffff81127834>] try_to_free_pages+0x94/0xc0
[  402.755325]  [<ffffffff8111d4c5>] __alloc_pages_nodemask+0x535/0xaa0
[  402.757057]  [<ffffffff8115cf9c>] alloc_pages_current+0x8c/0x100
[  402.758725]  [<ffffffff811148f7>] __page_cache_alloc+0xa7/0xc0
[  402.760362]  [<ffffffff81115364>] pagecache_get_page+0x54/0x1b0
[  402.762004]  [<ffffffff811154e8>] grab_cache_page_write_begin+0x28/0x50
[  402.763787]  [<ffffffffa023b04f>] xfs_vm_write_begin+0x2f/0xe0 [xfs]
[  402.765516]  [<ffffffff8111465c>] generic_perform_write+0xbc/0x1c0
[  402.767203]  [<ffffffffa024634f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs]
[  402.769078]  [<ffffffffa024642f>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs]

So this is basically the same as the previous one we just see it in a
slightly better shape because many threads managed to exit already.

> where I think a.out cannot die within reasonable duration due to b.out .

I am not sure you can have any reasonable time expectation with such a
huge contention on a single file. Even killing the task manually would
take quite some time I suspect. Sure, memory pressure makes it all much
worse.

> I don't know whether cgroups can help or not,

Memory cgroups would help you to limit the amount of anon memory but you
would have to be really careful about the potential overcomit due to
other allocations from outside of the restricted group. Not having any
swap doesn't help here either. It just moves all the reclaim pressure to
the file pages and slabs which struggle already.

> but I think we need to be prepared for cases where sending SIGKILL to
> all threads sharing the same memory does not help.

Sure, unkillable tasks are a problem which we have to handle. Having
GFP_KERNEL allocations looping without way out contributes to this which
is sad but your current data just show that sometimes it might take ages
to finish even without that going on.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-30 11:21                         ` Michal Hocko
@ 2014-12-30 13:33                           ` Tetsuo Handa
  2014-12-31 10:24                             ` Tetsuo Handa
  2015-02-09 11:44                           ` Tetsuo Handa
  2015-02-16 11:23                           ` Tetsuo Handa
  2 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-30 13:33 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes,
	torvalds

Michal Hocko wrote:
> So the OOM blocked task is sitting in the page fault caused by clearing
> the user buffer. According to your debugging patch this should be
> GFP_HIGHUSER_MOVABLE | __GFP_ZERO allocation which is the case where we
> retry without failing most of the time.

Oops, my debugging patch had a bug. I wanted to print p->gfp_flags but
was printing (p->gfp_flags & __GFP_WAIT). Retested with a fix and result
is http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-3.txt.xz .

  static void print_memalloc_info(const struct task_struct *p)
  {
          const gfp_t gfp = p->gfp_flags;
  
          /*
           * __alloc_pages_nodemask() doesn't use smp_wmb() between
           * updating ->gfp_start and ->gfp_flags. But reading stale
           * ->gfp_start value harms nothing but printing bogus duration.
           * Correct duration will be printed when this function is
           * called for the next time.
           */
          if (unlikely(gfp & __GFP_WAIT))
                  printk(KERN_INFO "MemAlloc: %ld jiffies on 0x%x\n",
                         jiffies - p->gfp_start, gfp);
  }

> That being said this doesn't look like a live lock or a lockup. System
> should recover from this state but it might take a lot of time (there
> are hundreds of tasks waiting on the i_mutex lock, each will try to
> allocate and fail and OOM victims will have to get out of the kernel and
> die). I am not sure we can do much about that from the allocator POV. A
> possible way would be refraining from the reclaim efforts when it is
> clear that nothing is really reclaimable. But I suspect this would be
> tricky to get right.

Indeed, this is not a livelock since the task holding the mutex is doing
a !__GFP_FS allocation and is making too-slow-to-wait progress, and the
"waited for" lines are eventually gone.

[  121.017797] b.out           R  running task        0  9999   9982 0x00000088
[  121.019750] MemAlloc: 30542 jiffies on 0x102005a
[  223.486701] b.out           R  running task        0 10008   9982 0x00000080
[  223.488642] MemAlloc: 12242 jiffies on 0x102005a
[  415.695635] b.out           R  running task        0 10013   9982 0x00000080
[  415.697578] MemAlloc: 108210 jiffies on 0x102005a
[  960.228134] b.out           R  running task        0 10013   9982 0x00000080
[  960.230179] MemAlloc: 652090 jiffies on 0x102005a

> > where I think a.out cannot die within reasonable duration due to b.out .
> 
> I am not sure you can have any reasonable time expectation with such a
> huge contention on a single file. Even killing the task manually would
> take quite some time I suspect. Sure, memory pressure makes it all much
> worse.

Not specific to OOM-killer case, but I wish that the stall ends within 10
seconds, for my customers are using watchdog timeout of 11 seconds with
watchdog keep-alive interval of 2 seconds.

I wish that there is a way to record that the process who is supposed to do
watchdog keep-alive operation was unexpectedly blocked for many seconds at
memory allocation. My gfp_start patch works for that purpose.

> > but I think we need to be prepared for cases where sending SIGKILL to
> > all threads sharing the same memory does not help.
> 
> Sure, unkillable tasks are a problem which we have to handle. Having
> GFP_KERNEL allocations looping without way out contributes to this which
> is sad but your current data just show that sometimes it might take ages
> to finish even without that going on.

Can't we replace mutex_lock() / wait_for_completion() with killable versions
where it is safe (in order to reduce locations of unkillable waits)?
I think replacing mutex_lock() in xfs_file_buffered_aio_write() with killable
version is possible because data written by buffered write is not guaranteed
to be flushed until sync() / fsync() / fdatasync() returns.

And can't we detect unkillable TIF_MEMDIE tasks (like checking task's ->state
after a while after TIF_MEMDIE was set)? My sysctl_memdie_timeout_jiffies
patch works for that purpose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?)
  2014-12-29 19:33                       ` Michal Hocko
@ 2014-12-30 13:42                         ` Michal Hocko
  2014-12-30 21:45                           ` Linus Torvalds
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2014-12-30 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Tetsuo Handa, Dave Chinner, linux-mm,
	David Rientjes, Oleg Nesterov, Linus Torvalds, Mel Gorman,
	Johannes Weiner

Andrew,
I've noticed you have taken the patch to mm tree already. I have
realized I haven't marked it for stable which is worth it IMO because
debugging nasty reclaim recursion bugs is definitely a pain and might
fix one and even if it doesn't it is rather straightforward and
shouldn't break anything. So if nobody has anything against I would mark
this for stable 3.16+ AFAICS.

On Mon 29-12-14 20:33:12, Michal Hocko wrote:
> From 3242f56ae8886a3c605d93960e77176dfe1dff43 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 29 Dec 2014 20:30:35 +0100
> Subject: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page
> 
> 2457aec63745 (mm: non-atomically mark page accessed during page cache
> allocation where possible) has added a separate parameter for specifying
> gfp mask for radix tree allocations.
> 
> Not only this is less than optimal from the API point of view
> because it is error prone, it is also buggy currently because
> grab_cache_page_write_begin is using GFP_KERNEL for radix tree and
> if fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
> AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
> the radix tree allocation wouldn't obey the restriction and might
> recurse into filesystem and cause deadlocks. This is the case for
> most filesystems unfortunately because only ext4 and gfs2 are using
> AOP_FLAG_NOFS.
> 
> Let's simply remove radix_gfp_mask parameter because the allocation
> context is same for both page cache and for the radix tree. Just make
> sure that the radix tree gets only the sane subset of the mask (e.g. do
> not pass __GFP_WRITE).
> 
> Long term it is more preferable to convert remaining users of
> AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
> interface even further.
> 
> Reported-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/pagemap.h | 13 ++++++-------
>  mm/filemap.c            | 29 ++++++++++++-----------------
>  2 files changed, 18 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 7ea069cd3257..4b3736f7065c 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -251,7 +251,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
>  #define FGP_NOWAIT		0x00000020
>  
>  struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
> -		int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask);
> +		int fgp_flags, gfp_t cache_gfp_mask);
>  
>  /**
>   * find_get_page - find and get a page reference
> @@ -266,13 +266,13 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
>  static inline struct page *find_get_page(struct address_space *mapping,
>  					pgoff_t offset)
>  {
> -	return pagecache_get_page(mapping, offset, 0, 0, 0);
> +	return pagecache_get_page(mapping, offset, 0, 0);
>  }
>  
>  static inline struct page *find_get_page_flags(struct address_space *mapping,
>  					pgoff_t offset, int fgp_flags)
>  {
> -	return pagecache_get_page(mapping, offset, fgp_flags, 0, 0);
> +	return pagecache_get_page(mapping, offset, fgp_flags, 0);
>  }
>  
>  /**
> @@ -292,7 +292,7 @@ static inline struct page *find_get_page_flags(struct address_space *mapping,
>  static inline struct page *find_lock_page(struct address_space *mapping,
>  					pgoff_t offset)
>  {
> -	return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0);
> +	return pagecache_get_page(mapping, offset, FGP_LOCK, 0);
>  }
>  
>  /**
> @@ -319,7 +319,7 @@ static inline struct page *find_or_create_page(struct address_space *mapping,
>  {
>  	return pagecache_get_page(mapping, offset,
>  					FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
> -					gfp_mask, gfp_mask & GFP_RECLAIM_MASK);
> +					gfp_mask);
>  }
>  
>  /**
> @@ -340,8 +340,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
>  {
>  	return pagecache_get_page(mapping, index,
>  			FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
> -			mapping_gfp_mask(mapping),
> -			GFP_NOFS);
> +			mapping_gfp_mask(mapping));
>  }
>  
>  struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index e8905bc3cbd7..11477d3b7838 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1046,8 +1046,7 @@ EXPORT_SYMBOL(find_lock_entry);
>   * @mapping: the address_space to search
>   * @offset: the page index
>   * @fgp_flags: PCG flags
> - * @cache_gfp_mask: gfp mask to use for the page cache data page allocation
> - * @radix_gfp_mask: gfp mask to use for radix tree node allocation
> + * @gfp_mask: gfp mask to use for the page cache data page allocation
>   *
>   * Looks up the page cache slot at @mapping & @offset.
>   *
> @@ -1056,11 +1055,9 @@ EXPORT_SYMBOL(find_lock_entry);
>   * FGP_ACCESSED: the page will be marked accessed
>   * FGP_LOCK: Page is return locked
>   * FGP_CREAT: If page is not present then a new page is allocated using
> - *		@cache_gfp_mask and added to the page cache and the VM's LRU
> - *		list. If radix tree nodes are allocated during page cache
> - *		insertion then @radix_gfp_mask is used. The page is returned
> - *		locked and with an increased refcount. Otherwise, %NULL is
> - *		returned.
> + *		@gfp_mask and added to the page cache and the VM's LRU
> + *		list. The page is returned locked and with an increased
> + *		refcount. Otherwise, %NULL is returned.
>   *
>   * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
>   * if the GFP flags specified for FGP_CREAT are atomic.
> @@ -1068,7 +1065,7 @@ EXPORT_SYMBOL(find_lock_entry);
>   * If there is a page cache page, it is returned with an increased refcount.
>   */
>  struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
> -	int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask)
> +	int fgp_flags, gfp_t gfp_mask)
>  {
>  	struct page *page;
>  
> @@ -1105,13 +1102,11 @@ no_page:
>  	if (!page && (fgp_flags & FGP_CREAT)) {
>  		int err;
>  		if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
> -			cache_gfp_mask |= __GFP_WRITE;
> -		if (fgp_flags & FGP_NOFS) {
> -			cache_gfp_mask &= ~__GFP_FS;
> -			radix_gfp_mask &= ~__GFP_FS;
> -		}
> +			gfp_mask |= __GFP_WRITE;
> +		if (fgp_flags & FGP_NOFS)
> +			gfp_mask &= ~__GFP_FS;
>  
> -		page = __page_cache_alloc(cache_gfp_mask);
> +		page = __page_cache_alloc(gfp_mask);
>  		if (!page)
>  			return NULL;
>  
> @@ -1122,7 +1117,8 @@ no_page:
>  		if (fgp_flags & FGP_ACCESSED)
>  			__SetPageReferenced(page);
>  
> -		err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask);
> +		err = add_to_page_cache_lru(page, mapping, offset,
> +				gfp_mask & GFP_RECLAIM_MASK);
>  		if (unlikely(err)) {
>  			page_cache_release(page);
>  			page = NULL;
> @@ -2443,8 +2439,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
>  		fgp_flags |= FGP_NOFS;
>  
>  	page = pagecache_get_page(mapping, index, fgp_flags,
> -			mapping_gfp_mask(mapping),
> -			GFP_KERNEL);
> +			mapping_gfp_mask(mapping));
>  	if (page)
>  		wait_for_stable_page(page);
>  
> -- 
> 2.1.4
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?)
  2014-12-30 13:42                         ` Michal Hocko
@ 2014-12-30 21:45                           ` Linus Torvalds
  0 siblings, 0 replies; 276+ messages in thread
From: Linus Torvalds @ 2014-12-30 21:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Chinner, Tetsuo Handa, Dave Chinner,
	linux-mm, David Rientjes, Oleg Nesterov, Mel Gorman,
	Johannes Weiner

On Tue, Dec 30, 2014 at 5:42 AM, Michal Hocko <mhocko@suse.cz> wrote:
>
> I've noticed you have taken the patch to mm tree already. I have
> realized I haven't marked it for stable which is worth it IMO because
> debugging nasty reclaim recursion bugs is definitely a pain and might
> fix one and even if it doesn't it is rather straightforward and
> shouldn't break anything. So if nobody has anything against I would mark
> this for stable 3.16+ AFAICS.

I already applied it (as commit 45f87de57f8f), so if you think it's
stable material - and I agree that it looks that way - you should just
email stable@vger.kernel.org about it.

I think it might be a good idea to wait a week or two to make sure it
doesn't have any unexpected side effects.

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-30 13:33                           ` Tetsuo Handa
@ 2014-12-31 10:24                             ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2014-12-31 10:24 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes,
	torvalds

Tetsuo Handa wrote:
> > > where I think a.out cannot die within reasonable duration due to b.out .
> > 
> > I am not sure you can have any reasonable time expectation with such a
> > huge contention on a single file. Even killing the task manually would
> > take quite some time I suspect. Sure, memory pressure makes it all much
> > worse.
> 
> Not specific to OOM-killer case, but I wish that the stall ends within 10
> seconds, for my customers are using watchdog timeout of 11 seconds with
> watchdog keep-alive interval of 2 seconds.
> 
> I wish that there is a way to record that the process who is supposed to do
> watchdog keep-alive operation was unexpectedly blocked for many seconds at
> memory allocation. My gfp_start patch works for that purpose.
> 
> > > but I think we need to be prepared for cases where sending SIGKILL to
> > > all threads sharing the same memory does not help.
> > 
> > Sure, unkillable tasks are a problem which we have to handle. Having
> > GFP_KERNEL allocations looping without way out contributes to this which
> > is sad but your current data just show that sometimes it might take ages
> > to finish even without that going on.
> 
> Can't we replace mutex_lock() / wait_for_completion() with killable versions
> where it is safe (in order to reduce locations of unkillable waits)?
> I think replacing mutex_lock() in xfs_file_buffered_aio_write() with killable
> version is possible because data written by buffered write is not guaranteed
> to be flushed until sync() / fsync() / fdatasync() returns.
> 
> And can't we detect unkillable TIF_MEMDIE tasks (like checking task's ->state
> after a while after TIF_MEMDIE was set)? My sysctl_memdie_timeout_jiffies
> patch works for that purpose.
> 

I was testing below patch on current linux.git tree. To my surprise, I can no
longer reproduce "stall by a.out + b.out" because setting TIF_MEMDIE to all
threads sharing the same memory (without granting access to memory reserves)
made it possible to solve the stalled state immediately (console log is at
http://I-love.SAKURA.ne.jp/tmp/serial-20141231-ab.txt.xz ). Given that
low-order (<=PAGE_ALLOC_COSTLY_ORDER) allocations are allowed to fail
immediately upon OOM, maybe we can let ongoing memory allocations fail
without granting access to memory reserves?
----------------------------------------
>From 9212fb2bc96579c0dd0e1f121f5c089c683e12c0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 31 Dec 2014 17:50:24 +0900
Subject: [RFC PATCH] oom: Introduce sysctl-tunable MEMDIE timeout.

When there is a thread with TIF_MEMDIE flag set, the OOM killer is
disabled. However, a victim process containing that thread could get
stuck due to dependency which is invisible to the OOM killer. As a
result, the system will stall for unpredictable duration because the
OOM killer is kept disabled when one of threads in the victim process
got stuck. This situation is easily reproduced by multi-threaded
programs where thread1 tries to allocate memory whereas thread2 tries
to perform file I/O operation. The OOM killer sets TIF_MEMDIE flag to
only thread1, but the threads which really needs TIF_MEMDIE flag which
is blocking thread2 via unkillable wait (e.g. mutex_lock() for
"struct inode"->i_mutex) can be thread3 doing memory allocation. And
the thread3 can be outside of the victim process containing thread1.

But in order to avoid depletion of memory reserves via TIF_MEMDIE flag,
we don't want to set TIF_MEMDIE flag to all threads which might be
preventing thread2 to terminate. Moreover, we can't know which threads
are holding the lock which thread2 depends on.

While converting unkillable waits (e.g. mutex_lock()) to killable waits
(e.g. mutex_lock_killable()) helps thread2 to die quickly (not only
SIGKILL by the OOM killer but also SIGKILL by user's operations), we
can't afford converting all unkillable waits. So, we want to be prepared
for unkillable threads anyway.

This patch does the following things.

  (1) Let ongoing memory allocation fail without accessing to memory
      reserves via TIF_MEMDIE flag.
  (2) Let the OOM killer set TIF_MEMDIE flag to all threads sharing
      the same memory.
  (3) Let the OOM killer record current time as of setting TIF_MEMDIE
      flag.
  (4) Let the OOM killer treat threads which did not die within
      sysctl-tunable timeout as unkillable.

We can avoid depletion of memory reserves via TIF_MEMDIE flag by (1).
While (1) might retard termination of thread1 when allowing access to
memory reserves helps the victim process containing thread1 to die
quickly, (4) will prevent thread1 from being unable to die forever by
killing other threads after timeout.

If the OOM killer cannot find threads to kill after timeout, something
is absolutely wrong. Therefore, kernel panic followed by automatic
reboot (with kdump as optional for analyzing the cause) should be OK.

(4) introduces /proc/sys/vm/memdie_task_{skip|panic}_secs interfaces
which control timeout for waiting for the threads with TIF_MEMDIE flag
set. When timeout expired, the former enables the OOM killer again and
the latter triggers kernel panic.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h   |  3 ++
 include/linux/sched.h |  1 +
 kernel/cpuset.c       |  5 ++--
 kernel/exit.c         |  1 +
 kernel/sysctl.c       | 19 +++++++++++++
 mm/oom_kill.c         | 77 ++++++++++++++++++++++++++++++++++++++++++++-------
 mm/page_alloc.c       |  4 +--
 7 files changed, 95 insertions(+), 15 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 853698c..642e4ae 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -68,6 +68,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
+extern bool is_killable_memdie_task(struct task_struct *p);
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
@@ -107,4 +108,6 @@ static inline bool task_will_free_mem(struct task_struct *task)
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern unsigned long sysctl_memdie_task_skip_secs;
+extern unsigned long sysctl_memdie_task_panic_secs;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..58ad56a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1701,6 +1701,7 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+	unsigned long memdie_start;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 64b257f..aea9d712 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -35,6 +35,7 @@
 #include <linux/kmod.h>
 #include <linux/list.h>
 #include <linux/mempolicy.h>
+#include <linux/oom.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
 #include <linux/export.h>
@@ -1008,7 +1009,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+	if (unlikely(is_killable_memdie_task(current)))
 		return;
 	if (current->flags & PF_EXITING) /* Let dying task have memory */
 		return;
@@ -2515,7 +2516,7 @@ int __cpuset_node_allowed(int node, gfp_t gfp_mask)
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+	if (unlikely(is_killable_memdie_task(current)))
 		return 1;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
 		return 0;
diff --git a/kernel/exit.c b/kernel/exit.c
index 1ea4369..de5efe5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -436,6 +436,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	clear_thread_flag(TIF_MEMDIE);
+	current->memdie_start = 0;
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 137c7f6..dab9b31 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -145,6 +145,9 @@ static const int cap_last_cap = CAP_LAST_CAP;
 static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);
 #endif
 
+/* Used by proc_doulongvec_minmax of sysctl_memdie_task_*_secs */
+static unsigned long memdie_task_timeout_max = (LONG_MAX/HZ);
+
 #ifdef CONFIG_INOTIFY_USER
 #include <linux/inotify.h>
 #endif
@@ -1502,6 +1505,22 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+	{
+		.procname	= "memdie_task_skip_secs",
+		.data		= &sysctl_memdie_task_skip_secs,
+		.maxlen		= sizeof(sysctl_memdie_task_skip_secs),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra2		= &memdie_task_timeout_max,
+	},
+	{
+		.procname	= "memdie_task_panic_secs",
+		.data		= &sysctl_memdie_task_panic_secs,
+		.maxlen		= sizeof(sysctl_memdie_task_panic_secs),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra2		= &memdie_task_timeout_max,
+	},
 	{ }
 };
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d503e9c..dbff730 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -43,6 +43,8 @@ int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
+unsigned long sysctl_memdie_task_skip_secs;
+unsigned long sysctl_memdie_task_panic_secs;
 
 #ifdef CONFIG_NUMA
 /**
@@ -117,6 +119,61 @@ found:
 	return t;
 }
 
+/**
+ * set_memdie_flag - set TIF_MEMDIE flag and record current time.
+ * @p: Pointer to "struct task_struct".
+ */
+static void set_memdie_flag(struct task_struct *p)
+{
+	/* For avoiding race condition, current time must not be 0. */
+	if (!p->memdie_start) {
+		const unsigned long start = jiffies;
+
+		p->memdie_start = start ? start : 1;
+	}
+	set_tsk_thread_flag(p, TIF_MEMDIE);
+}
+
+/**
+ * is_killable_memdie_task - check task is not stuck with TIF_MEMDIE flag set.
+ * @p: Pointer to "struct task_struct".
+ *
+ * Setting TIF_MEMDIE flag to @p disables the OOM killer. However, @p could get
+ * stuck due to dependency which is invisible to the OOM killer. When @p got
+ * stuck, the system will stall for unpredictable duration (presumably forever)
+ * because the OOM killer is kept disabled.
+ *
+ * If @p remained stuck for /proc/sys/vm/memdie_task_skip_secs seconds, this
+ * function returns false as if TIF_MEMDIE flag was not set to @p. As a result,
+ * the OOM killer will try to find other killable processes at the risk of
+ * kernel panic when there is no other killable processes.
+ * If @p remained stuck for /proc/sys/vm/memdie_task_panic_secs seconds, this
+ * function triggers kernel panic (for optionally taking vmcore for analysis).
+ * Setting 0 to these interfaces disables this check.
+ */
+bool is_killable_memdie_task(struct task_struct *p)
+{
+	unsigned long start, timeout;
+
+	/* If task does not have TIF_MEMDIE flag, there is nothing to do.*/
+	if (!test_tsk_thread_flag(p, TIF_MEMDIE))
+		return false;
+	/* Handle cases where TIF_MEMDIE was set outside of this file. */
+	start = p->memdie_start;
+	if (!start) {
+		set_memdie_flag(p);
+		return true;
+	}
+	/* Trigger kernel panic after timeout. */
+	timeout = sysctl_memdie_task_panic_secs;
+	if (timeout && time_after(jiffies, start + timeout * HZ))
+		panic("Out of memory: %s (%d) did not die within %lu seconds.\n",
+		      p->comm, p->pid, timeout);
+	/* Return true before timeout. */
+	timeout = sysctl_memdie_task_skip_secs;
+	return !timeout || time_before(jiffies, start + timeout * HZ);
+}
+
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask)
@@ -134,7 +191,7 @@ static bool oom_unkillable_task(struct task_struct *p,
 	if (!has_intersects_mems_allowed(p, nodemask))
 		return true;
 
-	return false;
+	return is_killable_memdie_task(p);
 }
 
 /**
@@ -439,7 +496,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (task_will_free_mem(p)) {
-		set_tsk_thread_flag(p, TIF_MEMDIE);
+		set_memdie_flag(p);
 		put_task_struct(p);
 		return;
 	}
@@ -500,12 +557,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
-	 * any.  They don't get access to memory reserves, though, to avoid
-	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
-	 * oom killed thread cannot exit because it requires the semaphore and
-	 * its contended by another thread trying to allocate memory itself.
-	 * That thread will now get access to memory reserves since it has a
-	 * pending fatal signal.
+	 * any. This mitigates mm->mmap_sem livelock when an oom killed thread
+	 * cannot exit because it requires the semaphore and its contended by
+	 * another thread trying to allocate memory itself. Note that this does
+	 * not help if the contended process does not share victim->mm. In that
+	 * case, is_killable_memdie_task() will detect it and take actions.
 	 */
 	rcu_read_lock();
 	for_each_process(p)
@@ -518,11 +574,12 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			pr_err("Kill process %d (%s) sharing same memory\n",
 				task_pid_nr(p), p->comm);
 			task_unlock(p);
+			set_memdie_flag(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 		}
 	rcu_read_unlock();
 
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	set_memdie_flag(victim);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
 }
@@ -645,7 +702,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-		set_thread_flag(TIF_MEMDIE);
+		set_memdie_flag(current);
 		return;
 	}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7633c50..3799139 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (!in_interrupt() &&
-				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-30 11:21                         ` Michal Hocko
  2014-12-30 13:33                           ` Tetsuo Handa
@ 2015-02-09 11:44                           ` Tetsuo Handa
  2015-02-10 13:58                             ` Tetsuo Handa
  2015-02-17 14:37                             ` Michal Hocko
  2015-02-16 11:23                           ` Tetsuo Handa
  2 siblings, 2 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-09 11:44 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes,
	torvalds

Hello.

Today I tested Linux 3.19 and noticed unexpected behavior (A) (B)
shown below.

(A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition
    despite we didn't remove the

        /*
         * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
         * means __GFP_NOFAIL, but that may not be true in other
         * implementations.
         */
        if (order <= PAGE_ALLOC_COSTLY_ORDER)
                return 1;

    check in should_alloc_retry(). Is this what you expected?

(B) When coredump to pipe is configured, the system stalls under OOM
    condition due to memory allocation by coredump's reader side.
    How should we handle this "expected to terminate shortly but unable
    to terminate due to invisible dependency" case? What approaches
    other than applying timeout on coredump's writer side are possible?
    (Running inside memory cgroup is not an answer which I want.)

Console log is at http://I-love.SAKURA.ne.jp/tmp/serial-20150209.txt.xz
and kernel config is at http://I-love.SAKURA.ne.jp/tmp/config-3.19 .

To reproduce these behavior, you can run reproducer program shown below
on a system with 4 CPUs / 2GB RAM / no swap. (Too small stack is passed
to clone() because I by error did so when trying to reproduce OOM-stall
situations caused by memory allocations inside unkillable
down_write("struct mm_struct"->mmap_sem) calls.)

---------- reproducer program start ----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>

static int file_mapper(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	void *ptr[10000]; /* Will cause SIGSEGV due to stack overflow */
	int i;
	while (1) {
		for (i = 0; i < 10000; i++)
			ptr[i] = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd,
				      0);
		for (i = 0; i < 10000; i++)
			munmap(ptr[i], 4096);
	}
	return 0;
}

static void child(void)
{
	const int fd = open("/proc/self/oom_score_adj", O_WRONLY);
	int i;
	write(fd, "999", 3);
	close(fd);
	for (i = 0; i < 10; i++) {
		char *cp = malloc(4 * 1024);
		if (!cp || clone(file_mapper, cp + 4 * 1024,
				 CLONE_SIGHAND | CLONE_VM, NULL) == -1)
			break;
	}
	while (1)
		pause();
}

static void memory_consumer(void)
{
	const int fd = open("/dev/zero", O_RDONLY);
	unsigned long size;
	char *buf = NULL;
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	while (1)
		read(fd, buf, size); /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
	if (fork() == 0)
		child();
	memory_consumer();
	return 0;
}
---------- reproducer program end ----------

Logs for (A)

[   98.933472] kworker/1:2: page allocation failure: order:0, mode:0x10
[   98.935374] CPU: 1 PID: 363 Comm: kworker/1:2 Not tainted 3.19.0 #329
[   98.937271] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   98.940026] Workqueue: events_freezable_power_ disk_events_workfn
[   98.942084]  0000000000000000 00000000f967a090 0000000000000000 ffffffff81576f4e
[   98.944511]  0000000000000010 ffffffff8110d26e ffff88007fffdb00 0000000000000000
[   98.946873]  0000000236945e30 0000000000000002 0000000000000000 00000000f967a090
[   98.949121] Call Trace:
[   98.950318]  [<ffffffff81576f4e>] ? dump_stack+0x40/0x50
[   98.952054]  [<ffffffff8110d26e>] ? warn_alloc_failed+0xee/0x150
[   98.953935]  [<ffffffff811108e2>] ? __alloc_pages_nodemask+0x6a2/0xa70
[   98.955912]  [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100
[   98.957812]  [<ffffffff812467c6>] ? bio_copy_user_iov+0x1c6/0x380
[   98.959709]  [<ffffffff81246a1a>] ? bio_copy_kern+0x4a/0xf0
[   98.961518]  [<ffffffff8125053a>] ? blk_rq_map_kern+0x6a/0x150
[   98.963346]  [<ffffffff8124a856>] ? blk_get_request+0x76/0x120
[   98.965208]  [<ffffffff8139d39c>] ? scsi_execute+0x12c/0x160
[   98.967093]  [<ffffffff8139d4ab>] ? scsi_execute_req_flags+0x8b/0x100
[   98.969088]  [<ffffffffa01fca20>] ? sr_check_events+0xc0/0x300 [sr_mod]
[   98.971076]  [<ffffffff81579152>] ? __schedule+0x272/0x760
[   98.972838]  [<ffffffffa01f017f>] ? cdrom_check_events+0xf/0x30 [cdrom]
[   98.974856]  [<ffffffff8125a5ba>] ? disk_check_events+0x5a/0x1e0
[   98.976753]  [<ffffffff8107b0b1>] ? process_one_work+0x131/0x360
[   98.978650]  [<ffffffff8107b863>] ? worker_thread+0x113/0x590
[   98.980489]  [<ffffffff8107b750>] ? rescuer_thread+0x470/0x470
[   98.982330]  [<ffffffff810804d1>] ? kthread+0xd1/0xf0
[   98.984068]  [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190
[   98.986049]  [<ffffffff8157d27c>] ? ret_from_fork+0x7c/0xb0
[   98.987845]  [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190

[  101.495212] kworker/1:2: page allocation failure: order:0, mode:0x10
[  101.497410] CPU: 1 PID: 363 Comm: kworker/1:2 Not tainted 3.19.0 #329
[  101.499581] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  101.502603] Workqueue: events_freezable_power_ disk_events_workfn
[  101.504775]  0000000000000000 00000000f967a090 0000000000000000 ffffffff81576f4e
[  101.507283]  0000000000000010 ffffffff8110d26e ffff88007fffdb00 0000000000000000
[  101.509800]  0000000236945e30 0000000000000002 0000000000000000 00000000f967a090
[  101.512324] Call Trace:
[  101.513767]  [<ffffffff81576f4e>] ? dump_stack+0x40/0x50
[  101.515748]  [<ffffffff8110d26e>] ? warn_alloc_failed+0xee/0x150
[  101.517897]  [<ffffffff811108e2>] ? __alloc_pages_nodemask+0x6a2/0xa70
[  101.520140]  [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100
[  101.522352]  [<ffffffff812467c6>] ? bio_copy_user_iov+0x1c6/0x380
[  101.524534]  [<ffffffff81246a1a>] ? bio_copy_kern+0x4a/0xf0
[  101.526619]  [<ffffffff8125053a>] ? blk_rq_map_kern+0x6a/0x150
[  101.528743]  [<ffffffff8124a856>] ? blk_get_request+0x76/0x120
[  101.530870]  [<ffffffff8139d39c>] ? scsi_execute+0x12c/0x160
[  101.532971]  [<ffffffff8139d4ab>] ? scsi_execute_req_flags+0x8b/0x100
[  101.535250]  [<ffffffffa01fca20>] ? sr_check_events+0xc0/0x300 [sr_mod]
[  101.537641]  [<ffffffff81579152>] ? __schedule+0x272/0x760
[  101.539713]  [<ffffffffa01f017f>] ? cdrom_check_events+0xf/0x30 [cdrom]
[  101.542015]  [<ffffffff8125a5ba>] ? disk_check_events+0x5a/0x1e0
[  101.544189]  [<ffffffff8107b0b1>] ? process_one_work+0x131/0x360
[  101.546370]  [<ffffffff8107b863>] ? worker_thread+0x113/0x590
[  101.548488]  [<ffffffff8107b750>] ? rescuer_thread+0x470/0x470
[  101.550575]  [<ffffffff810804d1>] ? kthread+0xd1/0xf0
[  101.552492]  [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190
[  101.554657]  [<ffffffff8157d27c>] ? ret_from_fork+0x7c/0xb0
[  101.556628]  [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190

[  104.052500] kworker/1:2: page allocation failure: order:0, mode:0x10
[  104.054694] CPU: 1 PID: 363 Comm: kworker/1:2 Not tainted 3.19.0 #329
[  104.056897] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  104.059887] Workqueue: events_freezable_power_ disk_events_workfn
[  104.062061]  0000000000000000 00000000f967a090 0000000000000000 ffffffff81576f4e
[  104.064611]  0000000000000010 ffffffff8110d26e ffff88007fffdb00 0000000000000000
[  104.067119]  0000000236945e30 0000000000000002 0000000000000000 00000000f967a090
[  104.069657] Call Trace:
[  104.071074]  [<ffffffff81576f4e>] ? dump_stack+0x40/0x50
[  104.073080]  [<ffffffff8110d26e>] ? warn_alloc_failed+0xee/0x150
[  104.075194]  [<ffffffff811108e2>] ? __alloc_pages_nodemask+0x6a2/0xa70
[  104.077424]  [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100
[  104.079626]  [<ffffffff812467c6>] ? bio_copy_user_iov+0x1c6/0x380
[  104.081800]  [<ffffffff81246a1a>] ? bio_copy_kern+0x4a/0xf0
[  104.083868]  [<ffffffff8125053a>] ? blk_rq_map_kern+0x6a/0x150
[  104.085988]  [<ffffffff8124a856>] ? blk_get_request+0x76/0x120
[  104.088119]  [<ffffffff8139d39c>] ? scsi_execute+0x12c/0x160
[  104.090206]  [<ffffffff8139d4ab>] ? scsi_execute_req_flags+0x8b/0x100
[  104.092497]  [<ffffffffa01fca20>] ? sr_check_events+0xc0/0x300 [sr_mod]
[  104.094781]  [<ffffffff81579152>] ? __schedule+0x272/0x760
[  104.096843]  [<ffffffffa01f017f>] ? cdrom_check_events+0xf/0x30 [cdrom]
[  104.099147]  [<ffffffff8125a5ba>] ? disk_check_events+0x5a/0x1e0
[  104.101306]  [<ffffffff8107b0b1>] ? process_one_work+0x131/0x360
[  104.103470]  [<ffffffff8107b863>] ? worker_thread+0x113/0x590
[  104.105600]  [<ffffffff8107b750>] ? rescuer_thread+0x470/0x470
[  104.107710]  [<ffffffff810804d1>] ? kthread+0xd1/0xf0
[  104.109607]  [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190
[  104.111781]  [<ffffffff8157d27c>] ? ret_from_fork+0x7c/0xb0
[  104.113733]  [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190

[  106.608783] kworker/1:2: page allocation failure: order:0, mode:0x10
[  106.610960] CPU: 1 PID: 363 Comm: kworker/1:2 Not tainted 3.19.0 #329
[  106.613123] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  106.616159] Workqueue: events_freezable_power_ disk_events_workfn
[  106.618337]  0000000000000000 00000000f967a090 0000000000000000 ffffffff81576f4e
[  106.621153]  0000000000000010 ffffffff8110d26e ffff88007fffdb00 0000000000000000
[  106.623823]  0000000236945e30 0000000000000002 0000000000000000 00000000f967a090
[  106.626386] Call Trace:
[  106.627810]  [<ffffffff81576f4e>] ? dump_stack+0x40/0x50
[  106.629800]  [<ffffffff8110d26e>] ? warn_alloc_failed+0xee/0x150
[  106.632128]  [<ffffffff811108e2>] ? __alloc_pages_nodemask+0x6a2/0xa70
[  106.634460]  [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100
[  106.636638]  [<ffffffff812467c6>] ? bio_copy_user_iov+0x1c6/0x380
[  106.638856]  [<ffffffff81246a1a>] ? bio_copy_kern+0x4a/0xf0
[  106.640929]  [<ffffffff8125053a>] ? blk_rq_map_kern+0x6a/0x150
[  106.643053]  [<ffffffff8124a856>] ? blk_get_request+0x76/0x120
[  106.645209]  [<ffffffff8139d39c>] ? scsi_execute+0x12c/0x160
[  106.647293]  [<ffffffff8139d4ab>] ? scsi_execute_req_flags+0x8b/0x100
[  106.649573]  [<ffffffffa01fca20>] ? sr_check_events+0xc0/0x300 [sr_mod]
[  106.651921]  [<ffffffff81579152>] ? __schedule+0x272/0x760
[  106.654008]  [<ffffffffa01f017f>] ? cdrom_check_events+0xf/0x30 [cdrom]
[  106.656297]  [<ffffffff8125a5ba>] ? disk_check_events+0x5a/0x1e0
[  106.658466]  [<ffffffff8107b0b1>] ? process_one_work+0x131/0x360
[  106.660610]  [<ffffffff8107b863>] ? worker_thread+0x113/0x590
[  106.662744]  [<ffffffff8107b750>] ? rescuer_thread+0x470/0x470
[  106.664849]  [<ffffffff810804d1>] ? kthread+0xd1/0xf0
[  106.666759]  [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190
[  106.668930]  [<ffffffff8157d27c>] ? ret_from_fork+0x7c/0xb0
[  106.670889]  [<ffffffff81080400>] ? kthread_create_on_node+0x190/0x190

Logs for (B)

[  145.078502] a.out           S ffff88007fc92d00     0  2643   2641 0x00000080
[  145.078503]  ffff88003681c480 0000000000012d00 ffff88007a51bfd8 0000000000012d00
[  145.078504]  ffff88003681c480 ffff88003681c480 000200d20000000f 0000000000000001
[  145.078504]  ffff88003681c480 ffff88003681c480 00007fb700000001 ffff88007adcc508
[  145.078505] Call Trace:
[  145.078506]  [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0
[  145.078507]  [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0
[  145.078508]  [<ffffffff8117ba17>] ? pipe_wait+0x67/0xb0
[  145.078509]  [<ffffffff8109ced0>] ? wait_woken+0x90/0x90
[  145.078510]  [<ffffffff8117bb48>] ? pipe_write+0x88/0x450
[  145.078511]  [<ffffffff811732a3>] ? new_sync_write+0x83/0xd0
[  145.078512]  [<ffffffff81173417>] ? __kernel_write+0x57/0x140
[  145.078513]  [<ffffffff811c615e>] ? dump_emit+0x8e/0xd0
[  145.078515]  [<ffffffff811c002f>] ? elf_core_dump+0x146f/0x15d0
[  145.078516]  [<ffffffff811c6a09>] ? do_coredump+0x769/0xe80
[  145.078517]  [<ffffffff8101634d>] ? native_sched_clock+0x2d/0x80
[  145.078518]  [<ffffffff8106fd2b>] ? __send_signal+0x16b/0x3a0
[  145.078520]  [<ffffffff810717f2>] ? get_signal+0x192/0x770
[  145.078521]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[  145.078522]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[  145.078523]  [<ffffffff8157e022>] ? retint_signal+0x48/0x86

[  145.078625] abrt-hook-ccpp  D 0000000000000002     0  2650    347 0x00000080
[  145.078626]  ffff88007b364d10 0000000000012d00 ffff88007ae3ffd8 0000000000012d00
[  145.078627]  ffff88007b364d10 ffff88007fffc000 ffffffff8111a6a5 0000000000000000
[  145.078628]  0000000000000000 000088007ae3f9e8 ffff88007b364d10 ffffffff81015df5
[  145.078628] Call Trace:
[  145.078629]  [<ffffffff8111a6a5>] ? shrink_zone+0x105/0x2a0
[  145.078630]  [<ffffffff81015df5>] ? read_tsc+0x5/0x10
[  145.078631]  [<ffffffff810c0270>] ? ktime_get+0x30/0x90
[  145.078632]  [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70
[  145.078633]  [<ffffffff8111ae45>] ? do_try_to_free_pages+0x3e5/0x480
[  145.078634]  [<ffffffff8157c013>] ? schedule_timeout+0x113/0x1b0
[  145.078635]  [<ffffffff810b9800>] ? migrate_timer_list+0x60/0x60
[  145.078636]  [<ffffffff811109ee>] ? __alloc_pages_nodemask+0x7ae/0xa70
[  145.078638]  [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100
[  145.078640]  [<ffffffff8110a240>] ? filemap_fault+0x1c0/0x400
[  145.078641]  [<ffffffff8112e7c6>] ? __do_fault+0x46/0xd0
[  145.078642]  [<ffffffff81131128>] ? do_read_fault.isra.62+0x228/0x310
[  145.078643]  [<ffffffff8113380e>] ? handle_mm_fault+0x7ae/0x10e0
[  145.078644]  [<ffffffff81138145>] ? vma_set_page_prot+0x35/0x60
[  145.078645]  [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540
[  145.078646]  [<ffffffff811399ac>] ? do_mmap_pgoff+0x33c/0x3f0
[  145.078647]  [<ffffffff8112180b>] ? vm_mmap_pgoff+0xbb/0xf0
[  145.078648]  [<ffffffff81051d40>] ? do_page_fault+0x30/0x70
[  145.078649]  [<ffffffff8157ed38>] ? page_fault+0x28/0x30

[  232.113394] a.out           S ffff88007fc92d00     0  2643   2641 0x00000080
[  232.115926]  ffff88003681c480 0000000000012d00 ffff88007a51bfd8 0000000000012d00
[  232.118630]  ffff88003681c480 ffff88003681c480 000200d20000000f 0000000000000001
[  232.121312]  ffff88003681c480 ffff88003681c480 00007fb700000001 ffff88007adcc508
[  232.124004] Call Trace:
[  232.125242]  [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0
[  232.127506]  [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0
[  232.129972]  [<ffffffff8117ba17>] ? pipe_wait+0x67/0xb0
[  232.131960]  [<ffffffff8109ced0>] ? wait_woken+0x90/0x90
[  232.133928]  [<ffffffff8117bb48>] ? pipe_write+0x88/0x450
[  232.135901]  [<ffffffff811732a3>] ? new_sync_write+0x83/0xd0
[  232.137956]  [<ffffffff81173417>] ? __kernel_write+0x57/0x140
[  232.140033]  [<ffffffff811c615e>] ? dump_emit+0x8e/0xd0
[  232.141958]  [<ffffffff811c002f>] ? elf_core_dump+0x146f/0x15d0
[  232.144161]  [<ffffffff811c6a09>] ? do_coredump+0x769/0xe80
[  232.146178]  [<ffffffff8101634d>] ? native_sched_clock+0x2d/0x80
[  232.148343]  [<ffffffff8106fd2b>] ? __send_signal+0x16b/0x3a0
[  232.150441]  [<ffffffff810717f2>] ? get_signal+0x192/0x770
[  232.152468]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[  232.154441]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[  232.156552]  [<ffffffff8157e022>] ? retint_signal+0x48/0x86

[  232.340460] abrt-hook-ccpp  D 0000000000000002     0  2650    347 0x00000080
[  232.343038]  ffff88007b364d10 0000000000012d00 ffff88007ae3ffd8 0000000000012d00
[  232.345779]  ffff88007b364d10 ffff88007fffc000 ffffffff8111a6a5 0000000000000000
[  232.348626]  0000000000000000 000088007ae3f9e8 ffff88007b364d10 ffffffff81015df5
[  232.351400] Call Trace:
[  232.352798]  [<ffffffff8111a6a5>] ? shrink_zone+0x105/0x2a0
[  232.355177]  [<ffffffff81015df5>] ? read_tsc+0x5/0x10
[  232.357260]  [<ffffffff810c0270>] ? ktime_get+0x30/0x90
[  232.359321]  [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70
[  232.361597]  [<ffffffff8111ae45>] ? do_try_to_free_pages+0x3e5/0x480
[  232.364151]  [<ffffffff81089ac1>] ? try_to_wake_up+0x221/0x2b0
[  232.366364]  [<ffffffff8110af07>] ? oom_badness+0x17/0x130
[  232.368410]  [<ffffffff8109ced9>] ? autoremove_wake_function+0x9/0x30
[  232.370694]  [<ffffffff8157992f>] ? _cond_resched+0x1f/0x40
[  232.372765]  [<ffffffff811106d0>] ? __alloc_pages_nodemask+0x490/0xa70
[  232.375082]  [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100
[  232.377416]  [<ffffffff8110a240>] ? filemap_fault+0x1c0/0x400
[  232.379542]  [<ffffffff8112e7c6>] ? __do_fault+0x46/0xd0
[  232.381624]  [<ffffffff81131128>] ? do_read_fault.isra.62+0x228/0x310
[  232.383984]  [<ffffffff8113380e>] ? handle_mm_fault+0x7ae/0x10e0
[  232.386198]  [<ffffffff81138145>] ? vma_set_page_prot+0x35/0x60
[  232.388386]  [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540
[  232.390592]  [<ffffffff811399ac>] ? do_mmap_pgoff+0x33c/0x3f0
[  232.392762]  [<ffffffff8112180b>] ? vm_mmap_pgoff+0xbb/0xf0
[  232.395259]  [<ffffffff81051d40>] ? do_page_fault+0x30/0x70
[  232.397472]  [<ffffffff8157ed38>] ? page_fault+0x28/0x30

[  328.225954] a.out           S ffff88007fc92d00     0  2643   2641 0x00000080
[  328.228262]  ffff88003681c480 0000000000012d00 ffff88007a51bfd8 0000000000012d00
[  328.230731]  ffff88003681c480 ffff88003681c480 000200d20000000f 0000000000000001
[  328.233188]  ffff88003681c480 ffff88003681c480 00007fb700000001 ffff88007adcc508
[  328.235701] Call Trace:
[  328.236851]  [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0
[  328.238826]  [<ffffffff8112af4e>] ? copy_from_iter+0x10e/0x2d0
[  328.240792]  [<ffffffff8117ba17>] ? pipe_wait+0x67/0xb0
[  328.242598]  [<ffffffff8109ced0>] ? wait_woken+0x90/0x90
[  328.244426]  [<ffffffff8117bb48>] ? pipe_write+0x88/0x450
[  328.246284]  [<ffffffff811732a3>] ? new_sync_write+0x83/0xd0
[  328.248208]  [<ffffffff81173417>] ? __kernel_write+0x57/0x140
[  328.250159]  [<ffffffff811c615e>] ? dump_emit+0x8e/0xd0
[  328.251967]  [<ffffffff811c002f>] ? elf_core_dump+0x146f/0x15d0
[  328.253930]  [<ffffffff811c6a09>] ? do_coredump+0x769/0xe80
[  328.255811]  [<ffffffff8101634d>] ? native_sched_clock+0x2d/0x80
[  328.257806]  [<ffffffff8106fd2b>] ? __send_signal+0x16b/0x3a0
[  328.259714]  [<ffffffff810717f2>] ? get_signal+0x192/0x770
[  328.261552]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[  328.263369]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[  328.265292]  [<ffffffff8157e022>] ? retint_signal+0x48/0x86

[  328.444215] abrt-hook-ccpp  D 0000000000000002     0  2650    347 0x00000080
[  328.446549]  ffff88007b364d10 0000000000012d00 ffff88007ae3ffd8 0000000000012d00
[  328.449029]  ffff88007b364d10 ffff88007fffc000 ffffffff8111a6a5 0000000000000000
[  328.451689]  0000000000000000 000088007ae3f9e8 ffff88007b364d10 ffffffff81015df5
[  328.454187] Call Trace:
[  328.455408]  [<ffffffff8111a6a5>] ? shrink_zone+0x105/0x2a0
[  328.457406]  [<ffffffff81015df5>] ? read_tsc+0x5/0x10
[  328.459289]  [<ffffffff810c0270>] ? ktime_get+0x30/0x90
[  328.461368]  [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70
[  328.464191]  [<ffffffff8111ae45>] ? do_try_to_free_pages+0x3e5/0x480
[  328.466419]  [<ffffffff8157c013>] ? schedule_timeout+0x113/0x1b0
[  328.468506]  [<ffffffff810b9800>] ? migrate_timer_list+0x60/0x60
[  328.470672]  [<ffffffff811109ee>] ? __alloc_pages_nodemask+0x7ae/0xa70
[  328.472883]  [<ffffffff811501d7>] ? alloc_pages_current+0x87/0x100
[  328.475087]  [<ffffffff8110a240>] ? filemap_fault+0x1c0/0x400
[  328.477089]  [<ffffffff8112e7c6>] ? __do_fault+0x46/0xd0
[  328.478960]  [<ffffffff81131128>] ? do_read_fault.isra.62+0x228/0x310
[  328.481116]  [<ffffffff8113380e>] ? handle_mm_fault+0x7ae/0x10e0
[  328.483454]  [<ffffffff81138145>] ? vma_set_page_prot+0x35/0x60
[  328.485613]  [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540
[  328.487634]  [<ffffffff811399ac>] ? do_mmap_pgoff+0x33c/0x3f0
[  328.489611]  [<ffffffff8112180b>] ? vm_mmap_pgoff+0xbb/0xf0
[  328.491539]  [<ffffffff81051d40>] ? do_page_fault+0x30/0x70
[  328.493441]  [<ffffffff8157ed38>] ? page_fault+0x28/0x30

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-09 11:44                           ` Tetsuo Handa
@ 2015-02-10 13:58                             ` Tetsuo Handa
  2015-02-10 15:19                               ` Johannes Weiner
  2015-02-17 14:37                             ` Michal Hocko
  1 sibling, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-10 13:58 UTC (permalink / raw)
  To: hannes, mhocko
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds

(Michal is offline, asking Johannes instead.)

Tetsuo Handa wrote:
> (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition
>     despite we didn't remove the
> 
>         /*
>          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
>          * means __GFP_NOFAIL, but that may not be true in other
>          * implementations.
>          */
>         if (order <= PAGE_ALLOC_COSTLY_ORDER)
>                 return 1;
> 
>     check in should_alloc_retry(). Is this what you expected?

This behavior is caused by commit 9879de7373fcfb46 "mm: page_alloc:
embed OOM killing naturally into allocation slowpath". Did you apply
that commit with agreement to let GFP_NOIO / GFP_NOFS allocations fail
upon memory pressure and permit filesystems to take fs error actions?

	/* The OOM killer does not compensate for light reclaim */
	if (!(gfp_mask & __GFP_FS))
		goto out;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-10 13:58                             ` Tetsuo Handa
@ 2015-02-10 15:19                               ` Johannes Weiner
  2015-02-11  2:23                                 ` Tetsuo Handa
  2015-02-17 14:50                                 ` Michal Hocko
  0 siblings, 2 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-10 15:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds

On Tue, Feb 10, 2015 at 10:58:46PM +0900, Tetsuo Handa wrote:
> (Michal is offline, asking Johannes instead.)
> 
> Tetsuo Handa wrote:
> > (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition
> >     despite we didn't remove the
> > 
> >         /*
> >          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> >          * means __GFP_NOFAIL, but that may not be true in other
> >          * implementations.
> >          */
> >         if (order <= PAGE_ALLOC_COSTLY_ORDER)
> >                 return 1;
> > 
> >     check in should_alloc_retry(). Is this what you expected?
> 
> This behavior is caused by commit 9879de7373fcfb46 "mm: page_alloc:
> embed OOM killing naturally into allocation slowpath". Did you apply
> that commit with agreement to let GFP_NOIO / GFP_NOFS allocations fail
> upon memory pressure and permit filesystems to take fs error actions?
> 
> 	/* The OOM killer does not compensate for light reclaim */
> 	if (!(gfp_mask & __GFP_FS))
> 		goto out;

The model behind the refactored code is to continue retrying the
allocation as long as the allocator has the ability to free memory,
i.e. if page reclaim makes progress, or the OOM killer can be used.

That being said, I missed that GFP_NOFS were able to loop endlessly
even without page reclaim making progress or the OOM killer working,
and since it didn't fit the model I dropped it by accident.

Is this a real workload you are having trouble with or an artificial
stresstest?  Because I'd certainly be willing to revert that part of
the patch and make GFP_NOFS looping explicit if it helps you.  But I
do think the new behavior makes more sense, so I'd prefer to keep it
if it's merely a stress test you use to test allocator performance.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e20f9c2fa5a..f77c58ebbcfa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
-		if (!(gfp_mask & __GFP_FS))
+		if (!(gfp_mask & __GFP_FS)) {
+			/*
+			 * XXX: Page reclaim didn't yield anything,
+			 * and the OOM killer can't be invoked, but
+			 * keep looping as per should_alloc_retry().
+			 */
+			*did_some_progress = 1;
 			goto out;
+		}
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-10 15:19                               ` Johannes Weiner
@ 2015-02-11  2:23                                 ` Tetsuo Handa
  2015-02-11 13:37                                   ` Tetsuo Handa
  2015-02-17 12:23                                   ` Tetsuo Handa
  2015-02-17 14:50                                 ` Michal Hocko
  1 sibling, 2 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-11  2:23 UTC (permalink / raw)
  To: hannes
  Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds

Johannes Weiner wrote:
> On Tue, Feb 10, 2015 at 10:58:46PM +0900, Tetsuo Handa wrote:
> > (Michal is offline, asking Johannes instead.)
> > 
> > Tetsuo Handa wrote:
> > > (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition
> > >     despite we didn't remove the
> > > 
> > >         /*
> > >          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > >          * means __GFP_NOFAIL, but that may not be true in other
> > >          * implementations.
> > >          */
> > >         if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > >                 return 1;
> > > 
> > >     check in should_alloc_retry(). Is this what you expected?
> > 
> > This behavior is caused by commit 9879de7373fcfb46 "mm: page_alloc:
> > embed OOM killing naturally into allocation slowpath". Did you apply
> > that commit with agreement to let GFP_NOIO / GFP_NOFS allocations fail
> > upon memory pressure and permit filesystems to take fs error actions?
> > 
> > 	/* The OOM killer does not compensate for light reclaim */
> > 	if (!(gfp_mask & __GFP_FS))
> > 		goto out;
> 
> The model behind the refactored code is to continue retrying the
> allocation as long as the allocator has the ability to free memory,
> i.e. if page reclaim makes progress, or the OOM killer can be used.
> 
> That being said, I missed that GFP_NOFS were able to loop endlessly
> even without page reclaim making progress or the OOM killer working,
> and since it didn't fit the model I dropped it by accident.
> 
> Is this a real workload you are having trouble with or an artificial
> stresstest?  Because I'd certainly be willing to revert that part of
> the patch and make GFP_NOFS looping explicit if it helps you.  But I
> do think the new behavior makes more sense, so I'd prefer to keep it
> if it's merely a stress test you use to test allocator performance.

I'm working for troubleshooting RHEL systems. This is an artificial
stresstest which I developed for trying to reproduce various low memory
troubles occurred on customer's systems.

> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8e20f9c2fa5a..f77c58ebbcfa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (high_zoneidx < ZONE_NORMAL)
>  			goto out;
>  		/* The OOM killer does not compensate for light reclaim */
> -		if (!(gfp_mask & __GFP_FS))
> +		if (!(gfp_mask & __GFP_FS)) {
> +			/*
> +			 * XXX: Page reclaim didn't yield anything,
> +			 * and the OOM killer can't be invoked, but
> +			 * keep looping as per should_alloc_retry().
> +			 */
> +			*did_some_progress = 1;
>  			goto out;
> +		}

Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?
Thread2 doing GFP_FS / GFP_KERNEL allocation might be waiting for Thread1
doing GFP_NOIO / GFP_NOFS allocation to call out_of_memory() on behalf of
Thread2, as mutexed by

        /*
         * Acquire the per-zone oom lock for each zone.  If that
         * fails, somebody else is making progress for us.
         */
        if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

lock. If Thread1 calls oom_zonelist_trylock() / oom_zonelist_unlock() without
sleep while Thread2 calls oom_zonelist_trylock() / oom_zonelist_unlock() with
sleep, Thread2 is unlikely able to call out_of_memory() because Thread2 likely
fails at oom_zonelist_trylock().

>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> 

Though, more serious behavior with this reproducer is (B) where the system
stalls forever without kernel messages being saved to /var/log/messages .
out_of_memory() does not select victims until the coredump to pipe can make
progress whereas the coredump to pipe can't make progress until memory
allocation succeeds or fails.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-11  2:23                                 ` Tetsuo Handa
@ 2015-02-11 13:37                                   ` Tetsuo Handa
  2015-02-11 18:50                                     ` Oleg Nesterov
  2015-02-17 12:23                                   ` Tetsuo Handa
  1 sibling, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-11 13:37 UTC (permalink / raw)
  To: oleg, mhocko
  Cc: hannes, david, dchinner, linux-mm, rientjes, akpm, mgorman, torvalds

(Asking Oleg this time.)

Tetsuo Handa wrote:
> Though, more serious behavior with this reproducer is (B) where the system
> stalls forever without kernel messages being saved to /var/log/messages .
> out_of_memory() does not select victims until the coredump to pipe can make
> progress whereas the coredump to pipe can't make progress until memory
> allocation succeeds or fails.

This behavior is related to commit d003f371b2701635 ("oom: don't assume
that a coredumping thread will exit soon"). That commit tried to take
SIGNAL_GROUP_COREDUMP into account, but actually it is failing to do so.

I tested with debug printk() and got the result shown below.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d503e9c..1f684df 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -268,8 +268,12 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
        if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
                if (unlikely(frozen(task)))
                        __thaw_task(task);
-               if (!force_kill)
+               if (!force_kill) {
+                       printk_ratelimited(KERN_INFO "OOM: Waiting for %s(%u) "
+                                          ": TIF_MEMDIE\n", task->comm,
+                                          task->pid);
                        return OOM_SCAN_ABORT;
+               }
        }
        if (!task->mm)
                return OOM_SCAN_CONTINUE;
@@ -281,8 +285,12 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
        if (oom_task_origin(task))
                return OOM_SCAN_SELECT;

-       if (task_will_free_mem(task) && !force_kill)
+       if (task_will_free_mem(task) && !force_kill) {
+               printk_ratelimited(KERN_INFO "OOM: Waiting for %s(%u) "
+                                  ": will_free_mem\n", task->comm,
+                                  task->pid);
                return OOM_SCAN_ABORT;
+       }

        return OOM_SCAN_OK;
 }
@@ -439,6 +447,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
         * its children or threads, just set TIF_MEMDIE so it can die quickly
         */
        if (task_will_free_mem(p)) {
+               printk(KERN_INFO "OOM: Waiting for %s(%u) : WILL_FREE_MEM\n",
+                      p->comm, p->pid);
                set_tsk_thread_flag(p, TIF_MEMDIE);
                put_task_struct(p);
                return;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e20f9c..4a2b19b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
                /* The OOM killer does not needlessly kill tasks for lowmem */
                if (high_zoneidx < ZONE_NORMAL)
                        goto out;
-               /* The OOM killer does not compensate for light reclaim */
-               if (!(gfp_mask & __GFP_FS))
-                       goto out;
                /*
                 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
                 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
----------

----------
[   66.374198] a.out[9918]: segfault at 2591768 ip 000000000040091e sp 0000000002591770 error 6[   66.374220] a.out[9919]: segfault at 2592778 ip 000000000040091e sp 0000000002592780 error 6 in a.out[400000+1000]

[   66.378705]  in a.out[400000+1000]
[   67.997279] OOM: Waiting for a.out(9917) : will_free_mem
(...snipped...)
[   90.952640] a.out           D 0000000000000002     0  9916   7303 0x00000080
[   90.954478]  ffff88007a4ca240 0000000000012f80 ffff88007bcc7fd8 0000000000012f80
[   90.956468]  ffff88007a4ca240 ffff88007fffc000 ffffffff8111a945 0000000000000000
[   90.958475]  0000000000000000 000088007bcc7908 ffff88007a4ca240 ffffffff81015df5
[   90.960471] Call Trace:
[   90.961420]  [<ffffffff8111a945>] ? shrink_zone+0x105/0x2a0
[   90.962939]  [<ffffffff81015df5>] ? read_tsc+0x5/0x10
[   90.964364]  [<ffffffff810c0270>] ? ktime_get+0x30/0x90
[   90.965816]  [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70
[   90.967322]  [<ffffffff8111b0e5>] ? do_try_to_free_pages+0x3e5/0x480
[   90.969115]  [<ffffffff815f23f3>] ? schedule_timeout+0x113/0x1b0
[   90.970796]  [<ffffffff810b9800>] ? migrate_timer_list+0x60/0x60
[   90.972380]  [<ffffffff81110c9e>] ? __alloc_pages_nodemask+0x7ae/0xa60
[   90.974090]  [<ffffffff81151eb2>] ? alloc_pages_vma+0x92/0x1a0
[   90.975643]  [<ffffffff81134037>] ? handle_mm_fault+0xd37/0x10e0
[   90.977212]  [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540
[   90.978753]  [<ffffffff81092fac>] ? update_curr+0xac/0x100
[   90.980228]  [<ffffffff810946cb>] ? put_prev_entity+0x5b/0x2c0
[   90.981763]  [<ffffffff8108ef1d>] ? pick_next_entity+0x9d/0x170
[   90.983305]  [<ffffffff8109157e>] ? set_next_entity+0x4e/0x60
[   90.984824]  [<ffffffff81097953>] ? pick_next_task_fair+0x453/0x520
[   90.986446]  [<ffffffff8100c6e0>] ? __switch_to+0x240/0x570
[   90.987943]  [<ffffffff81051d40>] ? do_page_fault+0x30/0x70
[   90.989453]  [<ffffffff815f5138>] ? page_fault+0x28/0x30
[   90.990987]  [<ffffffff812ed0bc>] ? __clear_user+0x1c/0x40
[   90.992481]  [<ffffffff8112cb16>] ? iov_iter_zero+0x66/0x2d0
[   90.993991]  [<ffffffff813c09d7>] ? read_iter_zero+0x37/0xa0
[   90.995515]  [<ffffffff81173470>] ? new_sync_read+0x80/0xd0
[   90.997027]  [<ffffffff81174678>] ? vfs_read+0x78/0x130
[   90.998492]  [<ffffffff8117477d>] ? SyS_read+0x4d/0xc0
[   90.999913]  [<ffffffff815f3729>] ? system_call_fastpath+0x12/0x17
[   91.001616] a.out           D ffff88007fc52f80     0  9917   9916 0x00000080
[   91.003485]  ffff880020b10000 0000000000012f80 ffff8800786d7fd8 0000000000012f80
[   91.005443]  ffff880020b10000 000000000000000a 0000000000000400 0000000100000001
[   91.007427]  0000000100000000 0000000000000000 0000000000000000 ffff8800786d7cc8
[   91.009348] Call Trace:
[   91.010281]  [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80
[   91.011759]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.013176]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.014661]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.016128]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.017551]  [<ffffffff815ef532>] ? __schedule+0x272/0x760
[   91.019007]  [<ffffffff81087408>] ? check_preempt_curr+0x78/0xa0
[   91.020569]  [<ffffffff81089c98>] ? wake_up_new_task+0xf8/0x140
[   91.022094]  [<ffffffff81063bd8>] ? do_fork+0x138/0x340
[   91.023526]  [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0
[   91.025171]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.026700]  [<ffffffff815f39c7>] ? int_signal+0x12/0x17
[   91.028136] a.out           D ffff88007c6a66c0     0  9918   9917 0x00000084
[   91.029945]  ffff88007c6a66c0 0000000000012f80 ffff88007c6cbfd8 0000000000012f80
[   91.031886]  ffff88007c6a66c0 0000000000000003 ffff88007c6a759a 0000000000000046
[   91.033830]  0000000000000046 ffff88007c6a6f50 ffffffff81089a55 ffff88007c6cbcc8
[   91.035913] Call Trace:
[   91.036848]  [<ffffffff81089a55>] ? try_to_wake_up+0x1b5/0x2b0
[   91.038382]  [<ffffffff8109c7ef>] ? __wake_up_common+0x4f/0x80
[   91.039944]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.041420]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.042931]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.044416]  [<ffffffff8109157e>] ? set_next_entity+0x4e/0x60
[   91.045941]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.047376]  [<ffffffff815ef532>] ? __schedule+0x272/0x760
[   91.048836]  [<ffffffff81067282>] ? do_exit+0x6d2/0xb40
[   91.050251]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.051786]  [<ffffffff815f4422>] ? retint_signal+0x48/0x86
[   91.053256] a.out           S ffff88007fcd2f80     0  9919   9917 0x00000080
[   91.055081]  ffff88007c6a6f50 0000000000012f80 ffff88007c04bfd8 0000000000012f80
[   91.057026]  ffff88007c6a6f50 ffff88007c6a6f50 000200d27fffc6c0 0000000000000001
[   91.059006]  ffff88007c6a6f50 ffff88007c6a6f50 0000014100000001 0000000000000000
[   91.060952] Call Trace:
[   91.061893]  [<ffffffff8112b1ee>] ? copy_from_iter+0x10e/0x2d0
[   91.063456]  [<ffffffff8112b1ee>] ? copy_from_iter+0x10e/0x2d0
[   91.065025]  [<ffffffff8117bcb7>] ? pipe_wait+0x67/0xb0
[   91.066491]  [<ffffffff8109ced0>] ? wait_woken+0x90/0x90
[   91.068160]  [<ffffffff8117bde8>] ? pipe_write+0x88/0x450
[   91.069787]  [<ffffffff81173543>] ? new_sync_write+0x83/0xd0
[   91.071302]  [<ffffffff811736b7>] ? __kernel_write+0x57/0x140
[   91.072813]  [<ffffffff811c63fe>] ? dump_emit+0x8e/0xd0
[   91.074293]  [<ffffffff811c02cf>] ? elf_core_dump+0x146f/0x15d0
[   91.075848]  [<ffffffff811c6ca9>] ? do_coredump+0x769/0xe80
[   91.077308]  [<ffffffff8101634d>] ? native_sched_clock+0x2d/0x80
[   91.078861]  [<ffffffff8106fd2b>] ? __send_signal+0x16b/0x3a0
[   91.080384]  [<ffffffff810717f2>] ? get_signal+0x192/0x770
[   91.081831]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.083234]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.084747]  [<ffffffff815f4422>] ? retint_signal+0x48/0x86
[   91.086210] a.out           D ffff88007c6a0000     0  9920   9917 0x00000080
[   91.088001]  ffff88007c6a0000 0000000000012f80 ffff88007b7affd8 0000000000012f80
[   91.089996]  ffff88007c6a0000 ffffea0001df9780 ffffffff81a5ba00 0000000000000200
[   91.091953]  ffff880036d8c480 0000000000000000 0000000000000000 ffff88007b7afcc8
[   91.093899] Call Trace:
[   91.094823]  [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80
[   91.096310]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.097785]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.099291]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.100773]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.102311]  [<ffffffff810faf95>] ? task_function_call+0x55/0x80
[   91.103978]  [<ffffffff81067282>] ? do_exit+0x6d2/0xb40
[   91.105413]  [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0
[   91.107047]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.108568]  [<ffffffff815f39c7>] ? int_signal+0x12/0x17
[   91.110069] a.out           D ffff88007c6a2240     0  9921   9917 0x00000080
[   91.111869]  ffff88007c6a2240 0000000000012f80 ffff88007b883fd8 0000000000012f80
[   91.113795]  ffff88007c6a2240 0000000000000001 ffffffff81a5ba00 0000000000000200
[   91.115708]  ffff880036d8d5a0 0000000000000000 0000000000000000 ffff88007b883cc8
[   91.117627] Call Trace:
[   91.118546]  [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80
[   91.120012]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.121432]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.122928]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.124469]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.125915]  [<ffffffff810faf95>] ? task_function_call+0x55/0x80
[   91.127487]  [<ffffffff81067282>] ? do_exit+0x6d2/0xb40
[   91.128906]  [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0
[   91.130518]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.132053]  [<ffffffff815f39c7>] ? int_signal+0x12/0x17
[   91.133505] a.out           D ffff88007c6a3360     0  9922   9917 0x00000080
[   91.135450]  ffff88007c6a3360 0000000000012f80 ffff88007861bfd8 0000000000012f80
[   91.137395]  ffff88007c6a3360 0000000000000001 ffffffff81a5ba00 0000000000000200
[   91.139332]  ffff88007a4cbbf0 0000000000000000 0000000000000000 ffff88007861bcc8
[   91.141356] Call Trace:
[   91.142290]  [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80
[   91.143781]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.145212]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.146724]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.148204]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.149657]  [<ffffffff810faf95>] ? task_function_call+0x55/0x80
[   91.151242]  [<ffffffff81067282>] ? do_exit+0x6d2/0xb40
[   91.152682]  [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0
[   91.154309]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.155855]  [<ffffffff815f39c7>] ? int_signal+0x12/0x17
[   91.157334] a.out           D ffff88007c6a0890     0  9923   9917 0x00000080
[   91.159214]  ffff88007c6a0890 0000000000012f80 ffff88007c62bfd8 0000000000012f80
[   91.161219]  ffff88007c6a0890 0000000000000400 ffffffff810969d2 0000000000000200
[   91.163193]  ffff88007f804a80 ffff88007fc12f80 0000000000000000 ffff88007c62bcc8
[   91.165161] Call Trace:
[   91.166115]  [<ffffffff810969d2>] ? load_balance+0x1d2/0x8a0
[   91.167678]  [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80
[   91.169293]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.170755]  [<ffffffff810163a5>] ? sched_clock+0x5/0x10
[   91.172208]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.173798]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.175282]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.176736]  [<ffffffff81067282>] ? do_exit+0x6d2/0xb40
[   91.178167]  [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0
[   91.179789]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.181319]  [<ffffffff815f39c7>] ? int_signal+0x12/0x17
[   91.182769] a.out           D ffff88007c6a19b0     0  9924   9917 0x00000080
[   91.184597]  ffff88007c6a19b0 0000000000012f80 ffff88007bf27fd8 0000000000012f80
[   91.186552]  ffff88007c6a19b0 0000000000000001 ffffffff81a5ba00 0000000000000200
[   91.188483]  ffff880020b11120 0000000000000000 0000000000000000 ffff88007bf27cc8
[   91.190517] Call Trace:
[   91.191462]  [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80
[   91.192961]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.194409]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.195926]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.197418]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.198884]  [<ffffffff810faf95>] ? task_function_call+0x55/0x80
[   91.200504]  [<ffffffff81067282>] ? do_exit+0x6d2/0xb40
[   91.202034]  [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0
[   91.203757]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.205293]  [<ffffffff815f39c7>] ? int_signal+0x12/0x17
[   91.206774] a.out           D ffff88007c6a2ad0     0  9925   9917 0x00000080
[   91.208641]  ffff88007c6a2ad0 0000000000012f80 ffff88007cb8bfd8 0000000000012f80
[   91.210592]  ffff88007c6a2ad0 0000000000000400 ffffffff810969d2 0000000000000200
[   91.212538]  ffff88007f804a80 ffff88007fc12f80 0000000000000000 ffff88007cb8bcc8
[   91.214486] Call Trace:
[   91.215428]  [<ffffffff810969d2>] ? load_balance+0x1d2/0x8a0
[   91.216949]  [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80
[   91.218437]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.219861]  [<ffffffff810163a5>] ? sched_clock+0x5/0x10
[   91.221301]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.222833]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.224362]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.225860]  [<ffffffff810faf95>] ? task_function_call+0x55/0x80
[   91.227442]  [<ffffffff81067282>] ? do_exit+0x6d2/0xb40
[   91.228891]  [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0
[   91.230543]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.232107]  [<ffffffff815f39c7>] ? int_signal+0x12/0x17
[   91.233565] a.out           D ffff88007c6a4d10     0  9926   9917 0x00000080
[   91.235432]  ffff88007c6a4d10 0000000000012f80 ffff88007860bfd8 0000000000012f80
[   91.237477]  ffff88007c6a4d10 0000000000000001 ffffffff81a5ba00 0000000000000200
[   91.239430]  ffff880020b12240 0000000000000000 0000000000000000 ffff88007860bcc8
[   91.241388] Call Trace:
[   91.242322]  [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80
[   91.243815]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.245241]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.246753]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.248232]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.249687]  [<ffffffff810faf95>] ? task_function_call+0x55/0x80
[   91.251271]  [<ffffffff81067282>] ? do_exit+0x6d2/0xb40
[   91.252709]  [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0
[   91.254334]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.255910]  [<ffffffff815f39c7>] ? int_signal+0x12/0x17
[   91.257441] a.out           D ffff88007fcd2f80     0  9927   9917 0x00000080
[   91.259308]  ffff88007c6a4480 0000000000012f80 ffff88007c67bfd8 0000000000012f80
[   91.261306]  ffff88007c6a4480 ffff88007c67bd40 ffff88007d119440 ffff88007c67bd18
[   91.263283]  000000001fe3d887 ffff88007c67bd18 ffffffff811fa4f4 ffff88007c67bcc8
[   91.265259] Call Trace:
[   91.266206]  [<ffffffff811fa4f4>] ? xfs_bmap_search_multi_extents+0x94/0x130
[   91.268011]  [<ffffffff8108d98d>] ? task_cputime+0x3d/0x80
[   91.269636]  [<ffffffff81066d8c>] ? do_exit+0x1dc/0xb40
[   91.271113]  [<ffffffff8106776a>] ? do_group_exit+0x3a/0x100
[   91.272636]  [<ffffffff810717fb>] ? get_signal+0x19b/0x770
[   91.274184]  [<ffffffff8100d451>] ? do_signal+0x31/0x6d0
[   91.275659]  [<ffffffff810faf95>] ? task_function_call+0x55/0x80
[   91.277250]  [<ffffffff81067282>] ? do_exit+0x6d2/0xb40
[   91.278699]  [<ffffffff810ede7c>] ? __audit_syscall_entry+0xac/0xf0
[   91.280342]  [<ffffffff8100db5c>] ? do_notify_resume+0x6c/0x90
[   91.281901]  [<ffffffff815f39c7>] ? int_signal+0x12/0x17
[   91.283368] abrt-hook-ccpp  D 0000000000000002     0  9928    345 0x00000080
[   91.285222]  ffff880020b10890 0000000000012f80 ffff88007c68bfd8 0000000000012f80
[   91.287200]  ffff880020b10890 ffff88007fffc000 ffffffff8111a945 0000000000000000
[   91.289187]  0000000000000000 000088007c68b9e8 ffff880020b10890 ffffffff81015df5
[   91.291215] Call Trace:
[   91.292155]  [<ffffffff8111a945>] ? shrink_zone+0x105/0x2a0
[   91.293682]  [<ffffffff81015df5>] ? read_tsc+0x5/0x10
[   91.295117]  [<ffffffff810c0270>] ? ktime_get+0x30/0x90
[   91.296574]  [<ffffffff810f73b9>] ? delayacct_end+0x39/0x70
[   91.298096]  [<ffffffff8111b0e5>] ? do_try_to_free_pages+0x3e5/0x480
[   91.299768]  [<ffffffff815f23f3>] ? schedule_timeout+0x113/0x1b0
[   91.301384]  [<ffffffff810b9800>] ? migrate_timer_list+0x60/0x60
[   91.303092]  [<ffffffff81110c9e>] ? __alloc_pages_nodemask+0x7ae/0xa60
[   91.304858]  [<ffffffff81150477>] ? alloc_pages_current+0x87/0x100
[   91.306497]  [<ffffffff8110a240>] ? filemap_fault+0x1c0/0x400
[   91.308054]  [<ffffffff8112ea66>] ? __do_fault+0x46/0xd0
[   91.309531]  [<ffffffff811313c8>] ? do_read_fault.isra.62+0x228/0x310
[   91.311204]  [<ffffffff81133aae>] ? handle_mm_fault+0x7ae/0x10e0
[   91.312800]  [<ffffffff81182762>] ? path_openat+0xa2/0x660
[   91.314298]  [<ffffffff8105194e>] ? __do_page_fault+0x17e/0x540
[   91.315884]  [<ffffffff81183c9e>] ? do_filp_open+0x3e/0xa0
[   91.317367]  [<ffffffff81051d40>] ? do_page_fault+0x30/0x70
[   91.318879]  [<ffffffff815f5138>] ? page_fault+0x28/0x30
(...snipped...)
[   93.038908] oom_scan_process_thread: 244092 callbacks suppressed
[   93.040655] OOM: Waiting for a.out(9917) : will_free_mem
----------

PID 9916 is the parent process doing read() from /dev/zero .
PID 9917 is the child process waiting at pause(). PIDs from 9918 to 9927
are the child thread of PID 9917 sharing the MM. PID 9919 is the thread
doing coredump to pipe and PID 9928 is the process doing read from pipe.

Since will_free_mem() for PID 9917 is true, oom_scan_process_thread()
does not choose a victim. PID 9917 is waiting for PID 9919 to complete
the coredump. PID 9919 is waiting for PID 9928 to read from pipe.
PID 9928 is waiting for PID 9917 to release memory.

----------
static void exit_mm(struct task_struct *tsk)
{
(...snipped...)
        if (core_state) {
                struct core_thread self;

                up_read(&mm->mmap_sem);

                self.task = tsk;
                self.next = xchg(&core_state->dumper.next, &self);
                /*
                 * Implies mb(), the result of xchg() must be visible
                 * to core_state->dumper.
                 */
                if (atomic_dec_and_test(&core_state->nr_threads))
                        complete(&core_state->startup);

                for (;;) {
                        set_task_state(tsk, TASK_UNINTERRUPTIBLE);
                        if (!self.task) /* see coredump_finish() */
                                break;
                        freezable_schedule(); /* <ffffffff81066d8c> is here. */
                }
                __set_task_state(tsk, TASK_RUNNING);
                down_read(&mm->mmap_sem);
        }
(...snipped...)
}
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-11 13:37                                   ` Tetsuo Handa
@ 2015-02-11 18:50                                     ` Oleg Nesterov
  2015-02-11 18:59                                       ` Oleg Nesterov
  0 siblings, 1 reply; 276+ messages in thread
From: Oleg Nesterov @ 2015-02-11 18:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, david, dchinner, linux-mm, rientjes, akpm,
	mgorman, torvalds

On 02/11, Tetsuo Handa wrote:
>
> (Asking Oleg this time.)

Well, sorry, I ignored the previous discussion, not sure I understand you
correctly.

> > Though, more serious behavior with this reproducer is (B) where the system
> > stalls forever without kernel messages being saved to /var/log/messages .
> > out_of_memory() does not select victims until the coredump to pipe can make
> > progress whereas the coredump to pipe can't make progress until memory
> > allocation succeeds or fails.
>
> This behavior is related to commit d003f371b2701635 ("oom: don't assume
> that a coredumping thread will exit soon"). That commit tried to take
> SIGNAL_GROUP_COREDUMP into account, but actually it is failing to do so.

Heh. Please see the changelog. This "fix" is obviously very limited, it does
not even try to solve all problems (even with coredump in particular).

Note also that SIGNAL_GROUP_COREDUMP is not even set if the process (not a
sub-thread) shares the memory with the coredumping task. It would be better
to check mm->core_state != NULL instead, but this needs the locking. Plus
that process likely sleeps in D state in exit_mm(), so this can't help.

And that is why we set SIGNAL_GROUP_COREDUMP in zap_threads(), not in
zap_process(). We probably want to make that "wait for coredump_finish()"
sleep in exit_mm() killable, but this is not simple.

Sorry for noise if the above is not relevant.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-11 18:50                                     ` Oleg Nesterov
@ 2015-02-11 18:59                                       ` Oleg Nesterov
  2015-03-14 13:03                                         ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Oleg Nesterov @ 2015-02-11 18:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, david, dchinner, linux-mm, rientjes, akpm,
	mgorman, torvalds

On 02/11, Oleg Nesterov wrote:
>
> On 02/11, Tetsuo Handa wrote:
> >
> > (Asking Oleg this time.)
>
> Well, sorry, I ignored the previous discussion, not sure I understand you
> correctly.
>
> > > Though, more serious behavior with this reproducer is (B) where the system
> > > stalls forever without kernel messages being saved to /var/log/messages .
> > > out_of_memory() does not select victims until the coredump to pipe can make
> > > progress whereas the coredump to pipe can't make progress until memory
> > > allocation succeeds or fails.
> >
> > This behavior is related to commit d003f371b2701635 ("oom: don't assume
> > that a coredumping thread will exit soon"). That commit tried to take
> > SIGNAL_GROUP_COREDUMP into account, but actually it is failing to do so.
>
> Heh. Please see the changelog. This "fix" is obviously very limited, it does
> not even try to solve all problems (even with coredump in particular).
>
> Note also that SIGNAL_GROUP_COREDUMP is not even set if the process (not a
> sub-thread) shares the memory with the coredumping task. It would be better
> to check mm->core_state != NULL instead, but this needs the locking. Plus
> that process likely sleeps in D state in exit_mm(), so this can't help.
>
> And that is why we set SIGNAL_GROUP_COREDUMP in zap_threads(), not in
> zap_process(). We probably want to make that "wait for coredump_finish()"
> sleep in exit_mm() killable, but this is not simple.

on a cecond thought, perhaps it makes sense to set SIGNAL_GROUP_COREDUMP
anyway, even if a CLONE_VM process participating in coredump is not killable.
I'll recheck tomorrow.

> Sorry for noise if the above is not relevant.
>
> Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2014-12-30 11:21                         ` Michal Hocko
  2014-12-30 13:33                           ` Tetsuo Handa
  2015-02-09 11:44                           ` Tetsuo Handa
@ 2015-02-16 11:23                           ` Tetsuo Handa
  2015-02-16 15:42                             ` Johannes Weiner
  2015-02-17 16:33                             ` Michal Hocko
  2 siblings, 2 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-16 11:23 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes,
	torvalds

Michal Hocko wrote:
> > but I think we need to be prepared for cases where sending SIGKILL to
> > all threads sharing the same memory does not help.
> 
> Sure, unkillable tasks are a problem which we have to handle. Having
> GFP_KERNEL allocations looping without way out contributes to this which
> is sad but your current data just show that sometimes it might take ages
> to finish even without that going on.

Hello. Can we resume TIF_MEMDIE stall discussion?

I'd like to propose

  (1) Make several locks killable.

  (2) Implement TIF_MEMDIE timeout.

  (3) Replace kmalloc() with kmalloc_nofail() and kmalloc_noretry().

for handling TIF_MEMDIE stall problems.



(1) Make several locks killable.

  On Linux 3.19, running below command line as an unprivileged user
  on a system with 4 CPUs / 2GB RAM / no swap can make the system unusable.

  $ for i in `seq 1 100`; do dd if=/dev/zero of=/tmp/file bs=104857600 count=100 & done

---------- An example with ext4 partition ----------
(...snipped...)
[  369.902616] dd              D ffff88007fc12d00     0  9113   6418 0x00000080
[  369.904867]  ffff88007b460890 0000000000012d00 ffff88007b28ffd8 0000000000012d00
[  369.907254]  ffff88007b460890 ffff88007fc12d80 ffff88007a6eb360 0000000000000001
[  369.909855]  ffffffff810946cb 00000000000025f6 ffffffff8108ef1d 0000000000000000
[  369.912054] Call Trace:
[  369.913175]  [<ffffffff810946cb>] ? put_prev_entity+0x5b/0x2c0
[  369.914960]  [<ffffffff8108ef1d>] ? pick_next_entity+0x9d/0x170
[  369.916778]  [<ffffffff8109157e>] ? set_next_entity+0x4e/0x60
[  369.918634]  [<ffffffff81097953>] ? pick_next_task_fair+0x453/0x520
[  369.920530]  [<ffffffff8100c6e0>] ? __switch_to+0x240/0x570
[  369.922263]  [<ffffffff815799f9>] ? schedule_preempt_disabled+0x9/0x10
[  369.924161]  [<ffffffff8157af25>] ? __mutex_lock_slowpath+0xb5/0x120
[  369.926106]  [<ffffffff8157afa6>] ? mutex_lock+0x16/0x25
[  369.927800]  [<ffffffffa01f3acc>] ? ext4_file_write_iter+0x7c/0x3a0 [ext4]
[  369.929778]  [<ffffffff81280fbc>] ? __clear_user+0x1c/0x40
[  369.931491]  [<ffffffff8112c876>] ? iov_iter_zero+0x66/0x2d0
[  369.933235]  [<ffffffff811732a3>] ? new_sync_write+0x83/0xd0
[  369.934977]  [<ffffffff8117397d>] ? vfs_write+0xad/0x1f0
[  369.936703]  [<ffffffff8101b57b>] ? syscall_trace_enter_phase1+0x19b/0x1b0
[  369.938674]  [<ffffffff8117459d>] ? SyS_write+0x4d/0xc0
[  369.940336]  [<ffffffff8157d329>] ? system_call_fastpath+0x12/0x17
(...snipped...)
[  498.421741] SysRq : Manual OOM execution
[  498.423627] kworker/3:3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
(...snipped...)
[  498.952807] Out of memory: Kill process 9113 (dd) score 57 or sacrifice child
[  498.954450] Killed process 9113 (dd) total-vm:210340kB, anon-rss:102500kB, file-rss:0kB
(...snipped...)
[  502.068921] SysRq : Manual OOM execution
[  502.070825] kworker/3:3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
(...snipped...)
[  502.618222] Out of memory: Kill process 9113 (dd) score 57 or sacrifice child
[  502.620016] Killed process 9113 (dd) total-vm:210340kB, anon-rss:102500kB, file-rss:0kB
(...snipped...)
[  503.900554] SysRq : Manual OOM execution
[  503.902387] kworker/3:3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
(...snipped...)
[  504.410444] Out of memory: Kill process 9113 (dd) score 57 or sacrifice child
[  504.412221] Killed process 9113 (dd) total-vm:210340kB, anon-rss:102500kB, file-rss:0kB
(...snipped...)
---------- An example with ext4 partition ----------

---------- An example with xfs partition ----------
(...snipped...)
[  127.135041] Out of memory: Kill process 2505 (dd) score 59 or sacrifice child
[  127.136460] Killed process 2505 (dd) total-vm:210340kB, anon-rss:102464kB, file-rss:1728kB
(...snipped...)
[  243.672302] dd              D ffff88005bd27cb8 12776  2505   2386 0x00100084
[  243.674066]  ffff88005bd27cb8 ffff88005bd27c98 ffff88007850c740 0000000000014080
[  243.676005]  0000000000000000 ffff88005bd27fd8 0000000000014080 ffff88005835d740
[  243.677916]  ffff88007850c740 0000000000000014 ffff8800669bee50 ffff8800669bee54
[  243.679823] Call Trace:
[  243.680478]  [<ffffffff816b2799>] schedule_preempt_disabled+0x29/0x70
[  243.682047]  [<ffffffff816b43d5>] __mutex_lock_slowpath+0x95/0x100
[  243.683548]  [<ffffffff816b83e8>] ? page_fault+0x28/0x30
[  243.684875]  [<ffffffff816b4463>] mutex_lock+0x23/0x37
[  243.686146]  [<ffffffff8129df6c>] xfs_file_buffered_aio_write+0x6c/0x240
[  243.687791]  [<ffffffff813497b5>] ? __clear_user+0x25/0x50
[  243.689121]  [<ffffffff8117294d>] ? iov_iter_zero+0x6d/0x2e0
[  243.690511]  [<ffffffff8129e1b8>] xfs_file_write_iter+0x78/0x110
[  243.691990]  [<ffffffff811beb31>] new_sync_write+0x81/0xb0
[  243.693329]  [<ffffffff811bf2a7>] vfs_write+0xb7/0x1f0
[  243.694581]  [<ffffffff811bfeb6>] SyS_write+0x46/0xb0
[  243.695834]  [<ffffffff81109196>] ? __audit_syscall_exit+0x236/0x2e0
[  243.697376]  [<ffffffff816b64a9>] system_call_fastpath+0x12/0x17
(...snipped...)
[  291.433296] dd              D ffff88005bd27cb8 12776  2505   2386 0x00100084
[  291.433297]  ffff88005bd27cb8 ffff88005bd27c98 ffff88007850c740 0000000000014080
[  291.433298]  0000000000000000 ffff88005bd27fd8 0000000000014080 ffff88005835d740
[  291.433298]  ffff88007850c740 0000000000000014 ffff8800669bee50 ffff8800669bee54
[  291.433299] Call Trace:
[  291.433300]  [<ffffffff816b2799>] schedule_preempt_disabled+0x29/0x70
[  291.433301]  [<ffffffff816b43d5>] __mutex_lock_slowpath+0x95/0x100
[  291.433302]  [<ffffffff816b83e8>] ? page_fault+0x28/0x30
[  291.433303]  [<ffffffff816b4463>] mutex_lock+0x23/0x37
[  291.433304]  [<ffffffff8129df6c>] xfs_file_buffered_aio_write+0x6c/0x240
[  291.433306]  [<ffffffff813497b5>] ? __clear_user+0x25/0x50
[  291.433307]  [<ffffffff8117294d>] ? iov_iter_zero+0x6d/0x2e0
[  291.433308]  [<ffffffff8129e1b8>] xfs_file_write_iter+0x78/0x110
[  291.433309]  [<ffffffff811beb31>] new_sync_write+0x81/0xb0
[  291.433311]  [<ffffffff811bf2a7>] vfs_write+0xb7/0x1f0
[  291.433312]  [<ffffffff811bfeb6>] SyS_write+0x46/0xb0
[  291.433313]  [<ffffffff81109196>] ? __audit_syscall_exit+0x236/0x2e0
[  291.433314]  [<ffffffff816b64a9>] system_call_fastpath+0x12/0x17
(...snipped...)
---------- An example with xfs partition ----------

  This is because the OOM killer happily tries to kill a process which is
  blocked at unkillable mutex_lock(). If locks shown above were killable,
  we can reduce the possibility of getting stuck.

  I didn't check whether it has livelocked or not. But too slow to wait is
  not acceptable. Oh, why every thread trying to allocate memory has to repeat
  the loop that might defer somebody who can make progress if CPU time was
  given? I wish only somebody like kswapd repeats the loop on behalf of all
  threads waiting at memory allocation slowpath...

(2) Implement TIF_MEMDIE timeout.

  While the command line shown above is an artificial stresstest, I'm seeing
  troubles on real KVM systems where the guests hang entirely with many
  processes being blocked at jbd2_journal_commit_transaction() or
  jbd2_journal_get_write_access(). The root cause of guest's stall is not yet
  identified but is at least independent with TIF_MEMDIE. However, cron jobs
  which are blocked at those functions after I/O stall begins exhaust all of
  the system's memory and make the situation worse (e.g. load average exceeded
  7000 on a guest with 2 CPUs as of occurrence of the OOM killer livelock).

  Unkillable locks in non-critical paths can be replaced with killable locks.
  But there are critical paths where fail-on-SIGKILL can lead to unwanted
  results (e.g. filesystem's error action such as remount as r/o or call
  panic() being taken), there are locks (e.g. rw_semaphore used by mmap_sem)
  where killable version does not exist, and there are wait_for_completion()
  calls where killable version does not worth complicating the code.

  If TIF_MEMDIE timeout were implemented, we can cope with the OOM killer
  livelock problem by choosing more OOM victims (for survive strategy) or
  calling panic() (for debug and reboot strategy).

(3) Replace kmalloc() with kmalloc_nofail() and kmalloc_noretry().

  Currently small allocations are implicitly treated like __GFP_NOFAIL
  unless TIF_MEMDIE is set. But silently changing small allocations like
  __GFP_NORETRY will cause obscure bugs. If TIF_MEMDIE timeout is implemented,
  we will no longer worry about unkillable tasks which is retrying forever at
  memory allocation; instead we kill more OOM victims and satisfy the request.
  Therefore, we could introduce kmalloc_nofail(size, gfp) which does
  kmalloc(size, gfp | __GFP_NOFAIL) (i.e. invoke the OOM killer) and
  kmalloc_noretry(size, gfp) which does kmalloc(size, gfp | __GFP_NORETRY)
  (i.e. do not invoke the OOM killer), and switch from kmalloc() to either
  kmalloc_noretry() or kmalloc_nofail(). Those who are doing smaller than
  PAGE_SIZE bytes allocations would wish to switch from kmalloc() to
  kmalloc_nofail() and eliminate untested memory allocation failure paths.
  Those who are well prepared for memory allocation failures would wish to
  switch from kmalloc() to kmalloc_noretry(). Eventually, kmalloc() which is
  implicitly treating small allocations like __GFP_NOFAIL and invoking the
  OOM killer will be abolished.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-16 11:23                           ` Tetsuo Handa
@ 2015-02-16 15:42                             ` Johannes Weiner
  2015-02-17 11:57                               ` Tetsuo Handa
  2015-02-17 16:33                             ` Michal Hocko
  1 sibling, 1 reply; 276+ messages in thread
From: Johannes Weiner @ 2015-02-16 15:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds

On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote:
>   (2) Implement TIF_MEMDIE timeout.

How about something like this?  This should solve the deadlock problem
in the page allocator, but it would also simplify the memcg OOM killer
and allow its use by in-kernel faults again.

--

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-16 15:42                             ` Johannes Weiner
@ 2015-02-17 11:57                               ` Tetsuo Handa
  2015-02-17 13:16                                 ` Johannes Weiner
  2015-02-23 22:08                                 ` David Rientjes
  0 siblings, 2 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-17 11:57 UTC (permalink / raw)
  To: hannes
  Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds

Johannes Weiner wrote:
> On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote:
> >   (2) Implement TIF_MEMDIE timeout.
> 
> How about something like this?  This should solve the deadlock problem
> in the page allocator, but it would also simplify the memcg OOM killer
> and allow its use by in-kernel faults again.

Yes, basic idea would be same with
http://marc.info/?l=linux-mm&m=142002495532320&w=2 .

But Michal and David do not like the timeout approach.
http://marc.info/?l=linux-mm&m=141684783713564&w=2
http://marc.info/?l=linux-mm&m=141686814824684&w=2

Unless they change their opinion in response to the discovery explained at
http://lwn.net/Articles/627419/ , timeout patches will not be accepted.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-11  2:23                                 ` Tetsuo Handa
  2015-02-11 13:37                                   ` Tetsuo Handa
@ 2015-02-17 12:23                                   ` Tetsuo Handa
  2015-02-17 12:53                                     ` Johannes Weiner
  2015-02-17 14:59                                     ` Michal Hocko
  1 sibling, 2 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-17 12:23 UTC (permalink / raw)
  To: hannes
  Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds

Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8e20f9c2fa5a..f77c58ebbcfa 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (high_zoneidx < ZONE_NORMAL)
> >  			goto out;
> >  		/* The OOM killer does not compensate for light reclaim */
> > -		if (!(gfp_mask & __GFP_FS))
> > +		if (!(gfp_mask & __GFP_FS)) {
> > +			/*
> > +			 * XXX: Page reclaim didn't yield anything,
> > +			 * and the OOM killer can't be invoked, but
> > +			 * keep looping as per should_alloc_retry().
> > +			 */
> > +			*did_some_progress = 1;
> >  			goto out;
> > +		}
> 
> Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?

I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings
at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm:
page_alloc: embed OOM killing naturally into allocation slowpath" introduced
a regression and below one is the fix.

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
                /* The OOM killer does not needlessly kill tasks for lowmem */
                if (high_zoneidx < ZONE_NORMAL)
                        goto out;
-               /* The OOM killer does not compensate for light reclaim */
-               if (!(gfp_mask & __GFP_FS))
-                       goto out;
                /*
                 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
                 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.

BTW, I think commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer
path raceless" opened a race window for __alloc_pages_may_oom(__GFP_NOFAIL)
allocation to fail when OOM killer is disabled. I think something like

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -789,7 +789,7 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	bool ret = false;
 
 	down_read(&oom_sem);
-	if (!oom_killer_disabled) {
+	if (!oom_killer_disabled || (gfp_mask & __GFP_NOFAIL)) {
 		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
 		ret = true;
 	}

is needed. But such change can race with up_write() and wait_event() in
oom_killer_disable(). While the comment of oom_killer_disable() says
"The function cannot be called when there are runnable user tasks because
the userspace would see unexpected allocation failures as a result.",
aren't there still kernel threads which might do __GFP_NOFAIL allocations?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 12:23                                   ` Tetsuo Handa
@ 2015-02-17 12:53                                     ` Johannes Weiner
  2015-02-17 15:38                                       ` Michal Hocko
  2015-02-17 22:54                                         ` Dave Chinner
  2015-02-17 14:59                                     ` Michal Hocko
  1 sibling, 2 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-17 12:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds

On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Johannes Weiner wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 8e20f9c2fa5a..f77c58ebbcfa 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >  		if (high_zoneidx < ZONE_NORMAL)
> > >  			goto out;
> > >  		/* The OOM killer does not compensate for light reclaim */
> > > -		if (!(gfp_mask & __GFP_FS))
> > > +		if (!(gfp_mask & __GFP_FS)) {
> > > +			/*
> > > +			 * XXX: Page reclaim didn't yield anything,
> > > +			 * and the OOM killer can't be invoked, but
> > > +			 * keep looping as per should_alloc_retry().
> > > +			 */
> > > +			*did_some_progress = 1;
> > >  			goto out;
> > > +		}
> > 
> > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?
> 
> I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings
> at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm:
> page_alloc: embed OOM killing naturally into allocation slowpath" introduced
> a regression and below one is the fix.
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>                 /* The OOM killer does not needlessly kill tasks for lowmem */
>                 if (high_zoneidx < ZONE_NORMAL)
>                         goto out;
> -               /* The OOM killer does not compensate for light reclaim */
> -               if (!(gfp_mask & __GFP_FS))
> -                       goto out;
>                 /*
>                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.

Again, we don't want to OOM kill on behalf of allocations that can't
initiate IO, or even actively prevent others from doing it.  Not per
default anyway, because most callers can deal with the failure without
having to resort to killing tasks, and NOFS reclaim *can* easily fail.
It's the exceptions that should be annotated instead:

void *
kmem_alloc(size_t size, xfs_km_flags_t flags)
{
	int	retries = 0;
	gfp_t	lflags = kmem_flags_convert(flags);
	void	*ptr;

	do {
		ptr = kmalloc(size, lflags);
		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
			return ptr;
		if (!(++retries % 100))
			xfs_err(NULL,
		"possible memory allocation deadlock in %s (mode:0x%x)",
					__func__, lflags);
		congestion_wait(BLK_RW_ASYNC, HZ/50);
	} while (1);
}

This should use __GFP_NOFAIL, which is not only designed to annotate
broken code like this, but also recognizes that endless looping on a
GFP_NOFS allocation needs the OOM killer after all to make progress.

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a7a3a63bb360..17ced1805d3a 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
 void *
 kmem_alloc(size_t size, xfs_km_flags_t flags)
 {
-	int	retries = 0;
 	gfp_t	lflags = kmem_flags_convert(flags);
-	void	*ptr;
 
-	do {
-		ptr = kmalloc(size, lflags);
-		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
-			return ptr;
-		if (!(++retries % 100))
-			xfs_err(NULL,
-		"possible memory allocation deadlock in %s (mode:0x%x)",
-					__func__, lflags);
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-	} while (1);
+	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
+		lflags |= __GFP_NOFAIL;
+
+	return kmalloc(size, lflags);
 }
 
 void *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 11:57                               ` Tetsuo Handa
@ 2015-02-17 13:16                                 ` Johannes Weiner
  2015-02-17 16:50                                   ` Michal Hocko
  2015-02-23 22:08                                 ` David Rientjes
  1 sibling, 1 reply; 276+ messages in thread
From: Johannes Weiner @ 2015-02-17 13:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds

On Tue, Feb 17, 2015 at 08:57:05PM +0900, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote:
> > >   (2) Implement TIF_MEMDIE timeout.
> > 
> > How about something like this?  This should solve the deadlock problem
> > in the page allocator, but it would also simplify the memcg OOM killer
> > and allow its use by in-kernel faults again.
> 
> Yes, basic idea would be same with
> http://marc.info/?l=linux-mm&m=142002495532320&w=2 .
> 
> But Michal and David do not like the timeout approach.
> http://marc.info/?l=linux-mm&m=141684783713564&w=2
> http://marc.info/?l=linux-mm&m=141686814824684&w=2

I'm open to suggestions, but we can't just stick our heads in the sand
and pretend that these are just unrelated bugs.  They're not.  As long
as it's legal to enter the allocator with *anything* that can prevent
another random task in the system from making progress, we have this
deadlock potential.  One side has to give up, and it can't be the page
allocator because it has to support __GFP_NOFAIL allocations, which
are usually exactly the allocations that are buried in hard-to-unwind
state that is likely to trip up exiting OOM victims.

The alternative would be lock dependency tracking, but I'm not sure it
can be realistically done for production environments.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-09 11:44                           ` Tetsuo Handa
  2015-02-10 13:58                             ` Tetsuo Handa
@ 2015-02-17 14:37                             ` Michal Hocko
  2015-02-17 14:44                               ` Michal Hocko
  1 sibling, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2015-02-17 14:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes,
	torvalds

On Mon 09-02-15 20:44:16, Tetsuo Handa wrote:
> Hello.
> 
> Today I tested Linux 3.19 and noticed unexpected behavior (A) (B)
> shown below.
> 
> (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition
>     despite we didn't remove the
> 
>         /*
>          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
>          * means __GFP_NOFAIL, but that may not be true in other
>          * implementations.
>          */
>         if (order <= PAGE_ALLOC_COSTLY_ORDER)
>                 return 1;
>
>     check in should_alloc_retry(). Is this what you expected?

The code before 9879de7373fc (mm: page_alloc: embed OOM killing
naturally into allocation slowpath) was looping on this kind of
allocation even though GFP_NOFS didn't trigger OOM killer. This change
was not intentional I guess but it makes sense on its own. We shouldn't
simply loop in a hope that something happens and we finally make a
progress.

Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think
this is a problem?

Btw. this has nothing to do with TIF_MEMDIE and it would be much better
to discuss it in a separate thread...

> (B) When coredump to pipe is configured, the system stalls under OOM
>     condition due to memory allocation by coredump's reader side.
>     How should we handle this "expected to terminate shortly but unable
>     to terminate due to invisible dependency" case? What approaches
>     other than applying timeout on coredump's writer side are possible?
>     (Running inside memory cgroup is not an answer which I want.)

This is really nasty and we have discussed that with Oleg some time
ago.  We have SIGNAL_GROUP_COREDUMP which prevents the OOM killer
from selecting the task. The issue seems to be that OOM killer might
inherently race with setting the flag.  I have no idea what to do about
this, unfortunately.
Oleg?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 14:37                             ` Michal Hocko
@ 2015-02-17 14:44                               ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-17 14:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes,
	torvalds

Ups, sorry I have missed the follow up emails in this thread. My filters
got crazy and the rest got sorted into a different mailbox.
Reading the rest now...

On Tue 17-02-15 15:37:20, Michal Hocko wrote:
> On Mon 09-02-15 20:44:16, Tetsuo Handa wrote:
> > Hello.
> > 
> > Today I tested Linux 3.19 and noticed unexpected behavior (A) (B)
> > shown below.
> > 
> > (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition
> >     despite we didn't remove the
> > 
> >         /*
> >          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> >          * means __GFP_NOFAIL, but that may not be true in other
> >          * implementations.
> >          */
> >         if (order <= PAGE_ALLOC_COSTLY_ORDER)
> >                 return 1;
> >
> >     check in should_alloc_retry(). Is this what you expected?
> 
> The code before 9879de7373fc (mm: page_alloc: embed OOM killing
> naturally into allocation slowpath) was looping on this kind of
> allocation even though GFP_NOFS didn't trigger OOM killer. This change
> was not intentional I guess but it makes sense on its own. We shouldn't
> simply loop in a hope that something happens and we finally make a
> progress.
> 
> Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think
> this is a problem?
> 
> Btw. this has nothing to do with TIF_MEMDIE and it would be much better
> to discuss it in a separate thread...
> 
> > (B) When coredump to pipe is configured, the system stalls under OOM
> >     condition due to memory allocation by coredump's reader side.
> >     How should we handle this "expected to terminate shortly but unable
> >     to terminate due to invisible dependency" case? What approaches
> >     other than applying timeout on coredump's writer side are possible?
> >     (Running inside memory cgroup is not an answer which I want.)
> 
> This is really nasty and we have discussed that with Oleg some time
> ago.  We have SIGNAL_GROUP_COREDUMP which prevents the OOM killer
> from selecting the task. The issue seems to be that OOM killer might
> inherently race with setting the flag.  I have no idea what to do about
> this, unfortunately.
> Oleg?
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-10 15:19                               ` Johannes Weiner
  2015-02-11  2:23                                 ` Tetsuo Handa
@ 2015-02-17 14:50                                 ` Michal Hocko
  1 sibling, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-17 14:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, david, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds

On Tue 10-02-15 10:19:34, Johannes Weiner wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8e20f9c2fa5a..f77c58ebbcfa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (high_zoneidx < ZONE_NORMAL)
>  			goto out;
>  		/* The OOM killer does not compensate for light reclaim */
> -		if (!(gfp_mask & __GFP_FS))
> +		if (!(gfp_mask & __GFP_FS)) {
> +			/*
> +			 * XXX: Page reclaim didn't yield anything,
> +			 * and the OOM killer can't be invoked, but
> +			 * keep looping as per should_alloc_retry().
> +			 */
> +			*did_some_progress = 1;
>  			goto out;
> +		}
>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.

Although the side effect of 9879de7373fc (mm: page_alloc: embed OOM
killing naturally into allocation slowpath) is subtle and it would be
much better if it was documented in the changelog (I have missed that
too during review otherwise I would ask for that) I do not think this is
a change in a good direction. Hopelessly retrying at the time when the
reclaimm didn't help and OOM is not available is simply a bad(tm)
choice.

Besides that __GFP_WAIT callers should be prepared for the allocation
failure and should better cope with it. So no, I really hate something
like the above.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 12:23                                   ` Tetsuo Handa
  2015-02-17 12:53                                     ` Johannes Weiner
@ 2015-02-17 14:59                                     ` Michal Hocko
  1 sibling, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-17 14:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds

On Tue 17-02-15 21:23:26, Tetsuo Handa wrote:
[...]
> > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?

Because they cannot perform any IO/FS transactions and that would lead
to a premature OOM conditions way too easily. OOM killer is a _last
resort_ reclaim opportunity not something that would happen just because
you happen to be not able to flush dirty pages. 

> I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings
> at kmem_alloc() in fs/xfs/kmem.c .

> I think commit 9879de7373fcfb46 "mm:
> page_alloc: embed OOM killing naturally into allocation slowpath" introduced
> a regression and below one is the fix.
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>                 /* The OOM killer does not needlessly kill tasks for lowmem */
>                 if (high_zoneidx < ZONE_NORMAL)
>                         goto out;
> -               /* The OOM killer does not compensate for light reclaim */
> -               if (!(gfp_mask & __GFP_FS))
> -                       goto out;
>                 /*
>                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.

So NAK to this.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 12:53                                     ` Johannes Weiner
@ 2015-02-17 15:38                                       ` Michal Hocko
  2015-02-17 22:54                                         ` Dave Chinner
  1 sibling, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-17 15:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, david, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds

On Tue 17-02-15 07:53:15, Johannes Weiner wrote:
[...]
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a7a3a63bb360..17ced1805d3a 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
>  void *
>  kmem_alloc(size_t size, xfs_km_flags_t flags)
>  {
> -	int	retries = 0;
>  	gfp_t	lflags = kmem_flags_convert(flags);
> -	void	*ptr;
>  
> -	do {
> -		ptr = kmalloc(size, lflags);
> -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> -			return ptr;
> -		if (!(++retries % 100))
> -			xfs_err(NULL,
> -		"possible memory allocation deadlock in %s (mode:0x%x)",
> -					__func__, lflags);
> -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> -	} while (1);
> +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> +		lflags |= __GFP_NOFAIL;
> +
> +	return kmalloc(size, lflags);
>  }
>  
>  void *

Yes, I think this is the right thing to do (care to send a patch with
the full changelog?).
We really want to have __GFP_NOFAIL explicit. If for nothing else I hope
we can get lockdep checks for this flag. I am hopelessly unfamiliar with
lockdep but even warning from __lockdep_trace_alloc for this flag and
any lock held in the current's context might be helpful to identify
those places and try to fix them.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-16 11:23                           ` Tetsuo Handa
  2015-02-16 15:42                             ` Johannes Weiner
@ 2015-02-17 16:33                             ` Michal Hocko
  1 sibling, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-17 16:33 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, hannes,
	torvalds

On Mon 16-02-15 20:23:16, Tetsuo Handa wrote:
[...]
> (1) Make several locks killable.
> 
>   On Linux 3.19, running below command line as an unprivileged user
>   on a system with 4 CPUs / 2GB RAM / no swap can make the system unusable.
> 
>   $ for i in `seq 1 100`; do dd if=/dev/zero of=/tmp/file bs=104857600 count=100 & done
> 
[...]
>   This is because the OOM killer happily tries to kill a process which is
>   blocked at unkillable mutex_lock(). If locks shown above were killable,
>   we can reduce the possibility of getting stuck.
> 
>   I didn't check whether it has livelocked or not. But too slow to wait is
>   not acceptable.

Well, you are beating your machine to death so you can hardly get any
time guarantee. It would be nice to have a better feedback mechanism to
know when to back off and fail the allocation attempt which might be
blocking OOM victim to pass away. This is extremely tricky because we
shouldn't be too eager to fail just because of a sudden memory pressure.

>   Oh, why every thread trying to allocate memory has to repeat
>   the loop that might defer somebody who can make progress if CPU time was
>   given?

I guess you are talking about direct reclaim and the whole priority
loop? Well, this is what I was talking above. Sometimes we really have
to go down to low priorities and basically scan the world in order to
find something reclaimable. If we bail out too early we might see pre
mature allocation failures and which could lead to reduced QoS.

>   I wish only somebody like kswapd repeats the loop on behalf of all
>   threads waiting at memory allocation slowpath...

This is the case when the kswapd is _able_ to cope with the memory
pressure.

[...]
> (3) Replace kmalloc() with kmalloc_nofail() and kmalloc_noretry().
> 
>   Currently small allocations are implicitly treated like __GFP_NOFAIL
>   unless TIF_MEMDIE is set. But silently changing small allocations like
>   __GFP_NORETRY will cause obscure bugs. If TIF_MEMDIE timeout is implemented,
>   we will no longer worry about unkillable tasks which is retrying forever at
>   memory allocation; instead we kill more OOM victims and satisfy the request.

I think this is a bad approach. GFP_KERNEL != __GFP_NORETRY and we
should treat it like that. Killing more victims is a bad solution
because it doesn't guarantee any progress (just look at your example of hundreds
processes with large RSS hammering the same file - you would have to
kill all of them at once).
Besides that any timeout solution is prone to unexpected delays due to
reasons which are not related to the allocation latency.

>   Therefore, we could introduce kmalloc_nofail(size, gfp) which does
>   kmalloc(size, gfp | __GFP_NOFAIL) (i.e. invoke the OOM killer) and
>   kmalloc_noretry(size, gfp) which does kmalloc(size, gfp | __GFP_NORETRY)
>   (i.e. do not invoke the OOM killer), and switch from kmalloc() to either
>   kmalloc_noretry() or kmalloc_nofail().

This sounds like a major overkill. We already have gfp flags for that.
What would this buy us?

>   Those who are doing smaller than
>   PAGE_SIZE bytes allocations would wish to switch from kmalloc() to
>   kmalloc_nofail() and eliminate untested memory allocation failure paths.

nofail allocations should be discouraged and used only if any other
measure would fail.

>   Those who are well prepared for memory allocation failures would wish to
>   switch from kmalloc() to kmalloc_noretry(). Eventually, kmalloc() which is
>   implicitly treating small allocations like __GFP_NOFAIL and invoking the
>   OOM killer will be abolished.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 13:16                                 ` Johannes Weiner
@ 2015-02-17 16:50                                   ` Michal Hocko
  2015-02-17 23:25                                     ` Dave Chinner
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2015-02-17 16:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, david, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds

On Tue 17-02-15 08:16:18, Johannes Weiner wrote:
> On Tue, Feb 17, 2015 at 08:57:05PM +0900, Tetsuo Handa wrote:
> > Johannes Weiner wrote:
> > > On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote:
> > > >   (2) Implement TIF_MEMDIE timeout.
> > > 
> > > How about something like this?  This should solve the deadlock problem
> > > in the page allocator, but it would also simplify the memcg OOM killer
> > > and allow its use by in-kernel faults again.
> > 
> > Yes, basic idea would be same with
> > http://marc.info/?l=linux-mm&m=142002495532320&w=2 .
> > 
> > But Michal and David do not like the timeout approach.
> > http://marc.info/?l=linux-mm&m=141684783713564&w=2
> > http://marc.info/?l=linux-mm&m=141686814824684&w=2

Yes I really hate time based solutions for reasons already explained in
the referenced links.
 
> I'm open to suggestions, but we can't just stick our heads in the sand
> and pretend that these are just unrelated bugs.  They're not. 

Requesting GFP_NOFAIL allocation with locks held is IMHO a bug and
should be fixed.
Hopelessly looping in the page allocator without GFP_NOFAIL is too risky
as well and we should get rid of this. Why should we still try to loop
when previous 1000 attempts failed with OOM killer invocation? Can we
simply fail after a configurable number of attempts? This is prone to
reveal unchecked allocation failures but those are bugs as well and we
shouldn't pretend otherwise.

> As long
> as it's legal to enter the allocator with *anything* that can prevent
> another random task in the system from making progress, we have this
> deadlock potential.  One side has to give up, and it can't be the page
> allocator because it has to support __GFP_NOFAIL allocations, which
> are usually exactly the allocations that are buried in hard-to-unwind
> state that is likely to trip up exiting OOM victims.

I am not convinced that GFP_NOFAIL is the biggest problem. Most if
OOM livelocks I have seen were either due to GFP_KERNEL treated as
GFP_NOFAIL or an incorrect gfp mask (e.g. GFP_FS added where not
appropriate). I think we should focus on this part before we start
adding heuristics into OOM killer.
 
> The alternative would be lock dependency tracking, but I'm not sure it
> can be realistically done for production environments.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 12:53                                     ` Johannes Weiner
@ 2015-02-17 22:54                                         ` Dave Chinner
  2015-02-17 22:54                                         ` Dave Chinner
  1 sibling, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-17 22:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

[ cc xfs list - experienced kernel devs should not have to be
reminded to do this ]

On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > Johannes Weiner wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > > >  		if (high_zoneidx < ZONE_NORMAL)
> > > >  			goto out;
> > > >  		/* The OOM killer does not compensate for light reclaim */
> > > > -		if (!(gfp_mask & __GFP_FS))
> > > > +		if (!(gfp_mask & __GFP_FS)) {
> > > > +			/*
> > > > +			 * XXX: Page reclaim didn't yield anything,
> > > > +			 * and the OOM killer can't be invoked, but
> > > > +			 * keep looping as per should_alloc_retry().
> > > > +			 */
> > > > +			*did_some_progress = 1;
> > > >  			goto out;
> > > > +		}
> > > 
> > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?
> > 
> > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings
> > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm:
> > page_alloc: embed OOM killing naturally into allocation slowpath" introduced
> > a regression and below one is the fix.
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >                 /* The OOM killer does not needlessly kill tasks for lowmem */
> >                 if (high_zoneidx < ZONE_NORMAL)
> >                         goto out;
> > -               /* The OOM killer does not compensate for light reclaim */
> > -               if (!(gfp_mask & __GFP_FS))
> > -                       goto out;
> >                 /*
> >                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> >                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> 
> Again, we don't want to OOM kill on behalf of allocations that can't
> initiate IO, or even actively prevent others from doing it.  Not per
> default anyway, because most callers can deal with the failure without
> having to resort to killing tasks, and NOFS reclaim *can* easily fail.
> It's the exceptions that should be annotated instead:
> 
> void *
> kmem_alloc(size_t size, xfs_km_flags_t flags)
> {
> 	int	retries = 0;
> 	gfp_t	lflags = kmem_flags_convert(flags);
> 	void	*ptr;
> 
> 	do {
> 		ptr = kmalloc(size, lflags);
> 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> 			return ptr;
> 		if (!(++retries % 100))
> 			xfs_err(NULL,
> 		"possible memory allocation deadlock in %s (mode:0x%x)",
> 					__func__, lflags);
> 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> 	} while (1);
> }
> 
> This should use __GFP_NOFAIL, which is not only designed to annotate
> broken code like this, but also recognizes that endless looping on a
> GFP_NOFS allocation needs the OOM killer after all to make progress.
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a7a3a63bb360..17ced1805d3a 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
>  void *
>  kmem_alloc(size_t size, xfs_km_flags_t flags)
>  {
> -	int	retries = 0;
>  	gfp_t	lflags = kmem_flags_convert(flags);
> -	void	*ptr;
>  
> -	do {
> -		ptr = kmalloc(size, lflags);
> -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> -			return ptr;
> -		if (!(++retries % 100))
> -			xfs_err(NULL,
> -		"possible memory allocation deadlock in %s (mode:0x%x)",
> -					__func__, lflags);
> -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> -	} while (1);
> +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> +		lflags |= __GFP_NOFAIL;
> +
> +	return kmalloc(size, lflags);
>  }

Hmmm - the only reason there is a focus on this loop is that it
emits warnings about allocations failing. It's obvious that the
problem being dealt with here is a fundamental design issue w.r.t.
to locking and the OOM killer, but the proposed special casing
hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
in XFS started emitting warnings about allocations failing more
often.

So the answer is to remove the warning?  That's like killing the
canary to stop the methane leak in the coal mine. No canary? No
problems!

Right now, the oom killer is a liability. Over the past 6 months
I've slowly had to exclude filesystem regression tests from running
on small memory machines because the OOM killer is now so unreliable
that it kills the test harness regularly rather than the process
generating memory pressure. That's a big red flag to me that all
this hacking around the edges is not solving the underlying problem,
but instead is breaking things that did once work.

And, well, then there's this (gfp.h):

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.  This modifier is deprecated and no new
 * users should be added.

So, is this another policy relevation from the mm developers about
the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
Or just another symptom of frantic thrashing because nobody actually
understands the problem or those that do are unwilling to throw out
the broken crap and redesign it?

If you are changing allocator behaviour and constraints, then you
better damn well think through that changes fully, then document
those changes, change all the relevant code to use the new API (not
just those that throw warnings in your face) and make sure
*everyone* knows about it. e.g. a LWN article explaining the changes
and how memory allocation is going to work into the future would be
a good start.

Otherwise, this just looks like another knee-jerk band aid for an
architectural problem that needs more than special case hacks to
solve.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-17 22:54                                         ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-17 22:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

[ cc xfs list - experienced kernel devs should not have to be
reminded to do this ]

On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > Johannes Weiner wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > > >  		if (high_zoneidx < ZONE_NORMAL)
> > > >  			goto out;
> > > >  		/* The OOM killer does not compensate for light reclaim */
> > > > -		if (!(gfp_mask & __GFP_FS))
> > > > +		if (!(gfp_mask & __GFP_FS)) {
> > > > +			/*
> > > > +			 * XXX: Page reclaim didn't yield anything,
> > > > +			 * and the OOM killer can't be invoked, but
> > > > +			 * keep looping as per should_alloc_retry().
> > > > +			 */
> > > > +			*did_some_progress = 1;
> > > >  			goto out;
> > > > +		}
> > > 
> > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?
> > 
> > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings
> > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm:
> > page_alloc: embed OOM killing naturally into allocation slowpath" introduced
> > a regression and below one is the fix.
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >                 /* The OOM killer does not needlessly kill tasks for lowmem */
> >                 if (high_zoneidx < ZONE_NORMAL)
> >                         goto out;
> > -               /* The OOM killer does not compensate for light reclaim */
> > -               if (!(gfp_mask & __GFP_FS))
> > -                       goto out;
> >                 /*
> >                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> >                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> 
> Again, we don't want to OOM kill on behalf of allocations that can't
> initiate IO, or even actively prevent others from doing it.  Not per
> default anyway, because most callers can deal with the failure without
> having to resort to killing tasks, and NOFS reclaim *can* easily fail.
> It's the exceptions that should be annotated instead:
> 
> void *
> kmem_alloc(size_t size, xfs_km_flags_t flags)
> {
> 	int	retries = 0;
> 	gfp_t	lflags = kmem_flags_convert(flags);
> 	void	*ptr;
> 
> 	do {
> 		ptr = kmalloc(size, lflags);
> 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> 			return ptr;
> 		if (!(++retries % 100))
> 			xfs_err(NULL,
> 		"possible memory allocation deadlock in %s (mode:0x%x)",
> 					__func__, lflags);
> 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> 	} while (1);
> }
> 
> This should use __GFP_NOFAIL, which is not only designed to annotate
> broken code like this, but also recognizes that endless looping on a
> GFP_NOFS allocation needs the OOM killer after all to make progress.
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a7a3a63bb360..17ced1805d3a 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
>  void *
>  kmem_alloc(size_t size, xfs_km_flags_t flags)
>  {
> -	int	retries = 0;
>  	gfp_t	lflags = kmem_flags_convert(flags);
> -	void	*ptr;
>  
> -	do {
> -		ptr = kmalloc(size, lflags);
> -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> -			return ptr;
> -		if (!(++retries % 100))
> -			xfs_err(NULL,
> -		"possible memory allocation deadlock in %s (mode:0x%x)",
> -					__func__, lflags);
> -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> -	} while (1);
> +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> +		lflags |= __GFP_NOFAIL;
> +
> +	return kmalloc(size, lflags);
>  }

Hmmm - the only reason there is a focus on this loop is that it
emits warnings about allocations failing. It's obvious that the
problem being dealt with here is a fundamental design issue w.r.t.
to locking and the OOM killer, but the proposed special casing
hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
in XFS started emitting warnings about allocations failing more
often.

So the answer is to remove the warning?  That's like killing the
canary to stop the methane leak in the coal mine. No canary? No
problems!

Right now, the oom killer is a liability. Over the past 6 months
I've slowly had to exclude filesystem regression tests from running
on small memory machines because the OOM killer is now so unreliable
that it kills the test harness regularly rather than the process
generating memory pressure. That's a big red flag to me that all
this hacking around the edges is not solving the underlying problem,
but instead is breaking things that did once work.

And, well, then there's this (gfp.h):

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.  This modifier is deprecated and no new
 * users should be added.

So, is this another policy relevation from the mm developers about
the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
Or just another symptom of frantic thrashing because nobody actually
understands the problem or those that do are unwilling to throw out
the broken crap and redesign it?

If you are changing allocator behaviour and constraints, then you
better damn well think through that changes fully, then document
those changes, change all the relevant code to use the new API (not
just those that throw warnings in your face) and make sure
*everyone* knows about it. e.g. a LWN article explaining the changes
and how memory allocation is going to work into the future would be
a good start.

Otherwise, this just looks like another knee-jerk band aid for an
architectural problem that needs more than special case hacks to
solve.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 16:50                                   ` Michal Hocko
@ 2015-02-17 23:25                                     ` Dave Chinner
  2015-02-18  8:48                                       ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Dave Chinner @ 2015-02-17 23:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds

On Tue, Feb 17, 2015 at 05:50:24PM +0100, Michal Hocko wrote:
> On Tue 17-02-15 08:16:18, Johannes Weiner wrote:
> > On Tue, Feb 17, 2015 at 08:57:05PM +0900, Tetsuo Handa wrote:
> > > Johannes Weiner wrote:
> > > > On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote:
> > > > >   (2) Implement TIF_MEMDIE timeout.
> > > > 
> > > > How about something like this?  This should solve the deadlock problem
> > > > in the page allocator, but it would also simplify the memcg OOM killer
> > > > and allow its use by in-kernel faults again.
> > > 
> > > Yes, basic idea would be same with
> > > http://marc.info/?l=linux-mm&m=142002495532320&w=2 .
> > > 
> > > But Michal and David do not like the timeout approach.
> > > http://marc.info/?l=linux-mm&m=141684783713564&w=2
> > > http://marc.info/?l=linux-mm&m=141686814824684&w=2
> 
> Yes I really hate time based solutions for reasons already explained in
> the referenced links.
>  
> > I'm open to suggestions, but we can't just stick our heads in the sand
> > and pretend that these are just unrelated bugs.  They're not. 
> 
> Requesting GFP_NOFAIL allocation with locks held is IMHO a bug and
> should be fixed.

That's rather naive.

Filesystems do demand paging of metadata within transactions, which
means we are guaranteed to be holding locks when doing memory
allocation. Indeed, this is what the GFP_NOFS allocation context is
supposed to convey - we currently *hold locks* and so reclaim needs
to be careful about recursion. I'll also argue that it means the OOM
killer cannot kill the process attempting memory allocation for the
same reason.

We are also guaranteed to be in a state where memory allocation
failure *cannot be tolerated* because failure to complete the
modification leaves the filesystem in a "corrupt in memory" state.
We don't use GFP_NOFAIL because it's deprecated, but the reality is
that we need to ensure memory allocation eventually succeeds because
we *cannot go backwards*.

The choice is simple: memory allocation fails, we shut down the
filesystem and guarantee that we DOS the entire machine because the
filesystems have gone AWOL; or we keep trying memory allocation
until it succeeds.

So, memory allocation generally succeeds eventually, so we have
these loops around kmalloc(), kmem_cache_alloc() and alloc_page()
that ensure allocation succeeds. Those loops also guarantee we get
warnings when allocation is repeatedly failing and we might have
actually hit a OOM deadlock situation.

> Hopelessly looping in the page allocator without GFP_NOFAIL is too risky
> as well and we should get rid of this.

Yet the exact situation we need GFP_NOFAIL is the situation that you
are calling a bug.

> Why should we still try to loop
> when previous 1000 attempts failed with OOM killer invocation? Can we
> simply fail after a configurable number of attempts?

OTOH, why should the memory allocator care what failure policy the
callers have?

> This is prone to
> reveal unchecked allocation failures but those are bugs as well and we
> shouldn't pretend otherwise.
> 
> > As long
> > as it's legal to enter the allocator with *anything* that can prevent
> > another random task in the system from making progress, we have this
> > deadlock potential.  One side has to give up, and it can't be the page
> > allocator because it has to support __GFP_NOFAIL allocations, which
> > are usually exactly the allocations that are buried in hard-to-unwind
> > state that is likely to trip up exiting OOM victims.
> 
> I am not convinced that GFP_NOFAIL is the biggest problem. Most if
> OOM livelocks I have seen were either due to GFP_KERNEL treated as
> GFP_NOFAIL or an incorrect gfp mask (e.g. GFP_FS added where not
> appropriate). I think we should focus on this part before we start
> adding heuristics into OOM killer.

Having the OOM killer being able to kill the process that triggered
it would be a good start. More often than not, that is the process
that needs killing, and the oom killer implementation currently
cannot do anything about that process. Make the OOM killer only be
invoked by kswapd or some other independent kernel thread so that it
is independent of the allocation context that needs to invoke it,
and have the invoker wait to be told what to do.

That way it can kill the invoking process if that's the one that
needs to be killed, and then all "can't kill processes because the
invoker holds locks they depend on" go away.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 22:54                                         ` Dave Chinner
@ 2015-02-17 23:32                                           ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-17 23:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, akpm, torvalds

On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote:
> On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >                 /* The OOM killer does not needlessly kill tasks for lowmem */
> > >                 if (high_zoneidx < ZONE_NORMAL)
> > >                         goto out;
> > > -               /* The OOM killer does not compensate for light reclaim */
> > > -               if (!(gfp_mask & __GFP_FS))
> > > -                       goto out;
> > >                 /*
> > >                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Again, we don't want to OOM kill on behalf of allocations that can't
> > initiate IO, or even actively prevent others from doing it.  Not per
> > default anyway, because most callers can deal with the failure without
> > having to resort to killing tasks, and NOFS reclaim *can* easily fail.
> > It's the exceptions that should be annotated instead:
> > 
> > void *
> > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > 	int	retries = 0;
> > 	gfp_t	lflags = kmem_flags_convert(flags);
> > 	void	*ptr;
> > 
> > 	do {
> > 		ptr = kmalloc(size, lflags);
> > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > 			return ptr;
> > 		if (!(++retries % 100))
> > 			xfs_err(NULL,
> > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > 					__func__, lflags);
> > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (1);
> > }
> > 
> > This should use __GFP_NOFAIL, which is not only designed to annotate
> > broken code like this, but also recognizes that endless looping on a
> > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a7a3a63bb360..17ced1805d3a 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  void *
> >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> > -	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > -					__func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Hmmm - the only reason there is a focus on this loop is that it
> emits warnings about allocations failing. It's obvious that the
> problem being dealt with here is a fundamental design issue w.r.t.
> to locking and the OOM killer, but the proposed special casing
> hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> in XFS started emitting warnings about allocations failing more
> often.
>
> So the answer is to remove the warning?  That's like killing the
> canary to stop the methane leak in the coal mine. No canary? No
> problems!

I'll also point out that there are two other identical allocation
loops in XFS, one of which is only 30 lines below this one. That's
further indication that this is a "silence the warning" patch rather
than something that actually fixes a problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-17 23:32                                           ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-17 23:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote:
> On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >                 /* The OOM killer does not needlessly kill tasks for lowmem */
> > >                 if (high_zoneidx < ZONE_NORMAL)
> > >                         goto out;
> > > -               /* The OOM killer does not compensate for light reclaim */
> > > -               if (!(gfp_mask & __GFP_FS))
> > > -                       goto out;
> > >                 /*
> > >                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Again, we don't want to OOM kill on behalf of allocations that can't
> > initiate IO, or even actively prevent others from doing it.  Not per
> > default anyway, because most callers can deal with the failure without
> > having to resort to killing tasks, and NOFS reclaim *can* easily fail.
> > It's the exceptions that should be annotated instead:
> > 
> > void *
> > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > 	int	retries = 0;
> > 	gfp_t	lflags = kmem_flags_convert(flags);
> > 	void	*ptr;
> > 
> > 	do {
> > 		ptr = kmalloc(size, lflags);
> > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > 			return ptr;
> > 		if (!(++retries % 100))
> > 			xfs_err(NULL,
> > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > 					__func__, lflags);
> > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (1);
> > }
> > 
> > This should use __GFP_NOFAIL, which is not only designed to annotate
> > broken code like this, but also recognizes that endless looping on a
> > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a7a3a63bb360..17ced1805d3a 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  void *
> >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> > -	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > -					__func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Hmmm - the only reason there is a focus on this loop is that it
> emits warnings about allocations failing. It's obvious that the
> problem being dealt with here is a fundamental design issue w.r.t.
> to locking and the OOM killer, but the proposed special casing
> hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> in XFS started emitting warnings about allocations failing more
> often.
>
> So the answer is to remove the warning?  That's like killing the
> canary to stop the methane leak in the coal mine. No canary? No
> problems!

I'll also point out that there are two other identical allocation
loops in XFS, one of which is only 30 lines below this one. That's
further indication that this is a "silence the warning" patch rather
than something that actually fixes a problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 22:54                                         ` Dave Chinner
@ 2015-02-18  8:25                                           ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-18  8:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [ cc xfs list - experienced kernel devs should not have to be
> reminded to do this ]
> 
> On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
[...]
> > void *
> > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > 	int	retries = 0;
> > 	gfp_t	lflags = kmem_flags_convert(flags);
> > 	void	*ptr;
> > 
> > 	do {
> > 		ptr = kmalloc(size, lflags);
> > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > 			return ptr;
> > 		if (!(++retries % 100))
> > 			xfs_err(NULL,
> > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > 					__func__, lflags);
> > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (1);
> > }
> > 
> > This should use __GFP_NOFAIL, which is not only designed to annotate
> > broken code like this, but also recognizes that endless looping on a
> > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a7a3a63bb360..17ced1805d3a 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  void *
> >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> > -	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > -					__func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Hmmm - the only reason there is a focus on this loop is that it
> emits warnings about allocations failing.

Such a warning should be part of the allocator and the whole point why
I like the patch is that we should really warn at a single place. I
was thinking about a simple warning (e.g. like the above) and having
something more sophisticated when lockdep is enabled.

> It's obvious that the
> problem being dealt with here is a fundamental design issue w.r.t.
> to locking and the OOM killer, but the proposed special casing
> hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> in XFS started emitting warnings about allocations failing more
> often.
> 
> So the answer is to remove the warning?  That's like killing the
> canary to stop the methane leak in the coal mine. No canary? No
> problems!

Not at all. I cannot speak for Johannes but I am pretty sure his
motivation wasn't to simply silence the warning. The thing is that no
kernel code paths except for the page allocator shouldn't emulate
behavior for which we have a gfp flag.

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure.

It would be great to get bug reports.

> That's a big red flag to me that all
> this hacking around the edges is not solving the underlying problem,
> but instead is breaking things that did once work.

I am heavily trying to discourage people from adding random hacks to
the already complicated and subtle OOM code.

> And, well, then there's this (gfp.h):
> 
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
> 
> So, is this another policy relevation from the mm developers about
> the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?

It is deprecated and shouldn't be used. But that doesn't mean that users
should workaround this by developing their own alternative. I agree the
wording could be more clear and mention that if the allocation failure
is absolutely unacceptable then the flags can be used rather than
creating the loop around. What do you think about the following?

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b840e3b2770d..ee6440ccb75d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -57,8 +57,12 @@ struct vm_area_struct;
  * _might_ fail.  This depends upon the particular VM implementation.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures.  This modifier is deprecated and no new
- * users should be added.
+ * cannot handle allocation failures.  This modifier is deprecated for allocation
+ * with order > 1. Besides that this modifier is very dangerous when allocation
+ * happens under a lock because it creates a lock dependency invisible for the
+ * OOM killer so it can livelock. If the allocation failure is _absolutely_
+ * unacceptable then the flags has to be used rather than looping around
+ * allocator.
  *
  * __GFP_NORETRY: The VM implementation must not retry indefinitely.
  *

> Or just another symptom of frantic thrashing because nobody actually
> understands the problem or those that do are unwilling to throw out
> the broken crap and redesign it?
> 
> If you are changing allocator behaviour and constraints, then you
> better damn well think through that changes fully, then document
> those changes, change all the relevant code to use the new API (not
> just those that throw warnings in your face) and make sure
> *everyone* knows about it. e.g. a LWN article explaining the changes
> and how memory allocation is going to work into the future would be
> a good start.

Well, I think the first step is to change the users of the allocator
to not lie about gfp flags. So if the code is infinitely trying then
it really should use GFP_NOFAIL flag.  In the meantime page allocator
should develop a proper diagnostic to help identify all the potential
dependencies. Next we should start thinking whether all the existing
GFP_NOFAIL paths are really necessary or the code can be
refactored/reimplemented to accept allocation failures.

> Otherwise, this just looks like another knee-jerk band aid for an
> architectural problem that needs more than special case hacks to
> solve.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-18  8:25                                           ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-18  8:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [ cc xfs list - experienced kernel devs should not have to be
> reminded to do this ]
> 
> On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
[...]
> > void *
> > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > 	int	retries = 0;
> > 	gfp_t	lflags = kmem_flags_convert(flags);
> > 	void	*ptr;
> > 
> > 	do {
> > 		ptr = kmalloc(size, lflags);
> > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > 			return ptr;
> > 		if (!(++retries % 100))
> > 			xfs_err(NULL,
> > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > 					__func__, lflags);
> > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (1);
> > }
> > 
> > This should use __GFP_NOFAIL, which is not only designed to annotate
> > broken code like this, but also recognizes that endless looping on a
> > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a7a3a63bb360..17ced1805d3a 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  void *
> >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> > -	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > -					__func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Hmmm - the only reason there is a focus on this loop is that it
> emits warnings about allocations failing.

Such a warning should be part of the allocator and the whole point why
I like the patch is that we should really warn at a single place. I
was thinking about a simple warning (e.g. like the above) and having
something more sophisticated when lockdep is enabled.

> It's obvious that the
> problem being dealt with here is a fundamental design issue w.r.t.
> to locking and the OOM killer, but the proposed special casing
> hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> in XFS started emitting warnings about allocations failing more
> often.
> 
> So the answer is to remove the warning?  That's like killing the
> canary to stop the methane leak in the coal mine. No canary? No
> problems!

Not at all. I cannot speak for Johannes but I am pretty sure his
motivation wasn't to simply silence the warning. The thing is that no
kernel code paths except for the page allocator shouldn't emulate
behavior for which we have a gfp flag.

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure.

It would be great to get bug reports.

> That's a big red flag to me that all
> this hacking around the edges is not solving the underlying problem,
> but instead is breaking things that did once work.

I am heavily trying to discourage people from adding random hacks to
the already complicated and subtle OOM code.

> And, well, then there's this (gfp.h):
> 
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
> 
> So, is this another policy relevation from the mm developers about
> the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?

It is deprecated and shouldn't be used. But that doesn't mean that users
should workaround this by developing their own alternative. I agree the
wording could be more clear and mention that if the allocation failure
is absolutely unacceptable then the flags can be used rather than
creating the loop around. What do you think about the following?

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b840e3b2770d..ee6440ccb75d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -57,8 +57,12 @@ struct vm_area_struct;
  * _might_ fail.  This depends upon the particular VM implementation.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures.  This modifier is deprecated and no new
- * users should be added.
+ * cannot handle allocation failures.  This modifier is deprecated for allocation
+ * with order > 1. Besides that this modifier is very dangerous when allocation
+ * happens under a lock because it creates a lock dependency invisible for the
+ * OOM killer so it can livelock. If the allocation failure is _absolutely_
+ * unacceptable then the flags has to be used rather than looping around
+ * allocator.
  *
  * __GFP_NORETRY: The VM implementation must not retry indefinitely.
  *

> Or just another symptom of frantic thrashing because nobody actually
> understands the problem or those that do are unwilling to throw out
> the broken crap and redesign it?
> 
> If you are changing allocator behaviour and constraints, then you
> better damn well think through that changes fully, then document
> those changes, change all the relevant code to use the new API (not
> just those that throw warnings in your face) and make sure
> *everyone* knows about it. e.g. a LWN article explaining the changes
> and how memory allocation is going to work into the future would be
> a good start.

Well, I think the first step is to change the users of the allocator
to not lie about gfp flags. So if the code is infinitely trying then
it really should use GFP_NOFAIL flag.  In the meantime page allocator
should develop a proper diagnostic to help identify all the potential
dependencies. Next we should start thinking whether all the existing
GFP_NOFAIL paths are really necessary or the code can be
refactored/reimplemented to accept allocation failures.

> Otherwise, this just looks like another knee-jerk band aid for an
> architectural problem that needs more than special case hacks to
> solve.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 23:25                                     ` Dave Chinner
@ 2015-02-18  8:48                                       ` Michal Hocko
  2015-02-18 11:23                                           ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2015-02-18  8:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds

On Wed 18-02-15 10:25:52, Dave Chinner wrote:
> On Tue, Feb 17, 2015 at 05:50:24PM +0100, Michal Hocko wrote:
> > On Tue 17-02-15 08:16:18, Johannes Weiner wrote:
> > > On Tue, Feb 17, 2015 at 08:57:05PM +0900, Tetsuo Handa wrote:
> > > > Johannes Weiner wrote:
> > > > > On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote:
> > > > > >   (2) Implement TIF_MEMDIE timeout.
> > > > > 
> > > > > How about something like this?  This should solve the deadlock problem
> > > > > in the page allocator, but it would also simplify the memcg OOM killer
> > > > > and allow its use by in-kernel faults again.
> > > > 
> > > > Yes, basic idea would be same with
> > > > http://marc.info/?l=linux-mm&m=142002495532320&w=2 .
> > > > 
> > > > But Michal and David do not like the timeout approach.
> > > > http://marc.info/?l=linux-mm&m=141684783713564&w=2
> > > > http://marc.info/?l=linux-mm&m=141686814824684&w=2
> > 
> > Yes I really hate time based solutions for reasons already explained in
> > the referenced links.
> >  
> > > I'm open to suggestions, but we can't just stick our heads in the sand
> > > and pretend that these are just unrelated bugs.  They're not. 
> > 
> > Requesting GFP_NOFAIL allocation with locks held is IMHO a bug and
> > should be fixed.
> 
> That's rather naive.
> 
> Filesystems do demand paging of metadata within transactions, which
> means we are guaranteed to be holding locks when doing memory
> allocation. Indeed, this is what the GFP_NOFS allocation context is
> supposed to convey - we currently *hold locks* and so reclaim needs
> to be careful about recursion. I'll also argue that it means the OOM
> killer cannot kill the process attempting memory allocation for the
> same reason.

I am not sure I understand. Do you mean that OOM killer should attempt
to select a victim which is doing doing GFP_NOFS allocation or an
allocation in general?

> We are also guaranteed to be in a state where memory allocation
> failure *cannot be tolerated* because failure to complete the
> modification leaves the filesystem in a "corrupt in memory" state.
> We don't use GFP_NOFAIL because it's deprecated, but the reality is
> that we need to ensure memory allocation eventually succeeds because
> we *cannot go backwards*.
> 
> The choice is simple: memory allocation fails, we shut down the
> filesystem and guarantee that we DOS the entire machine because the
> filesystems have gone AWOL; or we keep trying memory allocation
> until it succeeds.

Would it be possible to drop the locks and retry the allocations?
Is the context which is doing this transaction a killable context?

> So, memory allocation generally succeeds eventually, so we have
> these loops around kmalloc(), kmem_cache_alloc() and alloc_page()
> that ensure allocation succeeds. Those loops also guarantee we get
> warnings when allocation is repeatedly failing and we might have
> actually hit a OOM deadlock situation.

As pointed in another email this should be done in the page allocator
IMO.

> > Hopelessly looping in the page allocator without GFP_NOFAIL is too risky
> > as well and we should get rid of this.
> 
> Yet the exact situation we need GFP_NOFAIL is the situation that you
> are calling a bug.
> 
> > Why should we still try to loop
> > when previous 1000 attempts failed with OOM killer invocation? Can we
> > simply fail after a configurable number of attempts?
> 
> OTOH, why should the memory allocator care what failure policy the
> callers have?

It is not about failure policy of the caller. It is about how long the
allocator tries before it gives up. A good allocator tries hard but not
too much if the caller is able to handle the failure because it is the
caller who defines the fallback policy.

> > This is prone to
> > reveal unchecked allocation failures but those are bugs as well and we
> > shouldn't pretend otherwise.
> > 
> > > As long
> > > as it's legal to enter the allocator with *anything* that can prevent
> > > another random task in the system from making progress, we have this
> > > deadlock potential.  One side has to give up, and it can't be the page
> > > allocator because it has to support __GFP_NOFAIL allocations, which
> > > are usually exactly the allocations that are buried in hard-to-unwind
> > > state that is likely to trip up exiting OOM victims.
> > 
> > I am not convinced that GFP_NOFAIL is the biggest problem. Most if
> > OOM livelocks I have seen were either due to GFP_KERNEL treated as
> > GFP_NOFAIL or an incorrect gfp mask (e.g. GFP_FS added where not
> > appropriate). I think we should focus on this part before we start
> > adding heuristics into OOM killer.
> 
> Having the OOM killer being able to kill the process that triggered
> it would be a good start.

Not sure I understand. Do you mean sysctl_oom_kill_allocating_task?

> More often than not, that is the process
> that needs killing, and the oom killer implementation currently
> cannot do anything about that process.

Can you elaborate? AFAICS the process which has triggered the OOM is the
easiest to kill victim. It is not blocked on any locks so it just needs
to get outside of the kernel.

> Make the OOM killer only be
> invoked by kswapd or some other independent kernel thread so that it
> is independent of the allocation context that needs to invoke it,
> and have the invoker wait to be told what to do.

Again, I am not sure I understand. The OOM killer doesn't block the
context which has triggered OOM condition. Allocation is retried after
OOM killer invocation and if the current context is the victim the
allocation failure is expedited.

> That way it can kill the invoking process if that's the one that
> needs to be killed, and then all "can't kill processes because the
> invoker holds locks they depend on" go away.

Except that killing the messenger is not the best strategy...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18  8:25                                           ` Michal Hocko
@ 2015-02-18 10:48                                             ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-18 10:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> > [ cc xfs list - experienced kernel devs should not have to be
> > reminded to do this ]
> > 
> > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> [...]
> > > void *
> > > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > > {
> > > 	int	retries = 0;
> > > 	gfp_t	lflags = kmem_flags_convert(flags);
> > > 	void	*ptr;
> > > 
> > > 	do {
> > > 		ptr = kmalloc(size, lflags);
> > > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > > 			return ptr;
> > > 		if (!(++retries % 100))
> > > 			xfs_err(NULL,
> > > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > > 					__func__, lflags);
> > > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > 	} while (1);
> > > }
> > > 
> > > This should use __GFP_NOFAIL, which is not only designed to annotate
> > > broken code like this, but also recognizes that endless looping on a
> > > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index a7a3a63bb360..17ced1805d3a 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > >  void *
> > >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> > >  {
> > > -	int	retries = 0;
> > >  	gfp_t	lflags = kmem_flags_convert(flags);
> > > -	void	*ptr;
> > >  
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > > -					__func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Hmmm - the only reason there is a focus on this loop is that it
> > emits warnings about allocations failing.
> 
> Such a warning should be part of the allocator and the whole point why
> I like the patch is that we should really warn at a single place. I
> was thinking about a simple warning (e.g. like the above) and having
> something more sophisticated when lockdep is enabled.
> 
> > It's obvious that the
> > problem being dealt with here is a fundamental design issue w.r.t.
> > to locking and the OOM killer, but the proposed special casing
> > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> > in XFS started emitting warnings about allocations failing more
> > often.
> > 
> > So the answer is to remove the warning?  That's like killing the
> > canary to stop the methane leak in the coal mine. No canary? No
> > problems!
> 
> Not at all. I cannot speak for Johannes but I am pretty sure his
> motivation wasn't to simply silence the warning. The thing is that no
> kernel code paths except for the page allocator shouldn't emulate
> behavior for which we have a gfp flag.
> 
> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure.
> 
> It would be great to get bug reports.

I thought we were talking about a manifestation of the problems I've
been seeing....

> > That's a big red flag to me that all
> > this hacking around the edges is not solving the underlying problem,
> > but instead is breaking things that did once work.
> 
> I am heavily trying to discourage people from adding random hacks to
> the already complicated and subtle OOM code.
> 
> > And, well, then there's this (gfp.h):
> > 
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures.  This modifier is deprecated and no new
> >  * users should be added.
> > 
> > So, is this another policy relevation from the mm developers about
> > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
> 
> It is deprecated and shouldn't be used. But that doesn't mean that users
> should workaround this by developing their own alternative.

I'm kinda sick of hearing that, as if saying it enough times will
make reality change. We have a *hard requirement* for memory
allocation to make forwards progress, otherwise we *fail
catastrophically*.

History lesson - June 2004:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b30a2f7bf90593b12dbc912e4390b1b8ee133ea9

So, we're hardly working around the deprecation of GFP_NOFAIL when
the code existed 5 years before GFP_NOFAIL was deprecated. Indeed,
GFP_NOFAIL was shiny and new back then, having been introduced by
Andrew Morton back in 2003.

> I agree the
> wording could be more clear and mention that if the allocation failure
> is absolutely unacceptable then the flags can be used rather than
> creating the loop around. What do you think about the following?
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b840e3b2770d..ee6440ccb75d 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -57,8 +57,12 @@ struct vm_area_struct;
>   * _might_ fail.  This depends upon the particular VM implementation.
>   *
>   * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> - * cannot handle allocation failures.  This modifier is deprecated and no new
> - * users should be added.
> + * cannot handle allocation failures.  This modifier is deprecated for allocation
> + * with order > 1. Besides that this modifier is very dangerous when allocation
> + * happens under a lock because it creates a lock dependency invisible for the
> + * OOM killer so it can livelock. If the allocation failure is _absolutely_
> + * unacceptable then the flags has to be used rather than looping around
> + * allocator.

Doesn't change anything from an XFS point of view. We do order >1
allocations through kmem_alloc() wrapper, and so we are still doing
something that is "not supported" even if we use GFP_NOFAIL rather
than our own loop.

Also, this reads as an excuse for the OOM killer being broken and
not fixing it.  Keep in mind that we tell the memory alloc/reclaim
subsystem that *we hold locks* when we call into it. That's what
GFP_NOFS originally meant, and it's what it still means today in an
XFS context.

If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
that the invoking context holds, then that is a OOM killer bug, not
a bug in the subsystem calling kmalloc(GFP_NOFS).

>   *
>   * __GFP_NORETRY: The VM implementation must not retry indefinitely.
>   *
> 
> > Or just another symptom of frantic thrashing because nobody actually
> > understands the problem or those that do are unwilling to throw out
> > the broken crap and redesign it?
> > 
> > If you are changing allocator behaviour and constraints, then you
> > better damn well think through that changes fully, then document
> > those changes, change all the relevant code to use the new API (not
> > just those that throw warnings in your face) and make sure
> > *everyone* knows about it. e.g. a LWN article explaining the changes
> > and how memory allocation is going to work into the future would be
> > a good start.
> 
> Well, I think the first step is to change the users of the allocator
> to not lie about gfp flags. So if the code is infinitely trying then
> it really should use GFP_NOFAIL flag.

That's a complete non-issue when it comes to deciding whether it is
safe to invoke the OOM killer or not!

> In the meantime page allocator
> should develop a proper diagnostic to help identify all the potential
> dependencies. Next we should start thinking whether all the existing
> GFP_NOFAIL paths are really necessary or the code can be
> refactored/reimplemented to accept allocation failures.

Last time the "just make filesystems handle memory allocation
failures" I pointed out what that meant for XFS: dirty transaction
rollback is required. That's freakin' complex, will double the
memory footprint of transactions, roughly double the CPU cost, and
greatly increase the complexity of the transaction subsystem. It's a
*major* rework of a significant amount of the XFS codebase and will
take at least a couple of years design, test and stabilise before
it could be rolled out to production.

I'm not about to spend a couple of years rewriting XFS just so the
VM can get rid of a GFP_NOFAIL user. Especially as the we already
tell the Hammer of Last Resort the context in which it can work.

Move the OOM killer to kswapd - get it out of the direct reclaim
path altogether. If the system is that backed up on locks that it
cannot free any memory and has no reserves to satisfy the allocation
that kicked the OOM killer, then the OOM killer was not invoked soon
enough.

Hell, if you want a better way to proceed, then how about you allow
us to tell the MM subsystem how much memory reserve a specific set
of operations is going to require to complete? That's something that
we can do rough calculations for, and it integrates straight into
the existing transaction reservation system we already use for log
space and disk space, and we can tell the mm subsystem when the
reserve is no longer needed (i.e. last thing in transaction commit).

That way we don't start a transaction until the mm subsystem has
reserved enough pages for us to work with, and the reserve only
needs to be used when normal allocation has already failed. i.e
rather than looping we get a page allocated from the reserve pool.

The reservations wouldn't be perfect, but the majority of the time
we'd be able to make progress and not need the OOM killer. And best
of all, there's no responsibilty on the MM subsystem for preventing
OOM - getting the reservations right is the responsibiity of the
subsystem using them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-18 10:48                                             ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-18 10:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> > [ cc xfs list - experienced kernel devs should not have to be
> > reminded to do this ]
> > 
> > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> [...]
> > > void *
> > > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > > {
> > > 	int	retries = 0;
> > > 	gfp_t	lflags = kmem_flags_convert(flags);
> > > 	void	*ptr;
> > > 
> > > 	do {
> > > 		ptr = kmalloc(size, lflags);
> > > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > > 			return ptr;
> > > 		if (!(++retries % 100))
> > > 			xfs_err(NULL,
> > > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > > 					__func__, lflags);
> > > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > 	} while (1);
> > > }
> > > 
> > > This should use __GFP_NOFAIL, which is not only designed to annotate
> > > broken code like this, but also recognizes that endless looping on a
> > > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index a7a3a63bb360..17ced1805d3a 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > >  void *
> > >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> > >  {
> > > -	int	retries = 0;
> > >  	gfp_t	lflags = kmem_flags_convert(flags);
> > > -	void	*ptr;
> > >  
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > > -					__func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Hmmm - the only reason there is a focus on this loop is that it
> > emits warnings about allocations failing.
> 
> Such a warning should be part of the allocator and the whole point why
> I like the patch is that we should really warn at a single place. I
> was thinking about a simple warning (e.g. like the above) and having
> something more sophisticated when lockdep is enabled.
> 
> > It's obvious that the
> > problem being dealt with here is a fundamental design issue w.r.t.
> > to locking and the OOM killer, but the proposed special casing
> > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> > in XFS started emitting warnings about allocations failing more
> > often.
> > 
> > So the answer is to remove the warning?  That's like killing the
> > canary to stop the methane leak in the coal mine. No canary? No
> > problems!
> 
> Not at all. I cannot speak for Johannes but I am pretty sure his
> motivation wasn't to simply silence the warning. The thing is that no
> kernel code paths except for the page allocator shouldn't emulate
> behavior for which we have a gfp flag.
> 
> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure.
> 
> It would be great to get bug reports.

I thought we were talking about a manifestation of the problems I've
been seeing....

> > That's a big red flag to me that all
> > this hacking around the edges is not solving the underlying problem,
> > but instead is breaking things that did once work.
> 
> I am heavily trying to discourage people from adding random hacks to
> the already complicated and subtle OOM code.
> 
> > And, well, then there's this (gfp.h):
> > 
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures.  This modifier is deprecated and no new
> >  * users should be added.
> > 
> > So, is this another policy relevation from the mm developers about
> > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
> 
> It is deprecated and shouldn't be used. But that doesn't mean that users
> should workaround this by developing their own alternative.

I'm kinda sick of hearing that, as if saying it enough times will
make reality change. We have a *hard requirement* for memory
allocation to make forwards progress, otherwise we *fail
catastrophically*.

History lesson - June 2004:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b30a2f7bf90593b12dbc912e4390b1b8ee133ea9

So, we're hardly working around the deprecation of GFP_NOFAIL when
the code existed 5 years before GFP_NOFAIL was deprecated. Indeed,
GFP_NOFAIL was shiny and new back then, having been introduced by
Andrew Morton back in 2003.

> I agree the
> wording could be more clear and mention that if the allocation failure
> is absolutely unacceptable then the flags can be used rather than
> creating the loop around. What do you think about the following?
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b840e3b2770d..ee6440ccb75d 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -57,8 +57,12 @@ struct vm_area_struct;
>   * _might_ fail.  This depends upon the particular VM implementation.
>   *
>   * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> - * cannot handle allocation failures.  This modifier is deprecated and no new
> - * users should be added.
> + * cannot handle allocation failures.  This modifier is deprecated for allocation
> + * with order > 1. Besides that this modifier is very dangerous when allocation
> + * happens under a lock because it creates a lock dependency invisible for the
> + * OOM killer so it can livelock. If the allocation failure is _absolutely_
> + * unacceptable then the flags has to be used rather than looping around
> + * allocator.

Doesn't change anything from an XFS point of view. We do order >1
allocations through kmem_alloc() wrapper, and so we are still doing
something that is "not supported" even if we use GFP_NOFAIL rather
than our own loop.

Also, this reads as an excuse for the OOM killer being broken and
not fixing it.  Keep in mind that we tell the memory alloc/reclaim
subsystem that *we hold locks* when we call into it. That's what
GFP_NOFS originally meant, and it's what it still means today in an
XFS context.

If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
that the invoking context holds, then that is a OOM killer bug, not
a bug in the subsystem calling kmalloc(GFP_NOFS).

>   *
>   * __GFP_NORETRY: The VM implementation must not retry indefinitely.
>   *
> 
> > Or just another symptom of frantic thrashing because nobody actually
> > understands the problem or those that do are unwilling to throw out
> > the broken crap and redesign it?
> > 
> > If you are changing allocator behaviour and constraints, then you
> > better damn well think through that changes fully, then document
> > those changes, change all the relevant code to use the new API (not
> > just those that throw warnings in your face) and make sure
> > *everyone* knows about it. e.g. a LWN article explaining the changes
> > and how memory allocation is going to work into the future would be
> > a good start.
> 
> Well, I think the first step is to change the users of the allocator
> to not lie about gfp flags. So if the code is infinitely trying then
> it really should use GFP_NOFAIL flag.

That's a complete non-issue when it comes to deciding whether it is
safe to invoke the OOM killer or not!

> In the meantime page allocator
> should develop a proper diagnostic to help identify all the potential
> dependencies. Next we should start thinking whether all the existing
> GFP_NOFAIL paths are really necessary or the code can be
> refactored/reimplemented to accept allocation failures.

Last time the "just make filesystems handle memory allocation
failures" I pointed out what that meant for XFS: dirty transaction
rollback is required. That's freakin' complex, will double the
memory footprint of transactions, roughly double the CPU cost, and
greatly increase the complexity of the transaction subsystem. It's a
*major* rework of a significant amount of the XFS codebase and will
take at least a couple of years design, test and stabilise before
it could be rolled out to production.

I'm not about to spend a couple of years rewriting XFS just so the
VM can get rid of a GFP_NOFAIL user. Especially as the we already
tell the Hammer of Last Resort the context in which it can work.

Move the OOM killer to kswapd - get it out of the direct reclaim
path altogether. If the system is that backed up on locks that it
cannot free any memory and has no reserves to satisfy the allocation
that kicked the OOM killer, then the OOM killer was not invoked soon
enough.

Hell, if you want a better way to proceed, then how about you allow
us to tell the MM subsystem how much memory reserve a specific set
of operations is going to require to complete? That's something that
we can do rough calculations for, and it integrates straight into
the existing transaction reservation system we already use for log
space and disk space, and we can tell the mm subsystem when the
reserve is no longer needed (i.e. last thing in transaction commit).

That way we don't start a transaction until the mm subsystem has
reserved enough pages for us to work with, and the reserve only
needs to be used when normal allocation has already failed. i.e
rather than looping we get a page allocated from the reserve pool.

The reservations wouldn't be perfect, but the majority of the time
we'd be able to make progress and not need the OOM killer. And best
of all, there's no responsibilty on the MM subsystem for preventing
OOM - getting the reservations right is the responsibiity of the
subsystem using them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18  8:48                                       ` Michal Hocko
@ 2015-02-18 11:23                                           ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-18 11:23 UTC (permalink / raw)
  To: mhocko
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

[ cc fsdevel list - watch out for side effect of 9879de7373fc (mm: page_alloc:
embed OOM killing naturally into allocation slowpath) which was merged between
3.19-rc6 and 3.19-rc7 , started from
http://marc.info/?l=linux-mm&m=142348457310066&w=2 ]

Replying in this post picked up from several posts in this thread.

Michal Hocko wrote:
> Besides that __GFP_WAIT callers should be prepared for the allocation
> failure and should better cope with it. So no, I really hate something
> like the above.

Those who do not want to retry with invoking the OOM killer are using
__GFP_WAIT + __GFP_NORETRY allocations.

Those who want to retry with invoking the OOM killer are using
__GFP_WAIT allocations.

Those who must retry forever with invoking the OOM killer, no matter how
many processes the OOM killer kills, are using __GFP_WAIT + __GFP_NOFAIL
allocations.

However, since use of __GFP_NOFAIL is prohibited, I think many of
__GFP_WAIT users are expecting that the allocation fails only when
"the OOM killer set TIF_MEMDIE flag to the caller but the caller
failed to allocate from memory reserves". Also, the implementation
before 9879de7373fc (mm: page_alloc: embed OOM killing naturally
into allocation slowpath) effectively supported __GFP_WAIT users
with such expectation.

Michal Hocko wrote:
> Because they cannot perform any IO/FS transactions and that would lead
> to a premature OOM conditions way too easily. OOM killer is a _last
> resort_ reclaim opportunity not something that would happen just because
> you happen to be not able to flush dirty pages. 

But you should not have applied such change without making necessary
changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.

Michal Hocko wrote:
> Well, you are beating your machine to death so you can hardly get any
> time guarantee. It would be nice to have a better feedback mechanism to
> know when to back off and fail the allocation attempt which might be
> blocking OOM victim to pass away. This is extremely tricky because we
> shouldn't be too eager to fail just because of a sudden memory pressure.

Michal Hocko wrote:
> >   I wish only somebody like kswapd repeats the loop on behalf of all
> >   threads waiting at memory allocation slowpath...
> 
> This is the case when the kswapd is _able_ to cope with the memory
> pressure.

It looks wasteful for me that so many threads (greater than number of
available CPUs) are sleeping at cond_resched() in shrink_slab() when
checking SysRq-t. Imagine 1000 threads sleeping at cond_resched() in
shrink_slab() on a machine with only 1 CPU. Each thread gets a chance
to try calling reclaim function only when all other threads gave that
thread a chance at cond_resched(). Such situation is almost mutually
preventing from making progress. I wish the following mechanism.

  Prepare a kernel thread (for avoiding being OOM-killed) and let
  __GFP_WAIT and __GFP_WAIT + __GFP_NOFAIL users to wake up the kernel
  thread when they failed to allocate from free list. The kernel thread
  calls shrink_slab() etc. (and also out_of_memory() as needed) and
  wakes them sleeping at wait_for_event() up.

Failing to allocate from free list is a rare case. Therefore, context
switches for asking somebody else for reclaiming memory would be an
acceptable overhead. If such mechanism are implemented, 1000 threads
except the somebody can save CPU time by sleeping. Avoiding "almost
mutually preventing from making progress" situation will drastically
shorten the time guarantee even if I beat my machine to death.
Such mechanism might be similar to Dave Chinner's

  Make the OOM killer only be invoked by kswapd or some other
  independent kernel thread so that it is independent of the
  allocation context that needs to invoke it, and have the
  invoker wait to be told what to do.

suggestion.

Dave Chinner wrote:
> Filesystems do demand paging of metadata within transactions, which
> means we are guaranteed to be holding locks when doing memory
> allocation. Indeed, this is what the GFP_NOFS allocation context is
> supposed to convey - we currently *hold locks* and so reclaim needs
> to be careful about recursion. I'll also argue that it means the OOM
> killer cannot kill the process attempting memory allocation for the
> same reason.

I agree with Dave Chinner about this.

I tested on ext4 filesystem, one is stock Linux 3.19 and the other is
Linux 3.19 with

   -               /* The OOM killer does not compensate for light reclaim */
   -               if (!(gfp_mask & __GFP_FS))
   -                       goto out;

applied. Running a Java-like stressing program (which is multi threaded
and likely be chosen by the OOM killer due to huge memory usage) shown
below with ext4 filesystem set to remount read-only upon filesystem error.

   # mount -o remount,errors=remount-ro /

---------- Testing program start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
        char buffer[128] = { };
        int fd;
        snprintf(buffer, sizeof(buffer) - 1, "/tmp/file.%u", getpid());
        fd = open(buffer, O_WRONLY | O_CREAT, 0600);
        unlink(buffer);
        while (write(fd, buffer, 1) == 1 && fsync(fd) == 0);
        return 0;
}

static void memory_consumer(void)
{
        const int fd = open("/dev/zero", O_RDONLY);
        unsigned long size;
        char *buf = NULL;
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        read(fd, buf, size); /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
        int i;
        for (i = 0; i < 100; i++) {
                char *cp = malloc(4 * 1024);
                if (!cp || clone(file_writer, cp + 4 * 1024,
                                 CLONE_SIGHAND | CLONE_VM, NULL) == -1)
                        break;
        }
        memory_consumer();
        while (1)
                pause();
        return 0;
}
---------- Testing program end ----------

The former showed that the ext4 filesystem is remounted read-only due to
filesystem errors with 50%+ reproducibility.

----------
[   72.440013] do_get_write_access: OOM for frozen_buffer
[   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
(...snipped....)
[   72.495559] do_get_write_access: OOM for frozen_buffer
[   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.496839] do_get_write_access: OOM for frozen_buffer
[   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.505766] Aborting journal on device sda1-8.
[   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
[   72.505853] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   72.507995] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   72.508773] EXT4-fs (sda1): Remounting filesystem read-only
[   72.508775] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   72.547799] do_get_write_access: OOM for frozen_buffer
[   72.706692] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.035416] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.291732] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.422171] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.511862] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.589174] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.665302] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
----------

On the other hand, the latter showed that the ext4 filesystem was never
remounted read-only because filesystem errors did not occur, though several
TIF_MEMDIE stalls which the timeout patch would handle were observed as with
the former.

As this is ext4 filesystem, this would use GFP_NOFS. But does using GFP_NOFS +
__GFP_NOFAIL at ext4 filesystem solve the problem? I don't think so.
The underlying block layer which ext4 filesystem calls would use GFP_NOIO.
And memory allocation failures at block layer will result in I/O error which
is observed by users as filesystem error. Does passing __GFP_NOFAIL down to
block layer solve the problem? I don't think so. There is no means to teach
block layer that filesystem layer is doing critical operations where failure
results in serious problems. Then, does using GFP_NOIO + __GFP_NOFAIL at
block layer solves the problem? I don't think so. It is nothing but bypassing

   /* The OOM killer does not compensate for light reclaim */
   if (!(gfp_mask & __GFP_FS))
           goto out;

check by passing __GFP_NOFAIL flag.

Michal Hocko wrote:
> Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think
> this is a problem?

Killing a user space process or taking filesystem error actions (e.g.
remount-ro or kernel panic), which choice is less painful for users?
I believe that !(gfp_mask & __GFP_FS) check is a bug and should be removed.

Rather, shouldn't allocations without __GFP_FS get more chance to succeed
than allocations with __GFP_FS? If I were the author, I might have added
below check instead.

   /* This is not a critical allocation. Don't invoke the OOM killer. */
   if (gfp_mask & __GFP_FS)
           goto out;

Falling into retry loop with same watermark might prevent rescuer threads from
doing memory allocation which is needed for making free memory. Maybe we should
use lower watermark for GFP_NOIO and below, middle watermark for GFP_NOFS, high
watermark for GFP_KERNEL and above.

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-18 11:23                                           ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-18 11:23 UTC (permalink / raw)
  To: mhocko
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

[ cc fsdevel list - watch out for side effect of 9879de7373fc (mm: page_alloc:
embed OOM killing naturally into allocation slowpath) which was merged between
3.19-rc6 and 3.19-rc7 , started from
http://marc.info/?l=linux-mm&m=142348457310066&w=2 ]

Replying in this post picked up from several posts in this thread.

Michal Hocko wrote:
> Besides that __GFP_WAIT callers should be prepared for the allocation
> failure and should better cope with it. So no, I really hate something
> like the above.

Those who do not want to retry with invoking the OOM killer are using
__GFP_WAIT + __GFP_NORETRY allocations.

Those who want to retry with invoking the OOM killer are using
__GFP_WAIT allocations.

Those who must retry forever with invoking the OOM killer, no matter how
many processes the OOM killer kills, are using __GFP_WAIT + __GFP_NOFAIL
allocations.

However, since use of __GFP_NOFAIL is prohibited, I think many of
__GFP_WAIT users are expecting that the allocation fails only when
"the OOM killer set TIF_MEMDIE flag to the caller but the caller
failed to allocate from memory reserves". Also, the implementation
before 9879de7373fc (mm: page_alloc: embed OOM killing naturally
into allocation slowpath) effectively supported __GFP_WAIT users
with such expectation.

Michal Hocko wrote:
> Because they cannot perform any IO/FS transactions and that would lead
> to a premature OOM conditions way too easily. OOM killer is a _last
> resort_ reclaim opportunity not something that would happen just because
> you happen to be not able to flush dirty pages. 

But you should not have applied such change without making necessary
changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.

Michal Hocko wrote:
> Well, you are beating your machine to death so you can hardly get any
> time guarantee. It would be nice to have a better feedback mechanism to
> know when to back off and fail the allocation attempt which might be
> blocking OOM victim to pass away. This is extremely tricky because we
> shouldn't be too eager to fail just because of a sudden memory pressure.

Michal Hocko wrote:
> >   I wish only somebody like kswapd repeats the loop on behalf of all
> >   threads waiting at memory allocation slowpath...
> 
> This is the case when the kswapd is _able_ to cope with the memory
> pressure.

It looks wasteful for me that so many threads (greater than number of
available CPUs) are sleeping at cond_resched() in shrink_slab() when
checking SysRq-t. Imagine 1000 threads sleeping at cond_resched() in
shrink_slab() on a machine with only 1 CPU. Each thread gets a chance
to try calling reclaim function only when all other threads gave that
thread a chance at cond_resched(). Such situation is almost mutually
preventing from making progress. I wish the following mechanism.

  Prepare a kernel thread (for avoiding being OOM-killed) and let
  __GFP_WAIT and __GFP_WAIT + __GFP_NOFAIL users to wake up the kernel
  thread when they failed to allocate from free list. The kernel thread
  calls shrink_slab() etc. (and also out_of_memory() as needed) and
  wakes them sleeping at wait_for_event() up.

Failing to allocate from free list is a rare case. Therefore, context
switches for asking somebody else for reclaiming memory would be an
acceptable overhead. If such mechanism are implemented, 1000 threads
except the somebody can save CPU time by sleeping. Avoiding "almost
mutually preventing from making progress" situation will drastically
shorten the time guarantee even if I beat my machine to death.
Such mechanism might be similar to Dave Chinner's

  Make the OOM killer only be invoked by kswapd or some other
  independent kernel thread so that it is independent of the
  allocation context that needs to invoke it, and have the
  invoker wait to be told what to do.

suggestion.

Dave Chinner wrote:
> Filesystems do demand paging of metadata within transactions, which
> means we are guaranteed to be holding locks when doing memory
> allocation. Indeed, this is what the GFP_NOFS allocation context is
> supposed to convey - we currently *hold locks* and so reclaim needs
> to be careful about recursion. I'll also argue that it means the OOM
> killer cannot kill the process attempting memory allocation for the
> same reason.

I agree with Dave Chinner about this.

I tested on ext4 filesystem, one is stock Linux 3.19 and the other is
Linux 3.19 with

   -               /* The OOM killer does not compensate for light reclaim */
   -               if (!(gfp_mask & __GFP_FS))
   -                       goto out;

applied. Running a Java-like stressing program (which is multi threaded
and likely be chosen by the OOM killer due to huge memory usage) shown
below with ext4 filesystem set to remount read-only upon filesystem error.

   # mount -o remount,errors=remount-ro /

---------- Testing program start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
        char buffer[128] = { };
        int fd;
        snprintf(buffer, sizeof(buffer) - 1, "/tmp/file.%u", getpid());
        fd = open(buffer, O_WRONLY | O_CREAT, 0600);
        unlink(buffer);
        while (write(fd, buffer, 1) == 1 && fsync(fd) == 0);
        return 0;
}

static void memory_consumer(void)
{
        const int fd = open("/dev/zero", O_RDONLY);
        unsigned long size;
        char *buf = NULL;
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        read(fd, buf, size); /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
        int i;
        for (i = 0; i < 100; i++) {
                char *cp = malloc(4 * 1024);
                if (!cp || clone(file_writer, cp + 4 * 1024,
                                 CLONE_SIGHAND | CLONE_VM, NULL) == -1)
                        break;
        }
        memory_consumer();
        while (1)
                pause();
        return 0;
}
---------- Testing program end ----------

The former showed that the ext4 filesystem is remounted read-only due to
filesystem errors with 50%+ reproducibility.

----------
[   72.440013] do_get_write_access: OOM for frozen_buffer
[   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
(...snipped....)
[   72.495559] do_get_write_access: OOM for frozen_buffer
[   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.496839] do_get_write_access: OOM for frozen_buffer
[   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.505766] Aborting journal on device sda1-8.
[   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
[   72.505853] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   72.507995] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   72.508773] EXT4-fs (sda1): Remounting filesystem read-only
[   72.508775] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   72.547799] do_get_write_access: OOM for frozen_buffer
[   72.706692] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.035416] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.291732] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.422171] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.511862] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.589174] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
[   73.665302] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12)
----------

On the other hand, the latter showed that the ext4 filesystem was never
remounted read-only because filesystem errors did not occur, though several
TIF_MEMDIE stalls which the timeout patch would handle were observed as with
the former.

As this is ext4 filesystem, this would use GFP_NOFS. But does using GFP_NOFS +
__GFP_NOFAIL at ext4 filesystem solve the problem? I don't think so.
The underlying block layer which ext4 filesystem calls would use GFP_NOIO.
And memory allocation failures at block layer will result in I/O error which
is observed by users as filesystem error. Does passing __GFP_NOFAIL down to
block layer solve the problem? I don't think so. There is no means to teach
block layer that filesystem layer is doing critical operations where failure
results in serious problems. Then, does using GFP_NOIO + __GFP_NOFAIL at
block layer solves the problem? I don't think so. It is nothing but bypassing

   /* The OOM killer does not compensate for light reclaim */
   if (!(gfp_mask & __GFP_FS))
           goto out;

check by passing __GFP_NOFAIL flag.

Michal Hocko wrote:
> Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think
> this is a problem?

Killing a user space process or taking filesystem error actions (e.g.
remount-ro or kernel panic), which choice is less painful for users?
I believe that !(gfp_mask & __GFP_FS) check is a bug and should be removed.

Rather, shouldn't allocations without __GFP_FS get more chance to succeed
than allocations with __GFP_FS? If I were the author, I might have added
below check instead.

   /* This is not a critical allocation. Don't invoke the OOM killer. */
   if (gfp_mask & __GFP_FS)
           goto out;

Falling into retry loop with same watermark might prevent rescuer threads from
doing memory allocation which is needed for making free memory. Maybe we should
use lower watermark for GFP_NOIO and below, middle watermark for GFP_NOFS, high
watermark for GFP_KERNEL and above.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 10:48                                             ` Dave Chinner
@ 2015-02-18 12:16                                               ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-18 12:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
[...]
> Also, this reads as an excuse for the OOM killer being broken and
> not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> subsystem that *we hold locks* when we call into it. That's what
> GFP_NOFS originally meant, and it's what it still means today in an
> XFS context.

Sure, and OOM killer will not be invoked in NOFS context. See
__alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
is the OOM killer broken.

The crucial problem we are dealing with is not GFP_NOFAIL triggering the
OOM killer but a lock dependency introduced by the following sequence:

	taskA			taskB			taskC
lock(A)							alloc()
alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
# looping for ever if we				    select_bad_process
# cannot make any progress				      victim = taskB

There is no way OOM killer can tell taskB is blocked and that there is
dependency between A and B (without lockdep). That is why I call NOFAIL
under a lock as dangerous and a bug.

> If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
> that the invoking context holds, then that is a OOM killer bug, not
> a bug in the subsystem calling kmalloc(GFP_NOFS).

I guess we are talking about different things here or what am I missing?
 
[...]
> > In the meantime page allocator
> > should develop a proper diagnostic to help identify all the potential
> > dependencies. Next we should start thinking whether all the existing
> > GFP_NOFAIL paths are really necessary or the code can be
> > refactored/reimplemented to accept allocation failures.
> 
> Last time the "just make filesystems handle memory allocation
> failures" I pointed out what that meant for XFS: dirty transaction
> rollback is required. That's freakin' complex, will double the
> memory footprint of transactions, roughly double the CPU cost, and
> greatly increase the complexity of the transaction subsystem. It's a
> *major* rework of a significant amount of the XFS codebase and will
> take at least a couple of years design, test and stabilise before
> it could be rolled out to production.
> 
> I'm not about to spend a couple of years rewriting XFS just so the
> VM can get rid of a GFP_NOFAIL user. Especially as the we already
> tell the Hammer of Last Resort the context in which it can work.
> 
> Move the OOM killer to kswapd - get it out of the direct reclaim
> path altogether.

This doesn't change anything as explained in other email. The triggering
path doesn't wait for the victim to die.

> If the system is that backed up on locks that it
> cannot free any memory and has no reserves to satisfy the allocation
> that kicked the OOM killer, then the OOM killer was not invoked soon
> enough.
> 
> Hell, if you want a better way to proceed, then how about you allow
> us to tell the MM subsystem how much memory reserve a specific set
> of operations is going to require to complete? That's something that
> we can do rough calculations for, and it integrates straight into
> the existing transaction reservation system we already use for log
> space and disk space, and we can tell the mm subsystem when the
> reserve is no longer needed (i.e. last thing in transaction commit).
> 
> That way we don't start a transaction until the mm subsystem has
> reserved enough pages for us to work with, and the reserve only
> needs to be used when normal allocation has already failed. i.e
> rather than looping we get a page allocated from the reserve pool.

I am not sure I understand the above but isn't the mempools a tool for
this purpose?
 
> The reservations wouldn't be perfect, but the majority of the time
> we'd be able to make progress and not need the OOM killer. And best
> of all, there's no responsibilty on the MM subsystem for preventing
> OOM - getting the reservations right is the responsibiity of the
> subsystem using them.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-18 12:16                                               ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-18 12:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
[...]
> Also, this reads as an excuse for the OOM killer being broken and
> not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> subsystem that *we hold locks* when we call into it. That's what
> GFP_NOFS originally meant, and it's what it still means today in an
> XFS context.

Sure, and OOM killer will not be invoked in NOFS context. See
__alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
is the OOM killer broken.

The crucial problem we are dealing with is not GFP_NOFAIL triggering the
OOM killer but a lock dependency introduced by the following sequence:

	taskA			taskB			taskC
lock(A)							alloc()
alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
# looping for ever if we				    select_bad_process
# cannot make any progress				      victim = taskB

There is no way OOM killer can tell taskB is blocked and that there is
dependency between A and B (without lockdep). That is why I call NOFAIL
under a lock as dangerous and a bug.

> If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
> that the invoking context holds, then that is a OOM killer bug, not
> a bug in the subsystem calling kmalloc(GFP_NOFS).

I guess we are talking about different things here or what am I missing?
 
[...]
> > In the meantime page allocator
> > should develop a proper diagnostic to help identify all the potential
> > dependencies. Next we should start thinking whether all the existing
> > GFP_NOFAIL paths are really necessary or the code can be
> > refactored/reimplemented to accept allocation failures.
> 
> Last time the "just make filesystems handle memory allocation
> failures" I pointed out what that meant for XFS: dirty transaction
> rollback is required. That's freakin' complex, will double the
> memory footprint of transactions, roughly double the CPU cost, and
> greatly increase the complexity of the transaction subsystem. It's a
> *major* rework of a significant amount of the XFS codebase and will
> take at least a couple of years design, test and stabilise before
> it could be rolled out to production.
> 
> I'm not about to spend a couple of years rewriting XFS just so the
> VM can get rid of a GFP_NOFAIL user. Especially as the we already
> tell the Hammer of Last Resort the context in which it can work.
> 
> Move the OOM killer to kswapd - get it out of the direct reclaim
> path altogether.

This doesn't change anything as explained in other email. The triggering
path doesn't wait for the victim to die.

> If the system is that backed up on locks that it
> cannot free any memory and has no reserves to satisfy the allocation
> that kicked the OOM killer, then the OOM killer was not invoked soon
> enough.
> 
> Hell, if you want a better way to proceed, then how about you allow
> us to tell the MM subsystem how much memory reserve a specific set
> of operations is going to require to complete? That's something that
> we can do rough calculations for, and it integrates straight into
> the existing transaction reservation system we already use for log
> space and disk space, and we can tell the mm subsystem when the
> reserve is no longer needed (i.e. last thing in transaction commit).
> 
> That way we don't start a transaction until the mm subsystem has
> reserved enough pages for us to work with, and the reserve only
> needs to be used when normal allocation has already failed. i.e
> rather than looping we get a page allocated from the reserve pool.

I am not sure I understand the above but isn't the mempools a tool for
this purpose?
 
> The reservations wouldn't be perfect, but the majority of the time
> we'd be able to make progress and not need the OOM killer. And best
> of all, there's no responsibilty on the MM subsystem for preventing
> OOM - getting the reservations right is the responsibiity of the
> subsystem using them.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 11:23                                           ` Tetsuo Handa
@ 2015-02-18 12:29                                             ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-18 12:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

On Wed 18-02-15 20:23:19, Tetsuo Handa wrote:
> [ cc fsdevel list - watch out for side effect of 9879de7373fc (mm: page_alloc:
> embed OOM killing naturally into allocation slowpath) which was merged between
> 3.19-rc6 and 3.19-rc7 , started from
> http://marc.info/?l=linux-mm&m=142348457310066&w=2 ]
> 
> Replying in this post picked up from several posts in this thread.
> 
> Michal Hocko wrote:
> > Besides that __GFP_WAIT callers should be prepared for the allocation
> > failure and should better cope with it. So no, I really hate something
> > like the above.
> 
> Those who do not want to retry with invoking the OOM killer are using
> __GFP_WAIT + __GFP_NORETRY allocations.
> 
> Those who want to retry with invoking the OOM killer are using
> __GFP_WAIT allocations.
> 
> Those who must retry forever with invoking the OOM killer, no matter how
> many processes the OOM killer kills, are using __GFP_WAIT + __GFP_NOFAIL
> allocations.
> 
> However, since use of __GFP_NOFAIL is prohibited,

IT IS NOT PROHIBITED. It is highly discouraged because GFP_NOFAIL is a
strong requirement and the caller should be really aware of the
consequences. Especially when the allocation is done under locked
context.

> I think many of
> __GFP_WAIT users are expecting that the allocation fails only when
> "the OOM killer set TIF_MEMDIE flag to the caller but the caller
> failed to allocate from memory reserves".

This is not what __GFP_WAIT is defined for. It says that the allocator
might sleep.

> Also, the implementation
> before 9879de7373fc (mm: page_alloc: embed OOM killing naturally
> into allocation slowpath) effectively supported __GFP_WAIT users
> with such expectation.

same as GFP_KERNEL == GFP_NOFAIL for small allocations currently which
causes a lot of troubles which were not anticipated at the time this
was introduced. And we _should_ move away from that model. Because
GFP_NOFAIL should be really explicit rather than implicit.

> Michal Hocko wrote:
> > Because they cannot perform any IO/FS transactions and that would lead
> > to a premature OOM conditions way too easily. OOM killer is a _last
> > resort_ reclaim opportunity not something that would happen just because
> > you happen to be not able to flush dirty pages. 
> 
> But you should not have applied such change without making necessary
> changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
> at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.

This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since
before git era).
 
> Michal Hocko wrote:
> > Well, you are beating your machine to death so you can hardly get any
> > time guarantee. It would be nice to have a better feedback mechanism to
> > know when to back off and fail the allocation attempt which might be
> > blocking OOM victim to pass away. This is extremely tricky because we
> > shouldn't be too eager to fail just because of a sudden memory pressure.
> 
> Michal Hocko wrote:
> > >   I wish only somebody like kswapd repeats the loop on behalf of all
> > >   threads waiting at memory allocation slowpath...
> > 
> > This is the case when the kswapd is _able_ to cope with the memory
> > pressure.
> 
> It looks wasteful for me that so many threads (greater than number of
> available CPUs) are sleeping at cond_resched() in shrink_slab() when
> checking SysRq-t. Imagine 1000 threads sleeping at cond_resched() in
> shrink_slab() on a machine with only 1 CPU. Each thread gets a chance
> to try calling reclaim function only when all other threads gave that
> thread a chance at cond_resched(). Such situation is almost mutually
> preventing from making progress. I wish the following mechanism.

Feel free to send patches which are not breaking other loads...
[...]

> Michal Hocko wrote:
> > Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think
> > this is a problem?
> 
> Killing a user space process or taking filesystem error actions (e.g.
> remount-ro or kernel panic), which choice is less painful for users?
> I believe that !(gfp_mask & __GFP_FS) check is a bug and should be removed.

pre-mature OOM killer just because the current allocator context doesn't
allow for real reclaim is even worse.

> Rather, shouldn't allocations without __GFP_FS get more chance to succeed
> than allocations with __GFP_FS? If I were the author, I might have added
> below check instead.
> 
>    /* This is not a critical allocation. Don't invoke the OOM killer. */
>    if (gfp_mask & __GFP_FS)
>            goto out;

This doesn't make any sense what so ever. So regular GFP_KERNEL|USER
allocations wouldn't invoke oom killer. This includes page faults and
basically most of allocations.

> Falling into retry loop with same watermark might prevent rescuer threads from
> doing memory allocation which is needed for making free memory. Maybe we should
> use lower watermark for GFP_NOIO and below, middle watermark for GFP_NOFS, high
> watermark for GFP_KERNEL and above.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-18 12:29                                             ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-18 12:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

On Wed 18-02-15 20:23:19, Tetsuo Handa wrote:
> [ cc fsdevel list - watch out for side effect of 9879de7373fc (mm: page_alloc:
> embed OOM killing naturally into allocation slowpath) which was merged between
> 3.19-rc6 and 3.19-rc7 , started from
> http://marc.info/?l=linux-mm&m=142348457310066&w=2 ]
> 
> Replying in this post picked up from several posts in this thread.
> 
> Michal Hocko wrote:
> > Besides that __GFP_WAIT callers should be prepared for the allocation
> > failure and should better cope with it. So no, I really hate something
> > like the above.
> 
> Those who do not want to retry with invoking the OOM killer are using
> __GFP_WAIT + __GFP_NORETRY allocations.
> 
> Those who want to retry with invoking the OOM killer are using
> __GFP_WAIT allocations.
> 
> Those who must retry forever with invoking the OOM killer, no matter how
> many processes the OOM killer kills, are using __GFP_WAIT + __GFP_NOFAIL
> allocations.
> 
> However, since use of __GFP_NOFAIL is prohibited,

IT IS NOT PROHIBITED. It is highly discouraged because GFP_NOFAIL is a
strong requirement and the caller should be really aware of the
consequences. Especially when the allocation is done under locked
context.

> I think many of
> __GFP_WAIT users are expecting that the allocation fails only when
> "the OOM killer set TIF_MEMDIE flag to the caller but the caller
> failed to allocate from memory reserves".

This is not what __GFP_WAIT is defined for. It says that the allocator
might sleep.

> Also, the implementation
> before 9879de7373fc (mm: page_alloc: embed OOM killing naturally
> into allocation slowpath) effectively supported __GFP_WAIT users
> with such expectation.

same as GFP_KERNEL == GFP_NOFAIL for small allocations currently which
causes a lot of troubles which were not anticipated at the time this
was introduced. And we _should_ move away from that model. Because
GFP_NOFAIL should be really explicit rather than implicit.

> Michal Hocko wrote:
> > Because they cannot perform any IO/FS transactions and that would lead
> > to a premature OOM conditions way too easily. OOM killer is a _last
> > resort_ reclaim opportunity not something that would happen just because
> > you happen to be not able to flush dirty pages. 
> 
> But you should not have applied such change without making necessary
> changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
> at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.

This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since
before git era).
 
> Michal Hocko wrote:
> > Well, you are beating your machine to death so you can hardly get any
> > time guarantee. It would be nice to have a better feedback mechanism to
> > know when to back off and fail the allocation attempt which might be
> > blocking OOM victim to pass away. This is extremely tricky because we
> > shouldn't be too eager to fail just because of a sudden memory pressure.
> 
> Michal Hocko wrote:
> > >   I wish only somebody like kswapd repeats the loop on behalf of all
> > >   threads waiting at memory allocation slowpath...
> > 
> > This is the case when the kswapd is _able_ to cope with the memory
> > pressure.
> 
> It looks wasteful for me that so many threads (greater than number of
> available CPUs) are sleeping at cond_resched() in shrink_slab() when
> checking SysRq-t. Imagine 1000 threads sleeping at cond_resched() in
> shrink_slab() on a machine with only 1 CPU. Each thread gets a chance
> to try calling reclaim function only when all other threads gave that
> thread a chance at cond_resched(). Such situation is almost mutually
> preventing from making progress. I wish the following mechanism.

Feel free to send patches which are not breaking other loads...
[...]

> Michal Hocko wrote:
> > Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think
> > this is a problem?
> 
> Killing a user space process or taking filesystem error actions (e.g.
> remount-ro or kernel panic), which choice is less painful for users?
> I believe that !(gfp_mask & __GFP_FS) check is a bug and should be removed.

pre-mature OOM killer just because the current allocator context doesn't
allow for real reclaim is even worse.

> Rather, shouldn't allocations without __GFP_FS get more chance to succeed
> than allocations with __GFP_FS? If I were the author, I might have added
> below check instead.
> 
>    /* This is not a critical allocation. Don't invoke the OOM killer. */
>    if (gfp_mask & __GFP_FS)
>            goto out;

This doesn't make any sense what so ever. So regular GFP_KERNEL|USER
allocations wouldn't invoke oom killer. This includes page faults and
basically most of allocations.

> Falling into retry loop with same watermark might prevent rescuer threads from
> doing memory allocation which is needed for making free memory. Maybe we should
> use lower watermark for GFP_NOIO and below, middle watermark for GFP_NOFS, high
> watermark for GFP_KERNEL and above.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 12:29                                             ` Michal Hocko
@ 2015-02-18 14:06                                               ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-18 14:06 UTC (permalink / raw)
  To: mhocko
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Because they cannot perform any IO/FS transactions and that would lead
> > > to a premature OOM conditions way too easily. OOM killer is a _last
> > > resort_ reclaim opportunity not something that would happen just because
> > > you happen to be not able to flush dirty pages. 
> > 
> > But you should not have applied such change without making necessary
> > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
> > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.
> 
> This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since
> before git era).
>  
Then, at least I expect that filesystem error actions will not be taken so
trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for
Linux 3.19-stable?

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-18 14:06                                               ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-18 14:06 UTC (permalink / raw)
  To: mhocko
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Because they cannot perform any IO/FS transactions and that would lead
> > > to a premature OOM conditions way too easily. OOM killer is a _last
> > > resort_ reclaim opportunity not something that would happen just because
> > > you happen to be not able to flush dirty pages. 
> > 
> > But you should not have applied such change without making necessary
> > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
> > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.
> 
> This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since
> before git era).
>  
Then, at least I expect that filesystem error actions will not be taken so
trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for
Linux 3.19-stable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 14:06                                               ` Tetsuo Handa
  (?)
@ 2015-02-18 14:25                                               ` Michal Hocko
  2015-02-19 10:48                                                   ` Tetsuo Handa
  -1 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2015-02-18 14:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

On Wed 18-02-15 23:06:17, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > Because they cannot perform any IO/FS transactions and that would lead
> > > > to a premature OOM conditions way too easily. OOM killer is a _last
> > > > resort_ reclaim opportunity not something that would happen just because
> > > > you happen to be not able to flush dirty pages. 
> > > 
> > > But you should not have applied such change without making necessary
> > > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
> > > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.
> > 
> > This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since
> > before git era).
> >  
> Then, at least I expect that filesystem error actions will not be taken so
> trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for
> Linux 3.19-stable?

I do not understand. What kind of bug would be fixed by that change?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 12:16                                               ` Michal Hocko
@ 2015-02-18 21:31                                                 ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-18 21:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [...]
> > Also, this reads as an excuse for the OOM killer being broken and
> > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > subsystem that *we hold locks* when we call into it. That's what
> > GFP_NOFS originally meant, and it's what it still means today in an
> > XFS context.
> 
> Sure, and OOM killer will not be invoked in NOFS context. See
> __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> is the OOM killer broken.

I suspect that the page cache missing the correct GFP_NOFS was one
of the sources of the problems I've been seeing.

However, the oom killer exceptions are not checked if __GFP_NOFAIL
is present and so if we start using __GFP_NOFAIL then it will be
called in GFP_NOFS contexts...

> The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> OOM killer but a lock dependency introduced by the following sequence:
> 
> 	taskA			taskB			taskC
> lock(A)							alloc()
> alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> # looping for ever if we				    select_bad_process
> # cannot make any progress				      victim = taskB
> 
> There is no way OOM killer can tell taskB is blocked and that there is
> dependency between A and B (without lockdep). That is why I call NOFAIL
> under a lock as dangerous and a bug.

Sure. However, eventually the OOM killer with select task A to be
killed because nothing else is working. That, at least, marks
taskA with TIF_MEMDIE and gives us a potential way to break the
deadlock.

But the bigger problem is this:

	taskA			taskB
lock(A)
alloc(GFP_NOFS|GFP_NOFAIL)		lock(A)
  out_of_memory
    select_bad_process
      victim = taskB

Because there is no way to *ever* resolve that dependency because
taskA never leaves the allocator. Even if the oom killer selects
taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE
because GFP_NOFAIL is set and continues to loop.

This is why GFP_NOFAIL is not a solution to the "never fail"
alloation problem. The caller doing the "no fail" allocation _must
be able to set failure policy_. i.e. the choice of aborting and
shutting down because progress cannot be made, or continuing and
hoping for forwards progress is owned by the allocating context, no
the allocator.  The memory allocation subsystem cannot make that
choice for us as it has no concept of the failure characteristics of
the allocating context.

The situations in which this actually matters are extremely *rare* -
we've had these allocaiton loops in XFS for > 13 years, and we might
get a one or two reports a year of these "possible allocation
deadlock" messages occurring. Changing *everything* for such a rare,
unusual event is not an efficient use of time or resources.

> > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
> > that the invoking context holds, then that is a OOM killer bug, not
> > a bug in the subsystem calling kmalloc(GFP_NOFS).
> 
> I guess we are talking about different things here or what am I missing?

>From my perspective, you are tightly focussed on one aspect of the
problem and hence are not seeing the bigger picture: this is a
corner case of behaviour in a "last hope", brute force memory
reclaim technique that no production machine relies on for correct
or performant operation.

> [...]
> > > In the meantime page allocator
> > > should develop a proper diagnostic to help identify all the potential
> > > dependencies. Next we should start thinking whether all the existing
> > > GFP_NOFAIL paths are really necessary or the code can be
> > > refactored/reimplemented to accept allocation failures.
> > 
> > Last time the "just make filesystems handle memory allocation
> > failures" I pointed out what that meant for XFS: dirty transaction
> > rollback is required. That's freakin' complex, will double the
> > memory footprint of transactions, roughly double the CPU cost, and
> > greatly increase the complexity of the transaction subsystem. It's a
> > *major* rework of a significant amount of the XFS codebase and will
> > take at least a couple of years design, test and stabilise before
> > it could be rolled out to production.
> > 
> > I'm not about to spend a couple of years rewriting XFS just so the
> > VM can get rid of a GFP_NOFAIL user. Especially as the we already
> > tell the Hammer of Last Resort the context in which it can work.
> > 
> > Move the OOM killer to kswapd - get it out of the direct reclaim
> > path altogether.
> 
> This doesn't change anything as explained in other email. The triggering
> path doesn't wait for the victim to die.

But it does - we wouldn't be talking about deadlocks if there were
no blocking dependencies. In this case, allocation keeps retrying
until the memory freed by the killed tasks enables it to make
forward progress. That's a side effect of the last relevation that
was made in this thread that low order allocations never fail...

> > If the system is that backed up on locks that it
> > cannot free any memory and has no reserves to satisfy the allocation
> > that kicked the OOM killer, then the OOM killer was not invoked soon
> > enough.
> > 
> > Hell, if you want a better way to proceed, then how about you allow
> > us to tell the MM subsystem how much memory reserve a specific set
> > of operations is going to require to complete? That's something that
> > we can do rough calculations for, and it integrates straight into
> > the existing transaction reservation system we already use for log
> > space and disk space, and we can tell the mm subsystem when the
> > reserve is no longer needed (i.e. last thing in transaction commit).
> > 
> > That way we don't start a transaction until the mm subsystem has
> > reserved enough pages for us to work with, and the reserve only
> > needs to be used when normal allocation has already failed. i.e
> > rather than looping we get a page allocated from the reserve pool.
> 
> I am not sure I understand the above but isn't the mempools a tool for
> this purpose?

I knew this question would be the next one - I even deleted a one
line comment from my last email that said "And no, mempools are not
a solution" because that needs a more thorough explanation than a
dismissive one-liner.

As you know, mempools require a forward progress guarantee on a
single type of object and the objects must be slab based.

In transaction context we allocate from inode slabs, xfs_buf slabs,
log item slabs (6 different ones, IIRC), btree cursor slabs, etc,
but then we also have direct page allocations for buffers, vm_map_ram()
for mapping multi-page buffers, uncounted heap allocations, etc.
We cannot make all of these mempools, nor can me meet the forwards
progress requirements of a mempool because other allocations can
block and prevent progress.

Further, the object have lifetimes that don't correspond to the
transaction life cycles, and hence even if we complete the
transaction there is no guarantee that the objects allocated within
a transaction are going to be returned to the mempool at it's
completion.

IOWs, we have need for forward allocation progress guarantees on
(potentially) several megabytes of allocations from slab caches, the
heap and the page allocator, with all allocations all in
unpredictable order, with objects of different life times and life
cycles, and at which may, at any time, get stuck behind
objects locked in other transactions and hence can randomly block
until some other thread makes forward progress and completes a
transaction and unlocks the object.

The reservation would only need to cover the memory we need to
allocate and hold in the transaction (i.e. dirtied objects). There
is potentially unbound amounts of memory required through demand
paging of buffers to find the metadata we need to modify, but demand
paged metadata that is read and then released is recoverable. i.e
the shrinkers will free it as other memory demand requires, so it's
not included in reservation pools because it doesn't deplete the
amount of free memory.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-18 21:31                                                 ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-18 21:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [...]
> > Also, this reads as an excuse for the OOM killer being broken and
> > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > subsystem that *we hold locks* when we call into it. That's what
> > GFP_NOFS originally meant, and it's what it still means today in an
> > XFS context.
> 
> Sure, and OOM killer will not be invoked in NOFS context. See
> __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> is the OOM killer broken.

I suspect that the page cache missing the correct GFP_NOFS was one
of the sources of the problems I've been seeing.

However, the oom killer exceptions are not checked if __GFP_NOFAIL
is present and so if we start using __GFP_NOFAIL then it will be
called in GFP_NOFS contexts...

> The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> OOM killer but a lock dependency introduced by the following sequence:
> 
> 	taskA			taskB			taskC
> lock(A)							alloc()
> alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> # looping for ever if we				    select_bad_process
> # cannot make any progress				      victim = taskB
> 
> There is no way OOM killer can tell taskB is blocked and that there is
> dependency between A and B (without lockdep). That is why I call NOFAIL
> under a lock as dangerous and a bug.

Sure. However, eventually the OOM killer with select task A to be
killed because nothing else is working. That, at least, marks
taskA with TIF_MEMDIE and gives us a potential way to break the
deadlock.

But the bigger problem is this:

	taskA			taskB
lock(A)
alloc(GFP_NOFS|GFP_NOFAIL)		lock(A)
  out_of_memory
    select_bad_process
      victim = taskB

Because there is no way to *ever* resolve that dependency because
taskA never leaves the allocator. Even if the oom killer selects
taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE
because GFP_NOFAIL is set and continues to loop.

This is why GFP_NOFAIL is not a solution to the "never fail"
alloation problem. The caller doing the "no fail" allocation _must
be able to set failure policy_. i.e. the choice of aborting and
shutting down because progress cannot be made, or continuing and
hoping for forwards progress is owned by the allocating context, no
the allocator.  The memory allocation subsystem cannot make that
choice for us as it has no concept of the failure characteristics of
the allocating context.

The situations in which this actually matters are extremely *rare* -
we've had these allocaiton loops in XFS for > 13 years, and we might
get a one or two reports a year of these "possible allocation
deadlock" messages occurring. Changing *everything* for such a rare,
unusual event is not an efficient use of time or resources.

> > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
> > that the invoking context holds, then that is a OOM killer bug, not
> > a bug in the subsystem calling kmalloc(GFP_NOFS).
> 
> I guess we are talking about different things here or what am I missing?

>From my perspective, you are tightly focussed on one aspect of the
problem and hence are not seeing the bigger picture: this is a
corner case of behaviour in a "last hope", brute force memory
reclaim technique that no production machine relies on for correct
or performant operation.

> [...]
> > > In the meantime page allocator
> > > should develop a proper diagnostic to help identify all the potential
> > > dependencies. Next we should start thinking whether all the existing
> > > GFP_NOFAIL paths are really necessary or the code can be
> > > refactored/reimplemented to accept allocation failures.
> > 
> > Last time the "just make filesystems handle memory allocation
> > failures" I pointed out what that meant for XFS: dirty transaction
> > rollback is required. That's freakin' complex, will double the
> > memory footprint of transactions, roughly double the CPU cost, and
> > greatly increase the complexity of the transaction subsystem. It's a
> > *major* rework of a significant amount of the XFS codebase and will
> > take at least a couple of years design, test and stabilise before
> > it could be rolled out to production.
> > 
> > I'm not about to spend a couple of years rewriting XFS just so the
> > VM can get rid of a GFP_NOFAIL user. Especially as the we already
> > tell the Hammer of Last Resort the context in which it can work.
> > 
> > Move the OOM killer to kswapd - get it out of the direct reclaim
> > path altogether.
> 
> This doesn't change anything as explained in other email. The triggering
> path doesn't wait for the victim to die.

But it does - we wouldn't be talking about deadlocks if there were
no blocking dependencies. In this case, allocation keeps retrying
until the memory freed by the killed tasks enables it to make
forward progress. That's a side effect of the last relevation that
was made in this thread that low order allocations never fail...

> > If the system is that backed up on locks that it
> > cannot free any memory and has no reserves to satisfy the allocation
> > that kicked the OOM killer, then the OOM killer was not invoked soon
> > enough.
> > 
> > Hell, if you want a better way to proceed, then how about you allow
> > us to tell the MM subsystem how much memory reserve a specific set
> > of operations is going to require to complete? That's something that
> > we can do rough calculations for, and it integrates straight into
> > the existing transaction reservation system we already use for log
> > space and disk space, and we can tell the mm subsystem when the
> > reserve is no longer needed (i.e. last thing in transaction commit).
> > 
> > That way we don't start a transaction until the mm subsystem has
> > reserved enough pages for us to work with, and the reserve only
> > needs to be used when normal allocation has already failed. i.e
> > rather than looping we get a page allocated from the reserve pool.
> 
> I am not sure I understand the above but isn't the mempools a tool for
> this purpose?

I knew this question would be the next one - I even deleted a one
line comment from my last email that said "And no, mempools are not
a solution" because that needs a more thorough explanation than a
dismissive one-liner.

As you know, mempools require a forward progress guarantee on a
single type of object and the objects must be slab based.

In transaction context we allocate from inode slabs, xfs_buf slabs,
log item slabs (6 different ones, IIRC), btree cursor slabs, etc,
but then we also have direct page allocations for buffers, vm_map_ram()
for mapping multi-page buffers, uncounted heap allocations, etc.
We cannot make all of these mempools, nor can me meet the forwards
progress requirements of a mempool because other allocations can
block and prevent progress.

Further, the object have lifetimes that don't correspond to the
transaction life cycles, and hence even if we complete the
transaction there is no guarantee that the objects allocated within
a transaction are going to be returned to the mempool at it's
completion.

IOWs, we have need for forward allocation progress guarantees on
(potentially) several megabytes of allocations from slab caches, the
heap and the page allocator, with all allocations all in
unpredictable order, with objects of different life times and life
cycles, and at which may, at any time, get stuck behind
objects locked in other transactions and hence can randomly block
until some other thread makes forward progress and completes a
transaction and unlocks the object.

The reservation would only need to cover the memory we need to
allocate and hold in the transaction (i.e. dirtied objects). There
is potentially unbound amounts of memory required through demand
paging of buffers to find the metadata we need to modify, but demand
paged metadata that is read and then released is recoverable. i.e
the shrinkers will free it as other memory demand requires, so it's
not included in reservation pools because it doesn't deplete the
amount of free memory.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 21:31                                                 ` Dave Chinner
@ 2015-02-19  9:40                                                   ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-19  9:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Thu 19-02-15 08:31:18, Dave Chinner wrote:
> On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> > On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> > [...]
> > > Also, this reads as an excuse for the OOM killer being broken and
> > > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > > subsystem that *we hold locks* when we call into it. That's what
> > > GFP_NOFS originally meant, and it's what it still means today in an
> > > XFS context.
> > 
> > Sure, and OOM killer will not be invoked in NOFS context. See
> > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> > is the OOM killer broken.
> 
> I suspect that the page cache missing the correct GFP_NOFS was one
> of the sources of the problems I've been seeing.
> 
> However, the oom killer exceptions are not checked if __GFP_NOFAIL

Yes this is true. This is an effect of 9879de7373fc (mm: page_alloc:
embed OOM killing naturally into allocation slowpath) and IMO a
desirable one. Requiring infinite retrying with a seriously restricted
reclaim context calls for troubles (e.g. livelock without no way out
because regular reclaim cannot make any progress and OOM killer as the
last resort will not happen).

> is present and so if we start using __GFP_NOFAIL then it will be
> called in GFP_NOFS contexts...
> 
> > The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> > OOM killer but a lock dependency introduced by the following sequence:
> > 
> > 	taskA			taskB			taskC
> > lock(A)							alloc()
> > alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> > # looping for ever if we				    select_bad_process
> > # cannot make any progress				      victim = taskB
> > 
> > There is no way OOM killer can tell taskB is blocked and that there is
> > dependency between A and B (without lockdep). That is why I call NOFAIL
> > under a lock as dangerous and a bug.
> 
> Sure. However, eventually the OOM killer with select task A to be
> killed because nothing else is working.

That would require OOM killer to be able to select another victim while
the current one is still alive. There were time based heuristics
suggested to do this but I do not think they are the right way to handle
the problem and should be considered only if all other options fail.

One potential way would be giving access to give GFP_NOFAIL context
access to memory reserves when the allocation domain
(global/memcg/cpuset) is OOM. Andrea was suggesting something like that
IIRC.

> That, at least, marks
> taskA with TIF_MEMDIE and gives us a potential way to break the
> deadlock.
> 
> But the bigger problem is this:
> 
> 	taskA			taskB
> lock(A)
> alloc(GFP_NOFS|GFP_NOFAIL)		lock(A)
>   out_of_memory
>     select_bad_process
>       victim = taskB
> 
> Because there is no way to *ever* resolve that dependency because
> taskA never leaves the allocator. Even if the oom killer selects
> taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE
> because GFP_NOFAIL is set and continues to loop.

TIF_MEMDIE will at least give the task access to memory reserves. Anyway
this is essentially the same category of livelock as above.

> This is why GFP_NOFAIL is not a solution to the "never fail"
> alloation problem. The caller doing the "no fail" allocation _must
> be able to set failure policy_. i.e. the choice of aborting and
> shutting down because progress cannot be made, or continuing and
> hoping for forwards progress is owned by the allocating context, no
> the allocator.

I completely agree that the failure policy is the caller responsibility
and I would have no objections to something like:

	do {
		ptr = kmalloc(size, GFP_NOFS);
		if (ptr)
			return ptr;
		if (fatal_signal_pending(current))
			break;
		if (looping_too_long())
			break;
	} while (1);

	fallback_solution();

But this is not the case in kmem_alloc which is essentially GFP_NOFAIL
allocation with a warning and congestion_wait. There is no failure
policy defined there. The warning should be part of the allocator and
the NOFAIL policy should be explicit. So why exactly do you oppose to
changing kmem_alloc (and others which are doing essentially the same)?

> The memory allocation subsystem cannot make that
> choice for us as it has no concept of the failure characteristics of
> the allocating context.

Of course. I wasn't arguing we should change allocation loops which have
a fallback policy as well. That is an entirely different thing. My point
was we want to turn GFP_NOFAIL equivalents to use GFP_NOFAIL so that the
allocator can prevent from livelocks if possible.

> The situations in which this actually matters are extremely *rare* -
> we've had these allocaiton loops in XFS for > 13 years, and we might
> get a one or two reports a year of these "possible allocation
> deadlock" messages occurring. Changing *everything* for such a rare,
> unusual event is not an efficient use of time or resources.
> 
> > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
> > > that the invoking context holds, then that is a OOM killer bug, not
> > > a bug in the subsystem calling kmalloc(GFP_NOFS).
> > 
> > I guess we are talking about different things here or what am I missing?
> 
> From my perspective, you are tightly focussed on one aspect of the
> problem and hence are not seeing the bigger picture: this is a
> corner case of behaviour in a "last hope", brute force memory
> reclaim technique that no production machine relies on for correct
> or performant operation.

Of course this is a corner case. And I am trying to prevent heuristics
which would optimize for such a corner case (there were multiple of
them suggested in this thread).

The reason I care about GFP_NOFAIL is that there are apparently code
paths which do not tell allocator they are basically GFP_NOFAIL without
any fallback. This leads to two main problems 1) we do not have a good
overview how many code paths have such a strong requirements and so
cannot estimate e.g. how big memory reserves should be and 2) allocator
cannot help those paths (e.g. by giving them access to reserves to break
out of the livelock).

> > [...]
> > > > In the meantime page allocator
> > > > should develop a proper diagnostic to help identify all the potential
> > > > dependencies. Next we should start thinking whether all the existing
> > > > GFP_NOFAIL paths are really necessary or the code can be
> > > > refactored/reimplemented to accept allocation failures.
> > > 
> > > Last time the "just make filesystems handle memory allocation
> > > failures" I pointed out what that meant for XFS: dirty transaction
> > > rollback is required. That's freakin' complex, will double the
> > > memory footprint of transactions, roughly double the CPU cost, and
> > > greatly increase the complexity of the transaction subsystem. It's a
> > > *major* rework of a significant amount of the XFS codebase and will
> > > take at least a couple of years design, test and stabilise before
> > > it could be rolled out to production.
> > > 
> > > I'm not about to spend a couple of years rewriting XFS just so the
> > > VM can get rid of a GFP_NOFAIL user. Especially as the we already
> > > tell the Hammer of Last Resort the context in which it can work.
> > > 
> > > Move the OOM killer to kswapd - get it out of the direct reclaim
> > > path altogether.
> > 
> > This doesn't change anything as explained in other email. The triggering
> > path doesn't wait for the victim to die.
> 
> But it does - we wouldn't be talking about deadlocks if there were
> no blocking dependencies. In this case, allocation keeps retrying
> until the memory freed by the killed tasks enables it to make
> forward progress. That's a side effect of the last relevation that
> was made in this thread that low order allocations never fail...

Sure, low order allocations being almost GFP_NOFAIL makes things much
worse of course. And this should be changed. We just have to think about
the way how to do it without breaking the universe. I hope we can
discuss this at LSF.

But even then I do not see how triggering the OOM killer from kswapd
would help here. Victims would be looping in the allocator whether the
actual killing happens from their or any other context.

> > > If the system is that backed up on locks that it
> > > cannot free any memory and has no reserves to satisfy the allocation
> > > that kicked the OOM killer, then the OOM killer was not invoked soon
> > > enough.
> > > 
> > > Hell, if you want a better way to proceed, then how about you allow
> > > us to tell the MM subsystem how much memory reserve a specific set
> > > of operations is going to require to complete? That's something that
> > > we can do rough calculations for, and it integrates straight into
> > > the existing transaction reservation system we already use for log
> > > space and disk space, and we can tell the mm subsystem when the
> > > reserve is no longer needed (i.e. last thing in transaction commit).
> > > 
> > > That way we don't start a transaction until the mm subsystem has
> > > reserved enough pages for us to work with, and the reserve only
> > > needs to be used when normal allocation has already failed. i.e
> > > rather than looping we get a page allocated from the reserve pool.
> > 
> > I am not sure I understand the above but isn't the mempools a tool for
> > this purpose?
> 
> I knew this question would be the next one - I even deleted a one
> line comment from my last email that said "And no, mempools are not
> a solution" because that needs a more thorough explanation than a
> dismissive one-liner.
> 
> As you know, mempools require a forward progress guarantee on a
> single type of object and the objects must be slab based.
> 
> In transaction context we allocate from inode slabs, xfs_buf slabs,
> log item slabs (6 different ones, IIRC), btree cursor slabs, etc,
> but then we also have direct page allocations for buffers, vm_map_ram()
> for mapping multi-page buffers, uncounted heap allocations, etc.
> We cannot make all of these mempools, nor can me meet the forwards
> progress requirements of a mempool because other allocations can
> block and prevent progress.
> 
> Further, the object have lifetimes that don't correspond to the
> transaction life cycles, and hence even if we complete the
> transaction there is no guarantee that the objects allocated within
> a transaction are going to be returned to the mempool at it's
> completion.
> 
> IOWs, we have need for forward allocation progress guarantees on
> (potentially) several megabytes of allocations from slab caches, the
> heap and the page allocator, with all allocations all in
> unpredictable order, with objects of different life times and life
> cycles, and at which may, at any time, get stuck behind
> objects locked in other transactions and hence can randomly block
> until some other thread makes forward progress and completes a
> transaction and unlocks the object.

Thanks for the clarification, I have to think about it some more,
though. My thinking was that mempools could be used for an emergency
pool with a pre-allocated memory which would be used in the non failing
contexts.

> The reservation would only need to cover the memory we need to
> allocate and hold in the transaction (i.e. dirtied objects). There
> is potentially unbound amounts of memory required through demand
> paging of buffers to find the metadata we need to modify, but demand
> paged metadata that is read and then released is recoverable. i.e
> the shrinkers will free it as other memory demand requires, so it's
> not included in reservation pools because it doesn't deplete the
> amount of free memory.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19  9:40                                                   ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-19  9:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Thu 19-02-15 08:31:18, Dave Chinner wrote:
> On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> > On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> > [...]
> > > Also, this reads as an excuse for the OOM killer being broken and
> > > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > > subsystem that *we hold locks* when we call into it. That's what
> > > GFP_NOFS originally meant, and it's what it still means today in an
> > > XFS context.
> > 
> > Sure, and OOM killer will not be invoked in NOFS context. See
> > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> > is the OOM killer broken.
> 
> I suspect that the page cache missing the correct GFP_NOFS was one
> of the sources of the problems I've been seeing.
> 
> However, the oom killer exceptions are not checked if __GFP_NOFAIL

Yes this is true. This is an effect of 9879de7373fc (mm: page_alloc:
embed OOM killing naturally into allocation slowpath) and IMO a
desirable one. Requiring infinite retrying with a seriously restricted
reclaim context calls for troubles (e.g. livelock without no way out
because regular reclaim cannot make any progress and OOM killer as the
last resort will not happen).

> is present and so if we start using __GFP_NOFAIL then it will be
> called in GFP_NOFS contexts...
> 
> > The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> > OOM killer but a lock dependency introduced by the following sequence:
> > 
> > 	taskA			taskB			taskC
> > lock(A)							alloc()
> > alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> > # looping for ever if we				    select_bad_process
> > # cannot make any progress				      victim = taskB
> > 
> > There is no way OOM killer can tell taskB is blocked and that there is
> > dependency between A and B (without lockdep). That is why I call NOFAIL
> > under a lock as dangerous and a bug.
> 
> Sure. However, eventually the OOM killer with select task A to be
> killed because nothing else is working.

That would require OOM killer to be able to select another victim while
the current one is still alive. There were time based heuristics
suggested to do this but I do not think they are the right way to handle
the problem and should be considered only if all other options fail.

One potential way would be giving access to give GFP_NOFAIL context
access to memory reserves when the allocation domain
(global/memcg/cpuset) is OOM. Andrea was suggesting something like that
IIRC.

> That, at least, marks
> taskA with TIF_MEMDIE and gives us a potential way to break the
> deadlock.
> 
> But the bigger problem is this:
> 
> 	taskA			taskB
> lock(A)
> alloc(GFP_NOFS|GFP_NOFAIL)		lock(A)
>   out_of_memory
>     select_bad_process
>       victim = taskB
> 
> Because there is no way to *ever* resolve that dependency because
> taskA never leaves the allocator. Even if the oom killer selects
> taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE
> because GFP_NOFAIL is set and continues to loop.

TIF_MEMDIE will at least give the task access to memory reserves. Anyway
this is essentially the same category of livelock as above.

> This is why GFP_NOFAIL is not a solution to the "never fail"
> alloation problem. The caller doing the "no fail" allocation _must
> be able to set failure policy_. i.e. the choice of aborting and
> shutting down because progress cannot be made, or continuing and
> hoping for forwards progress is owned by the allocating context, no
> the allocator.

I completely agree that the failure policy is the caller responsibility
and I would have no objections to something like:

	do {
		ptr = kmalloc(size, GFP_NOFS);
		if (ptr)
			return ptr;
		if (fatal_signal_pending(current))
			break;
		if (looping_too_long())
			break;
	} while (1);

	fallback_solution();

But this is not the case in kmem_alloc which is essentially GFP_NOFAIL
allocation with a warning and congestion_wait. There is no failure
policy defined there. The warning should be part of the allocator and
the NOFAIL policy should be explicit. So why exactly do you oppose to
changing kmem_alloc (and others which are doing essentially the same)?

> The memory allocation subsystem cannot make that
> choice for us as it has no concept of the failure characteristics of
> the allocating context.

Of course. I wasn't arguing we should change allocation loops which have
a fallback policy as well. That is an entirely different thing. My point
was we want to turn GFP_NOFAIL equivalents to use GFP_NOFAIL so that the
allocator can prevent from livelocks if possible.

> The situations in which this actually matters are extremely *rare* -
> we've had these allocaiton loops in XFS for > 13 years, and we might
> get a one or two reports a year of these "possible allocation
> deadlock" messages occurring. Changing *everything* for such a rare,
> unusual event is not an efficient use of time or resources.
> 
> > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
> > > that the invoking context holds, then that is a OOM killer bug, not
> > > a bug in the subsystem calling kmalloc(GFP_NOFS).
> > 
> > I guess we are talking about different things here or what am I missing?
> 
> From my perspective, you are tightly focussed on one aspect of the
> problem and hence are not seeing the bigger picture: this is a
> corner case of behaviour in a "last hope", brute force memory
> reclaim technique that no production machine relies on for correct
> or performant operation.

Of course this is a corner case. And I am trying to prevent heuristics
which would optimize for such a corner case (there were multiple of
them suggested in this thread).

The reason I care about GFP_NOFAIL is that there are apparently code
paths which do not tell allocator they are basically GFP_NOFAIL without
any fallback. This leads to two main problems 1) we do not have a good
overview how many code paths have such a strong requirements and so
cannot estimate e.g. how big memory reserves should be and 2) allocator
cannot help those paths (e.g. by giving them access to reserves to break
out of the livelock).

> > [...]
> > > > In the meantime page allocator
> > > > should develop a proper diagnostic to help identify all the potential
> > > > dependencies. Next we should start thinking whether all the existing
> > > > GFP_NOFAIL paths are really necessary or the code can be
> > > > refactored/reimplemented to accept allocation failures.
> > > 
> > > Last time the "just make filesystems handle memory allocation
> > > failures" I pointed out what that meant for XFS: dirty transaction
> > > rollback is required. That's freakin' complex, will double the
> > > memory footprint of transactions, roughly double the CPU cost, and
> > > greatly increase the complexity of the transaction subsystem. It's a
> > > *major* rework of a significant amount of the XFS codebase and will
> > > take at least a couple of years design, test and stabilise before
> > > it could be rolled out to production.
> > > 
> > > I'm not about to spend a couple of years rewriting XFS just so the
> > > VM can get rid of a GFP_NOFAIL user. Especially as the we already
> > > tell the Hammer of Last Resort the context in which it can work.
> > > 
> > > Move the OOM killer to kswapd - get it out of the direct reclaim
> > > path altogether.
> > 
> > This doesn't change anything as explained in other email. The triggering
> > path doesn't wait for the victim to die.
> 
> But it does - we wouldn't be talking about deadlocks if there were
> no blocking dependencies. In this case, allocation keeps retrying
> until the memory freed by the killed tasks enables it to make
> forward progress. That's a side effect of the last relevation that
> was made in this thread that low order allocations never fail...

Sure, low order allocations being almost GFP_NOFAIL makes things much
worse of course. And this should be changed. We just have to think about
the way how to do it without breaking the universe. I hope we can
discuss this at LSF.

But even then I do not see how triggering the OOM killer from kswapd
would help here. Victims would be looping in the allocator whether the
actual killing happens from their or any other context.

> > > If the system is that backed up on locks that it
> > > cannot free any memory and has no reserves to satisfy the allocation
> > > that kicked the OOM killer, then the OOM killer was not invoked soon
> > > enough.
> > > 
> > > Hell, if you want a better way to proceed, then how about you allow
> > > us to tell the MM subsystem how much memory reserve a specific set
> > > of operations is going to require to complete? That's something that
> > > we can do rough calculations for, and it integrates straight into
> > > the existing transaction reservation system we already use for log
> > > space and disk space, and we can tell the mm subsystem when the
> > > reserve is no longer needed (i.e. last thing in transaction commit).
> > > 
> > > That way we don't start a transaction until the mm subsystem has
> > > reserved enough pages for us to work with, and the reserve only
> > > needs to be used when normal allocation has already failed. i.e
> > > rather than looping we get a page allocated from the reserve pool.
> > 
> > I am not sure I understand the above but isn't the mempools a tool for
> > this purpose?
> 
> I knew this question would be the next one - I even deleted a one
> line comment from my last email that said "And no, mempools are not
> a solution" because that needs a more thorough explanation than a
> dismissive one-liner.
> 
> As you know, mempools require a forward progress guarantee on a
> single type of object and the objects must be slab based.
> 
> In transaction context we allocate from inode slabs, xfs_buf slabs,
> log item slabs (6 different ones, IIRC), btree cursor slabs, etc,
> but then we also have direct page allocations for buffers, vm_map_ram()
> for mapping multi-page buffers, uncounted heap allocations, etc.
> We cannot make all of these mempools, nor can me meet the forwards
> progress requirements of a mempool because other allocations can
> block and prevent progress.
> 
> Further, the object have lifetimes that don't correspond to the
> transaction life cycles, and hence even if we complete the
> transaction there is no guarantee that the objects allocated within
> a transaction are going to be returned to the mempool at it's
> completion.
> 
> IOWs, we have need for forward allocation progress guarantees on
> (potentially) several megabytes of allocations from slab caches, the
> heap and the page allocator, with all allocations all in
> unpredictable order, with objects of different life times and life
> cycles, and at which may, at any time, get stuck behind
> objects locked in other transactions and hence can randomly block
> until some other thread makes forward progress and completes a
> transaction and unlocks the object.

Thanks for the clarification, I have to think about it some more,
though. My thinking was that mempools could be used for an emergency
pool with a pre-allocated memory which would be used in the non failing
contexts.

> The reservation would only need to cover the memory we need to
> allocate and hold in the transaction (i.e. dirtied objects). There
> is potentially unbound amounts of memory required through demand
> paging of buffers to find the metadata we need to modify, but demand
> paged metadata that is read and then released is recoverable. i.e
> the shrinkers will free it as other memory demand requires, so it's
> not included in reservation pools because it doesn't deplete the
> amount of free memory.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 22:54                                         ` Dave Chinner
@ 2015-02-19 10:24                                           ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-19 10:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote:
> [ cc xfs list - experienced kernel devs should not have to be
> reminded to do this ]
> 
> On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > Johannes Weiner wrote:
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > > > >  		if (high_zoneidx < ZONE_NORMAL)
> > > > >  			goto out;
> > > > >  		/* The OOM killer does not compensate for light reclaim */
> > > > > -		if (!(gfp_mask & __GFP_FS))
> > > > > +		if (!(gfp_mask & __GFP_FS)) {
> > > > > +			/*
> > > > > +			 * XXX: Page reclaim didn't yield anything,
> > > > > +			 * and the OOM killer can't be invoked, but
> > > > > +			 * keep looping as per should_alloc_retry().
> > > > > +			 */
> > > > > +			*did_some_progress = 1;
> > > > >  			goto out;
> > > > > +		}
> > > > 
> > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?
> > > 
> > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings
> > > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm:
> > > page_alloc: embed OOM killing naturally into allocation slowpath" introduced
> > > a regression and below one is the fix.
> > > 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >                 /* The OOM killer does not needlessly kill tasks for lowmem */
> > >                 if (high_zoneidx < ZONE_NORMAL)
> > >                         goto out;
> > > -               /* The OOM killer does not compensate for light reclaim */
> > > -               if (!(gfp_mask & __GFP_FS))
> > > -                       goto out;
> > >                 /*
> > >                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Again, we don't want to OOM kill on behalf of allocations that can't
> > initiate IO, or even actively prevent others from doing it.  Not per
> > default anyway, because most callers can deal with the failure without
> > having to resort to killing tasks, and NOFS reclaim *can* easily fail.
> > It's the exceptions that should be annotated instead:
> > 
> > void *
> > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > 	int	retries = 0;
> > 	gfp_t	lflags = kmem_flags_convert(flags);
> > 	void	*ptr;
> > 
> > 	do {
> > 		ptr = kmalloc(size, lflags);
> > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > 			return ptr;
> > 		if (!(++retries % 100))
> > 			xfs_err(NULL,
> > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > 					__func__, lflags);
> > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (1);
> > }
> > 
> > This should use __GFP_NOFAIL, which is not only designed to annotate
> > broken code like this, but also recognizes that endless looping on a
> > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a7a3a63bb360..17ced1805d3a 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  void *
> >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> > -	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > -					__func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Hmmm - the only reason there is a focus on this loop is that it
> emits warnings about allocations failing. It's obvious that the
> problem being dealt with here is a fundamental design issue w.r.t.
> to locking and the OOM killer, but the proposed special casing
> hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> in XFS started emitting warnings about allocations failing more
> often.
> 
> So the answer is to remove the warning?  That's like killing the
> canary to stop the methane leak in the coal mine. No canary? No
> problems!

That's not what happened.  The patch that affected behavior here
transformed code that an incoherent collection of conditions to
something that has an actual model.  That model is that we don't loop
in the allocator if there are no means to making forward progress.  In
this case, it was GFP_NOFS triggering an early exit from the allocator
because it's not allowed to invoke the OOM killer per default, and
there is little point in looping for times to better on their own.

So these deadlock warnings happen, ironically, by the page allocator
now bailing out of a locked-up state in which it's not making forward
progress.  They don't strike me as a very useful canary in this case.

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure. That's a big red flag to me that all
> this hacking around the edges is not solving the underlying problem,
> but instead is breaking things that did once work.
> 
> And, well, then there's this (gfp.h):
> 
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
> 
> So, is this another policy relevation from the mm developers about
> the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
> Or just another symptom of frantic thrashing because nobody actually
> understands the problem or those that do are unwilling to throw out
> the broken crap and redesign it?

Well, understand our dilemma here.  __GFP_NOFAIL is a liability
because it can trap tasks with unknown state and locks in a
potentially never ending loop, and we don't want people to start using
it as a convenient solution to get out of having a fallback strategy.

However, if your entire architecture around a particular allocation is
that failure is not an option at this point, and you can't reasonably
preallocate - although that would always be preferrable - then please
do not open code an endless loop around the call to the allocator but
use __GFP_NOFAIL instead so that these callsites are annotated and can
be reviewed.  By giving the allocator this information, it can then
also adjust its behavior, like it is the case right here: we don't
usually want to OOM kill for regular GFP_NOFS allocations because
their reclaim powers are weak and we don't want to kill tasks
prematurely.  But if your NOFS allocation can not fail under any
circumstances, then the OOM killer should very much be employed to
make any kind of forward progress at all for this allocation.  It's
just that the allocator needs to be made aware of this requirement.

So yes, we are wary of __GFP_NOFAIL allocations, but this is an
instance where it's the right way to communicate with the allocator,
it was introduced to replace such open-coded endless loops and have
the liability of making progress with the allocator, not the caller.

And please understand that this callsite blowing up is a chance to
better the code and behavior here.  Where previously it would just
endlessly loop in the allocator without any means to make progress,
converting it to a __GFP_NOFAIL allocation tells the allocator that
it's fine to use the OOM killer in such an instance, improving the
chances that this caller will actually make headway under heavy load.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 10:24                                           ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-19 10:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote:
> [ cc xfs list - experienced kernel devs should not have to be
> reminded to do this ]
> 
> On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > Johannes Weiner wrote:
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > > > >  		if (high_zoneidx < ZONE_NORMAL)
> > > > >  			goto out;
> > > > >  		/* The OOM killer does not compensate for light reclaim */
> > > > > -		if (!(gfp_mask & __GFP_FS))
> > > > > +		if (!(gfp_mask & __GFP_FS)) {
> > > > > +			/*
> > > > > +			 * XXX: Page reclaim didn't yield anything,
> > > > > +			 * and the OOM killer can't be invoked, but
> > > > > +			 * keep looping as per should_alloc_retry().
> > > > > +			 */
> > > > > +			*did_some_progress = 1;
> > > > >  			goto out;
> > > > > +		}
> > > > 
> > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?
> > > 
> > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings
> > > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm:
> > > page_alloc: embed OOM killing naturally into allocation slowpath" introduced
> > > a regression and below one is the fix.
> > > 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >                 /* The OOM killer does not needlessly kill tasks for lowmem */
> > >                 if (high_zoneidx < ZONE_NORMAL)
> > >                         goto out;
> > > -               /* The OOM killer does not compensate for light reclaim */
> > > -               if (!(gfp_mask & __GFP_FS))
> > > -                       goto out;
> > >                 /*
> > >                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Again, we don't want to OOM kill on behalf of allocations that can't
> > initiate IO, or even actively prevent others from doing it.  Not per
> > default anyway, because most callers can deal with the failure without
> > having to resort to killing tasks, and NOFS reclaim *can* easily fail.
> > It's the exceptions that should be annotated instead:
> > 
> > void *
> > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > 	int	retries = 0;
> > 	gfp_t	lflags = kmem_flags_convert(flags);
> > 	void	*ptr;
> > 
> > 	do {
> > 		ptr = kmalloc(size, lflags);
> > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > 			return ptr;
> > 		if (!(++retries % 100))
> > 			xfs_err(NULL,
> > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > 					__func__, lflags);
> > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (1);
> > }
> > 
> > This should use __GFP_NOFAIL, which is not only designed to annotate
> > broken code like this, but also recognizes that endless looping on a
> > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a7a3a63bb360..17ced1805d3a 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  void *
> >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> > -	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > -					__func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Hmmm - the only reason there is a focus on this loop is that it
> emits warnings about allocations failing. It's obvious that the
> problem being dealt with here is a fundamental design issue w.r.t.
> to locking and the OOM killer, but the proposed special casing
> hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> in XFS started emitting warnings about allocations failing more
> often.
> 
> So the answer is to remove the warning?  That's like killing the
> canary to stop the methane leak in the coal mine. No canary? No
> problems!

That's not what happened.  The patch that affected behavior here
transformed code that an incoherent collection of conditions to
something that has an actual model.  That model is that we don't loop
in the allocator if there are no means to making forward progress.  In
this case, it was GFP_NOFS triggering an early exit from the allocator
because it's not allowed to invoke the OOM killer per default, and
there is little point in looping for times to better on their own.

So these deadlock warnings happen, ironically, by the page allocator
now bailing out of a locked-up state in which it's not making forward
progress.  They don't strike me as a very useful canary in this case.

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure. That's a big red flag to me that all
> this hacking around the edges is not solving the underlying problem,
> but instead is breaking things that did once work.
> 
> And, well, then there's this (gfp.h):
> 
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
> 
> So, is this another policy relevation from the mm developers about
> the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
> Or just another symptom of frantic thrashing because nobody actually
> understands the problem or those that do are unwilling to throw out
> the broken crap and redesign it?

Well, understand our dilemma here.  __GFP_NOFAIL is a liability
because it can trap tasks with unknown state and locks in a
potentially never ending loop, and we don't want people to start using
it as a convenient solution to get out of having a fallback strategy.

However, if your entire architecture around a particular allocation is
that failure is not an option at this point, and you can't reasonably
preallocate - although that would always be preferrable - then please
do not open code an endless loop around the call to the allocator but
use __GFP_NOFAIL instead so that these callsites are annotated and can
be reviewed.  By giving the allocator this information, it can then
also adjust its behavior, like it is the case right here: we don't
usually want to OOM kill for regular GFP_NOFS allocations because
their reclaim powers are weak and we don't want to kill tasks
prematurely.  But if your NOFS allocation can not fail under any
circumstances, then the OOM killer should very much be employed to
make any kind of forward progress at all for this allocation.  It's
just that the allocator needs to be made aware of this requirement.

So yes, we are wary of __GFP_NOFAIL allocations, but this is an
instance where it's the right way to communicate with the allocator,
it was introduced to replace such open-coded endless loops and have
the liability of making progress with the allocator, not the caller.

And please understand that this callsite blowing up is a chance to
better the code and behavior here.  Where previously it would just
endlessly loop in the allocator without any means to make progress,
converting it to a __GFP_NOFAIL allocation tells the allocator that
it's fine to use the OOM killer in such an instance, improving the
chances that this caller will actually make headway under heavy load.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 14:25                                               ` Michal Hocko
@ 2015-02-19 10:48                                                   ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 10:48 UTC (permalink / raw)
  To: mhocko
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > Because they cannot perform any IO/FS transactions and that would lead
> > > > > to a premature OOM conditions way too easily. OOM killer is a _last
> > > > > resort_ reclaim opportunity not something that would happen just because
> > > > > you happen to be not able to flush dirty pages. 
> > > > 
> > > > But you should not have applied such change without making necessary
> > > > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
> > > > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.
> > > 
> > > This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since
> > > before git era).
> > >  
> > Then, at least I expect that filesystem error actions will not be taken so
> > trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for
> > Linux 3.19-stable?
> 
> I do not understand. What kind of bug would be fixed by that change?

That change fixes significant loss of file I/O reliability under extreme
memory pressure.

Today I tested how frequent filesystem errors occurs using scripted environment.
( Source code of a.out is http://marc.info/?l=linux-fsdevel&m=142425860904849&w=2 )

----------
#!/bin/sh
: > ~/trial.log
for i in `seq 1 100`
do
    mkfs.ext4 -q /dev/sdb1 || exit 1
    mount -o errors=remount-ro /dev/sdb1 /tmp || exit 2
    chmod 1777 /tmp
    su - demo -c ~demo/a.out
    if [ -w /tmp/ ]
    then
        echo -n "S" >> ~/trial.log
    else
        echo -n "F" >> ~/trial.log
    fi
    umount /tmp
done
----------

We can see that filesystem errors are occurring frequently if GFP_NOFS / GFP_NOIO
allocations give up without retrying. On the other hand, as far as these trials,
TIF_MEMDIE stall was not observed if GFP_NOFS / GFP_NOIO allocations give up
without retrying. Maybe giving up without retrying is keeping away from hitting
stalls for this test case?

  Linux 3.19-rc6 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-rc6.txt.xz )

    0 filesystem errors out of 100 trials. 2 stalls.
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

  Linux 3.19 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19.txt.xz )

    44 filesystem errors out of 100 trials. 0 stalls.
    SSFFSSSFSSSFSFFFFSSFSSFSSSSSSFFFSFSFFSSSSSSFFFFSFSSFFFSSSSFSSFFFFFSSSSSFSSFSFSSFSFFFSFFFFFFFSSSSSSSS

  Linux 3.19 with http://marc.info/?l=linux-mm&m=142418465615672&w=2 applied.
  (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-patched.txt.xz )

    0 filesystem errors out of 100 trials. 2 stalls.
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

If result of Linux 3.19 is what you wanted, we should chime fs developers
for immediate action. (But __GFP_NOFAIL discussion between you and Dave
is in progress. I don't know whether ext4 and underlying subsystems should
start using __GFP_NOFAIL.)

P.S. Just for experimental purpose, Linux 3.19 with below change applied
gave better result than retrying GFP_NOFS / GFP_NOIO allocations without
invoking the OOM killer. Short-lived small GFP_NOFS / GFP_NOIO allocations
can use GFP_ATOMIC instead? How many bytes does blk_rq_map_kern() want?

  --- a/mm/page_alloc.c
  +++ b/mm/page_alloc.c
  @@ -2867,6 +2867,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
           int classzone_idx;

           gfp_mask &= gfp_allowed_mask;
  +        if (gfp_mask == GFP_NOFS || gfp_mask == GFP_NOIO)
  +                gfp_mask = GFP_ATOMIC;

           lockdep_trace_alloc(gfp_mask);

    0 filesystem errors out of 100 trials. 0 stalls.
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 10:48                                                   ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 10:48 UTC (permalink / raw)
  To: mhocko
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > Because they cannot perform any IO/FS transactions and that would lead
> > > > > to a premature OOM conditions way too easily. OOM killer is a _last
> > > > > resort_ reclaim opportunity not something that would happen just because
> > > > > you happen to be not able to flush dirty pages. 
> > > > 
> > > > But you should not have applied such change without making necessary
> > > > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
> > > > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.
> > > 
> > > This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since
> > > before git era).
> > >  
> > Then, at least I expect that filesystem error actions will not be taken so
> > trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for
> > Linux 3.19-stable?
> 
> I do not understand. What kind of bug would be fixed by that change?

That change fixes significant loss of file I/O reliability under extreme
memory pressure.

Today I tested how frequent filesystem errors occurs using scripted environment.
( Source code of a.out is http://marc.info/?l=linux-fsdevel&m=142425860904849&w=2 )

----------
#!/bin/sh
: > ~/trial.log
for i in `seq 1 100`
do
    mkfs.ext4 -q /dev/sdb1 || exit 1
    mount -o errors=remount-ro /dev/sdb1 /tmp || exit 2
    chmod 1777 /tmp
    su - demo -c ~demo/a.out
    if [ -w /tmp/ ]
    then
        echo -n "S" >> ~/trial.log
    else
        echo -n "F" >> ~/trial.log
    fi
    umount /tmp
done
----------

We can see that filesystem errors are occurring frequently if GFP_NOFS / GFP_NOIO
allocations give up without retrying. On the other hand, as far as these trials,
TIF_MEMDIE stall was not observed if GFP_NOFS / GFP_NOIO allocations give up
without retrying. Maybe giving up without retrying is keeping away from hitting
stalls for this test case?

  Linux 3.19-rc6 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-rc6.txt.xz )

    0 filesystem errors out of 100 trials. 2 stalls.
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

  Linux 3.19 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19.txt.xz )

    44 filesystem errors out of 100 trials. 0 stalls.
    SSFFSSSFSSSFSFFFFSSFSSFSSSSSSFFFSFSFFSSSSSSFFFFSFSSFFFSSSSFSSFFFFFSSSSSFSSFSFSSFSFFFSFFFFFFFSSSSSSSS

  Linux 3.19 with http://marc.info/?l=linux-mm&m=142418465615672&w=2 applied.
  (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-patched.txt.xz )

    0 filesystem errors out of 100 trials. 2 stalls.
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

If result of Linux 3.19 is what you wanted, we should chime fs developers
for immediate action. (But __GFP_NOFAIL discussion between you and Dave
is in progress. I don't know whether ext4 and underlying subsystems should
start using __GFP_NOFAIL.)

P.S. Just for experimental purpose, Linux 3.19 with below change applied
gave better result than retrying GFP_NOFS / GFP_NOIO allocations without
invoking the OOM killer. Short-lived small GFP_NOFS / GFP_NOIO allocations
can use GFP_ATOMIC instead? How many bytes does blk_rq_map_kern() want?

  --- a/mm/page_alloc.c
  +++ b/mm/page_alloc.c
  @@ -2867,6 +2867,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
           int classzone_idx;

           gfp_mask &= gfp_allowed_mask;
  +        if (gfp_mask == GFP_NOFS || gfp_mask == GFP_NOIO)
  +                gfp_mask = GFP_ATOMIC;

           lockdep_trace_alloc(gfp_mask);

    0 filesystem errors out of 100 trials. 0 stalls.
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 12:16                                               ` Michal Hocko
@ 2015-02-19 11:01                                                 ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-19 11:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [...]
> > Also, this reads as an excuse for the OOM killer being broken and
> > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > subsystem that *we hold locks* when we call into it. That's what
> > GFP_NOFS originally meant, and it's what it still means today in an
> > XFS context.
> 
> Sure, and OOM killer will not be invoked in NOFS context. See
> __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> is the OOM killer broken.
> 
> The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> OOM killer but a lock dependency introduced by the following sequence:
> 
> 	taskA			taskB			taskC
> lock(A)							alloc()
> alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> # looping for ever if we				    select_bad_process
> # cannot make any progress				      victim = taskB

You don't even need taskC here.  taskA could invoke the OOM killer
with lock(A) held, and taskB getting selected as the victim while
trying to acquire lock(A).  It'll get the signal and TIF_MEMDIE and
then wait for lock(A) while taskA is waiting for it to exit.

But it doesn't matter who is doing the OOM killing - if the allocating
task with the lock/state is waiting for the OOM victim to free memory,
and the victim is waiting for same the lock/state, we have a deadlock.

> There is no way OOM killer can tell taskB is blocked and that there is
> dependency between A and B (without lockdep). That is why I call NOFAIL
> under a lock as dangerous and a bug.

You keep ignoring that it's also one of the main usecases of this
flag.  The caller has state that it can't unwind and thus needs the
allocation to succeed.  Chances are somebody else can get blocked up
on that same state.  And when that somebody else is the first choice
of the OOM killer, we're screwed.

This is exactly why I'm proposing that the OOM killer should not wait
indefinitely for its first choice to exit, but ultimately move on and
try other tasks.  There is no other way to resolve this deadlock.

Preferrably, we'd get rid of all nofail allocations and replace them
with preallocated reserves.  But this is not going to happen anytime
soon, so what other option do we have than resolving this on the OOM
killer side?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 11:01                                                 ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-19 11:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg,
	akpm, mgorman, torvalds, xfs

On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [...]
> > Also, this reads as an excuse for the OOM killer being broken and
> > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > subsystem that *we hold locks* when we call into it. That's what
> > GFP_NOFS originally meant, and it's what it still means today in an
> > XFS context.
> 
> Sure, and OOM killer will not be invoked in NOFS context. See
> __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> is the OOM killer broken.
> 
> The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> OOM killer but a lock dependency introduced by the following sequence:
> 
> 	taskA			taskB			taskC
> lock(A)							alloc()
> alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> # looping for ever if we				    select_bad_process
> # cannot make any progress				      victim = taskB

You don't even need taskC here.  taskA could invoke the OOM killer
with lock(A) held, and taskB getting selected as the victim while
trying to acquire lock(A).  It'll get the signal and TIF_MEMDIE and
then wait for lock(A) while taskA is waiting for it to exit.

But it doesn't matter who is doing the OOM killing - if the allocating
task with the lock/state is waiting for the OOM victim to free memory,
and the victim is waiting for same the lock/state, we have a deadlock.

> There is no way OOM killer can tell taskB is blocked and that there is
> dependency between A and B (without lockdep). That is why I call NOFAIL
> under a lock as dangerous and a bug.

You keep ignoring that it's also one of the main usecases of this
flag.  The caller has state that it can't unwind and thus needs the
allocation to succeed.  Chances are somebody else can get blocked up
on that same state.  And when that somebody else is the first choice
of the OOM killer, we're screwed.

This is exactly why I'm proposing that the OOM killer should not wait
indefinitely for its first choice to exit, but ultimately move on and
try other tasks.  There is no other way to resolve this deadlock.

Preferrably, we'd get rid of all nofail allocations and replace them
with preallocated reserves.  But this is not going to happen anytime
soon, so what other option do we have than resolving this on the OOM
killer side?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 11:01                                                 ` Johannes Weiner
@ 2015-02-19 12:29                                                   ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-19 12:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
[...]
> Preferrably, we'd get rid of all nofail allocations and replace them
> with preallocated reserves.  But this is not going to happen anytime
> soon, so what other option do we have than resolving this on the OOM
> killer side?

As I've mentioned in other email, we might give GFP_NOFAIL allocator
access to memory reserves (by giving it __GFP_HIGH). This is still not a
100% solution because reserves could get depleted but this risk is there
even with multiple oom victims. I would still argue that this would be a
better approach because selecting more victims might hit pathological
case more easily (other victims might be blocked on the very same lock
e.g.).

Something like the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8d52ab18fe0d..4b5cf28a13f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int oom = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2628,6 +2629,15 @@ retry:
 		wake_all_kswapds(order, ac);
 
 	/*
+	 * __GFP_NOFAIL allocations cannot fail but yet the current context
+	 * might be blocking resources needed by the OOM victim to terminate.
+	 * Allow the caller to dive into memory reserves to succeed the
+	 * allocation and break out from a potential deadlock.
+	 */
+	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
+		gfp_mask |= __GFP_HIGH;
+
+	/*
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
@@ -2759,6 +2769,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			oom++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 12:29                                                   ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-19 12:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg,
	akpm, mgorman, torvalds, xfs

On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
[...]
> Preferrably, we'd get rid of all nofail allocations and replace them
> with preallocated reserves.  But this is not going to happen anytime
> soon, so what other option do we have than resolving this on the OOM
> killer side?

As I've mentioned in other email, we might give GFP_NOFAIL allocator
access to memory reserves (by giving it __GFP_HIGH). This is still not a
100% solution because reserves could get depleted but this risk is there
even with multiple oom victims. I would still argue that this would be a
better approach because selecting more victims might hit pathological
case more easily (other victims might be blocked on the very same lock
e.g.).

Something like the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8d52ab18fe0d..4b5cf28a13f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int oom = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2628,6 +2629,15 @@ retry:
 		wake_all_kswapds(order, ac);
 
 	/*
+	 * __GFP_NOFAIL allocations cannot fail but yet the current context
+	 * might be blocking resources needed by the OOM victim to terminate.
+	 * Allow the caller to dive into memory reserves to succeed the
+	 * allocation and break out from a potential deadlock.
+	 */
+	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
+		gfp_mask |= __GFP_HIGH;
+
+	/*
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
@@ -2759,6 +2769,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			oom++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 12:29                                                   ` Michal Hocko
@ 2015-02-19 12:58                                                     ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-19 12:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Thu 19-02-15 13:29:14, Michal Hocko wrote:
[...]
> Something like the following.
__GFP_HIGH doesn't seem to be sufficient so we would need something
slightly else but the idea is still the same:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8d52ab18fe0d..2d224bbdf8e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int oom = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2635,6 +2636,15 @@ retry:
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 	/*
+	 * __GFP_NOFAIL allocations cannot fail but yet the current context
+	 * might be blocking resources needed by the OOM victim to terminate.
+	 * Allow the caller to dive into memory reserves to succeed the
+	 * allocation and break out from a potential deadlock.
+	 */
+	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
+		alloc_flags |= ALLOC_NO_WATERMARKS;
+
+	/*
 	 * Find the true preferred zone if the allocation is unconstrained by
 	 * cpusets.
 	 */
@@ -2759,6 +2769,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			oom++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 12:58                                                     ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-19 12:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg,
	akpm, mgorman, torvalds, xfs

On Thu 19-02-15 13:29:14, Michal Hocko wrote:
[...]
> Something like the following.
__GFP_HIGH doesn't seem to be sufficient so we would need something
slightly else but the idea is still the same:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8d52ab18fe0d..2d224bbdf8e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int oom = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2635,6 +2636,15 @@ retry:
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 	/*
+	 * __GFP_NOFAIL allocations cannot fail but yet the current context
+	 * might be blocking resources needed by the OOM victim to terminate.
+	 * Allow the caller to dive into memory reserves to succeed the
+	 * allocation and break out from a potential deadlock.
+	 */
+	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
+		alloc_flags |= ALLOC_NO_WATERMARKS;
+
+	/*
 	 * Find the true preferred zone if the allocation is unconstrained by
 	 * cpusets.
 	 */
@@ -2759,6 +2769,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			oom++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 12:29                                                   ` Michal Hocko
  (?)
@ 2015-02-19 13:29                                                     ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 13:29 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> [...]
> > Preferrably, we'd get rid of all nofail allocations and replace them
> > with preallocated reserves.  But this is not going to happen anytime
> > soon, so what other option do we have than resolving this on the OOM
> > killer side?
> 
> As I've mentioned in other email, we might give GFP_NOFAIL allocator
> access to memory reserves (by giving it __GFP_HIGH). This is still not a
> 100% solution because reserves could get depleted but this risk is there
> even with multiple oom victims. I would still argue that this would be a
> better approach because selecting more victims might hit pathological
> case more easily (other victims might be blocked on the very same lock
> e.g.).
> 
Does "multiple OOM victims" mean "select next if first does not die"?
Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
does not deplete memory reserves. ;-)

If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
those who do not want to fail (e.g. journal transaction) will start passing
__GFP_NOFAIL?

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 13:29                                                     ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 13:29 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel,
	akpm, fernando_b1, torvalds

Michal Hocko wrote:
> On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> [...]
> > Preferrably, we'd get rid of all nofail allocations and replace them
> > with preallocated reserves.  But this is not going to happen anytime
> > soon, so what other option do we have than resolving this on the OOM
> > killer side?
> 
> As I've mentioned in other email, we might give GFP_NOFAIL allocator
> access to memory reserves (by giving it __GFP_HIGH). This is still not a
> 100% solution because reserves could get depleted but this risk is there
> even with multiple oom victims. I would still argue that this would be a
> better approach because selecting more victims might hit pathological
> case more easily (other victims might be blocked on the very same lock
> e.g.).
> 
Does "multiple OOM victims" mean "select next if first does not die"?
Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
does not deplete memory reserves. ;-)

If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
those who do not want to fail (e.g. journal transaction) will start passing
__GFP_NOFAIL?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 13:29                                                     ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 13:29 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> [...]
> > Preferrably, we'd get rid of all nofail allocations and replace them
> > with preallocated reserves.  But this is not going to happen anytime
> > soon, so what other option do we have than resolving this on the OOM
> > killer side?
> 
> As I've mentioned in other email, we might give GFP_NOFAIL allocator
> access to memory reserves (by giving it __GFP_HIGH). This is still not a
> 100% solution because reserves could get depleted but this risk is there
> even with multiple oom victims. I would still argue that this would be a
> better approach because selecting more victims might hit pathological
> case more easily (other victims might be blocked on the very same lock
> e.g.).
> 
Does "multiple OOM victims" mean "select next if first does not die"?
Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
does not deplete memory reserves. ;-)

If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
those who do not want to fail (e.g. journal transaction) will start passing
__GFP_NOFAIL?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 12:58                                                     ` Michal Hocko
  (?)
@ 2015-02-19 15:29                                                       ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 15:29 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> [...]
> > Something like the following.
> __GFP_HIGH doesn't seem to be sufficient so we would need something
> slightly else but the idea is still the same:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8d52ab18fe0d..2d224bbdf8e8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
>  	bool deferred_compaction = false;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
> +	int oom = 0;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -2635,6 +2636,15 @@ retry:
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
>  	/*
> +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> +	 * might be blocking resources needed by the OOM victim to terminate.
> +	 * Allow the caller to dive into memory reserves to succeed the
> +	 * allocation and break out from a potential deadlock.
> +	 */

We don't know how many callers will pass __GFP_NOFAIL. But if 1000
threads are doing the same operation which requires __GFP_NOFAIL
allocation with a lock held, wouldn't memory reserves deplete?

This heuristic can't continue if memory reserves depleted or
continuous pages of requested order cannot be found.

> +	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +
> +	/*
>  	 * Find the true preferred zone if the allocation is unconstrained by
>  	 * cpusets.
>  	 */
> @@ -2759,6 +2769,8 @@ retry:
>  				goto got_pg;
>  			if (!did_some_progress)
>  				goto nopage;
> +
> +			oom++;
>  		}
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 15:29                                                       ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 15:29 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel,
	akpm, fernando_b1, torvalds

Michal Hocko wrote:
> On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> [...]
> > Something like the following.
> __GFP_HIGH doesn't seem to be sufficient so we would need something
> slightly else but the idea is still the same:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8d52ab18fe0d..2d224bbdf8e8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
>  	bool deferred_compaction = false;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
> +	int oom = 0;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -2635,6 +2636,15 @@ retry:
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
>  	/*
> +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> +	 * might be blocking resources needed by the OOM victim to terminate.
> +	 * Allow the caller to dive into memory reserves to succeed the
> +	 * allocation and break out from a potential deadlock.
> +	 */

We don't know how many callers will pass __GFP_NOFAIL. But if 1000
threads are doing the same operation which requires __GFP_NOFAIL
allocation with a lock held, wouldn't memory reserves deplete?

This heuristic can't continue if memory reserves depleted or
continuous pages of requested order cannot be found.

> +	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +
> +	/*
>  	 * Find the true preferred zone if the allocation is unconstrained by
>  	 * cpusets.
>  	 */
> @@ -2759,6 +2769,8 @@ retry:
>  				goto got_pg;
>  			if (!did_some_progress)
>  				goto nopage;
> +
> +			oom++;
>  		}
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> -- 
> Michal Hocko
> SUSE Labs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 15:29                                                       ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 15:29 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> [...]
> > Something like the following.
> __GFP_HIGH doesn't seem to be sufficient so we would need something
> slightly else but the idea is still the same:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8d52ab18fe0d..2d224bbdf8e8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
>  	bool deferred_compaction = false;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
> +	int oom = 0;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -2635,6 +2636,15 @@ retry:
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
>  	/*
> +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> +	 * might be blocking resources needed by the OOM victim to terminate.
> +	 * Allow the caller to dive into memory reserves to succeed the
> +	 * allocation and break out from a potential deadlock.
> +	 */

We don't know how many callers will pass __GFP_NOFAIL. But if 1000
threads are doing the same operation which requires __GFP_NOFAIL
allocation with a lock held, wouldn't memory reserves deplete?

This heuristic can't continue if memory reserves depleted or
continuous pages of requested order cannot be found.

> +	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +
> +	/*
>  	 * Find the true preferred zone if the allocation is unconstrained by
>  	 * cpusets.
>  	 */
> @@ -2759,6 +2769,8 @@ retry:
>  				goto got_pg;
>  			if (!did_some_progress)
>  				goto nopage;
> +
> +			oom++;
>  		}
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 12:29                                                   ` Michal Hocko
@ 2015-02-19 21:43                                                     ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-19 21:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> [...]
> > Preferrably, we'd get rid of all nofail allocations and replace them
> > with preallocated reserves.  But this is not going to happen anytime
> > soon, so what other option do we have than resolving this on the OOM
> > killer side?
> 
> As I've mentioned in other email, we might give GFP_NOFAIL allocator
> access to memory reserves (by giving it __GFP_HIGH).

Won't work when you have thousands of concurrent transactions
running in XFS and they are all doing GFP_NOFAIL allocations. That's
why I suggested the per-transaction reserve pool - we can use that
to throttle the number of concurent contexts demanding memory for
forwards progress, just the same was we throttle the number of
concurrent processes based on maximum log space requirements of the
transactions and the amount of unreserved log space available.

No log space, transaction reservations waits on an ordered queue for
space to become available. No memory available, transaction
reservation waits on an ordered queue for memory to become
available.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 21:43                                                     ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-19 21:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> [...]
> > Preferrably, we'd get rid of all nofail allocations and replace them
> > with preallocated reserves.  But this is not going to happen anytime
> > soon, so what other option do we have than resolving this on the OOM
> > killer side?
> 
> As I've mentioned in other email, we might give GFP_NOFAIL allocator
> access to memory reserves (by giving it __GFP_HIGH).

Won't work when you have thousands of concurrent transactions
running in XFS and they are all doing GFP_NOFAIL allocations. That's
why I suggested the per-transaction reserve pool - we can use that
to throttle the number of concurent contexts demanding memory for
forwards progress, just the same was we throttle the number of
concurrent processes based on maximum log space requirements of the
transactions and the amount of unreserved log space available.

No log space, transaction reservations waits on an ordered queue for
space to become available. No memory available, transaction
reservation waits on an ordered queue for memory to become
available.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 15:29                                                       ` Tetsuo Handa
  (?)
@ 2015-02-19 21:53                                                         ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 21:53 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> > [...]
> > > Something like the following.
> > __GFP_HIGH doesn't seem to be sufficient so we would need something
> > slightly else but the idea is still the same:
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8d52ab18fe0d..2d224bbdf8e8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> >  	bool deferred_compaction = false;
> >  	int contended_compaction = COMPACT_CONTENDED_NONE;
> > +	int oom = 0;
> >  
> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -2635,6 +2636,15 @@ retry:
> >  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >  
> >  	/*
> > +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> > +	 * might be blocking resources needed by the OOM victim to terminate.
> > +	 * Allow the caller to dive into memory reserves to succeed the
> > +	 * allocation and break out from a potential deadlock.
> > +	 */
> 
> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
> threads are doing the same operation which requires __GFP_NOFAIL
> allocation with a lock held, wouldn't memory reserves deplete?
> 
> This heuristic can't continue if memory reserves depleted or
> continuous pages of requested order cannot be found.
> 

Even if the system seems to be stalled, deadlocks may not have occurred.
If the cause is (e.g.) virtio disk being stuck for unknown reason than
a deadlock, nobody should start consuming the memory reserves after
waiting for a while.

The memory reserves are something like a balloon. To guarantee forward
progress, the balloon must not become empty. Therefore, I think that
throttling heuristics for memory requester side (deflator of the balloon,
or SIGKILL receiver called processes) should be avoided and
throttling heuristics for memory releaser side (inflator of the balloon,
or SIGKILL sender called the OOM killer) should be used.
If heuristic is used on the deflator side, the memory allocator may
deliver a final blow via ALLOC_NO_WATERMARKS. If heuristic is used on
the inflator side, the OOM killer can act as a watchdog when nobody
volunteered memory within reasonable period.

> > +	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
> > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +
> > +	/*
> >  	 * Find the true preferred zone if the allocation is unconstrained by
> >  	 * cpusets.
> >  	 */
> > @@ -2759,6 +2769,8 @@ retry:
> >  				goto got_pg;
> >  			if (!did_some_progress)
> >  				goto nopage;
> > +
> > +			oom++;
> >  		}
> >  		/* Wait for some write requests to complete then retry */
> >  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> 

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 21:53                                                         ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 21:53 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel,
	akpm, fernando_b1, torvalds

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> > [...]
> > > Something like the following.
> > __GFP_HIGH doesn't seem to be sufficient so we would need something
> > slightly else but the idea is still the same:
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8d52ab18fe0d..2d224bbdf8e8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> >  	bool deferred_compaction = false;
> >  	int contended_compaction = COMPACT_CONTENDED_NONE;
> > +	int oom = 0;
> >  
> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -2635,6 +2636,15 @@ retry:
> >  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >  
> >  	/*
> > +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> > +	 * might be blocking resources needed by the OOM victim to terminate.
> > +	 * Allow the caller to dive into memory reserves to succeed the
> > +	 * allocation and break out from a potential deadlock.
> > +	 */
> 
> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
> threads are doing the same operation which requires __GFP_NOFAIL
> allocation with a lock held, wouldn't memory reserves deplete?
> 
> This heuristic can't continue if memory reserves depleted or
> continuous pages of requested order cannot be found.
> 

Even if the system seems to be stalled, deadlocks may not have occurred.
If the cause is (e.g.) virtio disk being stuck for unknown reason than
a deadlock, nobody should start consuming the memory reserves after
waiting for a while.

The memory reserves are something like a balloon. To guarantee forward
progress, the balloon must not become empty. Therefore, I think that
throttling heuristics for memory requester side (deflator of the balloon,
or SIGKILL receiver called processes) should be avoided and
throttling heuristics for memory releaser side (inflator of the balloon,
or SIGKILL sender called the OOM killer) should be used.
If heuristic is used on the deflator side, the memory allocator may
deliver a final blow via ALLOC_NO_WATERMARKS. If heuristic is used on
the inflator side, the OOM killer can act as a watchdog when nobody
volunteered memory within reasonable period.

> > +	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
> > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +
> > +	/*
> >  	 * Find the true preferred zone if the allocation is unconstrained by
> >  	 * cpusets.
> >  	 */
> > @@ -2759,6 +2769,8 @@ retry:
> >  				goto got_pg;
> >  			if (!did_some_progress)
> >  				goto nopage;
> > +
> > +			oom++;
> >  		}
> >  		/* Wait for some write requests to complete then retry */
> >  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 21:53                                                         ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-19 21:53 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> > [...]
> > > Something like the following.
> > __GFP_HIGH doesn't seem to be sufficient so we would need something
> > slightly else but the idea is still the same:
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8d52ab18fe0d..2d224bbdf8e8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> >  	bool deferred_compaction = false;
> >  	int contended_compaction = COMPACT_CONTENDED_NONE;
> > +	int oom = 0;
> >  
> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -2635,6 +2636,15 @@ retry:
> >  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >  
> >  	/*
> > +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> > +	 * might be blocking resources needed by the OOM victim to terminate.
> > +	 * Allow the caller to dive into memory reserves to succeed the
> > +	 * allocation and break out from a potential deadlock.
> > +	 */
> 
> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
> threads are doing the same operation which requires __GFP_NOFAIL
> allocation with a lock held, wouldn't memory reserves deplete?
> 
> This heuristic can't continue if memory reserves depleted or
> continuous pages of requested order cannot be found.
> 

Even if the system seems to be stalled, deadlocks may not have occurred.
If the cause is (e.g.) virtio disk being stuck for unknown reason than
a deadlock, nobody should start consuming the memory reserves after
waiting for a while.

The memory reserves are something like a balloon. To guarantee forward
progress, the balloon must not become empty. Therefore, I think that
throttling heuristics for memory requester side (deflator of the balloon,
or SIGKILL receiver called processes) should be avoided and
throttling heuristics for memory releaser side (inflator of the balloon,
or SIGKILL sender called the OOM killer) should be used.
If heuristic is used on the deflator side, the memory allocator may
deliver a final blow via ALLOC_NO_WATERMARKS. If heuristic is used on
the inflator side, the OOM killer can act as a watchdog when nobody
volunteered memory within reasonable period.

> > +	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
> > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +
> > +	/*
> >  	 * Find the true preferred zone if the allocation is unconstrained by
> >  	 * cpusets.
> >  	 */
> > @@ -2759,6 +2769,8 @@ retry:
> >  				goto got_pg;
> >  			if (!did_some_progress)
> >  				goto nopage;
> > +
> > +			oom++;
> >  		}
> >  		/* Wait for some write requests to complete then retry */
> >  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19  9:40                                                   ` Michal Hocko
@ 2015-02-19 22:03                                                     ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-19 22:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Thu, Feb 19, 2015 at 10:40:20AM +0100, Michal Hocko wrote:
> On Thu 19-02-15 08:31:18, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > This is why GFP_NOFAIL is not a solution to the "never fail"
> > alloation problem. The caller doing the "no fail" allocation _must
> > be able to set failure policy_. i.e. the choice of aborting and
> > shutting down because progress cannot be made, or continuing and
> > hoping for forwards progress is owned by the allocating context, no
> > the allocator.
> 
> I completely agree that the failure policy is the caller responsibility
> and I would have no objections to something like:
> 
> 	do {
> 		ptr = kmalloc(size, GFP_NOFS);
> 		if (ptr)
> 			return ptr;
> 		if (fatal_signal_pending(current))
> 			break;
> 		if (looping_too_long())
> 			break;
> 	} while (1);
> 
> 	fallback_solution();
> 
> But this is not the case in kmem_alloc which is essentially GFP_NOFAIL
> allocation with a warning and congestion_wait. There is no failure
> policy defined there. The warning should be part of the allocator and
> the NOFAIL policy should be explicit. So why exactly do you oppose to
> changing kmem_alloc (and others which are doing essentially the same)?

I'm opposing changing kmem_alloc() to GFP_NOFAIL precisely because
doing so is *broken*, *and* it removes the policy decision from the
calling context where it belongs.

We are in the process of discussing - at an XFS level - how to
handle errors in a configurable manner. See, for example, this
discussion:

http://oss.sgi.com/archives/xfs/2015-02/msg00343.html

Where we are trying to decide how to expose failure policy to admins
to make decisions about error handling behaviour:

http://oss.sgi.com/archives/xfs/2015-02/msg00346.html

There is little doubt in my mind that this stretches to ENOMEM
handling; it is another case where we consider ENOMEM to be a
transient error and hence retry forever until it succeeds. But some
people are going to want to configure that behaviour, and the API
above allows peopel to configure exactly how many repeated memory
allocations we'd fail before considering the situation hopeless,
failing, and risking a filesystem shutdown....

Converting the code to use GFP_NOFAIL takes us in exactly the
opposite direction to our current line of development w.r.t. to
filesystem error handling.

> The reason I care about GFP_NOFAIL is that there are apparently code
> paths which do not tell allocator they are basically GFP_NOFAIL without
> any fallback. This leads to two main problems 1) we do not have a good
> overview how many code paths have such a strong requirements and so
> cannot estimate e.g. how big memory reserves should be and

Right, when GFP_NOFAIL got deprecated we lost the ability to document
such behaviour and find it easily. People just put retry loops in
instead of using GFP_NOFAIL. Good luck finding them all :/

> 2) allocator
> cannot help those paths (e.g. by giving them access to reserves to break
> out of the livelock).

Allocator should not help. Global reserves are unreliable - make the
allocation context reserve the amount it needs before it enters the
context where it can't back out....

> > IOWs, we have need for forward allocation progress guarantees on
> > (potentially) several megabytes of allocations from slab caches, the
> > heap and the page allocator, with all allocations all in
> > unpredictable order, with objects of different life times and life
> > cycles, and at which may, at any time, get stuck behind
> > objects locked in other transactions and hence can randomly block
> > until some other thread makes forward progress and completes a
> > transaction and unlocks the object.
> 
> Thanks for the clarification, I have to think about it some more,
> though. My thinking was that mempools could be used for an emergency
> pool with a pre-allocated memory which would be used in the non failing
> contexts.

The other problem with mempools is that they aren't exclusive to the
context that needs the reservation. i.e. we can preallocate to the
mempool, but then when the preallocating context goes to allocate,
that preallocation may have already been drained by other contexts.

The memory reservation needs to be follow to the transaction - we
can pass them between tasks, and they need to persist across
sleeping locks, IO, etc, and mempools simply too constrainted to be
usable in this environment.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 22:03                                                     ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-19 22:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Thu, Feb 19, 2015 at 10:40:20AM +0100, Michal Hocko wrote:
> On Thu 19-02-15 08:31:18, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > This is why GFP_NOFAIL is not a solution to the "never fail"
> > alloation problem. The caller doing the "no fail" allocation _must
> > be able to set failure policy_. i.e. the choice of aborting and
> > shutting down because progress cannot be made, or continuing and
> > hoping for forwards progress is owned by the allocating context, no
> > the allocator.
> 
> I completely agree that the failure policy is the caller responsibility
> and I would have no objections to something like:
> 
> 	do {
> 		ptr = kmalloc(size, GFP_NOFS);
> 		if (ptr)
> 			return ptr;
> 		if (fatal_signal_pending(current))
> 			break;
> 		if (looping_too_long())
> 			break;
> 	} while (1);
> 
> 	fallback_solution();
> 
> But this is not the case in kmem_alloc which is essentially GFP_NOFAIL
> allocation with a warning and congestion_wait. There is no failure
> policy defined there. The warning should be part of the allocator and
> the NOFAIL policy should be explicit. So why exactly do you oppose to
> changing kmem_alloc (and others which are doing essentially the same)?

I'm opposing changing kmem_alloc() to GFP_NOFAIL precisely because
doing so is *broken*, *and* it removes the policy decision from the
calling context where it belongs.

We are in the process of discussing - at an XFS level - how to
handle errors in a configurable manner. See, for example, this
discussion:

http://oss.sgi.com/archives/xfs/2015-02/msg00343.html

Where we are trying to decide how to expose failure policy to admins
to make decisions about error handling behaviour:

http://oss.sgi.com/archives/xfs/2015-02/msg00346.html

There is little doubt in my mind that this stretches to ENOMEM
handling; it is another case where we consider ENOMEM to be a
transient error and hence retry forever until it succeeds. But some
people are going to want to configure that behaviour, and the API
above allows peopel to configure exactly how many repeated memory
allocations we'd fail before considering the situation hopeless,
failing, and risking a filesystem shutdown....

Converting the code to use GFP_NOFAIL takes us in exactly the
opposite direction to our current line of development w.r.t. to
filesystem error handling.

> The reason I care about GFP_NOFAIL is that there are apparently code
> paths which do not tell allocator they are basically GFP_NOFAIL without
> any fallback. This leads to two main problems 1) we do not have a good
> overview how many code paths have such a strong requirements and so
> cannot estimate e.g. how big memory reserves should be and

Right, when GFP_NOFAIL got deprecated we lost the ability to document
such behaviour and find it easily. People just put retry loops in
instead of using GFP_NOFAIL. Good luck finding them all :/

> 2) allocator
> cannot help those paths (e.g. by giving them access to reserves to break
> out of the livelock).

Allocator should not help. Global reserves are unreliable - make the
allocation context reserve the amount it needs before it enters the
context where it can't back out....

> > IOWs, we have need for forward allocation progress guarantees on
> > (potentially) several megabytes of allocations from slab caches, the
> > heap and the page allocator, with all allocations all in
> > unpredictable order, with objects of different life times and life
> > cycles, and at which may, at any time, get stuck behind
> > objects locked in other transactions and hence can randomly block
> > until some other thread makes forward progress and completes a
> > transaction and unlocks the object.
> 
> Thanks for the clarification, I have to think about it some more,
> though. My thinking was that mempools could be used for an emergency
> pool with a pre-allocated memory which would be used in the non failing
> contexts.

The other problem with mempools is that they aren't exclusive to the
context that needs the reservation. i.e. we can preallocate to the
mempool, but then when the preallocating context goes to allocate,
that preallocation may have already been drained by other contexts.

The memory reservation needs to be follow to the transaction - we
can pass them between tasks, and they need to persist across
sleeping locks, IO, etc, and mempools simply too constrainted to be
usable in this environment.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 10:24                                           ` Johannes Weiner
@ 2015-02-19 22:52                                             ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-19 22:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Thu, Feb 19, 2015 at 05:24:31AM -0500, Johannes Weiner wrote:
> On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote:
> > [ cc xfs list - experienced kernel devs should not have to be
> > reminded to do this ]
> > 
> > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > > -					__func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Hmmm - the only reason there is a focus on this loop is that it
> > emits warnings about allocations failing. It's obvious that the
> > problem being dealt with here is a fundamental design issue w.r.t.
> > to locking and the OOM killer, but the proposed special casing
> > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> > in XFS started emitting warnings about allocations failing more
> > often.
> > 
> > So the answer is to remove the warning?  That's like killing the
> > canary to stop the methane leak in the coal mine. No canary? No
> > problems!
> 
> That's not what happened.  The patch that affected behavior here
> transformed code that an incoherent collection of conditions to
> something that has an actual model.

Which is entirely undocumented. If you have a model, the first thing
to do is document it and communicate that model to everyone who
needs to know about that new model. I have no idea what that model
is. Keeping it in your head and changing code that other people
maintain without giving them any means of understanding WTF you are
doing is a really bad engineering practice.


And yes, I have had a bit to say about this in public recently.
Go watch my recent LCA talk, for example....

And, FWIW, email discussions on a list is no substitute for a
properly documented design that people can take their time to
understand and digest.

> That model is that we don't loop
> in the allocator if there are no means to making forward progress.  In
> this case, it was GFP_NOFS triggering an early exit from the allocator
> because it's not allowed to invoke the OOM killer per default, and
> there is little point in looping for times to better on their own.

So you keep saying....

> So these deadlock warnings happen, ironically, by the page allocator
> now bailing out of a locked-up state in which it's not making forward
> progress.  They don't strike me as a very useful canary in this case.

... yet we *rarely* see the canary warnings we emit when we do too
many allocation retries, the code has been that way for 13-odd
years.  Hence, despite your protestations that your way is *better*,
we have code that is tried, tested and proven in rugged production
environments. That's far more convincing evidence that the *code
should not change* than your assertions that it is broken and needs
to be fixed.

> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure. That's a big red flag to me that all
> > this hacking around the edges is not solving the underlying problem,
> > but instead is breaking things that did once work.
> > 
> > And, well, then there's this (gfp.h):
> > 
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures.  This modifier is deprecated and no new
> >  * users should be added.
> > 
> > So, is this another policy relevation from the mm developers about
> > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
> > Or just another symptom of frantic thrashing because nobody actually
> > understands the problem or those that do are unwilling to throw out
> > the broken crap and redesign it?
> 
> Well, understand our dilemma here.  __GFP_NOFAIL is a liability
> because it can trap tasks with unknown state and locks in a
> potentially never ending loop, and we don't want people to start using
> it as a convenient solution to get out of having a fallback strategy.
> 
> However, if your entire architecture around a particular allocation is
> that failure is not an option at this point, and you can't reasonably
> preallocate - although that would always be preferrable - then please
> do not open code an endless loop around the call to the allocator but
> use __GFP_NOFAIL instead so that these callsites are annotated and can
> be reviewed. 

I will actively work around aanything that causes filesystem memory
pressure to increase the chance of oom killer invocations. The OOM
killer is not a solution - it is, by definition, a loose cannon and
so we should be reducing dependencies on it.

I really don't care about the OOM Killer corner cases - it's
completely the wrong way line of development to be spending time on
and you aren't going to convince me otherwise. The OOM killer a
crutch used to justify having a memory allocation subsystem that
can't provide forward progress guarantee mechanisms to callers that
need it.

I've proposed a method of providing this forward progress guarantee
for subsystems of arbitrary complexity, and this removes the
dependency on the OOM killer for fowards allocation progress in such
contexts (e.g. filesystems). We should be discussing how to
implement that, not what bandaids we need to apply to the OOM
killer. I want to fix the underlying problems, not push them under
the OOM-killer bus...

> And please understand that this callsite blowing up is a chance to
> better the code and behavior here.  Where previously it would just
> endlessly loop in the allocator without any means to make progress,

Again, this statement ignores the fact we have *no credible
evidence* that this is actually a problem in production
environments.

And, besides, even if you do force through changing the XFS code to
GFP_NOFAIL, it'll get changed back to a retry loop in the near
future when we add admin configurable error handling behaviour to
XFS, as I pointed Michal to....
(http://oss.sgi.com/archives/xfs/2015-02/msg00346.html)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-19 22:52                                             ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-19 22:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

On Thu, Feb 19, 2015 at 05:24:31AM -0500, Johannes Weiner wrote:
> On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote:
> > [ cc xfs list - experienced kernel devs should not have to be
> > reminded to do this ]
> > 
> > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > > -					__func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Hmmm - the only reason there is a focus on this loop is that it
> > emits warnings about allocations failing. It's obvious that the
> > problem being dealt with here is a fundamental design issue w.r.t.
> > to locking and the OOM killer, but the proposed special casing
> > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> > in XFS started emitting warnings about allocations failing more
> > often.
> > 
> > So the answer is to remove the warning?  That's like killing the
> > canary to stop the methane leak in the coal mine. No canary? No
> > problems!
> 
> That's not what happened.  The patch that affected behavior here
> transformed code that an incoherent collection of conditions to
> something that has an actual model.

Which is entirely undocumented. If you have a model, the first thing
to do is document it and communicate that model to everyone who
needs to know about that new model. I have no idea what that model
is. Keeping it in your head and changing code that other people
maintain without giving them any means of understanding WTF you are
doing is a really bad engineering practice.


And yes, I have had a bit to say about this in public recently.
Go watch my recent LCA talk, for example....

And, FWIW, email discussions on a list is no substitute for a
properly documented design that people can take their time to
understand and digest.

> That model is that we don't loop
> in the allocator if there are no means to making forward progress.  In
> this case, it was GFP_NOFS triggering an early exit from the allocator
> because it's not allowed to invoke the OOM killer per default, and
> there is little point in looping for times to better on their own.

So you keep saying....

> So these deadlock warnings happen, ironically, by the page allocator
> now bailing out of a locked-up state in which it's not making forward
> progress.  They don't strike me as a very useful canary in this case.

... yet we *rarely* see the canary warnings we emit when we do too
many allocation retries, the code has been that way for 13-odd
years.  Hence, despite your protestations that your way is *better*,
we have code that is tried, tested and proven in rugged production
environments. That's far more convincing evidence that the *code
should not change* than your assertions that it is broken and needs
to be fixed.

> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure. That's a big red flag to me that all
> > this hacking around the edges is not solving the underlying problem,
> > but instead is breaking things that did once work.
> > 
> > And, well, then there's this (gfp.h):
> > 
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures.  This modifier is deprecated and no new
> >  * users should be added.
> > 
> > So, is this another policy relevation from the mm developers about
> > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
> > Or just another symptom of frantic thrashing because nobody actually
> > understands the problem or those that do are unwilling to throw out
> > the broken crap and redesign it?
> 
> Well, understand our dilemma here.  __GFP_NOFAIL is a liability
> because it can trap tasks with unknown state and locks in a
> potentially never ending loop, and we don't want people to start using
> it as a convenient solution to get out of having a fallback strategy.
> 
> However, if your entire architecture around a particular allocation is
> that failure is not an option at this point, and you can't reasonably
> preallocate - although that would always be preferrable - then please
> do not open code an endless loop around the call to the allocator but
> use __GFP_NOFAIL instead so that these callsites are annotated and can
> be reviewed. 

I will actively work around aanything that causes filesystem memory
pressure to increase the chance of oom killer invocations. The OOM
killer is not a solution - it is, by definition, a loose cannon and
so we should be reducing dependencies on it.

I really don't care about the OOM Killer corner cases - it's
completely the wrong way line of development to be spending time on
and you aren't going to convince me otherwise. The OOM killer a
crutch used to justify having a memory allocation subsystem that
can't provide forward progress guarantee mechanisms to callers that
need it.

I've proposed a method of providing this forward progress guarantee
for subsystems of arbitrary complexity, and this removes the
dependency on the OOM killer for fowards allocation progress in such
contexts (e.g. filesystems). We should be discussing how to
implement that, not what bandaids we need to apply to the OOM
killer. I want to fix the underlying problems, not push them under
the OOM-killer bus...

> And please understand that this callsite blowing up is a chance to
> better the code and behavior here.  Where previously it would just
> endlessly loop in the allocator without any means to make progress,

Again, this statement ignores the fact we have *no credible
evidence* that this is actually a problem in production
environments.

And, besides, even if you do force through changing the XFS code to
GFP_NOFAIL, it'll get changed back to a retry loop in the near
future when we add admin configurable error handling behaviour to
XFS, as I pointed Michal to....
(http://oss.sgi.com/archives/xfs/2015-02/msg00346.html)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 10:48                                                   ` Tetsuo Handa
@ 2015-02-20  8:26                                                     ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20  8:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

On Thu 19-02-15 19:48:16, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > I do not understand. What kind of bug would be fixed by that change?
> 
> That change fixes significant loss of file I/O reliability under extreme
> memory pressure.
> 
> Today I tested how frequent filesystem errors occurs using scripted environment.
> ( Source code of a.out is http://marc.info/?l=linux-fsdevel&m=142425860904849&w=2 )
> 
> ----------
> #!/bin/sh
> : > ~/trial.log
> for i in `seq 1 100`
> do
>     mkfs.ext4 -q /dev/sdb1 || exit 1
>     mount -o errors=remount-ro /dev/sdb1 /tmp || exit 2
>     chmod 1777 /tmp
>     su - demo -c ~demo/a.out
>     if [ -w /tmp/ ]
>     then
>         echo -n "S" >> ~/trial.log
>     else
>         echo -n "F" >> ~/trial.log
>     fi
>     umount /tmp
> done
> ----------
> 
> We can see that filesystem errors are occurring frequently if GFP_NOFS / GFP_NOIO
> allocations give up without retrying.

I would suggest reporting this to ext people (in a separate thread
please) and see what is the proper fix.

> On the other hand, as far as these trials,
> TIF_MEMDIE stall was not observed if GFP_NOFS / GFP_NOIO allocations give up
> without retrying. Maybe giving up without retrying is keeping away from hitting
> stalls for this test case?

This is expected because those allocations are with locks held and so
the chances to release the lock are higher.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20  8:26                                                     ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20  8:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, hannes, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, linux-fsdevel, fernando_b1

On Thu 19-02-15 19:48:16, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > I do not understand. What kind of bug would be fixed by that change?
> 
> That change fixes significant loss of file I/O reliability under extreme
> memory pressure.
> 
> Today I tested how frequent filesystem errors occurs using scripted environment.
> ( Source code of a.out is http://marc.info/?l=linux-fsdevel&m=142425860904849&w=2 )
> 
> ----------
> #!/bin/sh
> : > ~/trial.log
> for i in `seq 1 100`
> do
>     mkfs.ext4 -q /dev/sdb1 || exit 1
>     mount -o errors=remount-ro /dev/sdb1 /tmp || exit 2
>     chmod 1777 /tmp
>     su - demo -c ~demo/a.out
>     if [ -w /tmp/ ]
>     then
>         echo -n "S" >> ~/trial.log
>     else
>         echo -n "F" >> ~/trial.log
>     fi
>     umount /tmp
> done
> ----------
> 
> We can see that filesystem errors are occurring frequently if GFP_NOFS / GFP_NOIO
> allocations give up without retrying.

I would suggest reporting this to ext people (in a separate thread
please) and see what is the proper fix.

> On the other hand, as far as these trials,
> TIF_MEMDIE stall was not observed if GFP_NOFS / GFP_NOIO allocations give up
> without retrying. Maybe giving up without retrying is keeping away from hitting
> stalls for this test case?

This is expected because those allocations are with locks held and so
the chances to release the lock are higher.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 13:29                                                     ` Tetsuo Handa
@ 2015-02-20  9:10                                                       ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20  9:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > [...]
> > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > with preallocated reserves.  But this is not going to happen anytime
> > > soon, so what other option do we have than resolving this on the OOM
> > > killer side?
> > 
> > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > 100% solution because reserves could get depleted but this risk is there
> > even with multiple oom victims. I would still argue that this would be a
> > better approach because selecting more victims might hit pathological
> > case more easily (other victims might be blocked on the very same lock
> > e.g.).
> > 
> Does "multiple OOM victims" mean "select next if first does not die"?
> Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> does not deplete memory reserves. ;-)

It doesn't because
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (!in_interrupt() &&
-				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;

you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
and break out from the allocator. Exiting task might need a memory to do
so and you make all those allocations fail basically. How do you know
this is not going to blow up?

> If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
> those who do not want to fail (e.g. journal transaction) will start passing
> __GFP_NOFAIL?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20  9:10                                                       ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20  9:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes,
	linux-fsdevel, akpm, fernando_b1, torvalds

On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > [...]
> > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > with preallocated reserves.  But this is not going to happen anytime
> > > soon, so what other option do we have than resolving this on the OOM
> > > killer side?
> > 
> > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > 100% solution because reserves could get depleted but this risk is there
> > even with multiple oom victims. I would still argue that this would be a
> > better approach because selecting more victims might hit pathological
> > case more easily (other victims might be blocked on the very same lock
> > e.g.).
> > 
> Does "multiple OOM victims" mean "select next if first does not die"?
> Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> does not deplete memory reserves. ;-)

It doesn't because
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (!in_interrupt() &&
-				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;

you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
and break out from the allocator. Exiting task might need a memory to do
so and you make all those allocations fail basically. How do you know
this is not going to blow up?

> If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
> those who do not want to fail (e.g. journal transaction) will start passing
> __GFP_NOFAIL?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 15:29                                                       ` Tetsuo Handa
@ 2015-02-20  9:13                                                         ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20  9:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes,
	linux-fsdevel, akpm, fernando_b1, torvalds

On Fri 20-02-15 00:29:29, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> > [...]
> > > Something like the following.
> > __GFP_HIGH doesn't seem to be sufficient so we would need something
> > slightly else but the idea is still the same:
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8d52ab18fe0d..2d224bbdf8e8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> >  	bool deferred_compaction = false;
> >  	int contended_compaction = COMPACT_CONTENDED_NONE;
> > +	int oom = 0;
> >  
> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -2635,6 +2636,15 @@ retry:
> >  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >  
> >  	/*
> > +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> > +	 * might be blocking resources needed by the OOM victim to terminate.
> > +	 * Allow the caller to dive into memory reserves to succeed the
> > +	 * allocation and break out from a potential deadlock.
> > +	 */
> 
> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
> threads are doing the same operation which requires __GFP_NOFAIL
> allocation with a lock held, wouldn't memory reserves deplete?

We shouldn't have an unbounded number of GFP_NOFAIL allocations at the
same time. This would be even more broken. If a load is known to use
such allocations excessively then the administrator can enlarge the
memory reserves.

> This heuristic can't continue if memory reserves depleted or
> continuous pages of requested order cannot be found.

Once memory reserves are depleted we are screwed anyway and we might
panic.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20  9:13                                                         ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20  9:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

On Fri 20-02-15 00:29:29, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> > [...]
> > > Something like the following.
> > __GFP_HIGH doesn't seem to be sufficient so we would need something
> > slightly else but the idea is still the same:
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8d52ab18fe0d..2d224bbdf8e8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> >  	bool deferred_compaction = false;
> >  	int contended_compaction = COMPACT_CONTENDED_NONE;
> > +	int oom = 0;
> >  
> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -2635,6 +2636,15 @@ retry:
> >  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >  
> >  	/*
> > +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> > +	 * might be blocking resources needed by the OOM victim to terminate.
> > +	 * Allow the caller to dive into memory reserves to succeed the
> > +	 * allocation and break out from a potential deadlock.
> > +	 */
> 
> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
> threads are doing the same operation which requires __GFP_NOFAIL
> allocation with a lock held, wouldn't memory reserves deplete?

We shouldn't have an unbounded number of GFP_NOFAIL allocations at the
same time. This would be even more broken. If a load is known to use
such allocations excessively then the administrator can enlarge the
memory reserves.

> This heuristic can't continue if memory reserves depleted or
> continuous pages of requested order cannot be found.

Once memory reserves are depleted we are screwed anyway and we might
panic.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 22:03                                                     ` Dave Chinner
@ 2015-02-20  9:27                                                       ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20  9:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Fri 20-02-15 09:03:55, Dave Chinner wrote:
[...]
> Converting the code to use GFP_NOFAIL takes us in exactly the
> opposite direction to our current line of development w.r.t. to
> filesystem error handling.

Fair enough. If there are plans to have a failure policy rather than
GFP_NOFAIL like behavior then I have, of course, no objections. Quite
opposite. This is exactly what I would like to see. GFP_NOFAIL should be
rarely used, really.

The whole point of this discussion, and I am sorry if I didn't make it
clear, is that _if_ there is really a GFP_NOFAIL requirement hidden
from the allocator then it should be changed to use GFP_NOFAIL so that
allocator knows about this requirement.

> > The reason I care about GFP_NOFAIL is that there are apparently code
> > paths which do not tell allocator they are basically GFP_NOFAIL without
> > any fallback. This leads to two main problems 1) we do not have a good
> > overview how many code paths have such a strong requirements and so
> > cannot estimate e.g. how big memory reserves should be and
> 
> Right, when GFP_NOFAIL got deprecated we lost the ability to document
> such behaviour and find it easily. People just put retry loops in
> instead of using GFP_NOFAIL. Good luck finding them all :/

That will be PITA, all right, but I guess the deprecation was a mistake
and we should stop this tendency.

> > 2) allocator
> > cannot help those paths (e.g. by giving them access to reserves to break
> > out of the livelock).
> 
> Allocator should not help. Global reserves are unreliable - make the
> allocation context reserve the amount it needs before it enters the
> context where it can't back out....

Sure pre-allocation is preferable. But once somebody asks for GFP_NOFAIL
then it is too late and the allocator only has memory reclaim and
potentially reserves.

[...]
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20  9:27                                                       ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20  9:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Fri 20-02-15 09:03:55, Dave Chinner wrote:
[...]
> Converting the code to use GFP_NOFAIL takes us in exactly the
> opposite direction to our current line of development w.r.t. to
> filesystem error handling.

Fair enough. If there are plans to have a failure policy rather than
GFP_NOFAIL like behavior then I have, of course, no objections. Quite
opposite. This is exactly what I would like to see. GFP_NOFAIL should be
rarely used, really.

The whole point of this discussion, and I am sorry if I didn't make it
clear, is that _if_ there is really a GFP_NOFAIL requirement hidden
from the allocator then it should be changed to use GFP_NOFAIL so that
allocator knows about this requirement.

> > The reason I care about GFP_NOFAIL is that there are apparently code
> > paths which do not tell allocator they are basically GFP_NOFAIL without
> > any fallback. This leads to two main problems 1) we do not have a good
> > overview how many code paths have such a strong requirements and so
> > cannot estimate e.g. how big memory reserves should be and
> 
> Right, when GFP_NOFAIL got deprecated we lost the ability to document
> such behaviour and find it easily. People just put retry loops in
> instead of using GFP_NOFAIL. Good luck finding them all :/

That will be PITA, all right, but I guess the deprecation was a mistake
and we should stop this tendency.

> > 2) allocator
> > cannot help those paths (e.g. by giving them access to reserves to break
> > out of the livelock).
> 
> Allocator should not help. Global reserves are unreliable - make the
> allocation context reserve the amount it needs before it enters the
> context where it can't back out....

Sure pre-allocation is preferable. But once somebody asks for GFP_NOFAIL
then it is too late and the allocator only has memory reclaim and
potentially reserves.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 22:52                                             ` Dave Chinner
@ 2015-02-20 10:36                                               ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-20 10:36 UTC (permalink / raw)
  To: david, hannes
  Cc: dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds

Dave Chinner wrote:
> I really don't care about the OOM Killer corner cases - it's
> completely the wrong way line of development to be spending time on
> and you aren't going to convince me otherwise. The OOM killer a
> crutch used to justify having a memory allocation subsystem that
> can't provide forward progress guarantee mechanisms to callers that
> need it.

I really care about the OOM Killer corner cases, for I'm

  (1) seeing trouble cases which occurred in enterprise systems
      under OOM conditions

  (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which
      an unprivileged user with a login shell can trivially trigger
      since Linux 2.0) to OOM "Genocide" attacks in order to allow
      OOM-unkillable daemons to restart OOM-killed processes

  (3) waiting for a bandaid for (2) in order to propose changes for
      mitigating OOM "Genocide" attacks (as bad guys will find how to
      trigger OOM "Deadlock or Genocide" attacks from changes for
      mitigating OOM "Genocide" attacks)

I started posting to linux-mm ML in order to make forward progress
about (1) and (2). I don't want the memory allocation subsystem to
lock up an entire system by indefinitely disabling memory releasing
mechanism provided by the OOM killer.

> I've proposed a method of providing this forward progress guarantee
> for subsystems of arbitrary complexity, and this removes the
> dependency on the OOM killer for fowards allocation progress in such
> contexts (e.g. filesystems). We should be discussing how to
> implement that, not what bandaids we need to apply to the OOM
> killer. I want to fix the underlying problems, not push them under
> the OOM-killer bus...

I'm fine with that direction for new kernels provided that a simple
bandaid which can be backported to distributor kernels for making
OOM "Deadlock" attacks impossible is implemented. Therefore, I'm
discussing what bandaids we need to apply to the OOM killer.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20 10:36                                               ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-20 10:36 UTC (permalink / raw)
  To: david, hannes
  Cc: mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs

Dave Chinner wrote:
> I really don't care about the OOM Killer corner cases - it's
> completely the wrong way line of development to be spending time on
> and you aren't going to convince me otherwise. The OOM killer a
> crutch used to justify having a memory allocation subsystem that
> can't provide forward progress guarantee mechanisms to callers that
> need it.

I really care about the OOM Killer corner cases, for I'm

  (1) seeing trouble cases which occurred in enterprise systems
      under OOM conditions

  (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which
      an unprivileged user with a login shell can trivially trigger
      since Linux 2.0) to OOM "Genocide" attacks in order to allow
      OOM-unkillable daemons to restart OOM-killed processes

  (3) waiting for a bandaid for (2) in order to propose changes for
      mitigating OOM "Genocide" attacks (as bad guys will find how to
      trigger OOM "Deadlock or Genocide" attacks from changes for
      mitigating OOM "Genocide" attacks)

I started posting to linux-mm ML in order to make forward progress
about (1) and (2). I don't want the memory allocation subsystem to
lock up an entire system by indefinitely disabling memory releasing
mechanism provided by the OOM killer.

> I've proposed a method of providing this forward progress guarantee
> for subsystems of arbitrary complexity, and this removes the
> dependency on the OOM killer for fowards allocation progress in such
> contexts (e.g. filesystems). We should be discussing how to
> implement that, not what bandaids we need to apply to the OOM
> killer. I want to fix the underlying problems, not push them under
> the OOM-killer bus...

I'm fine with that direction for new kernels provided that a simple
bandaid which can be backported to distributor kernels for making
OOM "Deadlock" attacks impossible is implemented. Therefore, I'm
discussing what bandaids we need to apply to the OOM killer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20  9:10                                                       ` Michal Hocko
  (?)
@ 2015-02-20 12:20                                                         ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-20 12:20 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > [...]
> > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > with preallocated reserves.  But this is not going to happen anytime
> > > > soon, so what other option do we have than resolving this on the OOM
> > > > killer side?
> > > 
> > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > > 100% solution because reserves could get depleted but this risk is there
> > > even with multiple oom victims. I would still argue that this would be a
> > > better approach because selecting more victims might hit pathological
> > > case more easily (other victims might be blocked on the very same lock
> > > e.g.).
> > > 
> > Does "multiple OOM victims" mean "select next if first does not die"?
> > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> > does not deplete memory reserves. ;-)
> 
> It doesn't because
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
>  		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (!in_interrupt() &&
> -				((current->flags & PF_MEMALLOC) ||
> -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> +		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> 
> you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
> and break out from the allocator. Exiting task might need a memory to do
> so and you make all those allocations fail basically. How do you know
> this is not going to blow up?
> 

Well, treat exiting tasks to imply __GFP_NOFAIL for clean up?

We cannot determine correct task to kill + allow access to memory reserves
based on lock dependency. Therefore, this patch evenly allow no tasks to
access to memory reserves.

Exiting task might need some memory to exit, and not allowing access to
memory reserves can retard exit of that task. But that task will eventually
get memory released by other tasks killed by timeout-based kill-more
mechanism. If no more killable tasks or expired panic-timeout, it is
the same result with depletion of memory reserves.

I think that this situation (automatically making forward progress as if
the administrator is periodically doing SysRq-f until the OOM condition
is solved, or is doing SysRq-c if no more killable tasks or stalled too
long) is better than current situation (not making forward progress since
the exiting task cannot exit due to lock dependency, caused by failing to
determine correct task to kill + allow access to memory reserves).

> > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
> > those who do not want to fail (e.g. journal transaction) will start passing
> > __GFP_NOFAIL?
> > 

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20 12:20                                                         ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-20 12:20 UTC (permalink / raw)
  To: mhocko
  Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes,
	linux-fsdevel, akpm, fernando_b1, torvalds

Michal Hocko wrote:
> On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > [...]
> > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > with preallocated reserves.  But this is not going to happen anytime
> > > > soon, so what other option do we have than resolving this on the OOM
> > > > killer side?
> > > 
> > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > > 100% solution because reserves could get depleted but this risk is there
> > > even with multiple oom victims. I would still argue that this would be a
> > > better approach because selecting more victims might hit pathological
> > > case more easily (other victims might be blocked on the very same lock
> > > e.g.).
> > > 
> > Does "multiple OOM victims" mean "select next if first does not die"?
> > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> > does not deplete memory reserves. ;-)
> 
> It doesn't because
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
>  		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (!in_interrupt() &&
> -				((current->flags & PF_MEMALLOC) ||
> -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> +		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> 
> you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
> and break out from the allocator. Exiting task might need a memory to do
> so and you make all those allocations fail basically. How do you know
> this is not going to blow up?
> 

Well, treat exiting tasks to imply __GFP_NOFAIL for clean up?

We cannot determine correct task to kill + allow access to memory reserves
based on lock dependency. Therefore, this patch evenly allow no tasks to
access to memory reserves.

Exiting task might need some memory to exit, and not allowing access to
memory reserves can retard exit of that task. But that task will eventually
get memory released by other tasks killed by timeout-based kill-more
mechanism. If no more killable tasks or expired panic-timeout, it is
the same result with depletion of memory reserves.

I think that this situation (automatically making forward progress as if
the administrator is periodically doing SysRq-f until the OOM condition
is solved, or is doing SysRq-c if no more killable tasks or stalled too
long) is better than current situation (not making forward progress since
the exiting task cannot exit due to lock dependency, caused by failing to
determine correct task to kill + allow access to memory reserves).

> > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
> > those who do not want to fail (e.g. journal transaction) will start passing
> > __GFP_NOFAIL?
> > 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20 12:20                                                         ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-20 12:20 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

Michal Hocko wrote:
> On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > [...]
> > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > with preallocated reserves.  But this is not going to happen anytime
> > > > soon, so what other option do we have than resolving this on the OOM
> > > > killer side?
> > > 
> > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > > 100% solution because reserves could get depleted but this risk is there
> > > even with multiple oom victims. I would still argue that this would be a
> > > better approach because selecting more victims might hit pathological
> > > case more easily (other victims might be blocked on the very same lock
> > > e.g.).
> > > 
> > Does "multiple OOM victims" mean "select next if first does not die"?
> > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> > does not deplete memory reserves. ;-)
> 
> It doesn't because
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
>  		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (!in_interrupt() &&
> -				((current->flags & PF_MEMALLOC) ||
> -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> +		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> 
> you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
> and break out from the allocator. Exiting task might need a memory to do
> so and you make all those allocations fail basically. How do you know
> this is not going to blow up?
> 

Well, treat exiting tasks to imply __GFP_NOFAIL for clean up?

We cannot determine correct task to kill + allow access to memory reserves
based on lock dependency. Therefore, this patch evenly allow no tasks to
access to memory reserves.

Exiting task might need some memory to exit, and not allowing access to
memory reserves can retard exit of that task. But that task will eventually
get memory released by other tasks killed by timeout-based kill-more
mechanism. If no more killable tasks or expired panic-timeout, it is
the same result with depletion of memory reserves.

I think that this situation (automatically making forward progress as if
the administrator is periodically doing SysRq-f until the OOM condition
is solved, or is doing SysRq-c if no more killable tasks or stalled too
long) is better than current situation (not making forward progress since
the exiting task cannot exit due to lock dependency, caused by failing to
determine correct task to kill + allow access to memory reserves).

> > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
> > those who do not want to fail (e.g. journal transaction) will start passing
> > __GFP_NOFAIL?
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 12:20                                                         ` Tetsuo Handa
@ 2015-02-20 12:38                                                           ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20 12:38 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, david, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, linux-fsdevel, fernando_b1

On Fri 20-02-15 21:20:58, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > > [...]
> > > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > > with preallocated reserves.  But this is not going to happen anytime
> > > > > soon, so what other option do we have than resolving this on the OOM
> > > > > killer side?
> > > > 
> > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > > > 100% solution because reserves could get depleted but this risk is there
> > > > even with multiple oom victims. I would still argue that this would be a
> > > > better approach because selecting more victims might hit pathological
> > > > case more easily (other victims might be blocked on the very same lock
> > > > e.g.).
> > > > 
> > > Does "multiple OOM victims" mean "select next if first does not die"?
> > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> > > does not deplete memory reserves. ;-)
> > 
> > It doesn't because
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> >  		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > -		else if (!in_interrupt() &&
> > -				((current->flags & PF_MEMALLOC) ||
> > -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> > +		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > 
> > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
> > and break out from the allocator. Exiting task might need a memory to do
> > so and you make all those allocations fail basically. How do you know
> > this is not going to blow up?
> > 
> 
> Well, treat exiting tasks to imply __GFP_NOFAIL for clean up?
> 
> We cannot determine correct task to kill + allow access to memory reserves
> based on lock dependency. Therefore, this patch evenly allow no tasks to
> access to memory reserves.
> 
> Exiting task might need some memory to exit, and not allowing access to
> memory reserves can retard exit of that task. But that task will eventually
> get memory released by other tasks killed by timeout-based kill-more
> mechanism. If no more killable tasks or expired panic-timeout, it is
> the same result with depletion of memory reserves.
> 
> I think that this situation (automatically making forward progress as if
> the administrator is periodically doing SysRq-f until the OOM condition
> is solved, or is doing SysRq-c if no more killable tasks or stalled too
> long) is better than current situation (not making forward progress since
> the exiting task cannot exit due to lock dependency, caused by failing to
> determine correct task to kill + allow access to memory reserves).

If you really believe this is an improvement then send a proper patch
with justification. But I am _really_ skeptical about such a change to
be honest.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20 12:38                                                           ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20 12:38 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes,
	linux-fsdevel, akpm, fernando_b1, torvalds

On Fri 20-02-15 21:20:58, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > > [...]
> > > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > > with preallocated reserves.  But this is not going to happen anytime
> > > > > soon, so what other option do we have than resolving this on the OOM
> > > > > killer side?
> > > > 
> > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > > > 100% solution because reserves could get depleted but this risk is there
> > > > even with multiple oom victims. I would still argue that this would be a
> > > > better approach because selecting more victims might hit pathological
> > > > case more easily (other victims might be blocked on the very same lock
> > > > e.g.).
> > > > 
> > > Does "multiple OOM victims" mean "select next if first does not die"?
> > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> > > does not deplete memory reserves. ;-)
> > 
> > It doesn't because
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> >  		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > -		else if (!in_interrupt() &&
> > -				((current->flags & PF_MEMALLOC) ||
> > -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> > +		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > 
> > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
> > and break out from the allocator. Exiting task might need a memory to do
> > so and you make all those allocations fail basically. How do you know
> > this is not going to blow up?
> > 
> 
> Well, treat exiting tasks to imply __GFP_NOFAIL for clean up?
> 
> We cannot determine correct task to kill + allow access to memory reserves
> based on lock dependency. Therefore, this patch evenly allow no tasks to
> access to memory reserves.
> 
> Exiting task might need some memory to exit, and not allowing access to
> memory reserves can retard exit of that task. But that task will eventually
> get memory released by other tasks killed by timeout-based kill-more
> mechanism. If no more killable tasks or expired panic-timeout, it is
> the same result with depletion of memory reserves.
> 
> I think that this situation (automatically making forward progress as if
> the administrator is periodically doing SysRq-f until the OOM condition
> is solved, or is doing SysRq-c if no more killable tasks or stalled too
> long) is better than current situation (not making forward progress since
> the exiting task cannot exit due to lock dependency, caused by failing to
> determine correct task to kill + allow access to memory reserves).

If you really believe this is an improvement then send a proper patch
with justification. But I am _really_ skeptical about such a change to
be honest.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 21:43                                                     ` Dave Chinner
@ 2015-02-20 12:48                                                       ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20 12:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Fri 20-02-15 08:43:56, Dave Chinner wrote:
> On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > [...]
> > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > with preallocated reserves.  But this is not going to happen anytime
> > > soon, so what other option do we have than resolving this on the OOM
> > > killer side?
> > 
> > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > access to memory reserves (by giving it __GFP_HIGH).
> 
> Won't work when you have thousands of concurrent transactions
> running in XFS and they are all doing GFP_NOFAIL allocations.

Is there any bound on how many transactions can run at the same time?

> That's why I suggested the per-transaction reserve pool - we can use
> that

I am still not sure what you mean by reserve pool (API wise). How
does it differ from pre-allocating memory before the "may not fail
context"? Could you elaborate on it, please?

> to throttle the number of concurent contexts demanding memory for
> forwards progress, just the same was we throttle the number of
> concurrent processes based on maximum log space requirements of the
> transactions and the amount of unreserved log space available.
> 
> No log space, transaction reservations waits on an ordered queue for
> space to become available. No memory available, transaction
> reservation waits on an ordered queue for memory to become
> available.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20 12:48                                                       ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-20 12:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Fri 20-02-15 08:43:56, Dave Chinner wrote:
> On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > [...]
> > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > with preallocated reserves.  But this is not going to happen anytime
> > > soon, so what other option do we have than resolving this on the OOM
> > > killer side?
> > 
> > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > access to memory reserves (by giving it __GFP_HIGH).
> 
> Won't work when you have thousands of concurrent transactions
> running in XFS and they are all doing GFP_NOFAIL allocations.

Is there any bound on how many transactions can run at the same time?

> That's why I suggested the per-transaction reserve pool - we can use
> that

I am still not sure what you mean by reserve pool (API wise). How
does it differ from pre-allocating memory before the "may not fail
context"? Could you elaborate on it, please?

> to throttle the number of concurent contexts demanding memory for
> forwards progress, just the same was we throttle the number of
> concurrent processes based on maximum log space requirements of the
> transactions and the amount of unreserved log space available.
> 
> No log space, transaction reservations waits on an ordered queue for
> space to become available. No memory available, transaction
> reservation waits on an ordered queue for memory to become
> available.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20  9:13                                                         ` Michal Hocko
@ 2015-02-20 13:37                                                           ` Stefan Ring
  -1 siblings, 0 replies; 276+ messages in thread
From: Stefan Ring @ 2015-02-20 13:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, Linux fs XFS, hannes, linux-mm,
	mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds

>> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
>> threads are doing the same operation which requires __GFP_NOFAIL
>> allocation with a lock held, wouldn't memory reserves deplete?
>
> We shouldn't have an unbounded number of GFP_NOFAIL allocations at the
> same time. This would be even more broken. If a load is known to use
> such allocations excessively then the administrator can enlarge the
> memory reserves.
>
>> This heuristic can't continue if memory reserves depleted or
>> continuous pages of requested order cannot be found.
>
> Once memory reserves are depleted we are screwed anyway and we might
> panic.

This discussion reminds me of a situation I've seen somewhat
regularly, which I have described here:
http://oss.sgi.com/pipermail/xfs/2014-April/035793.html

I've actually seen it more often on another box with OpenVZ and
VirtualBox installed, where it would almost always happen during
startup of a VirtualBox guest machine. This other machine is also
running XFS. I blamed it on OpenVZ or VirtualBox originally, but
having seen the same thing happen on the other machine with neither of
them, the next candidate for taking blame is XFS.

Is this behavior something that can be attributed to these memory
allocation retry loops?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20 13:37                                                           ` Stefan Ring
  0 siblings, 0 replies; 276+ messages in thread
From: Stefan Ring @ 2015-02-20 13:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, Linux fs XFS, linux-mm, mgorman,
	hannes, linux-fsdevel, rientjes, akpm, fernando_b1, torvalds

>> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
>> threads are doing the same operation which requires __GFP_NOFAIL
>> allocation with a lock held, wouldn't memory reserves deplete?
>
> We shouldn't have an unbounded number of GFP_NOFAIL allocations at the
> same time. This would be even more broken. If a load is known to use
> such allocations excessively then the administrator can enlarge the
> memory reserves.
>
>> This heuristic can't continue if memory reserves depleted or
>> continuous pages of requested order cannot be found.
>
> Once memory reserves are depleted we are screwed anyway and we might
> panic.

This discussion reminds me of a situation I've seen somewhat
regularly, which I have described here:
http://oss.sgi.com/pipermail/xfs/2014-April/035793.html

I've actually seen it more often on another box with OpenVZ and
VirtualBox installed, where it would almost always happen during
startup of a VirtualBox guest machine. This other machine is also
running XFS. I blamed it on OpenVZ or VirtualBox originally, but
having seen the same thing happen on the other machine with neither of
them, the next candidate for taking blame is XFS.

Is this behavior something that can be attributed to these memory
allocation retry loops?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 12:48                                                       ` Michal Hocko
@ 2015-02-20 23:09                                                         ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-20 23:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Fri, Feb 20, 2015 at 01:48:49PM +0100, Michal Hocko wrote:
> On Fri 20-02-15 08:43:56, Dave Chinner wrote:
> > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > [...]
> > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > with preallocated reserves.  But this is not going to happen anytime
> > > > soon, so what other option do we have than resolving this on the OOM
> > > > killer side?
> > > 
> > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > access to memory reserves (by giving it __GFP_HIGH).
> > 
> > Won't work when you have thousands of concurrent transactions
> > running in XFS and they are all doing GFP_NOFAIL allocations.
> 
> Is there any bound on how many transactions can run at the same time?

Yes. As many reservations that can fit in the available log space.

The log can be sized up to 2GB, and for filesystems larger than 4TB
will default to 2GB. Log space reservations depend on the operation
being done - an inode timestamp update requires about 5kB of
reservation, and rename requires about 200kB. Hence we can easily
have thousands of active transactions, even in the worst case
log space reversation cases.

You're saying it would be insane to have hundreds or thousands of
threads doing GFP_NOFAIL allocations concurrently. Reality check:
XFS has been operating successfully under such workload conditions
in production systems for many years.

> > That's why I suggested the per-transaction reserve pool - we can use
> > that
> 
> I am still not sure what you mean by reserve pool (API wise). How
> does it differ from pre-allocating memory before the "may not fail
> context"? Could you elaborate on it, please?

It is preallocating memory: into a reserve pool associated with the
transaction, done as part of the transaction reservation mechanism
we already have in XFS. The allocator then uses that reserve pool
to allocate from if an allocation would otherwise fail.

There is no way we can preallocate specific objects before the
transaction - that's just insane, especially handling the unbound
demand paged object requirement. Hence the need for a "preallocated
reserve pool" that the allocator can dip into that covers the memory
we need to *allocate and can't reclaim* during the course of the
transaction.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20 23:09                                                         ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-20 23:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Fri, Feb 20, 2015 at 01:48:49PM +0100, Michal Hocko wrote:
> On Fri 20-02-15 08:43:56, Dave Chinner wrote:
> > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > [...]
> > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > with preallocated reserves.  But this is not going to happen anytime
> > > > soon, so what other option do we have than resolving this on the OOM
> > > > killer side?
> > > 
> > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > access to memory reserves (by giving it __GFP_HIGH).
> > 
> > Won't work when you have thousands of concurrent transactions
> > running in XFS and they are all doing GFP_NOFAIL allocations.
> 
> Is there any bound on how many transactions can run at the same time?

Yes. As many reservations that can fit in the available log space.

The log can be sized up to 2GB, and for filesystems larger than 4TB
will default to 2GB. Log space reservations depend on the operation
being done - an inode timestamp update requires about 5kB of
reservation, and rename requires about 200kB. Hence we can easily
have thousands of active transactions, even in the worst case
log space reversation cases.

You're saying it would be insane to have hundreds or thousands of
threads doing GFP_NOFAIL allocations concurrently. Reality check:
XFS has been operating successfully under such workload conditions
in production systems for many years.

> > That's why I suggested the per-transaction reserve pool - we can use
> > that
> 
> I am still not sure what you mean by reserve pool (API wise). How
> does it differ from pre-allocating memory before the "may not fail
> context"? Could you elaborate on it, please?

It is preallocating memory: into a reserve pool associated with the
transaction, done as part of the transaction reservation mechanism
we already have in XFS. The allocator then uses that reserve pool
to allocate from if an allocation would otherwise fail.

There is no way we can preallocate specific objects before the
transaction - that's just insane, especially handling the unbound
demand paged object requirement. Hence the need for a "preallocated
reserve pool" that the allocator can dip into that covers the memory
we need to *allocate and can't reclaim* during the course of the
transaction.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 10:36                                               ` Tetsuo Handa
@ 2015-02-20 23:15                                                 ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-20 23:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > I really don't care about the OOM Killer corner cases - it's
> > completely the wrong way line of development to be spending time on
> > and you aren't going to convince me otherwise. The OOM killer a
> > crutch used to justify having a memory allocation subsystem that
> > can't provide forward progress guarantee mechanisms to callers that
> > need it.
> 
> I really care about the OOM Killer corner cases, for I'm
> 
>   (1) seeing trouble cases which occurred in enterprise systems
>       under OOM conditions

You reach OOM, then your SLAs are dead and buried. Reboot the
box - its a much more reliable way of returning to a working system
than playing Russian Roulette with the OOM killer.

>   (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which
>       an unprivileged user with a login shell can trivially trigger
>       since Linux 2.0) to OOM "Genocide" attacks in order to allow
>       OOM-unkillable daemons to restart OOM-killed processes
> 
>   (3) waiting for a bandaid for (2) in order to propose changes for
>       mitigating OOM "Genocide" attacks (as bad guys will find how to
>       trigger OOM "Deadlock or Genocide" attacks from changes for
>       mitigating OOM "Genocide" attacks)

Which is yet another indication that the OOM killer is the wrong
solution to the "lack of forward progress" problem. Any one can
generate enough memory pressure to trigger the OOM killer; we can't
prevent that from occurring when the OOM killer can be invoked by
user processes.

> I started posting to linux-mm ML in order to make forward progress
> about (1) and (2). I don't want the memory allocation subsystem to
> lock up an entire system by indefinitely disabling memory releasing
> mechanism provided by the OOM killer.
> 
> > I've proposed a method of providing this forward progress guarantee
> > for subsystems of arbitrary complexity, and this removes the
> > dependency on the OOM killer for fowards allocation progress in such
> > contexts (e.g. filesystems). We should be discussing how to
> > implement that, not what bandaids we need to apply to the OOM
> > killer. I want to fix the underlying problems, not push them under
> > the OOM-killer bus...
> 
> I'm fine with that direction for new kernels provided that a simple
> bandaid which can be backported to distributor kernels for making
> OOM "Deadlock" attacks impossible is implemented. Therefore, I'm
> discussing what bandaids we need to apply to the OOM killer.

The band-aids being proposed are worse than the problem they are
intended to cover up. In which case, the band-aids should not be
applied.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-20 23:15                                                 ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-20 23:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > I really don't care about the OOM Killer corner cases - it's
> > completely the wrong way line of development to be spending time on
> > and you aren't going to convince me otherwise. The OOM killer a
> > crutch used to justify having a memory allocation subsystem that
> > can't provide forward progress guarantee mechanisms to callers that
> > need it.
> 
> I really care about the OOM Killer corner cases, for I'm
> 
>   (1) seeing trouble cases which occurred in enterprise systems
>       under OOM conditions

You reach OOM, then your SLAs are dead and buried. Reboot the
box - its a much more reliable way of returning to a working system
than playing Russian Roulette with the OOM killer.

>   (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which
>       an unprivileged user with a login shell can trivially trigger
>       since Linux 2.0) to OOM "Genocide" attacks in order to allow
>       OOM-unkillable daemons to restart OOM-killed processes
> 
>   (3) waiting for a bandaid for (2) in order to propose changes for
>       mitigating OOM "Genocide" attacks (as bad guys will find how to
>       trigger OOM "Deadlock or Genocide" attacks from changes for
>       mitigating OOM "Genocide" attacks)

Which is yet another indication that the OOM killer is the wrong
solution to the "lack of forward progress" problem. Any one can
generate enough memory pressure to trigger the OOM killer; we can't
prevent that from occurring when the OOM killer can be invoked by
user processes.

> I started posting to linux-mm ML in order to make forward progress
> about (1) and (2). I don't want the memory allocation subsystem to
> lock up an entire system by indefinitely disabling memory releasing
> mechanism provided by the OOM killer.
> 
> > I've proposed a method of providing this forward progress guarantee
> > for subsystems of arbitrary complexity, and this removes the
> > dependency on the OOM killer for fowards allocation progress in such
> > contexts (e.g. filesystems). We should be discussing how to
> > implement that, not what bandaids we need to apply to the OOM
> > killer. I want to fix the underlying problems, not push them under
> > the OOM-killer bus...
> 
> I'm fine with that direction for new kernels provided that a simple
> bandaid which can be backported to distributor kernels for making
> OOM "Deadlock" attacks impossible is implemented. Therefore, I'm
> discussing what bandaids we need to apply to the OOM killer.

The band-aids being proposed are worse than the problem they are
intended to cover up. In which case, the band-aids should not be
applied.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 23:15                                                 ` Dave Chinner
@ 2015-02-21  3:20                                                   ` Theodore Ts'o
  -1 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-02-21  3:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, hannes, mhocko, dchinner, linux-mm, rientjes, oleg,
	akpm, mgorman, torvalds, xfs, linux-ext4

+akpm

So I'm arriving late to this discussion since I've been in conference
mode for the past week, and I'm only now catching up on this thread.

I'll note that this whole question of whether or not file systems
should use GFP_NOFAIL is one where the mm developers are not of one
mind.

In fact, search for the subject line "fs/reiserfs/journal.c: Remove
obsolete __GFP_NOFAIL" where we recapitulated many of these arguments,
Andrew Morton said that it was better to use GFP_NOFAIL over the
alternatives of (a) panic'ing the kernel because the file system has
no way to move forward other than leaving the file system corrupted,
or (b) looping in the file system to retry the memory allocation to
avoid the unfortunate effects of (a).

So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to
ext4/jbd2.

It sounds like 9879de7373fc is causing massive file system
errors, and it seems **really** unfortunate it was added so late in
the day (between -rc6 and rc7).

So at this point, it seems we have two choices.  We can either revert
9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's
memory allocations and submit them as stable bug fixes.

Linux MM developers, this is your call.  I will liberally be adding
GFP_NOFAIL to ext4 if you won't revert the commit, because that's the
only way I can fix things with minimal risk of adding additional,
potentially more serious regressions.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21  3:20                                                   ` Theodore Ts'o
  0 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-02-21  3:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hannes, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm,
	mgorman, rientjes, akpm, linux-ext4, torvalds

+akpm

So I'm arriving late to this discussion since I've been in conference
mode for the past week, and I'm only now catching up on this thread.

I'll note that this whole question of whether or not file systems
should use GFP_NOFAIL is one where the mm developers are not of one
mind.

In fact, search for the subject line "fs/reiserfs/journal.c: Remove
obsolete __GFP_NOFAIL" where we recapitulated many of these arguments,
Andrew Morton said that it was better to use GFP_NOFAIL over the
alternatives of (a) panic'ing the kernel because the file system has
no way to move forward other than leaving the file system corrupted,
or (b) looping in the file system to retry the memory allocation to
avoid the unfortunate effects of (a).

So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to
ext4/jbd2.

It sounds like 9879de7373fc is causing massive file system
errors, and it seems **really** unfortunate it was added so late in
the day (between -rc6 and rc7).

So at this point, it seems we have two choices.  We can either revert
9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's
memory allocations and submit them as stable bug fixes.

Linux MM developers, this is your call.  I will liberally be adding
GFP_NOFAIL to ext4 if you won't revert the commit, because that's the
only way I can fix things with minimal risk of adding additional,
potentially more serious regressions.

						- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  3:20                                                   ` Theodore Ts'o
@ 2015-02-21  9:19                                                     ` Andrew Morton
  -1 siblings, 0 replies; 276+ messages in thread
From: Andrew Morton @ 2015-02-21  9:19 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Tetsuo Handa, hannes, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs, linux-ext4

On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:

> +akpm

I was hoping not to have to read this thread ;)

afaict there are two (main) issues:

a) whether to oom-kill when __GFP_FS is not set.  The kernel hasn't
   been doing this for ages and nothing has changed recently.

b) whether to keep looping when __GFP_NOFAIL is not set and __GFP_FS
   is not set and we can't oom-kill anything (which goes without
   saying, because __GFP_FS isn't set!).

   And 9879de7373fc ("mm: page_alloc: embed OOM killing naturally
   into allocation slowpath") somewhat inadvertently changed this policy
   - the allocation attempt will now promptly return ENOMEM if
   !__GFP_NOFAIL and !__GFP_FS.

Correct enough?

Question a) seems a bit of red herring and we can park it for now.


What I'm not really understanding is why the pre-3.19 implementation
actually worked.  We've exhausted the free pages, we're not succeeding
at reclaiming anything, we aren't able to oom-kill anyone.  Yet it
*does* work - we eventually find that memory and everything proceeds.

How come?  Where did that memory come from?


Short term, we need to fix 3.19.x and 3.20 and that appears to be by
applying Johannes's akpm-doesnt-know-why-it-works patch:

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
-		if (!(gfp_mask & __GFP_FS))
+		if (!(gfp_mask & __GFP_FS)) {
+			/*
+			 * XXX: Page reclaim didn't yield anything,
+			 * and the OOM killer can't be invoked, but
+			 * keep looping as per should_alloc_retry().
+			 */
+			*did_some_progress = 1;
 			goto out;
+		}
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.

Have people adequately confirmed that this gets us out of trouble?


And yes, I agree that sites such as xfs's kmem_alloc() should be
passing __GFP_NOFAIL to tell the page allocator what's going on.  I
don't think it matters a lot whether kmem_alloc() retains its retry
loop.  If __GFP_NOFAIL is working correctly then it will never loop
anyway...


Also, this:

On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote:

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure.

David, I did not know this!  If you've been telling us about this then
perhaps it wasn't loud enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21  9:19                                                     ` Andrew Morton
  0 siblings, 0 replies; 276+ messages in thread
From: Andrew Morton @ 2015-02-21  9:19 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, hannes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, rientjes, linux-ext4, torvalds

On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:

> +akpm

I was hoping not to have to read this thread ;)

afaict there are two (main) issues:

a) whether to oom-kill when __GFP_FS is not set.  The kernel hasn't
   been doing this for ages and nothing has changed recently.

b) whether to keep looping when __GFP_NOFAIL is not set and __GFP_FS
   is not set and we can't oom-kill anything (which goes without
   saying, because __GFP_FS isn't set!).

   And 9879de7373fc ("mm: page_alloc: embed OOM killing naturally
   into allocation slowpath") somewhat inadvertently changed this policy
   - the allocation attempt will now promptly return ENOMEM if
   !__GFP_NOFAIL and !__GFP_FS.

Correct enough?

Question a) seems a bit of red herring and we can park it for now.


What I'm not really understanding is why the pre-3.19 implementation
actually worked.  We've exhausted the free pages, we're not succeeding
at reclaiming anything, we aren't able to oom-kill anyone.  Yet it
*does* work - we eventually find that memory and everything proceeds.

How come?  Where did that memory come from?


Short term, we need to fix 3.19.x and 3.20 and that appears to be by
applying Johannes's akpm-doesnt-know-why-it-works patch:

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
-		if (!(gfp_mask & __GFP_FS))
+		if (!(gfp_mask & __GFP_FS)) {
+			/*
+			 * XXX: Page reclaim didn't yield anything,
+			 * and the OOM killer can't be invoked, but
+			 * keep looping as per should_alloc_retry().
+			 */
+			*did_some_progress = 1;
 			goto out;
+		}
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.

Have people adequately confirmed that this gets us out of trouble?


And yes, I agree that sites such as xfs's kmem_alloc() should be
passing __GFP_NOFAIL to tell the page allocator what's going on.  I
don't think it matters a lot whether kmem_alloc() retains its retry
loop.  If __GFP_NOFAIL is working correctly then it will never loop
anyway...


Also, this:

On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote:

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure.

David, I did not know this!  If you've been telling us about this then
perhaps it wasn't loud enough.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 23:15                                                 ` Dave Chinner
@ 2015-02-21 11:12                                                   ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-21 11:12 UTC (permalink / raw)
  To: david
  Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes,
	akpm, torvalds

My main issue is

  c) whether to oom-kill more processes when the OOM victim cannot be
     terminated presumably due to the OOM killer deadlock.

Dave Chinner wrote:
> On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote:
> > Dave Chinner wrote:
> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > I really care about the OOM Killer corner cases, for I'm
> > 
> >   (1) seeing trouble cases which occurred in enterprise systems
> >       under OOM conditions
> 
> You reach OOM, then your SLAs are dead and buried. Reboot the
> box - its a much more reliable way of returning to a working system
> than playing Russian Roulette with the OOM killer.

What Service Level Agreements? Such troubles are occurring on RHEL systems
where users are not sitting in front of the console. Unless somebody is
sitting in front of the console in order to do SysRq-b when troubles
occur, the down time of system will become significantly longer.

What mechanisms are available for minimizing the down time of system
when troubles under OOM condition occur? Software/hardware watchdog?
Indeed they may help, but they may be triggered prematurely when the
system has not entered into the OOM condition. Only the OOM killer knows.

> 
> >   (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which
> >       an unprivileged user with a login shell can trivially trigger
> >       since Linux 2.0) to OOM "Genocide" attacks in order to allow
> >       OOM-unkillable daemons to restart OOM-killed processes
> > 
> >   (3) waiting for a bandaid for (2) in order to propose changes for
> >       mitigating OOM "Genocide" attacks (as bad guys will find how to
> >       trigger OOM "Deadlock or Genocide" attacks from changes for
> >       mitigating OOM "Genocide" attacks)
> 
> Which is yet another indication that the OOM killer is the wrong
> solution to the "lack of forward progress" problem. Any one can
> generate enough memory pressure to trigger the OOM killer; we can't
> prevent that from occurring when the OOM killer can be invoked by
> user processes.
> 

We have memory cgroups to reduce the possibility of triggering the OOM
killer, though there will be several bugs remaining in RHEL kernels
which make administrators hesitate to use memory cgroups.

> > I started posting to linux-mm ML in order to make forward progress
> > about (1) and (2). I don't want the memory allocation subsystem to
> > lock up an entire system by indefinitely disabling memory releasing
> > mechanism provided by the OOM killer.
> > 
> > > I've proposed a method of providing this forward progress guarantee
> > > for subsystems of arbitrary complexity, and this removes the
> > > dependency on the OOM killer for fowards allocation progress in such
> > > contexts (e.g. filesystems). We should be discussing how to
> > > implement that, not what bandaids we need to apply to the OOM
> > > killer. I want to fix the underlying problems, not push them under
> > > the OOM-killer bus...
> > 
> > I'm fine with that direction for new kernels provided that a simple
> > bandaid which can be backported to distributor kernels for making
> > OOM "Deadlock" attacks impossible is implemented. Therefore, I'm
> > discussing what bandaids we need to apply to the OOM killer.
> 
> The band-aids being proposed are worse than the problem they are
> intended to cover up. In which case, the band-aids should not be
> applied.
> 

The problem is simple. /proc/sys/vm/panic_on_oom == 0 setting does not
help if the OOM killer failed to determine correct task to kill + allow
access to memory reserves. The OOM killer is waiting forever under
the OOM deadlock condition than triggering kernel panic.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Swapping_and_Out_Of_Memory_Tips.html
says that "Usually, oom_killer can kill rogue processes and the system
will survive." but says nothing about what to do when we hit the OOM
killer deadlock condition.

My band-aids allows the OOM killer to trigger kernel panic (followed
by optionally kdump and automatic reboot) for people who want to reboot
the box when default /proc/sys/vm/panic_on_oom == 0 setting failed to
kill rogue processes, and allows people who want the system to survive
when the OOM killer failed to determine correct task to kill + allow
access to memory reserves.

Not only we cannot expect that the OOM killer messages being saved to
/var/log/messages under the OOM killer deadlock condition, but also
we do not emit the OOM killer messages if we hit

    void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
                          unsigned int points, unsigned long totalpages,
                          struct mem_cgroup *memcg, nodemask_t *nodemask,
                          const char *message)
    {
            struct task_struct *victim = p;
            struct task_struct *child;
            struct task_struct *t;
            struct mm_struct *mm;
            unsigned int victim_points = 0;
            static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
                                                  DEFAULT_RATELIMIT_BURST);
    
            /*
             * If the task is already exiting, don't alarm the sysadmin or kill
             * its children or threads, just set TIF_MEMDIE so it can die quickly
             */
            if (task_will_free_mem(p)) { /***** _THIS_ _CONDITION_ *****/
                    set_tsk_thread_flag(p, TIF_MEMDIE);
                    put_task_struct(p);
                    return;
            }
    
            if (__ratelimit(&oom_rs))
                    dump_header(p, gfp_mask, order, memcg, nodemask);
    
            task_lock(p);
            pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
                    message, task_pid_nr(p), p->comm, points);
            task_unlock(p);

followed by entering into the OOM killer deadlock condition. This is
annoying for me because neither serial console nor netconsole helps
finding out that the system entered into the OOM condition.

If you want to stop people from playing Russian Roulette with the OOM
killer, please remove the OOM killer code entirely from RHEL kernels so that
people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1
setting. Can you do it?

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21 11:12                                                   ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-21 11:12 UTC (permalink / raw)
  To: david
  Cc: hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

My main issue is

  c) whether to oom-kill more processes when the OOM victim cannot be
     terminated presumably due to the OOM killer deadlock.

Dave Chinner wrote:
> On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote:
> > Dave Chinner wrote:
> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > I really care about the OOM Killer corner cases, for I'm
> > 
> >   (1) seeing trouble cases which occurred in enterprise systems
> >       under OOM conditions
> 
> You reach OOM, then your SLAs are dead and buried. Reboot the
> box - its a much more reliable way of returning to a working system
> than playing Russian Roulette with the OOM killer.

What Service Level Agreements? Such troubles are occurring on RHEL systems
where users are not sitting in front of the console. Unless somebody is
sitting in front of the console in order to do SysRq-b when troubles
occur, the down time of system will become significantly longer.

What mechanisms are available for minimizing the down time of system
when troubles under OOM condition occur? Software/hardware watchdog?
Indeed they may help, but they may be triggered prematurely when the
system has not entered into the OOM condition. Only the OOM killer knows.

> 
> >   (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which
> >       an unprivileged user with a login shell can trivially trigger
> >       since Linux 2.0) to OOM "Genocide" attacks in order to allow
> >       OOM-unkillable daemons to restart OOM-killed processes
> > 
> >   (3) waiting for a bandaid for (2) in order to propose changes for
> >       mitigating OOM "Genocide" attacks (as bad guys will find how to
> >       trigger OOM "Deadlock or Genocide" attacks from changes for
> >       mitigating OOM "Genocide" attacks)
> 
> Which is yet another indication that the OOM killer is the wrong
> solution to the "lack of forward progress" problem. Any one can
> generate enough memory pressure to trigger the OOM killer; we can't
> prevent that from occurring when the OOM killer can be invoked by
> user processes.
> 

We have memory cgroups to reduce the possibility of triggering the OOM
killer, though there will be several bugs remaining in RHEL kernels
which make administrators hesitate to use memory cgroups.

> > I started posting to linux-mm ML in order to make forward progress
> > about (1) and (2). I don't want the memory allocation subsystem to
> > lock up an entire system by indefinitely disabling memory releasing
> > mechanism provided by the OOM killer.
> > 
> > > I've proposed a method of providing this forward progress guarantee
> > > for subsystems of arbitrary complexity, and this removes the
> > > dependency on the OOM killer for fowards allocation progress in such
> > > contexts (e.g. filesystems). We should be discussing how to
> > > implement that, not what bandaids we need to apply to the OOM
> > > killer. I want to fix the underlying problems, not push them under
> > > the OOM-killer bus...
> > 
> > I'm fine with that direction for new kernels provided that a simple
> > bandaid which can be backported to distributor kernels for making
> > OOM "Deadlock" attacks impossible is implemented. Therefore, I'm
> > discussing what bandaids we need to apply to the OOM killer.
> 
> The band-aids being proposed are worse than the problem they are
> intended to cover up. In which case, the band-aids should not be
> applied.
> 

The problem is simple. /proc/sys/vm/panic_on_oom == 0 setting does not
help if the OOM killer failed to determine correct task to kill + allow
access to memory reserves. The OOM killer is waiting forever under
the OOM deadlock condition than triggering kernel panic.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Swapping_and_Out_Of_Memory_Tips.html
says that "Usually, oom_killer can kill rogue processes and the system
will survive." but says nothing about what to do when we hit the OOM
killer deadlock condition.

My band-aids allows the OOM killer to trigger kernel panic (followed
by optionally kdump and automatic reboot) for people who want to reboot
the box when default /proc/sys/vm/panic_on_oom == 0 setting failed to
kill rogue processes, and allows people who want the system to survive
when the OOM killer failed to determine correct task to kill + allow
access to memory reserves.

Not only we cannot expect that the OOM killer messages being saved to
/var/log/messages under the OOM killer deadlock condition, but also
we do not emit the OOM killer messages if we hit

    void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
                          unsigned int points, unsigned long totalpages,
                          struct mem_cgroup *memcg, nodemask_t *nodemask,
                          const char *message)
    {
            struct task_struct *victim = p;
            struct task_struct *child;
            struct task_struct *t;
            struct mm_struct *mm;
            unsigned int victim_points = 0;
            static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
                                                  DEFAULT_RATELIMIT_BURST);
    
            /*
             * If the task is already exiting, don't alarm the sysadmin or kill
             * its children or threads, just set TIF_MEMDIE so it can die quickly
             */
            if (task_will_free_mem(p)) { /***** _THIS_ _CONDITION_ *****/
                    set_tsk_thread_flag(p, TIF_MEMDIE);
                    put_task_struct(p);
                    return;
            }
    
            if (__ratelimit(&oom_rs))
                    dump_header(p, gfp_mask, order, memcg, nodemask);
    
            task_lock(p);
            pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
                    message, task_pid_nr(p), p->comm, points);
            task_unlock(p);

followed by entering into the OOM killer deadlock condition. This is
annoying for me because neither serial console nor netconsole helps
finding out that the system entered into the OOM condition.

If you want to stop people from playing Russian Roulette with the OOM
killer, please remove the OOM killer code entirely from RHEL kernels so that
people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1
setting. Can you do it?

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  3:20                                                   ` Theodore Ts'o
  (?)
@ 2015-02-21 12:00                                                     ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-21 12:00 UTC (permalink / raw)
  To: tytso
  Cc: david, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs, linux-ext4

Theodore Ts'o wrote:
> So at this point, it seems we have two choices.  We can either revert
> 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's
> memory allocations and submit them as stable bug fixes.

Can you absorb this side effect by simply adding GFP_NOFAIL to only
ext4's memory allocations? Don't you also depend on lower layers which
use GFP_NOIO?

BTW, while you are using open-coded GFP_NOFAIL retry loop for GFP_NOFS
allocation in jbd2, you are already using GFP_NOFAIL for GFP_NOFS
allocation in jbd. Failure check being there for GFP_NOFAIL seems
redundant.

---------- linux-3.19/fs/jbd2/transaction.c ----------
257 static int start_this_handle(journal_t *journal, handle_t *handle,
258                              gfp_t gfp_mask)
259 {
260         transaction_t   *transaction, *new_transaction = NULL;
261         int             blocks = handle->h_buffer_credits;
262         int             rsv_blocks = 0;
263         unsigned long ts = jiffies;
264 
265         /*
266          * 1/2 of transaction can be reserved so we can practically handle
267          * only 1/2 of maximum transaction size per operation
268          */
269         if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
270                 printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
271                        current->comm, blocks,
272                        journal->j_max_transaction_buffers / 2);
273                 return -ENOSPC;
274         }
275 
276         if (handle->h_rsv_handle)
277                 rsv_blocks = handle->h_rsv_handle->h_buffer_credits;
278 
279 alloc_transaction:
280         if (!journal->j_running_transaction) {
281                 new_transaction = kmem_cache_zalloc(transaction_cache,
282                                                     gfp_mask);
283                 if (!new_transaction) {
284                         /*
285                          * If __GFP_FS is not present, then we may be
286                          * being called from inside the fs writeback
287                          * layer, so we MUST NOT fail.  Since
288                          * __GFP_NOFAIL is going away, we will arrange
289                          * to retry the allocation ourselves.
290                          */
291                         if ((gfp_mask & __GFP_FS) == 0) {
292                                 congestion_wait(BLK_RW_ASYNC, HZ/50);
293                                 goto alloc_transaction;
294                         }
295                         return -ENOMEM;
296                 }
297         }
298 
299         jbd_debug(3, "New handle %p going live.\n", handle);
---------- linux-3.19/fs/jbd2/transaction.c ----------

---------- linux-3.19/fs/jbd/transaction.c ----------
 84 static int start_this_handle(journal_t *journal, handle_t *handle)
 85 {
 86         transaction_t *transaction;
 87         int needed;
 88         int nblocks = handle->h_buffer_credits;
 89         transaction_t *new_transaction = NULL;
 90         int ret = 0;
 91 
 92         if (nblocks > journal->j_max_transaction_buffers) {
 93                 printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n",
 94                        current->comm, nblocks,
 95                        journal->j_max_transaction_buffers);
 96                 ret = -ENOSPC;
 97                 goto out;
 98         }
 99 
100 alloc_transaction:
101         if (!journal->j_running_transaction) {
102                 new_transaction = kzalloc(sizeof(*new_transaction),
103                                                 GFP_NOFS|__GFP_NOFAIL);
104                 if (!new_transaction) {
105                         ret = -ENOMEM;
106                         goto out;
107                 }
108         }
109 
110         jbd_debug(3, "New handle %p going live.\n", handle);
---------- linux-3.19/fs/jbd/transaction.c ----------

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21 12:00                                                     ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-21 12:00 UTC (permalink / raw)
  To: tytso
  Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes,
	akpm, linux-ext4, torvalds

Theodore Ts'o wrote:
> So at this point, it seems we have two choices.  We can either revert
> 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's
> memory allocations and submit them as stable bug fixes.

Can you absorb this side effect by simply adding GFP_NOFAIL to only
ext4's memory allocations? Don't you also depend on lower layers which
use GFP_NOIO?

BTW, while you are using open-coded GFP_NOFAIL retry loop for GFP_NOFS
allocation in jbd2, you are already using GFP_NOFAIL for GFP_NOFS
allocation in jbd. Failure check being there for GFP_NOFAIL seems
redundant.

---------- linux-3.19/fs/jbd2/transaction.c ----------
257 static int start_this_handle(journal_t *journal, handle_t *handle,
258                              gfp_t gfp_mask)
259 {
260         transaction_t   *transaction, *new_transaction = NULL;
261         int             blocks = handle->h_buffer_credits;
262         int             rsv_blocks = 0;
263         unsigned long ts = jiffies;
264 
265         /*
266          * 1/2 of transaction can be reserved so we can practically handle
267          * only 1/2 of maximum transaction size per operation
268          */
269         if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
270                 printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
271                        current->comm, blocks,
272                        journal->j_max_transaction_buffers / 2);
273                 return -ENOSPC;
274         }
275 
276         if (handle->h_rsv_handle)
277                 rsv_blocks = handle->h_rsv_handle->h_buffer_credits;
278 
279 alloc_transaction:
280         if (!journal->j_running_transaction) {
281                 new_transaction = kmem_cache_zalloc(transaction_cache,
282                                                     gfp_mask);
283                 if (!new_transaction) {
284                         /*
285                          * If __GFP_FS is not present, then we may be
286                          * being called from inside the fs writeback
287                          * layer, so we MUST NOT fail.  Since
288                          * __GFP_NOFAIL is going away, we will arrange
289                          * to retry the allocation ourselves.
290                          */
291                         if ((gfp_mask & __GFP_FS) == 0) {
292                                 congestion_wait(BLK_RW_ASYNC, HZ/50);
293                                 goto alloc_transaction;
294                         }
295                         return -ENOMEM;
296                 }
297         }
298 
299         jbd_debug(3, "New handle %p going live.\n", handle);
---------- linux-3.19/fs/jbd2/transaction.c ----------

---------- linux-3.19/fs/jbd/transaction.c ----------
 84 static int start_this_handle(journal_t *journal, handle_t *handle)
 85 {
 86         transaction_t *transaction;
 87         int needed;
 88         int nblocks = handle->h_buffer_credits;
 89         transaction_t *new_transaction = NULL;
 90         int ret = 0;
 91 
 92         if (nblocks > journal->j_max_transaction_buffers) {
 93                 printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n",
 94                        current->comm, nblocks,
 95                        journal->j_max_transaction_buffers);
 96                 ret = -ENOSPC;
 97                 goto out;
 98         }
 99 
100 alloc_transaction:
101         if (!journal->j_running_transaction) {
102                 new_transaction = kzalloc(sizeof(*new_transaction),
103                                                 GFP_NOFS|__GFP_NOFAIL);
104                 if (!new_transaction) {
105                         ret = -ENOMEM;
106                         goto out;
107                 }
108         }
109 
110         jbd_debug(3, "New handle %p going live.\n", handle);
---------- linux-3.19/fs/jbd/transaction.c ----------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21 12:00                                                     ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-21 12:00 UTC (permalink / raw)
  To: tytso
  Cc: david, hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs, linux-ext4

Theodore Ts'o wrote:
> So at this point, it seems we have two choices.  We can either revert
> 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's
> memory allocations and submit them as stable bug fixes.

Can you absorb this side effect by simply adding GFP_NOFAIL to only
ext4's memory allocations? Don't you also depend on lower layers which
use GFP_NOIO?

BTW, while you are using open-coded GFP_NOFAIL retry loop for GFP_NOFS
allocation in jbd2, you are already using GFP_NOFAIL for GFP_NOFS
allocation in jbd. Failure check being there for GFP_NOFAIL seems
redundant.

---------- linux-3.19/fs/jbd2/transaction.c ----------
257 static int start_this_handle(journal_t *journal, handle_t *handle,
258                              gfp_t gfp_mask)
259 {
260         transaction_t   *transaction, *new_transaction = NULL;
261         int             blocks = handle->h_buffer_credits;
262         int             rsv_blocks = 0;
263         unsigned long ts = jiffies;
264 
265         /*
266          * 1/2 of transaction can be reserved so we can practically handle
267          * only 1/2 of maximum transaction size per operation
268          */
269         if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
270                 printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
271                        current->comm, blocks,
272                        journal->j_max_transaction_buffers / 2);
273                 return -ENOSPC;
274         }
275 
276         if (handle->h_rsv_handle)
277                 rsv_blocks = handle->h_rsv_handle->h_buffer_credits;
278 
279 alloc_transaction:
280         if (!journal->j_running_transaction) {
281                 new_transaction = kmem_cache_zalloc(transaction_cache,
282                                                     gfp_mask);
283                 if (!new_transaction) {
284                         /*
285                          * If __GFP_FS is not present, then we may be
286                          * being called from inside the fs writeback
287                          * layer, so we MUST NOT fail.  Since
288                          * __GFP_NOFAIL is going away, we will arrange
289                          * to retry the allocation ourselves.
290                          */
291                         if ((gfp_mask & __GFP_FS) == 0) {
292                                 congestion_wait(BLK_RW_ASYNC, HZ/50);
293                                 goto alloc_transaction;
294                         }
295                         return -ENOMEM;
296                 }
297         }
298 
299         jbd_debug(3, "New handle %p going live.\n", handle);
---------- linux-3.19/fs/jbd2/transaction.c ----------

---------- linux-3.19/fs/jbd/transaction.c ----------
 84 static int start_this_handle(journal_t *journal, handle_t *handle)
 85 {
 86         transaction_t *transaction;
 87         int needed;
 88         int nblocks = handle->h_buffer_credits;
 89         transaction_t *new_transaction = NULL;
 90         int ret = 0;
 91 
 92         if (nblocks > journal->j_max_transaction_buffers) {
 93                 printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n",
 94                        current->comm, nblocks,
 95                        journal->j_max_transaction_buffers);
 96                 ret = -ENOSPC;
 97                 goto out;
 98         }
 99 
100 alloc_transaction:
101         if (!journal->j_running_transaction) {
102                 new_transaction = kzalloc(sizeof(*new_transaction),
103                                                 GFP_NOFS|__GFP_NOFAIL);
104                 if (!new_transaction) {
105                         ret = -ENOMEM;
106                         goto out;
107                 }
108         }
109 
110         jbd_debug(3, "New handle %p going live.\n", handle);
---------- linux-3.19/fs/jbd/transaction.c ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  9:19                                                     ` Andrew Morton
  (?)
@ 2015-02-21 13:48                                                       ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-21 13:48 UTC (permalink / raw)
  To: akpm
  Cc: tytso, david, hannes, mhocko, dchinner, linux-mm, rientjes, oleg,
	mgorman, torvalds, xfs, linux-ext4

Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

Sorry for getting so complicated.

> What I'm not really understanding is why the pre-3.19 implementation
> actually worked.  We've exhausted the free pages, we're not succeeding
> at reclaiming anything, we aren't able to oom-kill anyone.  Yet it
> *does* work - we eventually find that memory and everything proceeds.
> 
> How come?  Where did that memory come from?
> 

Even without __GFP_NOFAIL, GFP_NOFS / GFP_NOIO allocations retried forever
(without invoking the OOM killer) if order <= PAGE_ALLOC_COSTLY_ORDER and
TIF_MEMDIE is not set. Somebody else volunteered that memory while retrying.
This implies silent hang-up forever if nobody volunteers memory.

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath") inadvertently changed GFP_NOFS / GFP_NOIO allocations
not to retry unless __GFP_NOFAIL is specified. Therefore, either applying
Johannes's akpm-doesnt-know-why-it-works patch or passing __GFP_NOFAIL
will restore the pre-3.19 behavior (with possibility of silent hang-up).

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21 13:48                                                       ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-21 13:48 UTC (permalink / raw)
  To: akpm
  Cc: tytso, hannes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner,
	rientjes, linux-ext4, torvalds

Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

Sorry for getting so complicated.

> What I'm not really understanding is why the pre-3.19 implementation
> actually worked.  We've exhausted the free pages, we're not succeeding
> at reclaiming anything, we aren't able to oom-kill anyone.  Yet it
> *does* work - we eventually find that memory and everything proceeds.
> 
> How come?  Where did that memory come from?
> 

Even without __GFP_NOFAIL, GFP_NOFS / GFP_NOIO allocations retried forever
(without invoking the OOM killer) if order <= PAGE_ALLOC_COSTLY_ORDER and
TIF_MEMDIE is not set. Somebody else volunteered that memory while retrying.
This implies silent hang-up forever if nobody volunteers memory.

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath") inadvertently changed GFP_NOFS / GFP_NOIO allocations
not to retry unless __GFP_NOFAIL is specified. Therefore, either applying
Johannes's akpm-doesnt-know-why-it-works patch or passing __GFP_NOFAIL
will restore the pre-3.19 behavior (with possibility of silent hang-up).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21 13:48                                                       ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-21 13:48 UTC (permalink / raw)
  To: akpm
  Cc: tytso, david, hannes, mhocko, dchinner, linux-mm, rientjes, oleg,
	mgorman, torvalds, xfs, linux-ext4

Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

Sorry for getting so complicated.

> What I'm not really understanding is why the pre-3.19 implementation
> actually worked.  We've exhausted the free pages, we're not succeeding
> at reclaiming anything, we aren't able to oom-kill anyone.  Yet it
> *does* work - we eventually find that memory and everything proceeds.
> 
> How come?  Where did that memory come from?
> 

Even without __GFP_NOFAIL, GFP_NOFS / GFP_NOIO allocations retried forever
(without invoking the OOM killer) if order <= PAGE_ALLOC_COSTLY_ORDER and
TIF_MEMDIE is not set. Somebody else volunteered that memory while retrying.
This implies silent hang-up forever if nobody volunteers memory.

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath") inadvertently changed GFP_NOFS / GFP_NOIO allocations
not to retry unless __GFP_NOFAIL is specified. Therefore, either applying
Johannes's akpm-doesnt-know-why-it-works patch or passing __GFP_NOFAIL
will restore the pre-3.19 behavior (with possibility of silent hang-up).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  9:19                                                     ` Andrew Morton
  (?)
@ 2015-02-21 21:38                                                       ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-21 21:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Tetsuo Handa, hannes, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4

On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

ditto....

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

I'm not about to change behaviour "just because". Any sort of change
like this requires a *lot* of low memory regression testing because
we'd be replacing long standing known behaviour with behaviour that
changes without warning. e.g the ext4 low memory failures starting because of
changes made in 3.19-rc6 due to changes in oom-killer behaviour.
Those changes *did not affect XFS* and that's the way I'd like
things to remain.

Put simply: right now I don't trust the mm subsystem to get low memory
behaviour right, and this thread has done nothing to convince me
that it's going to improve any time soon.

> Also, this:
> 
> On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure.
> 
> David, I did not know this!  If you've been telling us about this then
> perhaps it wasn't loud enough.

IME, such bug reports get ignored.

Instead, over the past few months I have been pointing out bugs and
problems in the oom-killer in threads like this because it seems to
be the only way to get any attention to the issues I'm seeing. Bug
reports simply get ignored.  From this process, I've managed to
learn that low order memory allocation now never fails (contrary to
documentation and long standing behavioural expectations) and
pointed out bugs that cause the oom killer to get invoked when the
filesystem is saying "I can handle ENOMEM!" (commit 45f87de ("mm:
get rid of radix tree gfp mask for pagecache_get_page").

And yes, I've definitely mentioned in these discussions that, for
example, xfstests::generic/224 is triggering the oom killer far more
often than it used to on my 1GB RAM vm. The only fix that has been
made recently that's made any difference is 45f87de, so it's a slow
process of raising awareness and trying to ensure things don't get
worse before they get better....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21 21:38                                                       ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-21 21:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Tetsuo Handa, hannes, oleg, xfs, mhocko,
	linux-mm, mgorman, dchinner, rientjes, linux-ext4, torvalds

On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

ditto....

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

I'm not about to change behaviour "just because". Any sort of change
like this requires a *lot* of low memory regression testing because
we'd be replacing long standing known behaviour with behaviour that
changes without warning. e.g the ext4 low memory failures starting because of
changes made in 3.19-rc6 due to changes in oom-killer behaviour.
Those changes *did not affect XFS* and that's the way I'd like
things to remain.

Put simply: right now I don't trust the mm subsystem to get low memory
behaviour right, and this thread has done nothing to convince me
that it's going to improve any time soon.

> Also, this:
> 
> On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure.
> 
> David, I did not know this!  If you've been telling us about this then
> perhaps it wasn't loud enough.

IME, such bug reports get ignored.

Instead, over the past few months I have been pointing out bugs and
problems in the oom-killer in threads like this because it seems to
be the only way to get any attention to the issues I'm seeing. Bug
reports simply get ignored.  From this process, I've managed to
learn that low order memory allocation now never fails (contrary to
documentation and long standing behavioural expectations) and
pointed out bugs that cause the oom killer to get invoked when the
filesystem is saying "I can handle ENOMEM!" (commit 45f87de ("mm:
get rid of radix tree gfp mask for pagecache_get_page").

And yes, I've definitely mentioned in these discussions that, for
example, xfstests::generic/224 is triggering the oom killer far more
often than it used to on my 1GB RAM vm. The only fix that has been
made recently that's made any difference is 45f87de, so it's a slow
process of raising awareness and trying to ensure things don't get
worse before they get better....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21 21:38                                                       ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-21 21:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Tetsuo Handa, hannes, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4

On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

ditto....

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

I'm not about to change behaviour "just because". Any sort of change
like this requires a *lot* of low memory regression testing because
we'd be replacing long standing known behaviour with behaviour that
changes without warning. e.g the ext4 low memory failures starting because of
changes made in 3.19-rc6 due to changes in oom-killer behaviour.
Those changes *did not affect XFS* and that's the way I'd like
things to remain.

Put simply: right now I don't trust the mm subsystem to get low memory
behaviour right, and this thread has done nothing to convince me
that it's going to improve any time soon.

> Also, this:
> 
> On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure.
> 
> David, I did not know this!  If you've been telling us about this then
> perhaps it wasn't loud enough.

IME, such bug reports get ignored.

Instead, over the past few months I have been pointing out bugs and
problems in the oom-killer in threads like this because it seems to
be the only way to get any attention to the issues I'm seeing. Bug
reports simply get ignored.  From this process, I've managed to
learn that low order memory allocation now never fails (contrary to
documentation and long standing behavioural expectations) and
pointed out bugs that cause the oom killer to get invoked when the
filesystem is saying "I can handle ENOMEM!" (commit 45f87de ("mm:
get rid of radix tree gfp mask for pagecache_get_page").

And yes, I've definitely mentioned in these discussions that, for
example, xfstests::generic/224 is triggering the oom killer far more
often than it used to on my 1GB RAM vm. The only fix that has been
made recently that's made any difference is 45f87de, so it's a slow
process of raising awareness and trying to ensure things don't get
worse before they get better....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21 11:12                                                   ` Tetsuo Handa
@ 2015-02-21 21:48                                                     ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-21 21:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Sat, Feb 21, 2015 at 08:12:08PM +0900, Tetsuo Handa wrote:
> My main issue is
> 
>   c) whether to oom-kill more processes when the OOM victim cannot be
>      terminated presumably due to the OOM killer deadlock.
> 
> Dave Chinner wrote:
> > On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote:
> > > Dave Chinner wrote:
> > > > I really don't care about the OOM Killer corner cases - it's
> > > > completely the wrong way line of development to be spending time on
> > > > and you aren't going to convince me otherwise. The OOM killer a
> > > > crutch used to justify having a memory allocation subsystem that
> > > > can't provide forward progress guarantee mechanisms to callers that
> > > > need it.
> > > 
> > > I really care about the OOM Killer corner cases, for I'm
> > > 
> > >   (1) seeing trouble cases which occurred in enterprise systems
> > >       under OOM conditions
> > 
> > You reach OOM, then your SLAs are dead and buried. Reboot the
> > box - its a much more reliable way of returning to a working system
> > than playing Russian Roulette with the OOM killer.
> 
> What Service Level Agreements? Such troubles are occurring on RHEL systems
> where users are not sitting in front of the console. Unless somebody is
> sitting in front of the console in order to do SysRq-b when troubles
> occur, the down time of system will become significantly longer.
>
> What mechanisms are available for minimizing the down time of system
> when troubles under OOM condition occur? Software/hardware watchdog?
> Indeed they may help, but they may be triggered prematurely when the
> system has not entered into the OOM condition. Only the OOM killer knows.

# echo 1 > /proc/sys/vm/panic_on_oom

....

> We have memory cgroups to reduce the possibility of triggering the OOM
> killer, though there will be several bugs remaining in RHEL kernels
> which make administrators hesitate to use memory cgroups.

Fix upstream first, then worry about vendor kernels.

....

> Not only we cannot expect that the OOM killer messages being saved to
> /var/log/messages under the OOM killer deadlock condition, but also

CONFIG_PSTORE=y and configure appropriately from there.

> we do not emit the OOM killer messages if we hit

So add a warning.

> If you want to stop people from playing Russian Roulette with the OOM
> killer, please remove the OOM killer code entirely from RHEL kernels so that
> people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1
> setting. Can you do it?

No. You need to go through vendor channels to get a vendor kernel
config change made.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21 21:48                                                     ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-21 21:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

On Sat, Feb 21, 2015 at 08:12:08PM +0900, Tetsuo Handa wrote:
> My main issue is
> 
>   c) whether to oom-kill more processes when the OOM victim cannot be
>      terminated presumably due to the OOM killer deadlock.
> 
> Dave Chinner wrote:
> > On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote:
> > > Dave Chinner wrote:
> > > > I really don't care about the OOM Killer corner cases - it's
> > > > completely the wrong way line of development to be spending time on
> > > > and you aren't going to convince me otherwise. The OOM killer a
> > > > crutch used to justify having a memory allocation subsystem that
> > > > can't provide forward progress guarantee mechanisms to callers that
> > > > need it.
> > > 
> > > I really care about the OOM Killer corner cases, for I'm
> > > 
> > >   (1) seeing trouble cases which occurred in enterprise systems
> > >       under OOM conditions
> > 
> > You reach OOM, then your SLAs are dead and buried. Reboot the
> > box - its a much more reliable way of returning to a working system
> > than playing Russian Roulette with the OOM killer.
> 
> What Service Level Agreements? Such troubles are occurring on RHEL systems
> where users are not sitting in front of the console. Unless somebody is
> sitting in front of the console in order to do SysRq-b when troubles
> occur, the down time of system will become significantly longer.
>
> What mechanisms are available for minimizing the down time of system
> when troubles under OOM condition occur? Software/hardware watchdog?
> Indeed they may help, but they may be triggered prematurely when the
> system has not entered into the OOM condition. Only the OOM killer knows.

# echo 1 > /proc/sys/vm/panic_on_oom

....

> We have memory cgroups to reduce the possibility of triggering the OOM
> killer, though there will be several bugs remaining in RHEL kernels
> which make administrators hesitate to use memory cgroups.

Fix upstream first, then worry about vendor kernels.

....

> Not only we cannot expect that the OOM killer messages being saved to
> /var/log/messages under the OOM killer deadlock condition, but also

CONFIG_PSTORE=y and configure appropriately from there.

> we do not emit the OOM killer messages if we hit

So add a warning.

> If you want to stop people from playing Russian Roulette with the OOM
> killer, please remove the OOM killer code entirely from RHEL kernels so that
> people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1
> setting. Can you do it?

No. You need to go through vendor channels to get a vendor kernel
config change made.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 22:52                                             ` Dave Chinner
@ 2015-02-21 23:52                                               ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-21 23:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> I will actively work around aanything that causes filesystem memory
> pressure to increase the chance of oom killer invocations. The OOM
> killer is not a solution - it is, by definition, a loose cannon and
> so we should be reducing dependencies on it.

Once we have a better-working alternative, sure.

> I really don't care about the OOM Killer corner cases - it's
> completely the wrong way line of development to be spending time on
> and you aren't going to convince me otherwise. The OOM killer a
> crutch used to justify having a memory allocation subsystem that
> can't provide forward progress guarantee mechanisms to callers that
> need it.

We can provide this.  Are all these callers able to preallocate?

---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51bd1e72a917..af81b8a67651 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -380,6 +380,10 @@ extern void free_kmem_pages(unsigned long addr, unsigned int order);
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
+void register_private_page(struct page *page, unsigned int order);
+int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr);
+void free_private_pages(void);
+
 void page_alloc_init(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..1fe390779f23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1545,6 +1545,8 @@ struct task_struct {
 #endif
 
 /* VM state */
+	struct list_head private_pages;
+
 	struct reclaim_state *reclaim_state;
 
 	struct backing_dev_info *backing_dev_info;
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139615a0..b6349b0e5da2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1308,6 +1308,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	memset(&p->rss_stat, 0, sizeof(p->rss_stat));
 #endif
 
+	INIT_LIST_HEAD(&p->private_pages);
+
 	p->default_timer_slack_ns = current->timer_slack_ns;
 
 	task_io_accounting_init(&p->ioac);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a47f0b229a1a..546db4e0da75 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -490,12 +490,10 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 static inline void set_page_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
-	__SetPageBuddy(page);
 }
 
 static inline void rmv_page_order(struct page *page)
 {
-	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 }
 
@@ -617,6 +615,7 @@ static inline void __free_one_page(struct page *page,
 			list_del(&buddy->lru);
 			zone->free_area[order].nr_free--;
 			rmv_page_order(buddy);
+			__ClearPageBuddy(buddy);
 		}
 		combined_idx = buddy_idx & page_idx;
 		page = page + (combined_idx - page_idx);
@@ -624,6 +623,7 @@ static inline void __free_one_page(struct page *page,
 		order++;
 	}
 	set_page_order(page, order);
+	__SetPageBuddy(page);
 
 	/*
 	 * If this is not the largest possible page, check if the buddy
@@ -924,6 +924,7 @@ static inline void expand(struct zone *zone, struct page *page,
 		list_add(&page[size].lru, &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
+		__SetPageBuddy(page);
 	}
 }
 
@@ -1015,6 +1016,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 							struct page, lru);
 		list_del(&page->lru);
 		rmv_page_order(page);
+		__ClearPageBuddy(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
 		set_freepage_migratetype(page, migratetype);
@@ -1212,6 +1214,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
+			__ClearPageBuddy(page);
 
 			expand(zone, page, order, current_order, area,
 					buddy_type);
@@ -1598,6 +1601,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	list_del(&page->lru);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
+	__ClearPageBuddy(page);
 
 	/* Set the pageblock if the isolated page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
@@ -2504,6 +2508,40 @@ retry:
 	return page;
 }
 
+/* Try to allocate from the caller's private memory reserves */
+static inline struct page *
+__alloc_pages_private(gfp_t gfp_mask, unsigned int order,
+		      const struct alloc_context *ac)
+{
+	unsigned int uninitialized_var(alloc_order);
+	struct page *page = NULL;
+	struct page *p;
+
+	/* Dopy, but this is a slowpath right before OOM */
+	list_for_each_entry(p, &current->private_pages, lru) {
+		int o = page_order(p);
+
+		if (o >= order && (!page || o < alloc_order)) {
+			page = p;
+			alloc_order = o;
+		}
+	}
+	if (!page)
+		return NULL;
+
+	list_del(&page->lru);
+	rmv_page_order(page);
+
+	/* Give back the remainder */
+	while (alloc_order > order) {
+		alloc_order--;
+		set_page_order(&page[1 << alloc_order], alloc_order);
+		list_add(&page[1 << alloc_order].lru, &current->private_pages);
+	}
+
+	return page;
+}
+
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -2753,9 +2791,13 @@ retry:
 		/*
 		 * If we fail to make progress by freeing individual
 		 * pages, but the allocation wants us to keep going,
-		 * start OOM killing tasks.
+		 * dip into private reserves, or start OOM killing.
 		 */
 		if (!did_some_progress) {
+			page = __alloc_pages_private(gfp_mask, order, ac);
+			if (page)
+				goto got_pg;
+
 			page = __alloc_pages_may_oom(gfp_mask, order, ac,
 							&did_some_progress);
 			if (page)
@@ -3046,6 +3088,82 @@ void free_pages_exact(void *virt, size_t size)
 EXPORT_SYMBOL(free_pages_exact);
 
 /**
+ * alloc_private_pages - allocate private memory reserve pages
+ * @gfp_mask: gfp flags for the allocations
+ * @order: order of pages to allocate
+ * @nr: number of pages to allocate
+ *
+ * This allocates @nr pages of order @order as an emergency reserve of
+ * the calling task, to be used by the page allocator if an allocation
+ * would otherwise fail.
+ *
+ * The caller is responsible for calling free_private_pages() once the
+ * reserves are no longer required.
+ */
+int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr)
+{
+	struct page *page, *page2;
+	LIST_HEAD(pages);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		page = alloc_pages(gfp_mask, order);
+		if (!page)
+			goto error;
+		set_page_order(page, order);
+		list_add(&page->lru, &pages);
+	}
+
+	list_splice(&pages, &current->private_pages);
+	return 0;
+
+error:
+	list_for_each_entry_safe(page, page2, &pages, lru) {
+		list_del(&page->lru);
+		rmv_page_order(page);
+		__free_pages(page, order);
+	}
+	return -ENOMEM;
+}
+
+/**
+ * register_private_page - register a private memory reserve page
+ * @page: pre-allocated page
+ * @order: @page's order
+ *
+ * This registers @page as an emergency reserve of the calling task,
+ * to be used by the page allocator if an allocation would otherwise
+ * fail.
+ *
+ * The caller is responsible for calling free_private_pages() once the
+ * reserves are no longer required.
+ */
+void register_private_page(struct page *page, unsigned int order)
+{
+	set_page_order(page, order);
+	list_add(&page->lru, &current->private_pages);
+}
+
+/**
+ * free_private_pages - free all private memory reserve pages
+ *
+ * Frees all (remaining) pages of the calling task's memory reserves
+ * established by alloc_private_pages() and register_private_page().
+ */
+void free_private_pages(void)
+{
+	struct page *page, *page2;
+
+	list_for_each_entry_safe(page, page2, &current->private_pages, lru) {
+		int order = page_order(page);
+
+		list_del(&page->lru);
+		rmv_page_order(page);
+		__free_pages(page, order);
+	}
+}
+
+/**
  * nr_free_zone_pages - count number of pages beyond high watermark
  * @offset: The zone index of the highest zone
  *
@@ -6551,6 +6669,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 		list_del(&page->lru);
 		rmv_page_order(page);
+		__ClearPageBuddy(page);
 		zone->free_area[order].nr_free--;
 		for (i = 0; i < (1 << order); i++)
 			SetPageReserved((page+i));

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-21 23:52                                               ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-21 23:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> I will actively work around aanything that causes filesystem memory
> pressure to increase the chance of oom killer invocations. The OOM
> killer is not a solution - it is, by definition, a loose cannon and
> so we should be reducing dependencies on it.

Once we have a better-working alternative, sure.

> I really don't care about the OOM Killer corner cases - it's
> completely the wrong way line of development to be spending time on
> and you aren't going to convince me otherwise. The OOM killer a
> crutch used to justify having a memory allocation subsystem that
> can't provide forward progress guarantee mechanisms to callers that
> need it.

We can provide this.  Are all these callers able to preallocate?

---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51bd1e72a917..af81b8a67651 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -380,6 +380,10 @@ extern void free_kmem_pages(unsigned long addr, unsigned int order);
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
+void register_private_page(struct page *page, unsigned int order);
+int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr);
+void free_private_pages(void);
+
 void page_alloc_init(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..1fe390779f23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1545,6 +1545,8 @@ struct task_struct {
 #endif
 
 /* VM state */
+	struct list_head private_pages;
+
 	struct reclaim_state *reclaim_state;
 
 	struct backing_dev_info *backing_dev_info;
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139615a0..b6349b0e5da2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1308,6 +1308,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	memset(&p->rss_stat, 0, sizeof(p->rss_stat));
 #endif
 
+	INIT_LIST_HEAD(&p->private_pages);
+
 	p->default_timer_slack_ns = current->timer_slack_ns;
 
 	task_io_accounting_init(&p->ioac);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a47f0b229a1a..546db4e0da75 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -490,12 +490,10 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 static inline void set_page_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
-	__SetPageBuddy(page);
 }
 
 static inline void rmv_page_order(struct page *page)
 {
-	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 }
 
@@ -617,6 +615,7 @@ static inline void __free_one_page(struct page *page,
 			list_del(&buddy->lru);
 			zone->free_area[order].nr_free--;
 			rmv_page_order(buddy);
+			__ClearPageBuddy(buddy);
 		}
 		combined_idx = buddy_idx & page_idx;
 		page = page + (combined_idx - page_idx);
@@ -624,6 +623,7 @@ static inline void __free_one_page(struct page *page,
 		order++;
 	}
 	set_page_order(page, order);
+	__SetPageBuddy(page);
 
 	/*
 	 * If this is not the largest possible page, check if the buddy
@@ -924,6 +924,7 @@ static inline void expand(struct zone *zone, struct page *page,
 		list_add(&page[size].lru, &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
+		__SetPageBuddy(page);
 	}
 }
 
@@ -1015,6 +1016,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 							struct page, lru);
 		list_del(&page->lru);
 		rmv_page_order(page);
+		__ClearPageBuddy(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
 		set_freepage_migratetype(page, migratetype);
@@ -1212,6 +1214,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
+			__ClearPageBuddy(page);
 
 			expand(zone, page, order, current_order, area,
 					buddy_type);
@@ -1598,6 +1601,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	list_del(&page->lru);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
+	__ClearPageBuddy(page);
 
 	/* Set the pageblock if the isolated page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
@@ -2504,6 +2508,40 @@ retry:
 	return page;
 }
 
+/* Try to allocate from the caller's private memory reserves */
+static inline struct page *
+__alloc_pages_private(gfp_t gfp_mask, unsigned int order,
+		      const struct alloc_context *ac)
+{
+	unsigned int uninitialized_var(alloc_order);
+	struct page *page = NULL;
+	struct page *p;
+
+	/* Dopy, but this is a slowpath right before OOM */
+	list_for_each_entry(p, &current->private_pages, lru) {
+		int o = page_order(p);
+
+		if (o >= order && (!page || o < alloc_order)) {
+			page = p;
+			alloc_order = o;
+		}
+	}
+	if (!page)
+		return NULL;
+
+	list_del(&page->lru);
+	rmv_page_order(page);
+
+	/* Give back the remainder */
+	while (alloc_order > order) {
+		alloc_order--;
+		set_page_order(&page[1 << alloc_order], alloc_order);
+		list_add(&page[1 << alloc_order].lru, &current->private_pages);
+	}
+
+	return page;
+}
+
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -2753,9 +2791,13 @@ retry:
 		/*
 		 * If we fail to make progress by freeing individual
 		 * pages, but the allocation wants us to keep going,
-		 * start OOM killing tasks.
+		 * dip into private reserves, or start OOM killing.
 		 */
 		if (!did_some_progress) {
+			page = __alloc_pages_private(gfp_mask, order, ac);
+			if (page)
+				goto got_pg;
+
 			page = __alloc_pages_may_oom(gfp_mask, order, ac,
 							&did_some_progress);
 			if (page)
@@ -3046,6 +3088,82 @@ void free_pages_exact(void *virt, size_t size)
 EXPORT_SYMBOL(free_pages_exact);
 
 /**
+ * alloc_private_pages - allocate private memory reserve pages
+ * @gfp_mask: gfp flags for the allocations
+ * @order: order of pages to allocate
+ * @nr: number of pages to allocate
+ *
+ * This allocates @nr pages of order @order as an emergency reserve of
+ * the calling task, to be used by the page allocator if an allocation
+ * would otherwise fail.
+ *
+ * The caller is responsible for calling free_private_pages() once the
+ * reserves are no longer required.
+ */
+int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr)
+{
+	struct page *page, *page2;
+	LIST_HEAD(pages);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		page = alloc_pages(gfp_mask, order);
+		if (!page)
+			goto error;
+		set_page_order(page, order);
+		list_add(&page->lru, &pages);
+	}
+
+	list_splice(&pages, &current->private_pages);
+	return 0;
+
+error:
+	list_for_each_entry_safe(page, page2, &pages, lru) {
+		list_del(&page->lru);
+		rmv_page_order(page);
+		__free_pages(page, order);
+	}
+	return -ENOMEM;
+}
+
+/**
+ * register_private_page - register a private memory reserve page
+ * @page: pre-allocated page
+ * @order: @page's order
+ *
+ * This registers @page as an emergency reserve of the calling task,
+ * to be used by the page allocator if an allocation would otherwise
+ * fail.
+ *
+ * The caller is responsible for calling free_private_pages() once the
+ * reserves are no longer required.
+ */
+void register_private_page(struct page *page, unsigned int order)
+{
+	set_page_order(page, order);
+	list_add(&page->lru, &current->private_pages);
+}
+
+/**
+ * free_private_pages - free all private memory reserve pages
+ *
+ * Frees all (remaining) pages of the calling task's memory reserves
+ * established by alloc_private_pages() and register_private_page().
+ */
+void free_private_pages(void)
+{
+	struct page *page, *page2;
+
+	list_for_each_entry_safe(page, page2, &current->private_pages, lru) {
+		int order = page_order(page);
+
+		list_del(&page->lru);
+		rmv_page_order(page);
+		__free_pages(page, order);
+	}
+}
+
+/**
  * nr_free_zone_pages - count number of pages beyond high watermark
  * @offset: The zone index of the highest zone
  *
@@ -6551,6 +6669,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 		list_del(&page->lru);
 		rmv_page_order(page);
+		__ClearPageBuddy(page);
 		zone->free_area[order].nr_free--;
 		for (i = 0; i < (1 << order); i++)
 			SetPageReserved((page+i));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  9:19                                                     ` Andrew Morton
@ 2015-02-22  0:20                                                       ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-22  0:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Dave Chinner, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs, linux-ext4

On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> applying Johannes's akpm-doesnt-know-why-it-works patch:
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (high_zoneidx < ZONE_NORMAL)
>  			goto out;
>  		/* The OOM killer does not compensate for light reclaim */
> -		if (!(gfp_mask & __GFP_FS))
> +		if (!(gfp_mask & __GFP_FS)) {
> +			/*
> +			 * XXX: Page reclaim didn't yield anything,
> +			 * and the OOM killer can't be invoked, but
> +			 * keep looping as per should_alloc_retry().
> +			 */
> +			*did_some_progress = 1;
>  			goto out;
> +		}
>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> 
> Have people adequately confirmed that this gets us out of trouble?

I'd be interested in this too.  Who is seeing these failures?

Andrew, can you please use the following changelog for this patch?

---
From: Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change

Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
into allocation slowpath"), which should have been a simple cleanup
patch, accidentally changed the behavior to aborting the allocation at
that point.  This creates problems with filesystem callers (?) that
currently rely on the allocator waiting for other tasks to intervene.

Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.

Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-22  0:20                                                       ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-22  0:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, mhocko,
	linux-mm, mgorman, dchinner, linux-ext4, torvalds

On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> applying Johannes's akpm-doesnt-know-why-it-works patch:
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (high_zoneidx < ZONE_NORMAL)
>  			goto out;
>  		/* The OOM killer does not compensate for light reclaim */
> -		if (!(gfp_mask & __GFP_FS))
> +		if (!(gfp_mask & __GFP_FS)) {
> +			/*
> +			 * XXX: Page reclaim didn't yield anything,
> +			 * and the OOM killer can't be invoked, but
> +			 * keep looping as per should_alloc_retry().
> +			 */
> +			*did_some_progress = 1;
>  			goto out;
> +		}
>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> 
> Have people adequately confirmed that this gets us out of trouble?

I'd be interested in this too.  Who is seeing these failures?

Andrew, can you please use the following changelog for this patch?

---
From: Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change

Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
into allocation slowpath"), which should have been a simple cleanup
patch, accidentally changed the behavior to aborting the allocation at
that point.  This creates problems with filesystem callers (?) that
currently rely on the allocator waiting for other tasks to intervene.

Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.

Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* __GFP_NOFAIL and oom_killer_disabled?
  2015-02-21  9:19                                                     ` Andrew Morton
                                                                       ` (3 preceding siblings ...)
  (?)
@ 2015-02-22 14:48                                                     ` Tetsuo Handa
  2015-02-23 10:21                                                       ` Michal Hocko
  -1 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-22 14:48 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg,
	mgorman, torvalds

Andrew Morton wrote:
> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

__GFP_NOFAIL fails to work correctly if oom_killer_disabled == true.
I'm wondering how oom_killer_disable() interferes with __GFP_NOFAIL
allocation. We had race check after setting oom_killer_disabled to true
in 3.19.

---------- linux-3.19/kernel/power/process.c ----------
int freeze_processes(void)
{
(...snipped...)
        pm_wakeup_clear();
        printk("Freezing user space processes ... ");
        pm_freezing = true;
        oom_kills_saved = oom_kills_count();
        error = try_to_freeze_tasks(true);
        if (!error) {
                __usermodehelper_set_disable_depth(UMH_DISABLED);
                oom_killer_disable();

                /*
                 * There might have been an OOM kill while we were
                 * freezing tasks and the killed task might be still
                 * on the way out so we have to double check for race.
                 */
                if (oom_kills_count() != oom_kills_saved &&
                    !check_frozen_processes()) {
                        __usermodehelper_set_disable_depth(UMH_ENABLED);
                        printk("OOM in progress.");
                        error = -EBUSY;
                } else {
                        printk("done.");
                }
        }
(...snipped...)
}
---------- linux-3.19/kernel/power/process.c ----------

I worry that commit c32b3cbe0d067a9c "oom, PM: make OOM detection in
the freezer path raceless" might have opened a race window for
__alloc_pages_may_oom(__GFP_NOFAIL) allocation to fail when OOM killer
is disabled. I think something like

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -789,7 +789,7 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	bool ret = false;
 
 	down_read(&oom_sem);
-	if (!oom_killer_disabled) {
+	if (!oom_killer_disabled || (gfp_mask & __GFP_NOFAIL)) {
 		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
 		ret = true;
 	}

is needed. But such change can race with up_write() and wait_event() in
oom_killer_disable(). While the comment of oom_killer_disable() says
"The function cannot be called when there are runnable user tasks because
the userspace would see unexpected allocation failures as a result.",
aren't there still kernel threads which might do __GFP_NOFAIL allocations?
After all, don't we need to recheck after setting oom_killer_disabled to true?

---------- linux.git/kernel/power/process.c ----------
int freeze_processes(void)
{
(...snipped...)
        pm_wakeup_clear();
        pr_info("Freezing user space processes ... ");
        pm_freezing = true;
        error = try_to_freeze_tasks(true);
        if (!error) {
                __usermodehelper_set_disable_depth(UMH_DISABLED);
                pr_cont("done.");
        }
        pr_cont("\n");
        BUG_ON(in_atomic());

        /*
         * Now that the whole userspace is frozen we need to disbale
         * the OOM killer to disallow any further interference with
         * killable tasks.
         */
        if (!error && !oom_killer_disable())
                error = -EBUSY;
(...snipped...)
}
---------- linux.git/kernel/power/process.c ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21 23:52                                               ` Johannes Weiner
@ 2015-02-23  0:45                                                 ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-23  0:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> > I will actively work around aanything that causes filesystem memory
> > pressure to increase the chance of oom killer invocations. The OOM
> > killer is not a solution - it is, by definition, a loose cannon and
> > so we should be reducing dependencies on it.
> 
> Once we have a better-working alternative, sure.

Great, but first a simple request: please stop writing code and
instead start architecting a solution to the problem. i.e. we need a
design and have that documented before code gets written. If you
watched my recent LCA talk, then you'll understand what I mean
when I say: stop programming and start engineering.

> > I really don't care about the OOM Killer corner cases - it's
> > completely the wrong way line of development to be spending time on
> > and you aren't going to convince me otherwise. The OOM killer a
> > crutch used to justify having a memory allocation subsystem that
> > can't provide forward progress guarantee mechanisms to callers that
> > need it.
> 
> We can provide this.  Are all these callers able to preallocate?

Anything that allocates in transaction context (and therefor is
GFP_NOFS by definition) can preallocate at transaction reservation
time. However, preallocation is dumb, complex, CPU and memory
intensive and will have a *massive* impact on performance.
Allocating 10-100 pages to a reserve which we will almost *never
use* and then free them again *on every single transaction* is a lot
of unnecessary additional fast path overhead.  Hence a "preallocate
for every context" reserve pool is not a viable solution.

And, really, "reservation" != "preallocation".

Maybe it's my filesystem background, but those to things are vastly
different things.

Reservations are simply an *accounting* of the maximum amount of a
reserve required by an operation to guarantee forwards progress. In
filesystems, we do this for log space (transactions) and some do it
for filesystem space (e.g. delayed allocation needs correct ENOSPC
detection so we don't overcommit disk space).  The VM already has
such concepts (e.g. watermarks and things like min_free_kbytes) that
it uses to ensure that there are sufficient reserves for certain
types of allocations to succeed.

A reserve memory pool is no different - every time a memory reserve
occurs, a watermark is lifted to accommodate it, and the transaction
is not allowed to proceed until the amount of free memory exceeds
that watermark. The memory allocation subsystem then only allows
*allocations* marked correctly to allocate pages from that the
reserve that watermark protects. e.g. only allocations using
__GFP_RESERVE are allowed to dip into the reserve pool.

By using watermarks, freeing of memory will automatically top
up the reserve pool which means that we guarantee that reclaimable
memory allocated for demand paging during transacitons doesn't
deplete the reserve pool permanently.  As a result, when there is
plenty of free and/or reclaimable memory, the reserve pool
watermarks will have almost zero impact on performance and
behaviour.

Further, because it's just accounting and behavioural thresholds,
this allows the mm subsystem to control how the reserve pool is
accounted internally. e.g. clean, reclaimable pages in the page
cache could serve as reserve pool pages as they can be immediately
reclaimed for allocation. This could be acheived by setting reclaim
targets first to the reserve pool watermark, then the second target
is enough pages to satisfy the current allocation.

And, FWIW, there's nothing stopping this mechanism from have order
based reserve thresholds. e.g. IB could really do with a 64k reserve
pool threshold and hence help solve the long standing problems they
have with filling the receive ring in GFP_ATOMIC context...

Sure, that's looking further down the track, but my point still
remains: we need a viable long term solution to this problem. Maybe
reservations are not the solution, but I don't see anyone else who
is thinking of how to address this architectural problem at a system
level right now.  We need to design and document the model first,
then review it, then we can start working at the code level to
implement the solution we've designed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23  0:45                                                 ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-23  0:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> > I will actively work around aanything that causes filesystem memory
> > pressure to increase the chance of oom killer invocations. The OOM
> > killer is not a solution - it is, by definition, a loose cannon and
> > so we should be reducing dependencies on it.
> 
> Once we have a better-working alternative, sure.

Great, but first a simple request: please stop writing code and
instead start architecting a solution to the problem. i.e. we need a
design and have that documented before code gets written. If you
watched my recent LCA talk, then you'll understand what I mean
when I say: stop programming and start engineering.

> > I really don't care about the OOM Killer corner cases - it's
> > completely the wrong way line of development to be spending time on
> > and you aren't going to convince me otherwise. The OOM killer a
> > crutch used to justify having a memory allocation subsystem that
> > can't provide forward progress guarantee mechanisms to callers that
> > need it.
> 
> We can provide this.  Are all these callers able to preallocate?

Anything that allocates in transaction context (and therefor is
GFP_NOFS by definition) can preallocate at transaction reservation
time. However, preallocation is dumb, complex, CPU and memory
intensive and will have a *massive* impact on performance.
Allocating 10-100 pages to a reserve which we will almost *never
use* and then free them again *on every single transaction* is a lot
of unnecessary additional fast path overhead.  Hence a "preallocate
for every context" reserve pool is not a viable solution.

And, really, "reservation" != "preallocation".

Maybe it's my filesystem background, but those to things are vastly
different things.

Reservations are simply an *accounting* of the maximum amount of a
reserve required by an operation to guarantee forwards progress. In
filesystems, we do this for log space (transactions) and some do it
for filesystem space (e.g. delayed allocation needs correct ENOSPC
detection so we don't overcommit disk space).  The VM already has
such concepts (e.g. watermarks and things like min_free_kbytes) that
it uses to ensure that there are sufficient reserves for certain
types of allocations to succeed.

A reserve memory pool is no different - every time a memory reserve
occurs, a watermark is lifted to accommodate it, and the transaction
is not allowed to proceed until the amount of free memory exceeds
that watermark. The memory allocation subsystem then only allows
*allocations* marked correctly to allocate pages from that the
reserve that watermark protects. e.g. only allocations using
__GFP_RESERVE are allowed to dip into the reserve pool.

By using watermarks, freeing of memory will automatically top
up the reserve pool which means that we guarantee that reclaimable
memory allocated for demand paging during transacitons doesn't
deplete the reserve pool permanently.  As a result, when there is
plenty of free and/or reclaimable memory, the reserve pool
watermarks will have almost zero impact on performance and
behaviour.

Further, because it's just accounting and behavioural thresholds,
this allows the mm subsystem to control how the reserve pool is
accounted internally. e.g. clean, reclaimable pages in the page
cache could serve as reserve pool pages as they can be immediately
reclaimed for allocation. This could be acheived by setting reclaim
targets first to the reserve pool watermark, then the second target
is enough pages to satisfy the current allocation.

And, FWIW, there's nothing stopping this mechanism from have order
based reserve thresholds. e.g. IB could really do with a 64k reserve
pool threshold and hence help solve the long standing problems they
have with filling the receive ring in GFP_ATOMIC context...

Sure, that's looking further down the track, but my point still
remains: we need a viable long term solution to this problem. Maybe
reservations are not the solution, but I don't see anyone else who
is thinking of how to address this architectural problem at a system
level right now.  We need to design and document the model first,
then review it, then we can start working at the code level to
implement the solution we've designed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  0:45                                                 ` Dave Chinner
@ 2015-02-23  1:29                                                   ` Andrew Morton
  -1 siblings, 0 replies; 276+ messages in thread
From: Andrew Morton @ 2015-02-23  1:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, torvalds

On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:

> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > We can provide this.  Are all these callers able to preallocate?
> 
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

Yup.

> Reservations are simply an *accounting* of the maximum amount of a
> reserve required by an operation to guarantee forwards progress. In
> filesystems, we do this for log space (transactions) and some do it
> for filesystem space (e.g. delayed allocation needs correct ENOSPC
> detection so we don't overcommit disk space).  The VM already has
> such concepts (e.g. watermarks and things like min_free_kbytes) that
> it uses to ensure that there are sufficient reserves for certain
> types of allocations to succeed.

Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
reserve.  So to reserve N pages we increase the page allocator dynamic
reserve by N, do some reclaim if necessary then deposit N tokens into
the caller's task_struct (it'll be a set of zone/nr-pages tuples I
suppose).

When allocating pages the caller should drain its reserves in
preference to dipping into the regular freelist.  This guy has already
done his reclaim and shouldn't be penalised a second time.  I guess
Johannes's preallocation code should switch to doing this for the same
reason, plus the fact that snipping a page off
task_struct.prealloc_pages is super-fast and needs to be done sometime
anyway so why not do it by default.

Both reservation and preallocation are vulnerable to deadlocks - 10,000
tasks all trying to reserve/prealloc 100 pages, they all have 50 pages
and we ran out of memory.  Whoops.  We can undeadlock by returning
ENOMEM but I suspect there will still be problematic situations where
massive numbers of pages are temporarily AWOL.  Perhaps some form of
queuing and throttling will be needed, to limit the peak number of
reserved pages.  Per zone, I guess.

And it'll be a huge pain handling order>0 pages.  I'd be inclined to
make it order-0 only, and tell the lamer callers that
vmap-is-thattaway.  Alas, one lame caller is slub.


But the biggest issue is how the heck does a caller work out how many
pages to reserve/prealloc?  Even a single sb_bread() - it's sitting on
loop on a sparse NTFS file on loop on a five-deep DM stack on a
six-deep MD stack on loop on NFS on an eleventy-deep networking stack. 
And then there will be an unknown number of slab allocations of unknown
size with unknown slabs-per-page rules - how many pages needed for
them?  And to make it much worse, how many pages of which orders? 
Bless its heart, slub will go and use a 1-order page for allocations
which should have been in 0-order pages..


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23  1:29                                                   ` Andrew Morton
  0 siblings, 0 replies; 276+ messages in thread
From: Andrew Morton @ 2015-02-23  1:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs

On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:

> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > We can provide this.  Are all these callers able to preallocate?
> 
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

Yup.

> Reservations are simply an *accounting* of the maximum amount of a
> reserve required by an operation to guarantee forwards progress. In
> filesystems, we do this for log space (transactions) and some do it
> for filesystem space (e.g. delayed allocation needs correct ENOSPC
> detection so we don't overcommit disk space).  The VM already has
> such concepts (e.g. watermarks and things like min_free_kbytes) that
> it uses to ensure that there are sufficient reserves for certain
> types of allocations to succeed.

Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
reserve.  So to reserve N pages we increase the page allocator dynamic
reserve by N, do some reclaim if necessary then deposit N tokens into
the caller's task_struct (it'll be a set of zone/nr-pages tuples I
suppose).

When allocating pages the caller should drain its reserves in
preference to dipping into the regular freelist.  This guy has already
done his reclaim and shouldn't be penalised a second time.  I guess
Johannes's preallocation code should switch to doing this for the same
reason, plus the fact that snipping a page off
task_struct.prealloc_pages is super-fast and needs to be done sometime
anyway so why not do it by default.

Both reservation and preallocation are vulnerable to deadlocks - 10,000
tasks all trying to reserve/prealloc 100 pages, they all have 50 pages
and we ran out of memory.  Whoops.  We can undeadlock by returning
ENOMEM but I suspect there will still be problematic situations where
massive numbers of pages are temporarily AWOL.  Perhaps some form of
queuing and throttling will be needed, to limit the peak number of
reserved pages.  Per zone, I guess.

And it'll be a huge pain handling order>0 pages.  I'd be inclined to
make it order-0 only, and tell the lamer callers that
vmap-is-thattaway.  Alas, one lame caller is slub.


But the biggest issue is how the heck does a caller work out how many
pages to reserve/prealloc?  Even a single sb_bread() - it's sitting on
loop on a sparse NTFS file on loop on a five-deep DM stack on a
six-deep MD stack on loop on NFS on an eleventy-deep networking stack. 
And then there will be an unknown number of slab allocations of unknown
size with unknown slabs-per-page rules - how many pages needed for
them?  And to make it much worse, how many pages of which orders? 
Bless its heart, slub will go and use a 1-order page for allocations
which should have been in 0-order pages..


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  1:29                                                   ` Andrew Morton
@ 2015-02-23  7:32                                                     ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-23  7:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, torvalds

On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > > > I really don't care about the OOM Killer corner cases - it's
> > > > completely the wrong way line of development to be spending time on
> > > > and you aren't going to convince me otherwise. The OOM killer a
> > > > crutch used to justify having a memory allocation subsystem that
> > > > can't provide forward progress guarantee mechanisms to callers that
> > > > need it.
> > > 
> > > We can provide this.  Are all these callers able to preallocate?
> > 
> > Anything that allocates in transaction context (and therefor is
> > GFP_NOFS by definition) can preallocate at transaction reservation
> > time. However, preallocation is dumb, complex, CPU and memory
> > intensive and will have a *massive* impact on performance.
> > Allocating 10-100 pages to a reserve which we will almost *never
> > use* and then free them again *on every single transaction* is a lot
> > of unnecessary additional fast path overhead.  Hence a "preallocate
> > for every context" reserve pool is not a viable solution.
> 
> Yup.
> 
> > Reservations are simply an *accounting* of the maximum amount of a
> > reserve required by an operation to guarantee forwards progress. In
> > filesystems, we do this for log space (transactions) and some do it
> > for filesystem space (e.g. delayed allocation needs correct ENOSPC
> > detection so we don't overcommit disk space).  The VM already has
> > such concepts (e.g. watermarks and things like min_free_kbytes) that
> > it uses to ensure that there are sufficient reserves for certain
> > types of allocations to succeed.
> 
> Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
> reserve.  So to reserve N pages we increase the page allocator dynamic
> reserve by N, do some reclaim if necessary then deposit N tokens into
> the caller's task_struct (it'll be a set of zone/nr-pages tuples I
> suppose).
> 
> When allocating pages the caller should drain its reserves in
> preference to dipping into the regular freelist.  This guy has already
> done his reclaim and shouldn't be penalised a second time.  I guess
> Johannes's preallocation code should switch to doing this for the same
> reason, plus the fact that snipping a page off
> task_struct.prealloc_pages is super-fast and needs to be done sometime
> anyway so why not do it by default.

That is at odds with the requirements of demand paging, which
allocate for objects that are reclaimable within the course of the
transaction. The reserve is there to ensure forward progress for
allocations for objects that aren't freed until after the
transaction completes, but if we drain it for reclaimable objects we
then have nothing left in the reserve pool when we actually need it.

We do not know ahead of time if the object we are allocating is
going to modified and hence locked into the transaction. Hence we
can't say "use the reserve for this *specific* allocation", and so
the only guidance we can really give is "we will to allocate and
*permanently consume* this much memory", and the reserve pool needs
to cover that consumption to guarantee forwards progress.

Forwards progress for all other allocations is guaranteed because
they are reclaimable objects - they either freed directly back to
their source (slab, heap, page lists) or they are freed by shrinkers
once they have been released from the transaction.

Hence we need allocations to come from the free list and trigger
reclaim, regardless of the fact there is a reserve pool there. The
reserve pool needs to be a last resort once there are no other
avenues to allocate memory. i.e. it would be used to replace the OOM
killer for GFP_NOFAIL allocations.

> Both reservation and preallocation are vulnerable to deadlocks - 10,000
> tasks all trying to reserve/prealloc 100 pages, they all have 50 pages
> and we ran out of memory.  Whoops.

Yes, that's the big problem with preallocation, as well as your
proposed "depelete the reserved memory first" approach. They
*require* up front "preallocation" of free memory, either directly
by the application, or internally by the mm subsystem.

Hence my comments about appropriate classification of "reserved
memory". Reserved memory does not necessarily need to be on the free
list. It could be "immediately reclaimable" memory, so that
reserving memory doesn't need to immediately reclaim memory, but can
it can be pulled from the reclaimable memory reserves when
memory pressure occurs. If there is no memory pressure, we do
nothing beause we have no need to do anything....

> We can undeadlock by returning ENOMEM but I suspect there will
> still be problematic situations where massive numbers of pages are
> temporarily AWOL.  Perhaps some form of queuing and throttling
> will be needed,

Yes, think that is necessary, but I don't see it as necessary in the
MM subsystem. XFS already has a ticket-based queue mechanisms for
throttling concurrent access to ensure we don't overcommit log space
and I'd want to tie the two together...

> to limit the peak number of reserved pages.  Per
> zone, I guess.

Internal implementation issue that I don't really care about.
When it comes to guaranteeing memory allocation, global context
is all I care about. Locality of allocation simple doesn't matter;
we want that page we reserved, no matter wher eit is located.

> And it'll be a huge pain handling order>0 pages.  I'd be inclined
> to make it order-0 only, and tell the lamer callers that
> vmap-is-thattaway.  Alas, one lame caller is slub.

Sure, but vmap requires GFP_KERNEL memory allocation and we're
talking about allocation in transactions, which are GFP_NOFS.

I've lost count of the number of times we've asked for that problem
to be fixed. Refusing to fix it has simply lead to the growing use
of ugly hacks around that problem (i.e. memalloc_noio_save() and
friends).

> But the biggest issue is how the heck does a caller work out how
> many pages to reserve/prealloc?  Even a single sb_bread() - it's
> sitting on loop on a sparse NTFS file on loop on a five-deep DM
> stack on a six-deep MD stack on loop on NFS on an eleventy-deep
> networking stack. 

Each subsystem needs to take care of itself first, then we can worry
about esoteric stacking requirements.

Besides, stacking requirements through the IO layer is still pretty
trivial - we only need to guarantee single IO progress from the
highest layer as it can be recycled again and again for every IO
that needs to be done.

And, because mempools already give that guarantee to most block
devices and drivers, we won't need to reserve memory for most block
devices to make forwards progress. It's only crazy "recurse through
filesystem" configurations where this will be an issue.

> And then there will be an unknown number of
> slab allocations of unknown size with unknown slabs-per-page rules
> - how many pages needed for them?

However many pages needed to allocate the number of objects we'll
consume from the slab.

> And to make it much worse, how
> many pages of which orders?  Bless its heart, slub will go and use
> a 1-order page for allocations which should have been in 0-order
> pages..

The majority of allocations will be order-0, though if we know that
they are going to be significant numbers of high order allocations,
then it should be simple enough to tell the mm subsystem "need a
reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have
memory compaction just do it's stuff. But, IMO, we should cross that
bridge when somebody actually needs reservations to be that
specific....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23  7:32                                                     ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-23  7:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs

On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > > > I really don't care about the OOM Killer corner cases - it's
> > > > completely the wrong way line of development to be spending time on
> > > > and you aren't going to convince me otherwise. The OOM killer a
> > > > crutch used to justify having a memory allocation subsystem that
> > > > can't provide forward progress guarantee mechanisms to callers that
> > > > need it.
> > > 
> > > We can provide this.  Are all these callers able to preallocate?
> > 
> > Anything that allocates in transaction context (and therefor is
> > GFP_NOFS by definition) can preallocate at transaction reservation
> > time. However, preallocation is dumb, complex, CPU and memory
> > intensive and will have a *massive* impact on performance.
> > Allocating 10-100 pages to a reserve which we will almost *never
> > use* and then free them again *on every single transaction* is a lot
> > of unnecessary additional fast path overhead.  Hence a "preallocate
> > for every context" reserve pool is not a viable solution.
> 
> Yup.
> 
> > Reservations are simply an *accounting* of the maximum amount of a
> > reserve required by an operation to guarantee forwards progress. In
> > filesystems, we do this for log space (transactions) and some do it
> > for filesystem space (e.g. delayed allocation needs correct ENOSPC
> > detection so we don't overcommit disk space).  The VM already has
> > such concepts (e.g. watermarks and things like min_free_kbytes) that
> > it uses to ensure that there are sufficient reserves for certain
> > types of allocations to succeed.
> 
> Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
> reserve.  So to reserve N pages we increase the page allocator dynamic
> reserve by N, do some reclaim if necessary then deposit N tokens into
> the caller's task_struct (it'll be a set of zone/nr-pages tuples I
> suppose).
> 
> When allocating pages the caller should drain its reserves in
> preference to dipping into the regular freelist.  This guy has already
> done his reclaim and shouldn't be penalised a second time.  I guess
> Johannes's preallocation code should switch to doing this for the same
> reason, plus the fact that snipping a page off
> task_struct.prealloc_pages is super-fast and needs to be done sometime
> anyway so why not do it by default.

That is at odds with the requirements of demand paging, which
allocate for objects that are reclaimable within the course of the
transaction. The reserve is there to ensure forward progress for
allocations for objects that aren't freed until after the
transaction completes, but if we drain it for reclaimable objects we
then have nothing left in the reserve pool when we actually need it.

We do not know ahead of time if the object we are allocating is
going to modified and hence locked into the transaction. Hence we
can't say "use the reserve for this *specific* allocation", and so
the only guidance we can really give is "we will to allocate and
*permanently consume* this much memory", and the reserve pool needs
to cover that consumption to guarantee forwards progress.

Forwards progress for all other allocations is guaranteed because
they are reclaimable objects - they either freed directly back to
their source (slab, heap, page lists) or they are freed by shrinkers
once they have been released from the transaction.

Hence we need allocations to come from the free list and trigger
reclaim, regardless of the fact there is a reserve pool there. The
reserve pool needs to be a last resort once there are no other
avenues to allocate memory. i.e. it would be used to replace the OOM
killer for GFP_NOFAIL allocations.

> Both reservation and preallocation are vulnerable to deadlocks - 10,000
> tasks all trying to reserve/prealloc 100 pages, they all have 50 pages
> and we ran out of memory.  Whoops.

Yes, that's the big problem with preallocation, as well as your
proposed "depelete the reserved memory first" approach. They
*require* up front "preallocation" of free memory, either directly
by the application, or internally by the mm subsystem.

Hence my comments about appropriate classification of "reserved
memory". Reserved memory does not necessarily need to be on the free
list. It could be "immediately reclaimable" memory, so that
reserving memory doesn't need to immediately reclaim memory, but can
it can be pulled from the reclaimable memory reserves when
memory pressure occurs. If there is no memory pressure, we do
nothing beause we have no need to do anything....

> We can undeadlock by returning ENOMEM but I suspect there will
> still be problematic situations where massive numbers of pages are
> temporarily AWOL.  Perhaps some form of queuing and throttling
> will be needed,

Yes, think that is necessary, but I don't see it as necessary in the
MM subsystem. XFS already has a ticket-based queue mechanisms for
throttling concurrent access to ensure we don't overcommit log space
and I'd want to tie the two together...

> to limit the peak number of reserved pages.  Per
> zone, I guess.

Internal implementation issue that I don't really care about.
When it comes to guaranteeing memory allocation, global context
is all I care about. Locality of allocation simple doesn't matter;
we want that page we reserved, no matter wher eit is located.

> And it'll be a huge pain handling order>0 pages.  I'd be inclined
> to make it order-0 only, and tell the lamer callers that
> vmap-is-thattaway.  Alas, one lame caller is slub.

Sure, but vmap requires GFP_KERNEL memory allocation and we're
talking about allocation in transactions, which are GFP_NOFS.

I've lost count of the number of times we've asked for that problem
to be fixed. Refusing to fix it has simply lead to the growing use
of ugly hacks around that problem (i.e. memalloc_noio_save() and
friends).

> But the biggest issue is how the heck does a caller work out how
> many pages to reserve/prealloc?  Even a single sb_bread() - it's
> sitting on loop on a sparse NTFS file on loop on a five-deep DM
> stack on a six-deep MD stack on loop on NFS on an eleventy-deep
> networking stack. 

Each subsystem needs to take care of itself first, then we can worry
about esoteric stacking requirements.

Besides, stacking requirements through the IO layer is still pretty
trivial - we only need to guarantee single IO progress from the
highest layer as it can be recycled again and again for every IO
that needs to be done.

And, because mempools already give that guarantee to most block
devices and drivers, we won't need to reserve memory for most block
devices to make forwards progress. It's only crazy "recurse through
filesystem" configurations where this will be an issue.

> And then there will be an unknown number of
> slab allocations of unknown size with unknown slabs-per-page rules
> - how many pages needed for them?

However many pages needed to allocate the number of objects we'll
consume from the slab.

> And to make it much worse, how
> many pages of which orders?  Bless its heart, slub will go and use
> a 1-order page for allocations which should have been in 0-order
> pages..

The majority of allocations will be order-0, though if we know that
they are going to be significant numbers of high order allocations,
then it should be simple enough to tell the mm subsystem "need a
reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have
memory compaction just do it's stuff. But, IMO, we should cross that
bridge when somebody actually needs reservations to be that
specific....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: __GFP_NOFAIL and oom_killer_disabled?
  2015-02-22 14:48                                                     ` __GFP_NOFAIL and oom_killer_disabled? Tetsuo Handa
@ 2015-02-23 10:21                                                       ` Michal Hocko
  2015-02-23 13:03                                                         ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2015-02-23 10:21 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg,
	mgorman, torvalds

On Sun 22-02-15 23:48:01, Tetsuo Handa wrote:
> Andrew Morton wrote:
> > And yes, I agree that sites such as xfs's kmem_alloc() should be
> > passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> > don't think it matters a lot whether kmem_alloc() retains its retry
> > loop.  If __GFP_NOFAIL is working correctly then it will never loop
> > anyway...
> 
> __GFP_NOFAIL fails to work correctly if oom_killer_disabled == true.
> I'm wondering how oom_killer_disable() interferes with __GFP_NOFAIL
> allocation. We had race check after setting oom_killer_disabled to true
> in 3.19.
[...]
> I worry that commit c32b3cbe0d067a9c "oom, PM: make OOM detection in
> the freezer path raceless" might have opened a race window for
> __alloc_pages_may_oom(__GFP_NOFAIL) allocation to fail when OOM killer
> is disabled.

This commit hasn't introduced any behavior changes. GFP_NOFAIL
allocations fail when OOM killer is disabled since beginning
7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen).

> I think something like
> 
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -789,7 +789,7 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	bool ret = false;
>  
>  	down_read(&oom_sem);
> -	if (!oom_killer_disabled) {
> +	if (!oom_killer_disabled || (gfp_mask & __GFP_NOFAIL)) {
>  		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
>  		ret = true;
>  	}
> 
> is needed.

> But such change can race with up_write() and wait_event() in
> oom_killer_disable(). 

Not only it races with the above but also breaks the core assumption
that no userspace task might interact with later stages of the suspend.

> While the comment of oom_killer_disable() says
> "The function cannot be called when there are runnable user tasks because
> the userspace would see unexpected allocation failures as a result.",
> aren't there still kernel threads which might do __GFP_NOFAIL allocations?

OK, this is a fair point. My assumption was that kernel threads rarely
do __GFP_NOFAIL allocations. It seems I was wrong here. This makes the
logic much more trickier. I can see 3 possible ways to handle this:

1) move oom_killer_disable after kernel threads are frozen. This has a
   risk that the OOM victim wouldn't be able to finish because it would
   depend on an already frozen kernel thread. This would be really
   tricky to debug.
2) do not fail GFP_NOFAIL allocation no matter what and risk a potential
   (and silent) endless loop during suspend. On the other hand the
   chances that __GFP_NOFAIL comes from a freezable kernel thread rather
   than from deep pm suspend path is considerably higher.
   So now that I am thinking about that it indeed makes more sense to
   simply warn when OOM is disabled and retry the allocation. Freezable
   kernel threads will loop and fail the suspend. Incidental allocations
   after kernel threads are frozen will at least dump a warning - if we
   are lucky and the serial console is still active of course...
3) do nothing ;)

But whatever we do there is simply no way to guarantee __GFP_NOFAIL
after OOM killer has been disabled. So we are risking between endless
loops and possible crashes due to unexpected allocation failures. Not a
nice choice. We can only chose the less risky way and it sounds like 2)
is that option. Considering that we haven't seen any crashes with the
current behavior I would be tempted to simply declare this a corner case
which doesn't need any action but well, I hate to debug nasty issues so
better be prepared...

What about something like the following?
---

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  3:20                                                   ` Theodore Ts'o
  (?)
@ 2015-02-23 10:26                                                     ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-23 10:26 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Tetsuo Handa, hannes, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs, linux-ext4

On Fri 20-02-15 22:20:00, Theodore Ts'o wrote:
[...]
> So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to
> ext4/jbd2.

I am currently going through opencoded GFP_NOFAIL allocations and have
this in my local branch currently. I assume you did the same so I will
drop mine if you have pushed yours already.
---
>From dc49cef75dbd677d5542c9e5bd27bbfab9a7bc3a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 20 Feb 2015 11:32:58 +0100
Subject: [PATCH] jbd2: revert must-not-fail allocation loops back to
 GFP_NOFAIL

This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2
layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
to open coding the endless loop around the allocator rather than
removing the dependency on the non failing allocation. So the
deprecation was a clear failure and the reality tells us that
__GFP_NOFAIL is not even close to go away.

It is still true that __GFP_NOFAIL allocations are generally discouraged
and new uses should be evaluated and an alternative (pre-allocations or
reservations) should be considered but it doesn't make any sense to lie
the allocator about the requirements. Allocator can take steps to help
making a progress if it knows the requirements.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 fs/jbd2/journal.c     | 11 +----------
 fs/jbd2/transaction.c | 20 +++++++-------------
 2 files changed, 8 insertions(+), 23 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 1df94fabe4eb..878ed3e761f0 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -371,16 +371,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 */
 	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
 
-retry_alloc:
-	new_bh = alloc_buffer_head(GFP_NOFS);
-	if (!new_bh) {
-		/*
-		 * Failure is not an option, but __GFP_NOFAIL is going
-		 * away; so we retry ourselves here.
-		 */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-		goto retry_alloc;
-	}
+	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
 
 	/* keep subsequent assertions sane */
 	atomic_set(&new_bh->b_count, 1);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5f09370c90a8..dac4523fa142 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -278,22 +278,16 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
+		/*
+		 * If __GFP_FS is not present, then we may be being called from
+		 * inside the fs writeback layer, so we MUST NOT fail.
+		 */
+		if ((gfp_mask & __GFP_FS) == 0)
+			gfp_mask |= __GFP_NOFAIL;
 		new_transaction = kmem_cache_zalloc(transaction_cache,
 						    gfp_mask);
-		if (!new_transaction) {
-			/*
-			 * If __GFP_FS is not present, then we may be
-			 * being called from inside the fs writeback
-			 * layer, so we MUST NOT fail.  Since
-			 * __GFP_NOFAIL is going away, we will arrange
-			 * to retry the allocation ourselves.
-			 */
-			if ((gfp_mask & __GFP_FS) == 0) {
-				congestion_wait(BLK_RW_ASYNC, HZ/50);
-				goto alloc_transaction;
-			}
+		if (!new_transaction)
 			return -ENOMEM;
-		}
 	}
 
 	jbd_debug(3, "New handle %p going live.\n", handle);
-- 
2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23 10:26                                                     ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-23 10:26 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, hannes, linux-mm, mgorman,
	rientjes, akpm, linux-ext4, torvalds

On Fri 20-02-15 22:20:00, Theodore Ts'o wrote:
[...]
> So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to
> ext4/jbd2.

I am currently going through opencoded GFP_NOFAIL allocations and have
this in my local branch currently. I assume you did the same so I will
drop mine if you have pushed yours already.
---
>From dc49cef75dbd677d5542c9e5bd27bbfab9a7bc3a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 20 Feb 2015 11:32:58 +0100
Subject: [PATCH] jbd2: revert must-not-fail allocation loops back to
 GFP_NOFAIL

This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2
layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
to open coding the endless loop around the allocator rather than
removing the dependency on the non failing allocation. So the
deprecation was a clear failure and the reality tells us that
__GFP_NOFAIL is not even close to go away.

It is still true that __GFP_NOFAIL allocations are generally discouraged
and new uses should be evaluated and an alternative (pre-allocations or
reservations) should be considered but it doesn't make any sense to lie
the allocator about the requirements. Allocator can take steps to help
making a progress if it knows the requirements.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 fs/jbd2/journal.c     | 11 +----------
 fs/jbd2/transaction.c | 20 +++++++-------------
 2 files changed, 8 insertions(+), 23 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 1df94fabe4eb..878ed3e761f0 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -371,16 +371,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 */
 	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
 
-retry_alloc:
-	new_bh = alloc_buffer_head(GFP_NOFS);
-	if (!new_bh) {
-		/*
-		 * Failure is not an option, but __GFP_NOFAIL is going
-		 * away; so we retry ourselves here.
-		 */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-		goto retry_alloc;
-	}
+	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
 
 	/* keep subsequent assertions sane */
 	atomic_set(&new_bh->b_count, 1);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5f09370c90a8..dac4523fa142 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -278,22 +278,16 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
+		/*
+		 * If __GFP_FS is not present, then we may be being called from
+		 * inside the fs writeback layer, so we MUST NOT fail.
+		 */
+		if ((gfp_mask & __GFP_FS) == 0)
+			gfp_mask |= __GFP_NOFAIL;
 		new_transaction = kmem_cache_zalloc(transaction_cache,
 						    gfp_mask);
-		if (!new_transaction) {
-			/*
-			 * If __GFP_FS is not present, then we may be
-			 * being called from inside the fs writeback
-			 * layer, so we MUST NOT fail.  Since
-			 * __GFP_NOFAIL is going away, we will arrange
-			 * to retry the allocation ourselves.
-			 */
-			if ((gfp_mask & __GFP_FS) == 0) {
-				congestion_wait(BLK_RW_ASYNC, HZ/50);
-				goto alloc_transaction;
-			}
+		if (!new_transaction)
 			return -ENOMEM;
-		}
 	}
 
 	jbd_debug(3, "New handle %p going live.\n", handle);
-- 
2.1.4

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23 10:26                                                     ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-23 10:26 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Tetsuo Handa, hannes, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs, linux-ext4

On Fri 20-02-15 22:20:00, Theodore Ts'o wrote:
[...]
> So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to
> ext4/jbd2.

I am currently going through opencoded GFP_NOFAIL allocations and have
this in my local branch currently. I assume you did the same so I will
drop mine if you have pushed yours already.
---

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-22  0:20                                                       ` Johannes Weiner
  (?)
@ 2015-02-23 10:48                                                         ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-23 10:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Theodore Ts'o, Dave Chinner, Tetsuo Handa,
	dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs,
	linux-ext4

On Sat 21-02-15 19:20:58, Johannes Weiner wrote:
> On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> > Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> > applying Johannes's akpm-doesnt-know-why-it-works patch:
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (high_zoneidx < ZONE_NORMAL)
> >  			goto out;
> >  		/* The OOM killer does not compensate for light reclaim */
> > -		if (!(gfp_mask & __GFP_FS))
> > +		if (!(gfp_mask & __GFP_FS)) {
> > +			/*
> > +			 * XXX: Page reclaim didn't yield anything,
> > +			 * and the OOM killer can't be invoked, but
> > +			 * keep looping as per should_alloc_retry().
> > +			 */
> > +			*did_some_progress = 1;
> >  			goto out;
> > +		}
> >  		/*
> >  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> >  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Have people adequately confirmed that this gets us out of trouble?
> 
> I'd be interested in this too.  Who is seeing these failures?
> 
> Andrew, can you please use the following changelog for this patch?
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.

OK, if this a _short term_ change. I really think that all the requests
except for __GFP_NOFAIL should be able to fail. I would argue that it
should be the caller who should be fixed but it is true that the patch
was introduced too late (rc7) and so it caught other subsystems
unprepared so backporting to stable makes sense to me. But can we please
move on and stop pretending that allocations do not fail for the
upcoming release?

> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23 10:48                                                         ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-23 10:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, linux-mm,
	mgorman, dchinner, Andrew Morton, linux-ext4, torvalds

On Sat 21-02-15 19:20:58, Johannes Weiner wrote:
> On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> > Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> > applying Johannes's akpm-doesnt-know-why-it-works patch:
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (high_zoneidx < ZONE_NORMAL)
> >  			goto out;
> >  		/* The OOM killer does not compensate for light reclaim */
> > -		if (!(gfp_mask & __GFP_FS))
> > +		if (!(gfp_mask & __GFP_FS)) {
> > +			/*
> > +			 * XXX: Page reclaim didn't yield anything,
> > +			 * and the OOM killer can't be invoked, but
> > +			 * keep looping as per should_alloc_retry().
> > +			 */
> > +			*did_some_progress = 1;
> >  			goto out;
> > +		}
> >  		/*
> >  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> >  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Have people adequately confirmed that this gets us out of trouble?
> 
> I'd be interested in this too.  Who is seeing these failures?
> 
> Andrew, can you please use the following changelog for this patch?
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.

OK, if this a _short term_ change. I really think that all the requests
except for __GFP_NOFAIL should be able to fail. I would argue that it
should be the caller who should be fixed but it is true that the patch
was introduced too late (rc7) and so it caught other subsystems
unprepared so backporting to stable makes sense to me. But can we please
move on and stop pretending that allocations do not fail for the
upcoming release?

> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23 10:48                                                         ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-02-23 10:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Theodore Ts'o, Dave Chinner, Tetsuo Handa,
	dchinner, linux-mm, rientjes, oleg, mgorman, torvalds, xfs,
	linux-ext4

On Sat 21-02-15 19:20:58, Johannes Weiner wrote:
> On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> > Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> > applying Johannes's akpm-doesnt-know-why-it-works patch:
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (high_zoneidx < ZONE_NORMAL)
> >  			goto out;
> >  		/* The OOM killer does not compensate for light reclaim */
> > -		if (!(gfp_mask & __GFP_FS))
> > +		if (!(gfp_mask & __GFP_FS)) {
> > +			/*
> > +			 * XXX: Page reclaim didn't yield anything,
> > +			 * and the OOM killer can't be invoked, but
> > +			 * keep looping as per should_alloc_retry().
> > +			 */
> > +			*did_some_progress = 1;
> >  			goto out;
> > +		}
> >  		/*
> >  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> >  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Have people adequately confirmed that this gets us out of trouble?
> 
> I'd be interested in this too.  Who is seeing these failures?
> 
> Andrew, can you please use the following changelog for this patch?
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.

OK, if this a _short term_ change. I really think that all the requests
except for __GFP_NOFAIL should be able to fail. I would argue that it
should be the caller who should be fixed but it is true that the patch
was introduced too late (rc7) and so it caught other subsystems
unprepared so backporting to stable makes sense to me. But can we please
move on and stop pretending that allocations do not fail for the
upcoming release?

> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23 10:48                                                         ` Michal Hocko
  (?)
@ 2015-02-23 11:23                                                           ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-23 11:23 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: akpm, tytso, david, dchinner, linux-mm, rientjes, oleg, mgorman,
	torvalds, xfs, linux-ext4

Michal Hocko wrote:
> On Sat 21-02-15 19:20:58, Johannes Weiner wrote:
> > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> > > applying Johannes's akpm-doesnt-know-why-it-works patch:
> > > 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >  		if (high_zoneidx < ZONE_NORMAL)
> > >  			goto out;
> > >  		/* The OOM killer does not compensate for light reclaim */
> > > -		if (!(gfp_mask & __GFP_FS))
> > > +		if (!(gfp_mask & __GFP_FS)) {
> > > +			/*
> > > +			 * XXX: Page reclaim didn't yield anything,
> > > +			 * and the OOM killer can't be invoked, but
> > > +			 * keep looping as per should_alloc_retry().
> > > +			 */
> > > +			*did_some_progress = 1;
> > >  			goto out;
> > > +		}
> > >  		/*
> > >  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > > 
> > > Have people adequately confirmed that this gets us out of trouble?
> > 
> > I'd be interested in this too.  Who is seeing these failures?

So far ext4 and xfs. I don't have environment to test other filesystems.

> > 
> > Andrew, can you please use the following changelog for this patch?
> > 
> > ---
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> > 
> > Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> > killer once reclaim had failed, but nevertheless kept looping in the
> > allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> > into allocation slowpath"), which should have been a simple cleanup
> > patch, accidentally changed the behavior to aborting the allocation at
> > that point.  This creates problems with filesystem callers (?) that
> > currently rely on the allocator waiting for other tasks to intervene.
> > 
> > Revert the behavior as it shouldn't have been changed as part of a
> > cleanup patch.
> 
> OK, if this a _short term_ change. I really think that all the requests
> except for __GFP_NOFAIL should be able to fail. I would argue that it
> should be the caller who should be fixed but it is true that the patch
> was introduced too late (rc7) and so it caught other subsystems
> unprepared so backporting to stable makes sense to me. But can we please
> move on and stop pretending that allocations do not fail for the
> upcoming release?
> 
> > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>
> 

Without this patch, I think the system becomes unusable under OOM.
However, with this patch, I know the system may become unusable under
OOM. Please do write patches for handling below condition.

  Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Johannes's patch will get us out of filesystem error troubles, at
the cost of getting us into stall troubles (as with until 3.19-rc6).

I retested http://marc.info/?l=linux-ext4&m=142443125221571&w=2
with debug printk patch shown below.

---------- debug printk patch ----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d503e9c..5144506 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -610,6 +610,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 	spin_unlock(&zone_scan_lock);
 }
 
+atomic_t oom_killer_skipped_count = ATOMIC_INIT(0);
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
@@ -679,6 +681,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 				 nodemask, "Out of memory");
 		killed = 1;
 	}
+	else
+		atomic_inc(&oom_killer_skipped_count);
 out:
 	/*
 	 * Give the killed threads a good chance of exiting before trying to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e20f9c..eaea16b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
-		if (!(gfp_mask & __GFP_FS))
+		if (!(gfp_mask & __GFP_FS)) {
+			/*
+			 * XXX: Page reclaim didn't yield anything,
+			 * and the OOM killer can't be invoked, but
+			 * keep looping as per should_alloc_retry().
+			 */
+			*did_some_progress = 1;
 			goto out;
+		}
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -2635,6 +2642,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
+extern atomic_t oom_killer_skipped_count;
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2649,6 +2658,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long first_retried_time = 0;
+	unsigned long next_warn_time = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2821,6 +2832,19 @@ retry:
 			if (!did_some_progress)
 				goto nopage;
 		}
+		if (!first_retried_time) {
+			first_retried_time = jiffies;
+			if (!first_retried_time)
+				first_retried_time = 1;
+			next_warn_time = first_retried_time + 5 * HZ;
+		} else if (time_after(jiffies, next_warn_time)) {
+			printk(KERN_INFO "%d (%s) : gfp 0x%X : %lu seconds : "
+			       "OOM-killer skipped %u\n", current->pid,
+			       current->comm, gfp_mask,
+			       (jiffies - first_retried_time) / HZ,
+			       atomic_read(&oom_killer_skipped_count));
+			next_warn_time = jiffies + 5 * HZ;
+		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto retry;
---------- debug printk patch ----------

GFP_NOFS allocations stalled for 10 minutes waiting for somebody else
to volunteer memory. GFP_FS allocations stalled for 10 minutes waiting
for the OOM killer to kill somebody. The OOM killer stalled for 10
minutes waiting for GFP_NOFS allocations to complete.

I guess the system made forward progress because the number of remaining
a.out processes decreased over time.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-ext4-patched.txt.xz )
---------- ext4 / Linux 3.19 + patch ----------
[ 1335.187579] Out of memory: Kill process 14156 (a.out) score 760 or sacrifice child
[ 1335.189604] Killed process 14156 (a.out) total-vm:2167392kB, anon-rss:1360196kB, file-rss:0kB
[ 1335.191920] Kill process 14177 (a.out) sharing same memory
[ 1335.193465] Kill process 14178 (a.out) sharing same memory
[ 1335.195013] Kill process 14179 (a.out) sharing same memory
[ 1335.196580] Kill process 14180 (a.out) sharing same memory
[ 1335.198128] Kill process 14181 (a.out) sharing same memory
[ 1335.199674] Kill process 14182 (a.out) sharing same memory
[ 1335.201217] Kill process 14183 (a.out) sharing same memory
[ 1335.202768] Kill process 14184 (a.out) sharing same memory
[ 1335.204316] Kill process 14185 (a.out) sharing same memory
[ 1335.205871] Kill process 14186 (a.out) sharing same memory
[ 1335.207420] Kill process 14187 (a.out) sharing same memory
[ 1335.208974] Kill process 14188 (a.out) sharing same memory
[ 1335.210515] Kill process 14189 (a.out) sharing same memory
[ 1335.212063] Kill process 14190 (a.out) sharing same memory
[ 1335.213611] Kill process 14191 (a.out) sharing same memory
[ 1335.215165] Kill process 14192 (a.out) sharing same memory
[ 1335.216715] Kill process 14193 (a.out) sharing same memory
[ 1335.218286] Kill process 14194 (a.out) sharing same memory
[ 1335.219836] Kill process 14195 (a.out) sharing same memory
[ 1335.221378] Kill process 14196 (a.out) sharing same memory
[ 1335.222918] Kill process 14197 (a.out) sharing same memory
[ 1335.224461] Kill process 14198 (a.out) sharing same memory
[ 1335.225999] Kill process 14199 (a.out) sharing same memory
[ 1335.227545] Kill process 14200 (a.out) sharing same memory
[ 1335.229095] Kill process 14201 (a.out) sharing same memory
[ 1335.230643] Kill process 14202 (a.out) sharing same memory
[ 1335.232184] Kill process 14203 (a.out) sharing same memory
[ 1335.233738] Kill process 14204 (a.out) sharing same memory
[ 1335.235293] Kill process 14205 (a.out) sharing same memory
[ 1335.236834] Kill process 14206 (a.out) sharing same memory
[ 1335.238387] Kill process 14207 (a.out) sharing same memory
[ 1335.239930] Kill process 14208 (a.out) sharing same memory
[ 1335.241471] Kill process 14209 (a.out) sharing same memory
[ 1335.243011] Kill process 14210 (a.out) sharing same memory
[ 1335.244554] Kill process 14211 (a.out) sharing same memory
[ 1335.246101] Kill process 14212 (a.out) sharing same memory
[ 1335.247645] Kill process 14213 (a.out) sharing same memory
[ 1335.249182] Kill process 14214 (a.out) sharing same memory
[ 1335.250718] Kill process 14215 (a.out) sharing same memory
[ 1335.252305] Kill process 14216 (a.out) sharing same memory
[ 1335.253899] Kill process 14217 (a.out) sharing same memory
[ 1335.255443] Kill process 14218 (a.out) sharing same memory
[ 1335.256993] Kill process 14219 (a.out) sharing same memory
[ 1335.258531] Kill process 14220 (a.out) sharing same memory
[ 1335.260066] Kill process 14221 (a.out) sharing same memory
[ 1335.261616] Kill process 14222 (a.out) sharing same memory
[ 1335.263143] Kill process 14223 (a.out) sharing same memory
[ 1335.264647] Kill process 14224 (a.out) sharing same memory
[ 1335.266121] Kill process 14225 (a.out) sharing same memory
[ 1335.267598] Kill process 14226 (a.out) sharing same memory
[ 1335.269077] Kill process 14227 (a.out) sharing same memory
[ 1335.270560] Kill process 14228 (a.out) sharing same memory
[ 1335.272038] Kill process 14229 (a.out) sharing same memory
[ 1335.273508] Kill process 14230 (a.out) sharing same memory
[ 1335.274999] Kill process 14231 (a.out) sharing same memory
[ 1335.276469] Kill process 14232 (a.out) sharing same memory
[ 1335.277947] Kill process 14233 (a.out) sharing same memory
[ 1335.279428] Kill process 14234 (a.out) sharing same memory
[ 1335.280894] Kill process 14235 (a.out) sharing same memory
[ 1335.282361] Kill process 14236 (a.out) sharing same memory
[ 1335.283832] Kill process 14237 (a.out) sharing same memory
[ 1335.285304] Kill process 14238 (a.out) sharing same memory
[ 1335.286768] Kill process 14239 (a.out) sharing same memory
[ 1335.288242] Kill process 14240 (a.out) sharing same memory
[ 1335.289714] Kill process 14241 (a.out) sharing same memory
[ 1335.291196] Kill process 14242 (a.out) sharing same memory
[ 1335.292731] Kill process 14243 (a.out) sharing same memory
[ 1335.294258] Kill process 14244 (a.out) sharing same memory
[ 1335.295734] Kill process 14245 (a.out) sharing same memory
[ 1335.297215] Kill process 14246 (a.out) sharing same memory
[ 1335.298710] Kill process 14247 (a.out) sharing same memory
[ 1335.300188] Kill process 14248 (a.out) sharing same memory
[ 1335.301672] Kill process 14249 (a.out) sharing same memory
[ 1335.303157] Kill process 14250 (a.out) sharing same memory
[ 1335.304655] Kill process 14251 (a.out) sharing same memory
[ 1335.306141] Kill process 14252 (a.out) sharing same memory
[ 1335.307621] Kill process 14253 (a.out) sharing same memory
[ 1335.309107] Kill process 14254 (a.out) sharing same memory
[ 1335.310573] Kill process 14255 (a.out) sharing same memory
[ 1335.312052] Kill process 14256 (a.out) sharing same memory
[ 1335.313528] Kill process 14257 (a.out) sharing same memory
[ 1335.315039] Kill process 14258 (a.out) sharing same memory
[ 1335.316522] Kill process 14259 (a.out) sharing same memory
[ 1335.317992] Kill process 14260 (a.out) sharing same memory
[ 1335.319462] Kill process 14261 (a.out) sharing same memory
[ 1335.320965] Kill process 14262 (a.out) sharing same memory
[ 1335.322459] Kill process 14263 (a.out) sharing same memory
[ 1335.323958] Kill process 14264 (a.out) sharing same memory
[ 1335.325472] Kill process 14265 (a.out) sharing same memory
[ 1335.326966] Kill process 14266 (a.out) sharing same memory
[ 1335.328454] Kill process 14267 (a.out) sharing same memory
[ 1335.329945] Kill process 14268 (a.out) sharing same memory
[ 1335.331444] Kill process 14269 (a.out) sharing same memory
[ 1335.332944] Kill process 14270 (a.out) sharing same memory
[ 1335.334435] Kill process 14271 (a.out) sharing same memory
[ 1335.335930] Kill process 14272 (a.out) sharing same memory
[ 1335.337437] Kill process 14273 (a.out) sharing same memory
[ 1335.338927] Kill process 14274 (a.out) sharing same memory
[ 1335.340400] Kill process 14275 (a.out) sharing same memory
[ 1335.341890] Kill process 14276 (a.out) sharing same memory
[ 1339.640500] 464 (systemd-journal) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459181
[ 1339.649374] 615 (vmtoolsd) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459438
[ 1339.649611] 4079 (pool) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459447
[ 1340.343322] 14258 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275
[ 1340.343331] 14194 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275
[ 1340.343345] 14210 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478276
[ 1340.343360] 14179 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478277
[ 1340.345290] 14154 (su) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22478339
[ 1340.345312] 14180 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339
[ 1340.345319] 14260 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339
[ 1340.345337] 14178 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340
[ 1340.345345] 14245 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340
[ 1340.345361] 14226 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478341
[ 1340.346119] 14256 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478368
[ 1340.346139] 14181 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478369
[ 1340.347082] 14274 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347091] 14267 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347095] 14189 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347099] 14238 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347107] 14276 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403
[ 1340.347112] 14183 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403
[ 1340.347397] 14254 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413
[ 1340.347402] 14228 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413
[ 1340.347414] 14185 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347419] 14261 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347423] 14217 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347427] 14203 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347439] 14234 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415
[ 1340.347452] 14269 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415
[ 1340.347461] 14255 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347465] 14192 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347473] 14259 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347492] 14232 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347497] 14223 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347505] 14220 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347523] 14252 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418
[ 1340.347531] 14193 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418
(...snipped...)
[ 1949.672951] 43 (kworker/1:1) : gfp 0x10 : 90 seconds : OOM-killer skipped 41315348
[ 1949.993045] 4079 (pool) : gfp 0x201DA : 615 seconds : OOM-killer skipped 41325108
[ 1950.694909] 14269 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41346727
[ 1950.703945] 14181 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41347003
[ 1950.742087] 14254 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348208
[ 1950.744937] 14193 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348299
[ 1950.748884] 2 (kthreadd) : gfp 0x2000D0 : 10 seconds : OOM-killer skipped 41348418
[ 1950.751565] 14203 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348502
[ 1950.756955] 14232 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348656
[ 1950.776918] 14185 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349279
[ 1950.791214] 14217 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349720
[ 1950.798961] 14179 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349957
[ 1950.806551] 14255 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350209
[ 1950.810860] 14234 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350356
[ 1950.813821] 14258 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350450
[ 1950.860422] 14261 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41351919
[ 1950.864015] 14210 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352033
[ 1950.866636] 14226 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352107
[ 1950.905003] 14238 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353303
[ 1950.907813] 14180 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353381
[ 1950.913963] 14276 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353567
[ 1952.238344] 649 (chronyd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393388
[ 1952.243228] 4030 (gnome-shell) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393566
[ 1952.247225] 592 (audispd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393701
[ 1952.258265] 1 (systemd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394041
[ 1952.269296] 1691 (rpcbind) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394365
[ 1952.299073] 702 (rtkit-daemon) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41395288
[ 1952.301231] 627 (lsmd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41395385
[ 1952.350200] 464 (systemd-journal) : gfp 0x201DA : 165 seconds : OOM-killer skipped 41396935
[ 1952.472040] 543 (auditd) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400669
[ 1952.475211] 14154 (su) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400795
[ 1952.527084] 3514 (smbd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402412
[ 1952.543205] 613 (irqbalance) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402892
[ 1952.568276] 12672 (pickup) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403656
[ 1952.572329] 770 (tuned) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41403784
[ 1952.578076] 3392 (master) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403955
[ 1952.597273] 615 (vmtoolsd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41404520
[ 1952.619187] 14146 (sleep) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405206
[ 1952.621214] 811 (NetworkManager) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405265
[ 1952.765035] 3700 (gnome-settings-) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409551
[ 1952.776099] 603 (alsactl) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409856
[ 1952.823163] 661 (crond) : gfp 0x201DA : 325 seconds : OOM-killer skipped 41411303
[ 1953.201269] SysRq : Resetting
---------- ext4 / Linux 3.19 + patch ----------

I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
with debug printk patch shown above. According to console logs,
oom_kill_process() is trivially called via pagefault_out_of_memory()
for the former kernel. Due to giving up !GFP_FS allocations immediately?

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
---------- xfs / Linux 3.19 ----------
[  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
[  793.283102] su cpuset=/ mems_allowed=0
[  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
[  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
[  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
[  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
[  793.283164] Call Trace:
[  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
[  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
[  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
[  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
[  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
[  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
[  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
[  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
[  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
[  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
[  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
[  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
---------- xfs / Linux 3.19 ----------

On the other hand, stall is observed for the latter kernel.
I guess that this time the system failed to make forward progress, for
oom_killer_skipped_count is increasing over time but the number of
remaining a.out processes remained unchanged.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-patched.txt.xz )
---------- xfs / Linux 3.19 + patch ----------
[ 2062.847965] 505 (abrt-watch-log) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388568
[ 2062.850270] 515 (lsmd) : gfp 0x2015A : 674 seconds : OOM-killer skipped 22388662
[ 2062.850389] 491 (audispd) : gfp 0x2015A : 666 seconds : OOM-killer skipped 22388667
[ 2062.850400] 346 (systemd-journal) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388667
[ 2062.850402] 610 (rtkit-daemon) : gfp 0x2015A : 677 seconds : OOM-killer skipped 22388667
[ 2062.850424] 494 (alsactl) : gfp 0x2015A : 546 seconds : OOM-killer skipped 22388668
[ 2062.850446] 558 (crond) : gfp 0x2015A : 645 seconds : OOM-killer skipped 22388669
[ 2062.850451] 25532 (su) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388669
[ 2062.850456] 516 (vmtoolsd) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388669
[ 2062.850494] 741 (NetworkManager) : gfp 0x2015A : 530 seconds : OOM-killer skipped 22388670
[ 2062.850503] 3132 (master) : gfp 0x2015A : 644 seconds : OOM-killer skipped 22388671
[ 2062.850508] 3144 (pickup) : gfp 0x2015A : 604 seconds : OOM-killer skipped 22388671
[ 2062.850512] 3145 (qmgr) : gfp 0x2015A : 526 seconds : OOM-killer skipped 22388671
[ 2062.850540] 25653 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388672
[ 2062.850561] 655 (tuned) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388673
[ 2062.852404] 10429 (kworker/0:14) : gfp 0x2040D0 : 683 seconds : OOM-killer skipped 22388748
[ 2062.852430] 543 (chronyd) : gfp 0x2015A : 293 seconds : OOM-killer skipped 22388749
[ 2062.852436] 13012 (goa-daemon) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22388749
[ 2062.852449] 1454 (rpcbind) : gfp 0x2015A : 662 seconds : OOM-killer skipped 22388749
[ 2062.854288] 466 (auditd) : gfp 0x2015A : 626 seconds : OOM-killer skipped 22388751
[ 2062.854305] 25622 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751
[ 2062.854426] 1419 (dhclient) : gfp 0x2015A : 388 seconds : OOM-killer skipped 22388751
[ 2062.854443] 25638 (a.out) : gfp 0x204250 : 683 seconds : OOM-killer skipped 22388751
[ 2062.854450] 25582 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751
[ 2062.854462] 25400 (sleep) : gfp 0x2015A : 635 seconds : OOM-killer skipped 22388751
[ 2062.854469] 532 (smartd) : gfp 0x2015A : 246 seconds : OOM-killer skipped 22388751
[ 2062.854486] 2 (kthreadd) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22388752
[ 2062.854497] 3867 (gnome-shell) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388752
[ 2062.854502] 3562 (gnome-settings-) : gfp 0x2015A : 676 seconds : OOM-killer skipped 22388752
[ 2062.854524] 25641 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753
[ 2062.854536] 25566 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753
[ 2062.908915] 61 (kworker/3:1) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22390715
[ 2062.913407] 531 (irqbalance) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22390894
[ 2064.988155] SysRq : Resetting
---------- xfs / Linux 3.19 + patch ----------

Oh, current code is too hintless to determine whether forward progress is
made, for no kernel messages are printed when the OOM victim failed to die
immediately. I wish we had debug printk patch shown above and/or
like http://marc.info/?l=linux-mm&m=141671829611143&w=2 .

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23 11:23                                                           ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-23 11:23 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: tytso, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm,
	linux-ext4, torvalds

Michal Hocko wrote:
> On Sat 21-02-15 19:20:58, Johannes Weiner wrote:
> > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> > > applying Johannes's akpm-doesnt-know-why-it-works patch:
> > > 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >  		if (high_zoneidx < ZONE_NORMAL)
> > >  			goto out;
> > >  		/* The OOM killer does not compensate for light reclaim */
> > > -		if (!(gfp_mask & __GFP_FS))
> > > +		if (!(gfp_mask & __GFP_FS)) {
> > > +			/*
> > > +			 * XXX: Page reclaim didn't yield anything,
> > > +			 * and the OOM killer can't be invoked, but
> > > +			 * keep looping as per should_alloc_retry().
> > > +			 */
> > > +			*did_some_progress = 1;
> > >  			goto out;
> > > +		}
> > >  		/*
> > >  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > > 
> > > Have people adequately confirmed that this gets us out of trouble?
> > 
> > I'd be interested in this too.  Who is seeing these failures?

So far ext4 and xfs. I don't have environment to test other filesystems.

> > 
> > Andrew, can you please use the following changelog for this patch?
> > 
> > ---
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> > 
> > Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> > killer once reclaim had failed, but nevertheless kept looping in the
> > allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> > into allocation slowpath"), which should have been a simple cleanup
> > patch, accidentally changed the behavior to aborting the allocation at
> > that point.  This creates problems with filesystem callers (?) that
> > currently rely on the allocator waiting for other tasks to intervene.
> > 
> > Revert the behavior as it shouldn't have been changed as part of a
> > cleanup patch.
> 
> OK, if this a _short term_ change. I really think that all the requests
> except for __GFP_NOFAIL should be able to fail. I would argue that it
> should be the caller who should be fixed but it is true that the patch
> was introduced too late (rc7) and so it caught other subsystems
> unprepared so backporting to stable makes sense to me. But can we please
> move on and stop pretending that allocations do not fail for the
> upcoming release?
> 
> > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>
> 

Without this patch, I think the system becomes unusable under OOM.
However, with this patch, I know the system may become unusable under
OOM. Please do write patches for handling below condition.

  Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Johannes's patch will get us out of filesystem error troubles, at
the cost of getting us into stall troubles (as with until 3.19-rc6).

I retested http://marc.info/?l=linux-ext4&m=142443125221571&w=2
with debug printk patch shown below.

---------- debug printk patch ----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d503e9c..5144506 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -610,6 +610,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 	spin_unlock(&zone_scan_lock);
 }
 
+atomic_t oom_killer_skipped_count = ATOMIC_INIT(0);
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
@@ -679,6 +681,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 				 nodemask, "Out of memory");
 		killed = 1;
 	}
+	else
+		atomic_inc(&oom_killer_skipped_count);
 out:
 	/*
 	 * Give the killed threads a good chance of exiting before trying to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e20f9c..eaea16b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
-		if (!(gfp_mask & __GFP_FS))
+		if (!(gfp_mask & __GFP_FS)) {
+			/*
+			 * XXX: Page reclaim didn't yield anything,
+			 * and the OOM killer can't be invoked, but
+			 * keep looping as per should_alloc_retry().
+			 */
+			*did_some_progress = 1;
 			goto out;
+		}
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -2635,6 +2642,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
+extern atomic_t oom_killer_skipped_count;
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2649,6 +2658,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long first_retried_time = 0;
+	unsigned long next_warn_time = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2821,6 +2832,19 @@ retry:
 			if (!did_some_progress)
 				goto nopage;
 		}
+		if (!first_retried_time) {
+			first_retried_time = jiffies;
+			if (!first_retried_time)
+				first_retried_time = 1;
+			next_warn_time = first_retried_time + 5 * HZ;
+		} else if (time_after(jiffies, next_warn_time)) {
+			printk(KERN_INFO "%d (%s) : gfp 0x%X : %lu seconds : "
+			       "OOM-killer skipped %u\n", current->pid,
+			       current->comm, gfp_mask,
+			       (jiffies - first_retried_time) / HZ,
+			       atomic_read(&oom_killer_skipped_count));
+			next_warn_time = jiffies + 5 * HZ;
+		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto retry;
---------- debug printk patch ----------

GFP_NOFS allocations stalled for 10 minutes waiting for somebody else
to volunteer memory. GFP_FS allocations stalled for 10 minutes waiting
for the OOM killer to kill somebody. The OOM killer stalled for 10
minutes waiting for GFP_NOFS allocations to complete.

I guess the system made forward progress because the number of remaining
a.out processes decreased over time.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-ext4-patched.txt.xz )
---------- ext4 / Linux 3.19 + patch ----------
[ 1335.187579] Out of memory: Kill process 14156 (a.out) score 760 or sacrifice child
[ 1335.189604] Killed process 14156 (a.out) total-vm:2167392kB, anon-rss:1360196kB, file-rss:0kB
[ 1335.191920] Kill process 14177 (a.out) sharing same memory
[ 1335.193465] Kill process 14178 (a.out) sharing same memory
[ 1335.195013] Kill process 14179 (a.out) sharing same memory
[ 1335.196580] Kill process 14180 (a.out) sharing same memory
[ 1335.198128] Kill process 14181 (a.out) sharing same memory
[ 1335.199674] Kill process 14182 (a.out) sharing same memory
[ 1335.201217] Kill process 14183 (a.out) sharing same memory
[ 1335.202768] Kill process 14184 (a.out) sharing same memory
[ 1335.204316] Kill process 14185 (a.out) sharing same memory
[ 1335.205871] Kill process 14186 (a.out) sharing same memory
[ 1335.207420] Kill process 14187 (a.out) sharing same memory
[ 1335.208974] Kill process 14188 (a.out) sharing same memory
[ 1335.210515] Kill process 14189 (a.out) sharing same memory
[ 1335.212063] Kill process 14190 (a.out) sharing same memory
[ 1335.213611] Kill process 14191 (a.out) sharing same memory
[ 1335.215165] Kill process 14192 (a.out) sharing same memory
[ 1335.216715] Kill process 14193 (a.out) sharing same memory
[ 1335.218286] Kill process 14194 (a.out) sharing same memory
[ 1335.219836] Kill process 14195 (a.out) sharing same memory
[ 1335.221378] Kill process 14196 (a.out) sharing same memory
[ 1335.222918] Kill process 14197 (a.out) sharing same memory
[ 1335.224461] Kill process 14198 (a.out) sharing same memory
[ 1335.225999] Kill process 14199 (a.out) sharing same memory
[ 1335.227545] Kill process 14200 (a.out) sharing same memory
[ 1335.229095] Kill process 14201 (a.out) sharing same memory
[ 1335.230643] Kill process 14202 (a.out) sharing same memory
[ 1335.232184] Kill process 14203 (a.out) sharing same memory
[ 1335.233738] Kill process 14204 (a.out) sharing same memory
[ 1335.235293] Kill process 14205 (a.out) sharing same memory
[ 1335.236834] Kill process 14206 (a.out) sharing same memory
[ 1335.238387] Kill process 14207 (a.out) sharing same memory
[ 1335.239930] Kill process 14208 (a.out) sharing same memory
[ 1335.241471] Kill process 14209 (a.out) sharing same memory
[ 1335.243011] Kill process 14210 (a.out) sharing same memory
[ 1335.244554] Kill process 14211 (a.out) sharing same memory
[ 1335.246101] Kill process 14212 (a.out) sharing same memory
[ 1335.247645] Kill process 14213 (a.out) sharing same memory
[ 1335.249182] Kill process 14214 (a.out) sharing same memory
[ 1335.250718] Kill process 14215 (a.out) sharing same memory
[ 1335.252305] Kill process 14216 (a.out) sharing same memory
[ 1335.253899] Kill process 14217 (a.out) sharing same memory
[ 1335.255443] Kill process 14218 (a.out) sharing same memory
[ 1335.256993] Kill process 14219 (a.out) sharing same memory
[ 1335.258531] Kill process 14220 (a.out) sharing same memory
[ 1335.260066] Kill process 14221 (a.out) sharing same memory
[ 1335.261616] Kill process 14222 (a.out) sharing same memory
[ 1335.263143] Kill process 14223 (a.out) sharing same memory
[ 1335.264647] Kill process 14224 (a.out) sharing same memory
[ 1335.266121] Kill process 14225 (a.out) sharing same memory
[ 1335.267598] Kill process 14226 (a.out) sharing same memory
[ 1335.269077] Kill process 14227 (a.out) sharing same memory
[ 1335.270560] Kill process 14228 (a.out) sharing same memory
[ 1335.272038] Kill process 14229 (a.out) sharing same memory
[ 1335.273508] Kill process 14230 (a.out) sharing same memory
[ 1335.274999] Kill process 14231 (a.out) sharing same memory
[ 1335.276469] Kill process 14232 (a.out) sharing same memory
[ 1335.277947] Kill process 14233 (a.out) sharing same memory
[ 1335.279428] Kill process 14234 (a.out) sharing same memory
[ 1335.280894] Kill process 14235 (a.out) sharing same memory
[ 1335.282361] Kill process 14236 (a.out) sharing same memory
[ 1335.283832] Kill process 14237 (a.out) sharing same memory
[ 1335.285304] Kill process 14238 (a.out) sharing same memory
[ 1335.286768] Kill process 14239 (a.out) sharing same memory
[ 1335.288242] Kill process 14240 (a.out) sharing same memory
[ 1335.289714] Kill process 14241 (a.out) sharing same memory
[ 1335.291196] Kill process 14242 (a.out) sharing same memory
[ 1335.292731] Kill process 14243 (a.out) sharing same memory
[ 1335.294258] Kill process 14244 (a.out) sharing same memory
[ 1335.295734] Kill process 14245 (a.out) sharing same memory
[ 1335.297215] Kill process 14246 (a.out) sharing same memory
[ 1335.298710] Kill process 14247 (a.out) sharing same memory
[ 1335.300188] Kill process 14248 (a.out) sharing same memory
[ 1335.301672] Kill process 14249 (a.out) sharing same memory
[ 1335.303157] Kill process 14250 (a.out) sharing same memory
[ 1335.304655] Kill process 14251 (a.out) sharing same memory
[ 1335.306141] Kill process 14252 (a.out) sharing same memory
[ 1335.307621] Kill process 14253 (a.out) sharing same memory
[ 1335.309107] Kill process 14254 (a.out) sharing same memory
[ 1335.310573] Kill process 14255 (a.out) sharing same memory
[ 1335.312052] Kill process 14256 (a.out) sharing same memory
[ 1335.313528] Kill process 14257 (a.out) sharing same memory
[ 1335.315039] Kill process 14258 (a.out) sharing same memory
[ 1335.316522] Kill process 14259 (a.out) sharing same memory
[ 1335.317992] Kill process 14260 (a.out) sharing same memory
[ 1335.319462] Kill process 14261 (a.out) sharing same memory
[ 1335.320965] Kill process 14262 (a.out) sharing same memory
[ 1335.322459] Kill process 14263 (a.out) sharing same memory
[ 1335.323958] Kill process 14264 (a.out) sharing same memory
[ 1335.325472] Kill process 14265 (a.out) sharing same memory
[ 1335.326966] Kill process 14266 (a.out) sharing same memory
[ 1335.328454] Kill process 14267 (a.out) sharing same memory
[ 1335.329945] Kill process 14268 (a.out) sharing same memory
[ 1335.331444] Kill process 14269 (a.out) sharing same memory
[ 1335.332944] Kill process 14270 (a.out) sharing same memory
[ 1335.334435] Kill process 14271 (a.out) sharing same memory
[ 1335.335930] Kill process 14272 (a.out) sharing same memory
[ 1335.337437] Kill process 14273 (a.out) sharing same memory
[ 1335.338927] Kill process 14274 (a.out) sharing same memory
[ 1335.340400] Kill process 14275 (a.out) sharing same memory
[ 1335.341890] Kill process 14276 (a.out) sharing same memory
[ 1339.640500] 464 (systemd-journal) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459181
[ 1339.649374] 615 (vmtoolsd) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459438
[ 1339.649611] 4079 (pool) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459447
[ 1340.343322] 14258 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275
[ 1340.343331] 14194 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275
[ 1340.343345] 14210 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478276
[ 1340.343360] 14179 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478277
[ 1340.345290] 14154 (su) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22478339
[ 1340.345312] 14180 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339
[ 1340.345319] 14260 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339
[ 1340.345337] 14178 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340
[ 1340.345345] 14245 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340
[ 1340.345361] 14226 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478341
[ 1340.346119] 14256 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478368
[ 1340.346139] 14181 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478369
[ 1340.347082] 14274 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347091] 14267 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347095] 14189 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347099] 14238 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347107] 14276 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403
[ 1340.347112] 14183 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403
[ 1340.347397] 14254 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413
[ 1340.347402] 14228 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413
[ 1340.347414] 14185 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347419] 14261 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347423] 14217 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347427] 14203 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347439] 14234 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415
[ 1340.347452] 14269 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415
[ 1340.347461] 14255 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347465] 14192 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347473] 14259 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347492] 14232 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347497] 14223 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347505] 14220 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347523] 14252 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418
[ 1340.347531] 14193 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418
(...snipped...)
[ 1949.672951] 43 (kworker/1:1) : gfp 0x10 : 90 seconds : OOM-killer skipped 41315348
[ 1949.993045] 4079 (pool) : gfp 0x201DA : 615 seconds : OOM-killer skipped 41325108
[ 1950.694909] 14269 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41346727
[ 1950.703945] 14181 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41347003
[ 1950.742087] 14254 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348208
[ 1950.744937] 14193 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348299
[ 1950.748884] 2 (kthreadd) : gfp 0x2000D0 : 10 seconds : OOM-killer skipped 41348418
[ 1950.751565] 14203 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348502
[ 1950.756955] 14232 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348656
[ 1950.776918] 14185 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349279
[ 1950.791214] 14217 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349720
[ 1950.798961] 14179 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349957
[ 1950.806551] 14255 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350209
[ 1950.810860] 14234 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350356
[ 1950.813821] 14258 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350450
[ 1950.860422] 14261 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41351919
[ 1950.864015] 14210 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352033
[ 1950.866636] 14226 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352107
[ 1950.905003] 14238 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353303
[ 1950.907813] 14180 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353381
[ 1950.913963] 14276 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353567
[ 1952.238344] 649 (chronyd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393388
[ 1952.243228] 4030 (gnome-shell) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393566
[ 1952.247225] 592 (audispd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393701
[ 1952.258265] 1 (systemd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394041
[ 1952.269296] 1691 (rpcbind) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394365
[ 1952.299073] 702 (rtkit-daemon) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41395288
[ 1952.301231] 627 (lsmd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41395385
[ 1952.350200] 464 (systemd-journal) : gfp 0x201DA : 165 seconds : OOM-killer skipped 41396935
[ 1952.472040] 543 (auditd) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400669
[ 1952.475211] 14154 (su) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400795
[ 1952.527084] 3514 (smbd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402412
[ 1952.543205] 613 (irqbalance) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402892
[ 1952.568276] 12672 (pickup) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403656
[ 1952.572329] 770 (tuned) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41403784
[ 1952.578076] 3392 (master) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403955
[ 1952.597273] 615 (vmtoolsd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41404520
[ 1952.619187] 14146 (sleep) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405206
[ 1952.621214] 811 (NetworkManager) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405265
[ 1952.765035] 3700 (gnome-settings-) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409551
[ 1952.776099] 603 (alsactl) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409856
[ 1952.823163] 661 (crond) : gfp 0x201DA : 325 seconds : OOM-killer skipped 41411303
[ 1953.201269] SysRq : Resetting
---------- ext4 / Linux 3.19 + patch ----------

I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
with debug printk patch shown above. According to console logs,
oom_kill_process() is trivially called via pagefault_out_of_memory()
for the former kernel. Due to giving up !GFP_FS allocations immediately?

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
---------- xfs / Linux 3.19 ----------
[  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
[  793.283102] su cpuset=/ mems_allowed=0
[  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
[  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
[  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
[  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
[  793.283164] Call Trace:
[  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
[  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
[  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
[  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
[  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
[  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
[  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
[  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
[  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
[  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
[  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
[  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
---------- xfs / Linux 3.19 ----------

On the other hand, stall is observed for the latter kernel.
I guess that this time the system failed to make forward progress, for
oom_killer_skipped_count is increasing over time but the number of
remaining a.out processes remained unchanged.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-patched.txt.xz )
---------- xfs / Linux 3.19 + patch ----------
[ 2062.847965] 505 (abrt-watch-log) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388568
[ 2062.850270] 515 (lsmd) : gfp 0x2015A : 674 seconds : OOM-killer skipped 22388662
[ 2062.850389] 491 (audispd) : gfp 0x2015A : 666 seconds : OOM-killer skipped 22388667
[ 2062.850400] 346 (systemd-journal) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388667
[ 2062.850402] 610 (rtkit-daemon) : gfp 0x2015A : 677 seconds : OOM-killer skipped 22388667
[ 2062.850424] 494 (alsactl) : gfp 0x2015A : 546 seconds : OOM-killer skipped 22388668
[ 2062.850446] 558 (crond) : gfp 0x2015A : 645 seconds : OOM-killer skipped 22388669
[ 2062.850451] 25532 (su) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388669
[ 2062.850456] 516 (vmtoolsd) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388669
[ 2062.850494] 741 (NetworkManager) : gfp 0x2015A : 530 seconds : OOM-killer skipped 22388670
[ 2062.850503] 3132 (master) : gfp 0x2015A : 644 seconds : OOM-killer skipped 22388671
[ 2062.850508] 3144 (pickup) : gfp 0x2015A : 604 seconds : OOM-killer skipped 22388671
[ 2062.850512] 3145 (qmgr) : gfp 0x2015A : 526 seconds : OOM-killer skipped 22388671
[ 2062.850540] 25653 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388672
[ 2062.850561] 655 (tuned) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388673
[ 2062.852404] 10429 (kworker/0:14) : gfp 0x2040D0 : 683 seconds : OOM-killer skipped 22388748
[ 2062.852430] 543 (chronyd) : gfp 0x2015A : 293 seconds : OOM-killer skipped 22388749
[ 2062.852436] 13012 (goa-daemon) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22388749
[ 2062.852449] 1454 (rpcbind) : gfp 0x2015A : 662 seconds : OOM-killer skipped 22388749
[ 2062.854288] 466 (auditd) : gfp 0x2015A : 626 seconds : OOM-killer skipped 22388751
[ 2062.854305] 25622 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751
[ 2062.854426] 1419 (dhclient) : gfp 0x2015A : 388 seconds : OOM-killer skipped 22388751
[ 2062.854443] 25638 (a.out) : gfp 0x204250 : 683 seconds : OOM-killer skipped 22388751
[ 2062.854450] 25582 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751
[ 2062.854462] 25400 (sleep) : gfp 0x2015A : 635 seconds : OOM-killer skipped 22388751
[ 2062.854469] 532 (smartd) : gfp 0x2015A : 246 seconds : OOM-killer skipped 22388751
[ 2062.854486] 2 (kthreadd) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22388752
[ 2062.854497] 3867 (gnome-shell) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388752
[ 2062.854502] 3562 (gnome-settings-) : gfp 0x2015A : 676 seconds : OOM-killer skipped 22388752
[ 2062.854524] 25641 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753
[ 2062.854536] 25566 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753
[ 2062.908915] 61 (kworker/3:1) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22390715
[ 2062.913407] 531 (irqbalance) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22390894
[ 2064.988155] SysRq : Resetting
---------- xfs / Linux 3.19 + patch ----------

Oh, current code is too hintless to determine whether forward progress is
made, for no kernel messages are printed when the OOM victim failed to die
immediately. I wish we had debug printk patch shown above and/or
like http://marc.info/?l=linux-mm&m=141671829611143&w=2 .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23 11:23                                                           ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-23 11:23 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: akpm, tytso, david, dchinner, linux-mm, rientjes, oleg, mgorman,
	torvalds, xfs, linux-ext4

Michal Hocko wrote:
> On Sat 21-02-15 19:20:58, Johannes Weiner wrote:
> > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> > > applying Johannes's akpm-doesnt-know-why-it-works patch:
> > > 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >  		if (high_zoneidx < ZONE_NORMAL)
> > >  			goto out;
> > >  		/* The OOM killer does not compensate for light reclaim */
> > > -		if (!(gfp_mask & __GFP_FS))
> > > +		if (!(gfp_mask & __GFP_FS)) {
> > > +			/*
> > > +			 * XXX: Page reclaim didn't yield anything,
> > > +			 * and the OOM killer can't be invoked, but
> > > +			 * keep looping as per should_alloc_retry().
> > > +			 */
> > > +			*did_some_progress = 1;
> > >  			goto out;
> > > +		}
> > >  		/*
> > >  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > > 
> > > Have people adequately confirmed that this gets us out of trouble?
> > 
> > I'd be interested in this too.  Who is seeing these failures?

So far ext4 and xfs. I don't have environment to test other filesystems.

> > 
> > Andrew, can you please use the following changelog for this patch?
> > 
> > ---
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> > 
> > Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> > killer once reclaim had failed, but nevertheless kept looping in the
> > allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> > into allocation slowpath"), which should have been a simple cleanup
> > patch, accidentally changed the behavior to aborting the allocation at
> > that point.  This creates problems with filesystem callers (?) that
> > currently rely on the allocator waiting for other tasks to intervene.
> > 
> > Revert the behavior as it shouldn't have been changed as part of a
> > cleanup patch.
> 
> OK, if this a _short term_ change. I really think that all the requests
> except for __GFP_NOFAIL should be able to fail. I would argue that it
> should be the caller who should be fixed but it is true that the patch
> was introduced too late (rc7) and so it caught other subsystems
> unprepared so backporting to stable makes sense to me. But can we please
> move on and stop pretending that allocations do not fail for the
> upcoming release?
> 
> > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>
> 

Without this patch, I think the system becomes unusable under OOM.
However, with this patch, I know the system may become unusable under
OOM. Please do write patches for handling below condition.

  Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Johannes's patch will get us out of filesystem error troubles, at
the cost of getting us into stall troubles (as with until 3.19-rc6).

I retested http://marc.info/?l=linux-ext4&m=142443125221571&w=2
with debug printk patch shown below.

---------- debug printk patch ----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d503e9c..5144506 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -610,6 +610,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 	spin_unlock(&zone_scan_lock);
 }
 
+atomic_t oom_killer_skipped_count = ATOMIC_INIT(0);
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
@@ -679,6 +681,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 				 nodemask, "Out of memory");
 		killed = 1;
 	}
+	else
+		atomic_inc(&oom_killer_skipped_count);
 out:
 	/*
 	 * Give the killed threads a good chance of exiting before trying to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e20f9c..eaea16b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
-		if (!(gfp_mask & __GFP_FS))
+		if (!(gfp_mask & __GFP_FS)) {
+			/*
+			 * XXX: Page reclaim didn't yield anything,
+			 * and the OOM killer can't be invoked, but
+			 * keep looping as per should_alloc_retry().
+			 */
+			*did_some_progress = 1;
 			goto out;
+		}
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -2635,6 +2642,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
+extern atomic_t oom_killer_skipped_count;
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2649,6 +2658,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long first_retried_time = 0;
+	unsigned long next_warn_time = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2821,6 +2832,19 @@ retry:
 			if (!did_some_progress)
 				goto nopage;
 		}
+		if (!first_retried_time) {
+			first_retried_time = jiffies;
+			if (!first_retried_time)
+				first_retried_time = 1;
+			next_warn_time = first_retried_time + 5 * HZ;
+		} else if (time_after(jiffies, next_warn_time)) {
+			printk(KERN_INFO "%d (%s) : gfp 0x%X : %lu seconds : "
+			       "OOM-killer skipped %u\n", current->pid,
+			       current->comm, gfp_mask,
+			       (jiffies - first_retried_time) / HZ,
+			       atomic_read(&oom_killer_skipped_count));
+			next_warn_time = jiffies + 5 * HZ;
+		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto retry;
---------- debug printk patch ----------

GFP_NOFS allocations stalled for 10 minutes waiting for somebody else
to volunteer memory. GFP_FS allocations stalled for 10 minutes waiting
for the OOM killer to kill somebody. The OOM killer stalled for 10
minutes waiting for GFP_NOFS allocations to complete.

I guess the system made forward progress because the number of remaining
a.out processes decreased over time.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-ext4-patched.txt.xz )
---------- ext4 / Linux 3.19 + patch ----------
[ 1335.187579] Out of memory: Kill process 14156 (a.out) score 760 or sacrifice child
[ 1335.189604] Killed process 14156 (a.out) total-vm:2167392kB, anon-rss:1360196kB, file-rss:0kB
[ 1335.191920] Kill process 14177 (a.out) sharing same memory
[ 1335.193465] Kill process 14178 (a.out) sharing same memory
[ 1335.195013] Kill process 14179 (a.out) sharing same memory
[ 1335.196580] Kill process 14180 (a.out) sharing same memory
[ 1335.198128] Kill process 14181 (a.out) sharing same memory
[ 1335.199674] Kill process 14182 (a.out) sharing same memory
[ 1335.201217] Kill process 14183 (a.out) sharing same memory
[ 1335.202768] Kill process 14184 (a.out) sharing same memory
[ 1335.204316] Kill process 14185 (a.out) sharing same memory
[ 1335.205871] Kill process 14186 (a.out) sharing same memory
[ 1335.207420] Kill process 14187 (a.out) sharing same memory
[ 1335.208974] Kill process 14188 (a.out) sharing same memory
[ 1335.210515] Kill process 14189 (a.out) sharing same memory
[ 1335.212063] Kill process 14190 (a.out) sharing same memory
[ 1335.213611] Kill process 14191 (a.out) sharing same memory
[ 1335.215165] Kill process 14192 (a.out) sharing same memory
[ 1335.216715] Kill process 14193 (a.out) sharing same memory
[ 1335.218286] Kill process 14194 (a.out) sharing same memory
[ 1335.219836] Kill process 14195 (a.out) sharing same memory
[ 1335.221378] Kill process 14196 (a.out) sharing same memory
[ 1335.222918] Kill process 14197 (a.out) sharing same memory
[ 1335.224461] Kill process 14198 (a.out) sharing same memory
[ 1335.225999] Kill process 14199 (a.out) sharing same memory
[ 1335.227545] Kill process 14200 (a.out) sharing same memory
[ 1335.229095] Kill process 14201 (a.out) sharing same memory
[ 1335.230643] Kill process 14202 (a.out) sharing same memory
[ 1335.232184] Kill process 14203 (a.out) sharing same memory
[ 1335.233738] Kill process 14204 (a.out) sharing same memory
[ 1335.235293] Kill process 14205 (a.out) sharing same memory
[ 1335.236834] Kill process 14206 (a.out) sharing same memory
[ 1335.238387] Kill process 14207 (a.out) sharing same memory
[ 1335.239930] Kill process 14208 (a.out) sharing same memory
[ 1335.241471] Kill process 14209 (a.out) sharing same memory
[ 1335.243011] Kill process 14210 (a.out) sharing same memory
[ 1335.244554] Kill process 14211 (a.out) sharing same memory
[ 1335.246101] Kill process 14212 (a.out) sharing same memory
[ 1335.247645] Kill process 14213 (a.out) sharing same memory
[ 1335.249182] Kill process 14214 (a.out) sharing same memory
[ 1335.250718] Kill process 14215 (a.out) sharing same memory
[ 1335.252305] Kill process 14216 (a.out) sharing same memory
[ 1335.253899] Kill process 14217 (a.out) sharing same memory
[ 1335.255443] Kill process 14218 (a.out) sharing same memory
[ 1335.256993] Kill process 14219 (a.out) sharing same memory
[ 1335.258531] Kill process 14220 (a.out) sharing same memory
[ 1335.260066] Kill process 14221 (a.out) sharing same memory
[ 1335.261616] Kill process 14222 (a.out) sharing same memory
[ 1335.263143] Kill process 14223 (a.out) sharing same memory
[ 1335.264647] Kill process 14224 (a.out) sharing same memory
[ 1335.266121] Kill process 14225 (a.out) sharing same memory
[ 1335.267598] Kill process 14226 (a.out) sharing same memory
[ 1335.269077] Kill process 14227 (a.out) sharing same memory
[ 1335.270560] Kill process 14228 (a.out) sharing same memory
[ 1335.272038] Kill process 14229 (a.out) sharing same memory
[ 1335.273508] Kill process 14230 (a.out) sharing same memory
[ 1335.274999] Kill process 14231 (a.out) sharing same memory
[ 1335.276469] Kill process 14232 (a.out) sharing same memory
[ 1335.277947] Kill process 14233 (a.out) sharing same memory
[ 1335.279428] Kill process 14234 (a.out) sharing same memory
[ 1335.280894] Kill process 14235 (a.out) sharing same memory
[ 1335.282361] Kill process 14236 (a.out) sharing same memory
[ 1335.283832] Kill process 14237 (a.out) sharing same memory
[ 1335.285304] Kill process 14238 (a.out) sharing same memory
[ 1335.286768] Kill process 14239 (a.out) sharing same memory
[ 1335.288242] Kill process 14240 (a.out) sharing same memory
[ 1335.289714] Kill process 14241 (a.out) sharing same memory
[ 1335.291196] Kill process 14242 (a.out) sharing same memory
[ 1335.292731] Kill process 14243 (a.out) sharing same memory
[ 1335.294258] Kill process 14244 (a.out) sharing same memory
[ 1335.295734] Kill process 14245 (a.out) sharing same memory
[ 1335.297215] Kill process 14246 (a.out) sharing same memory
[ 1335.298710] Kill process 14247 (a.out) sharing same memory
[ 1335.300188] Kill process 14248 (a.out) sharing same memory
[ 1335.301672] Kill process 14249 (a.out) sharing same memory
[ 1335.303157] Kill process 14250 (a.out) sharing same memory
[ 1335.304655] Kill process 14251 (a.out) sharing same memory
[ 1335.306141] Kill process 14252 (a.out) sharing same memory
[ 1335.307621] Kill process 14253 (a.out) sharing same memory
[ 1335.309107] Kill process 14254 (a.out) sharing same memory
[ 1335.310573] Kill process 14255 (a.out) sharing same memory
[ 1335.312052] Kill process 14256 (a.out) sharing same memory
[ 1335.313528] Kill process 14257 (a.out) sharing same memory
[ 1335.315039] Kill process 14258 (a.out) sharing same memory
[ 1335.316522] Kill process 14259 (a.out) sharing same memory
[ 1335.317992] Kill process 14260 (a.out) sharing same memory
[ 1335.319462] Kill process 14261 (a.out) sharing same memory
[ 1335.320965] Kill process 14262 (a.out) sharing same memory
[ 1335.322459] Kill process 14263 (a.out) sharing same memory
[ 1335.323958] Kill process 14264 (a.out) sharing same memory
[ 1335.325472] Kill process 14265 (a.out) sharing same memory
[ 1335.326966] Kill process 14266 (a.out) sharing same memory
[ 1335.328454] Kill process 14267 (a.out) sharing same memory
[ 1335.329945] Kill process 14268 (a.out) sharing same memory
[ 1335.331444] Kill process 14269 (a.out) sharing same memory
[ 1335.332944] Kill process 14270 (a.out) sharing same memory
[ 1335.334435] Kill process 14271 (a.out) sharing same memory
[ 1335.335930] Kill process 14272 (a.out) sharing same memory
[ 1335.337437] Kill process 14273 (a.out) sharing same memory
[ 1335.338927] Kill process 14274 (a.out) sharing same memory
[ 1335.340400] Kill process 14275 (a.out) sharing same memory
[ 1335.341890] Kill process 14276 (a.out) sharing same memory
[ 1339.640500] 464 (systemd-journal) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459181
[ 1339.649374] 615 (vmtoolsd) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459438
[ 1339.649611] 4079 (pool) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459447
[ 1340.343322] 14258 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275
[ 1340.343331] 14194 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275
[ 1340.343345] 14210 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478276
[ 1340.343360] 14179 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478277
[ 1340.345290] 14154 (su) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22478339
[ 1340.345312] 14180 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339
[ 1340.345319] 14260 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339
[ 1340.345337] 14178 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340
[ 1340.345345] 14245 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340
[ 1340.345361] 14226 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478341
[ 1340.346119] 14256 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478368
[ 1340.346139] 14181 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478369
[ 1340.347082] 14274 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347091] 14267 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347095] 14189 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347099] 14238 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347107] 14276 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403
[ 1340.347112] 14183 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403
[ 1340.347397] 14254 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413
[ 1340.347402] 14228 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413
[ 1340.347414] 14185 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347419] 14261 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347423] 14217 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347427] 14203 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347439] 14234 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415
[ 1340.347452] 14269 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415
[ 1340.347461] 14255 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347465] 14192 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347473] 14259 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347492] 14232 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347497] 14223 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347505] 14220 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347523] 14252 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418
[ 1340.347531] 14193 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418
(...snipped...)
[ 1949.672951] 43 (kworker/1:1) : gfp 0x10 : 90 seconds : OOM-killer skipped 41315348
[ 1949.993045] 4079 (pool) : gfp 0x201DA : 615 seconds : OOM-killer skipped 41325108
[ 1950.694909] 14269 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41346727
[ 1950.703945] 14181 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41347003
[ 1950.742087] 14254 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348208
[ 1950.744937] 14193 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348299
[ 1950.748884] 2 (kthreadd) : gfp 0x2000D0 : 10 seconds : OOM-killer skipped 41348418
[ 1950.751565] 14203 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348502
[ 1950.756955] 14232 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348656
[ 1950.776918] 14185 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349279
[ 1950.791214] 14217 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349720
[ 1950.798961] 14179 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349957
[ 1950.806551] 14255 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350209
[ 1950.810860] 14234 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350356
[ 1950.813821] 14258 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350450
[ 1950.860422] 14261 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41351919
[ 1950.864015] 14210 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352033
[ 1950.866636] 14226 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352107
[ 1950.905003] 14238 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353303
[ 1950.907813] 14180 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353381
[ 1950.913963] 14276 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353567
[ 1952.238344] 649 (chronyd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393388
[ 1952.243228] 4030 (gnome-shell) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393566
[ 1952.247225] 592 (audispd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393701
[ 1952.258265] 1 (systemd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394041
[ 1952.269296] 1691 (rpcbind) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394365
[ 1952.299073] 702 (rtkit-daemon) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41395288
[ 1952.301231] 627 (lsmd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41395385
[ 1952.350200] 464 (systemd-journal) : gfp 0x201DA : 165 seconds : OOM-killer skipped 41396935
[ 1952.472040] 543 (auditd) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400669
[ 1952.475211] 14154 (su) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400795
[ 1952.527084] 3514 (smbd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402412
[ 1952.543205] 613 (irqbalance) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402892
[ 1952.568276] 12672 (pickup) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403656
[ 1952.572329] 770 (tuned) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41403784
[ 1952.578076] 3392 (master) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403955
[ 1952.597273] 615 (vmtoolsd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41404520
[ 1952.619187] 14146 (sleep) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405206
[ 1952.621214] 811 (NetworkManager) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405265
[ 1952.765035] 3700 (gnome-settings-) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409551
[ 1952.776099] 603 (alsactl) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409856
[ 1952.823163] 661 (crond) : gfp 0x201DA : 325 seconds : OOM-killer skipped 41411303
[ 1953.201269] SysRq : Resetting
---------- ext4 / Linux 3.19 + patch ----------

I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
with debug printk patch shown above. According to console logs,
oom_kill_process() is trivially called via pagefault_out_of_memory()
for the former kernel. Due to giving up !GFP_FS allocations immediately?

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
---------- xfs / Linux 3.19 ----------
[  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
[  793.283102] su cpuset=/ mems_allowed=0
[  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
[  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
[  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
[  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
[  793.283164] Call Trace:
[  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
[  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
[  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
[  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
[  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
[  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
[  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
[  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
[  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
[  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
[  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
[  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
---------- xfs / Linux 3.19 ----------

On the other hand, stall is observed for the latter kernel.
I guess that this time the system failed to make forward progress, for
oom_killer_skipped_count is increasing over time but the number of
remaining a.out processes remained unchanged.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-patched.txt.xz )
---------- xfs / Linux 3.19 + patch ----------
[ 2062.847965] 505 (abrt-watch-log) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388568
[ 2062.850270] 515 (lsmd) : gfp 0x2015A : 674 seconds : OOM-killer skipped 22388662
[ 2062.850389] 491 (audispd) : gfp 0x2015A : 666 seconds : OOM-killer skipped 22388667
[ 2062.850400] 346 (systemd-journal) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388667
[ 2062.850402] 610 (rtkit-daemon) : gfp 0x2015A : 677 seconds : OOM-killer skipped 22388667
[ 2062.850424] 494 (alsactl) : gfp 0x2015A : 546 seconds : OOM-killer skipped 22388668
[ 2062.850446] 558 (crond) : gfp 0x2015A : 645 seconds : OOM-killer skipped 22388669
[ 2062.850451] 25532 (su) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388669
[ 2062.850456] 516 (vmtoolsd) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388669
[ 2062.850494] 741 (NetworkManager) : gfp 0x2015A : 530 seconds : OOM-killer skipped 22388670
[ 2062.850503] 3132 (master) : gfp 0x2015A : 644 seconds : OOM-killer skipped 22388671
[ 2062.850508] 3144 (pickup) : gfp 0x2015A : 604 seconds : OOM-killer skipped 22388671
[ 2062.850512] 3145 (qmgr) : gfp 0x2015A : 526 seconds : OOM-killer skipped 22388671
[ 2062.850540] 25653 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388672
[ 2062.850561] 655 (tuned) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388673
[ 2062.852404] 10429 (kworker/0:14) : gfp 0x2040D0 : 683 seconds : OOM-killer skipped 22388748
[ 2062.852430] 543 (chronyd) : gfp 0x2015A : 293 seconds : OOM-killer skipped 22388749
[ 2062.852436] 13012 (goa-daemon) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22388749
[ 2062.852449] 1454 (rpcbind) : gfp 0x2015A : 662 seconds : OOM-killer skipped 22388749
[ 2062.854288] 466 (auditd) : gfp 0x2015A : 626 seconds : OOM-killer skipped 22388751
[ 2062.854305] 25622 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751
[ 2062.854426] 1419 (dhclient) : gfp 0x2015A : 388 seconds : OOM-killer skipped 22388751
[ 2062.854443] 25638 (a.out) : gfp 0x204250 : 683 seconds : OOM-killer skipped 22388751
[ 2062.854450] 25582 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751
[ 2062.854462] 25400 (sleep) : gfp 0x2015A : 635 seconds : OOM-killer skipped 22388751
[ 2062.854469] 532 (smartd) : gfp 0x2015A : 246 seconds : OOM-killer skipped 22388751
[ 2062.854486] 2 (kthreadd) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22388752
[ 2062.854497] 3867 (gnome-shell) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388752
[ 2062.854502] 3562 (gnome-settings-) : gfp 0x2015A : 676 seconds : OOM-killer skipped 22388752
[ 2062.854524] 25641 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753
[ 2062.854536] 25566 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753
[ 2062.908915] 61 (kworker/3:1) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22390715
[ 2062.913407] 531 (irqbalance) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22390894
[ 2064.988155] SysRq : Resetting
---------- xfs / Linux 3.19 + patch ----------

Oh, current code is too hintless to determine whether forward progress is
made, for no kernel messages are printed when the OOM victim failed to die
immediately. I wish we had debug printk patch shown above and/or
like http://marc.info/?l=linux-mm&m=141671829611143&w=2 .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: __GFP_NOFAIL and oom_killer_disabled?
  2015-02-23 10:21                                                       ` Michal Hocko
@ 2015-02-23 13:03                                                         ` Tetsuo Handa
  2015-02-24 18:14                                                           ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-23 13:03 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg,
	mgorman, torvalds

Michal Hocko wrote:
> What about something like the following?

I'm fine with whatever approaches as long as retry is guaranteed.

But maybe we can use memory reserves like below? I think there will be
little risk because userspace processes are already frozen...

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a47f0b2..cea0a1b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2760,8 +2760,17 @@ retry:
 							&did_some_progress);
 			if (page)
 				goto got_pg;
-			if (!did_some_progress)
+			if (!did_some_progress && !(gfp_mask & __GFP_NOFAIL))
 				goto nopage;
+			/*
+			 * What!? __GFP_NOFAIL allocation failed to invoke
+			 * the OOM killer due to oom_killer_disabled == true?
+			 * Then, pretend ALLOC_NO_WATERMARKS request and let
+			 * __alloc_pages_high_priority() retry forever...
+			 */
+			WARN(1, "Retrying GFP_NOFAIL allocation...\n");
+			gfp_mask &= ~__GFP_NOMEMALLOC;
+			gfp_mask |= __GFP_MEMALLOC;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-22  0:20                                                       ` Johannes Weiner
  (?)
@ 2015-02-23 21:33                                                         ` David Rientjes
  -1 siblings, 0 replies; 276+ messages in thread
From: David Rientjes @ 2015-02-23 21:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Theodore Ts'o, Dave Chinner, Tetsuo Handa,
	mhocko, dchinner, linux-mm, oleg, mgorman, torvalds, xfs,
	linux-ext4

On Sat, 21 Feb 2015, Johannes Weiner wrote:

> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.
> 
> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Cc: stable@vger.kernel.org [3.19]
Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23 21:33                                                         ` David Rientjes
  0 siblings, 0 replies; 276+ messages in thread
From: David Rientjes @ 2015-02-23 21:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, Andrew Morton, linux-ext4, torvalds

On Sat, 21 Feb 2015, Johannes Weiner wrote:

> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.
> 
> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Cc: stable@vger.kernel.org [3.19]
Acked-by: David Rientjes <rientjes@google.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-23 21:33                                                         ` David Rientjes
  0 siblings, 0 replies; 276+ messages in thread
From: David Rientjes @ 2015-02-23 21:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Theodore Ts'o, Dave Chinner, Tetsuo Handa,
	mhocko, dchinner, linux-mm, oleg, mgorman, torvalds, xfs,
	linux-ext4

On Sat, 21 Feb 2015, Johannes Weiner wrote:

> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.
> 
> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Cc: stable@vger.kernel.org [3.19]
Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 11:57                               ` Tetsuo Handa
  2015-02-17 13:16                                 ` Johannes Weiner
@ 2015-02-23 22:08                                 ` David Rientjes
  2015-02-24 11:20                                   ` Tetsuo Handa
  1 sibling, 1 reply; 276+ messages in thread
From: David Rientjes @ 2015-02-23 22:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, mhocko, david, dchinner, linux-mm, oleg, akpm, mgorman, torvalds

On Tue, 17 Feb 2015, Tetsuo Handa wrote:

> Yes, basic idea would be same with
> http://marc.info/?l=linux-mm&m=142002495532320&w=2 .
> 
> But Michal and David do not like the timeout approach.
> http://marc.info/?l=linux-mm&m=141684783713564&w=2
> http://marc.info/?l=linux-mm&m=141686814824684&w=2
> 
> Unless they change their opinion in response to the discovery explained at
> http://lwn.net/Articles/627419/ , timeout patches will not be accepted.
> 

Unfortunately, timeout based solutions aren't guaranteed to provide 
anything more helpful.  The problem you're referring to is when the oom 
kill victim is waiting on a mutex and cannot make forward progress even 
though it has access to memory reserves.  Threads that are holding the 
mutex and allocate in a blockable context will cause the oom killer to 
defer forever because it sees the presence of a victim waiting to exit.

	TaskA			TaskB
	=====			=====
	mutex_lock(i_mutex)
	allocate memory
	oom kill TaskB
				mutex_lock(i_mutex)

In this scenario, nothing on the system will be able to allocate memory 
without some type of memory reserve since at least one thread is holding 
the mutex that the victim needs and is looping forever, unless memory is 
freed by something else on the system which allows TaskA to allocate and 
drop the mutex.

In a timeout based solution, this would be detected and another thread 
would be chosen for oom kill.  There's currently no way for the oom killer 
to select a process that isn't waiting for that same mutex, however.  If 
it does, then the process has been killed needlessly since it cannot make 
forward progress itself without grabbing the mutex.

Certainly, it would be better to eventually kill something else in the 
hope that it does not need the mutex and will free some memory which would 
allow the thread that had originally been deferring forever, TaskA, in the 
oom killer waiting for the original victim, TaskB, to exit.  If that's the 
solution, then TaskA had been killed unnecessarily itself.

Perhaps we should consider an alternative: allow threads, such as TaskA, 
that are deferring for a long amount of time to simply allocate with 
ALLOC_NO_WATERMARKS itself in that scenario in the hope that the 
allocation succeeding will eventually allow it to drop the mutex.  Two 
problems: (1) there's no guarantee that the simple allocation is all TaskA 
needs before it will drop the lock and (2) another thread could 
immediately grab the same mutex and allocate, in which the same series of 
events repeats.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23 22:08                                 ` David Rientjes
@ 2015-02-24 11:20                                   ` Tetsuo Handa
  2015-02-24 15:20                                     ` Theodore Ts'o
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-24 11:20 UTC (permalink / raw)
  To: rientjes
  Cc: hannes, mhocko, david, dchinner, linux-mm, oleg, akpm, mgorman,
	torvalds, fernando_b1

David Rientjes wrote:
> Perhaps we should consider an alternative: allow threads, such as TaskA, 
> that are deferring for a long amount of time to simply allocate with 
> ALLOC_NO_WATERMARKS itself in that scenario in the hope that the 
> allocation succeeding will eventually allow it to drop the mutex.  Two 
> problems: (1) there's no guarantee that the simple allocation is all TaskA 
> needs before it will drop the lock and (2) another thread could 
> immediately grab the same mutex and allocate, in which the same series of 
> events repeats.

We can see that effectively GFP_NOFAIL allocations with a lock held
(e.g. filesystem transaction) exist, can't we?

----------------------------------------
TaskA               TaskB               TaskC               TaskD               TaskE
                    call mutex_lock()
                                        call mutex_lock()
                                                            call mutex_lock()
                                                                                call mutex_lock()
call mutex_lock()
                    do GFP_NOFAIL allocation
                    oom kill TaskA
                    waiting for TaskA to die
                    will do something with allocated memory
                    will call mutex_unlock()
                                        will do GFP_NOFAIL allocation
                                        will wait for TaskA to die
                                        will do something with allocated memory
                                        will call mutex_unlock()
                                                            will do GFP_NOFAIL allocation
                                                            will wait for TaskA to die
                                                            will do something with allocated memory
                                                            will call mutex_unlock()
                                                                                will do GFP_NOFAIL allocation
                                                                                will wait for TaskA to die
                                                                                will do something with allocated memory
                                                                                will call mutex_unlock()
will do GFP_NOFAIL allocation
----------------------------------------

Allowing ALLOC_NO_WATERMARKS to TaskB helps nothing. We don't want to
allow ALLOC_NO_WATERMARKS to TaskC, TaskD, TaskE and TaskA when they
do the same sequence TaskB did, or we will deplete memory reserves.

> In a timeout based solution, this would be detected and another thread 
> would be chosen for oom kill.  There's currently no way for the oom killer 
> to select a process that isn't waiting for that same mutex, however.  If 
> it does, then the process has been killed needlessly since it cannot make 
> forward progress itself without grabbing the mutex.

Right. The OOM killer cannot understand that there is such lock dependency.

And do you think there will be a way available for the OOM killer to select
a process that isn't waiting for that same mutex in the neare future?
(Remembering mutex's address currently waiting for using "struct task_struct"
would do, but will not be accepted due to performance penalty. Simplified form
would be to check "struct task_struct"->state , but will not be perfect.)

> Certainly, it would be better to eventually kill something else in the 
> hope that it does not need the mutex and will free some memory which would 
> allow the thread that had originally been deferring forever, TaskA, in the 
> oom killer waiting for the original victim, TaskB, to exit.  If that's the 
> solution, then TaskA had been killed unnecessarily itself.

Complaining about unnecessarily killed processes is preventing us from
making forward progress.

The memory reserves are something like a balloon. To guarantee forward
progress, the balloon must not become empty. All memory managing techniques
except the OOM killer are trying to control "deflator of the balloon" via
various throttling heuristics. On the other hand, the OOM killer is the only
memory managing technique which is trying to control "inflator of the balloon"
via several throttling heuristics. The OOM killer is invoked when all memory
managing techniques except the OOM killer failed to make forward progress.

Therefore, the OOM killer is responsible for making forward progress for
"deflator of the balloon" and is granted the prerogative to send SIGKILL to
any process.

Given the fact that the OOM killer cannot understand lock dependency and
there are effectively GFP_NOFAIL allocations, it is inevitable that the
OOM killer fails to choose one correct process that will make forward
progress.

Currently the OOM killer is invoked as one shot mode. This mode helps us
to reduce the possibility of depleting the memory reserves and killing
processes unnecessarily. But this mode is bothering people with "silently
stalling forever" problem when the bullet from the OOM killer missed the
target. This mode is also bothering people with "complete system crash"
problem when the bullet from SysRq-f missed the target, for they have to
use SysRq-i or SysRq-c or SysRq-b which is far more unnecessary kill of
processes in order to solve the OOM condition.

My proposal is to allow the OOM killer be invoked as consecutive shots
mode. Although consecutive shots mode may increase possibility of killing
processes unnecessarily, trying to kill an unkillable process in one shot
mode is after all unnecessary kill of processes. The root cause is the same
(i.e. the OOM killer cannot understand the dependency). My patch can stop
bothering people with "silently stalling forever" / "complete system crash"
problems by retrying the oom kill attempt than wait forever.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-24 11:20                                   ` Tetsuo Handa
@ 2015-02-24 15:20                                     ` Theodore Ts'o
  2015-02-24 21:02                                       ` Dave Chinner
  0 siblings, 1 reply; 276+ messages in thread
From: Theodore Ts'o @ 2015-02-24 15:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, hannes, mhocko, david, dchinner, linux-mm, oleg, akpm,
	mgorman, torvalds, fernando_b1

On Tue, Feb 24, 2015 at 08:20:11PM +0900, Tetsuo Handa wrote:
> > In a timeout based solution, this would be detected and another thread 
> > would be chosen for oom kill.  There's currently no way for the oom killer 
> > to select a process that isn't waiting for that same mutex, however.  If 
> > it does, then the process has been killed needlessly since it cannot make 
> > forward progress itself without grabbing the mutex.
> 
> Right. The OOM killer cannot understand that there is such lock dependency....

> The memory reserves are something like a balloon. To guarantee forward
> progress, the balloon must not become empty. All memory managing techniques
> except the OOM killer are trying to control "deflator of the balloon" via
> various throttling heuristics. On the other hand, the OOM killer is the only
> memory managing technique which is trying to control "inflator of the balloon"
> via several throttling heuristics.....

The mm developers have suggested in the past whether we could solve
problems by preallocating memory in advance.  Sometimes this is very
hard to do because we don't know exactly how much or if we need
memory, or in order to do this, we would need to completely
restructure the code because the memory allocation is happening deep
in the call stack, potentially in some other subsystem.

So I wonder if we can solve the problem by having a subsystem
reserving memory in advance of taking the mutexes.  We do something
like this in ext3/ext4 --- when we allocate a (sub-)transaction
handle, we give a worst case estimate of how many blocks we might need
to dirty under that handle, and if there isn't enough space in the
journal, we block in the start_handle() call while the current
transaction is closed, and the transaction handle will be attached to
the next transaction.

In the memory allocation scenario, it's a bit more complicated, since
the memory might be allocated in a slab that requires a higher-order
page allocation, but would it be sufficient if we do something rough
where the foreground kernel thread "reserves" a few pages before it
starts doing something that requires mutexes.  The reservation would
be reserved on an accounting basis, and kernel codepath which has
reserved pages would get priority over kernel threads running under a
task_struct which hsa not reserved pages.  If there the system doesn't
have enough pages available, then the reservation request would block
the process until more memory is available.

This wouldn't necessary help in cases where the memory is required for
cleaning dirty pages (although in those cases you really *do* want to
let the memory allocation succeed --- so maybe there should be a way
to hint to the mm subsystem that a memory allocation should be given
higher priority since it might help get the system out of the ham that
it is in).

However, for "normal" operations, where blocking a process who was
about to execute, say, a read(2) or a open(2) system call early,
*before* it takes some mutex, it owuld be good if we could provide a
certain amount of admission control when memory pressure is specially
high.

Would this be a viable strategy?

Even if this was a hint that wasn't perfect (i.e., it some cases a
kernel thread might end up requiring more pages than it had hinted,
which would not be considered fatal, although the excess requested
pages would be treated the same way as if no reservation was made at
all, meaning the memory allocation would be more likely to fail and a
GFP_NOFAIL allocation would loop for longer), I would think this could
only help us do a better job of "keeping the baloon from getting
completely deflated".

Cheers,

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: __GFP_NOFAIL and oom_killer_disabled?
  2015-02-23 13:03                                                         ` Tetsuo Handa
@ 2015-02-24 18:14                                                           ` Michal Hocko
  2015-02-25 11:22                                                             ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2015-02-24 18:14 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg,
	mgorman, torvalds

On Mon 23-02-15 22:03:25, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > What about something like the following?
> 
> I'm fine with whatever approaches as long as retry is guaranteed.
> 
> But maybe we can use memory reserves like below?

This sounds too risky to me and not really necessary. GFP_NOFAIL
allocations shouldn't be called while the system is not running any
tasks (aka from pm/device code). So we are primarily trying to help
those nofail allocations which come from kernel threads and their retry
will fail the suspend rather than blow up because of an unexpected
allocation failure.

> I think there will be little risk because userspace processes are
> already frozen...
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a47f0b2..cea0a1b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2760,8 +2760,17 @@ retry:
>  							&did_some_progress);
>  			if (page)
>  				goto got_pg;
> -			if (!did_some_progress)
> +			if (!did_some_progress && !(gfp_mask & __GFP_NOFAIL))
>  				goto nopage;
> +			/*
> +			 * What!? __GFP_NOFAIL allocation failed to invoke
> +			 * the OOM killer due to oom_killer_disabled == true?
> +			 * Then, pretend ALLOC_NO_WATERMARKS request and let
> +			 * __alloc_pages_high_priority() retry forever...
> +			 */
> +			WARN(1, "Retrying GFP_NOFAIL allocation...\n");
> +			gfp_mask &= ~__GFP_NOMEMALLOC;
> +			gfp_mask |= __GFP_MEMALLOC;
>  		}
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-24 15:20                                     ` Theodore Ts'o
@ 2015-02-24 21:02                                       ` Dave Chinner
  2015-02-25 14:31                                         ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Dave Chinner @ 2015-02-24 21:02 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, rientjes, hannes, mhocko, dchinner, linux-mm, oleg,
	akpm, mgorman, torvalds, fernando_b1

On Tue, Feb 24, 2015 at 10:20:33AM -0500, Theodore Ts'o wrote:
> On Tue, Feb 24, 2015 at 08:20:11PM +0900, Tetsuo Handa wrote:
> > > In a timeout based solution, this would be detected and another thread 
> > > would be chosen for oom kill.  There's currently no way for the oom killer 
> > > to select a process that isn't waiting for that same mutex, however.  If 
> > > it does, then the process has been killed needlessly since it cannot make 
> > > forward progress itself without grabbing the mutex.
> > 
> > Right. The OOM killer cannot understand that there is such lock dependency....
> 
> > The memory reserves are something like a balloon. To guarantee forward
> > progress, the balloon must not become empty. All memory managing techniques
> > except the OOM killer are trying to control "deflator of the balloon" via
> > various throttling heuristics. On the other hand, the OOM killer is the only
> > memory managing technique which is trying to control "inflator of the balloon"
> > via several throttling heuristics.....
> 
> The mm developers have suggested in the past whether we could solve
> problems by preallocating memory in advance.  Sometimes this is very
> hard to do because we don't know exactly how much or if we need
> memory, or in order to do this, we would need to completely
> restructure the code because the memory allocation is happening deep
> in the call stack, potentially in some other subsystem.
> 
> So I wonder if we can solve the problem by having a subsystem
> reserving memory in advance of taking the mutexes.  We do something
> like this in ext3/ext4 --- when we allocate a (sub-)transaction
> handle, we give a worst case estimate of how many blocks we might need
> to dirty under that handle, and if there isn't enough space in the
> journal, we block in the start_handle() call while the current
> transaction is closed, and the transaction handle will be attached to
> the next transaction.

This exact discussion is already underway.

My initial proposal:

http://oss.sgi.com/archives/xfs/2015-02/msg00314.html

Why mempools don't work but transaction based reservations will:

http://oss.sgi.com/archives/xfs/2015-02/msg00339.html

Reservation needs to be an accounting mechanisms, not preallocation:

http://oss.sgi.com/archives/xfs/2015-02/msg00456.html
http://oss.sgi.com/archives/xfs/2015-02/msg00457.html
http://oss.sgi.com/archives/xfs/2015-02/msg00458.html

And that's where the discussion currently sits.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: __GFP_NOFAIL and oom_killer_disabled?
  2015-02-24 18:14                                                           ` Michal Hocko
@ 2015-02-25 11:22                                                             ` Tetsuo Handa
  2015-02-25 16:02                                                               ` Michal Hocko
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-25 11:22 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg,
	mgorman, torvalds

Michal Hocko wrote:
> This commit hasn't introduced any behavior changes. GFP_NOFAIL
> allocations fail when OOM killer is disabled since beginning
> 7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen).

I thought that

-       out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false);
-       *did_some_progress = 1;
+       if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false))
+               *did_some_progress = 1;

in commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer
path raceless" introduced a code path which fails to set
*did_some_progress to non 0 value.

> "
> We haven't seen any bug reports since 2009 so I haven't marked the patch
> for stable. I have no problem to backport it to stable trees though if
> people think it is a good precaution.
> "

Until 3.18, GFP_NOFAIL for GFP_NOFS / GFP_NOIO did not fail with
oom_killer_disabled == true because of

----------
        if (!did_some_progress) {
                if (oom_gfp_allowed(gfp_mask)) {
                        if (oom_killer_disabled)
                                goto nopage;
			(...snipped...)
                        goto restart;
                }
        }
	(...snipped...)
	goto rebalance;
----------

and that might be the reason you did not see bug reports.
In 3.19, GFP_NOFAIL for GFP_NOFS / GFP_NOIO started to fail with
oom_killer_disabled == true because of

----------
        if (should_alloc_retry(gfp_mask, order, did_some_progress,
                                                pages_reclaimed)) {
                /*
                 * If we fail to make progress by freeing individual
                 * pages, but the allocation wants us to keep going,
                 * start OOM killing tasks.
                 */
                if (!did_some_progress) {
                        page = __alloc_pages_may_oom(gfp_mask, order, zonelist,
                                                high_zoneidx, nodemask,
                                                preferred_zone, classzone_idx,
                                                migratetype,&did_some_progress);
                        if (page)
                                goto got_pg;
                        if (!did_some_progress)
                                goto nopage;
                }
                /* Wait for some write requests to complete then retry */
                wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
                goto retry;
	} else
----------

----------
static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
        struct zonelist *zonelist, enum zone_type high_zoneidx,
        nodemask_t *nodemask, struct zone *preferred_zone,
        int classzone_idx, int migratetype, unsigned long *did_some_progress)
{
        struct page *page;

        *did_some_progress = 0;

        if (oom_killer_disabled)
                return NULL;
----------

and thus you might start seeing bug reports.

So, it is commit 9879de7373fc "mm: page_alloc: embed OOM killing naturally
into allocation slowpath" than commit c32b3cbe0d067a9c "oom, PM: make OOM
detection in the freezer path raceless" that introduced behavior changes?

> On Mon 23-02-15 22:03:25, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > What about something like the following?
> > 
> > I'm fine with whatever approaches as long as retry is guaranteed.
> > 
> > But maybe we can use memory reserves like below?
> 
> This sounds too risky to me and not really necessary. GFP_NOFAIL
> allocations shouldn't be called while the system is not running any
> tasks (aka from pm/device code). So we are primarily trying to help
> those nofail allocations which come from kernel threads and their retry
> will fail the suspend rather than blow up because of an unexpected
> allocation failure.

I meant "After all, don't we need to recheck after setting
oom_killer_disabled to true?" as "their retry will fail the suspend".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-24 21:02                                       ` Dave Chinner
@ 2015-02-25 14:31                                         ` Tetsuo Handa
  2015-02-27  7:39                                           ` Dave Chinner
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-25 14:31 UTC (permalink / raw)
  To: david
  Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm,
	mgorman, torvalds, fernando_b1

Dave Chinner wrote:
> This exact discussion is already underway.
> 
> My initial proposal:
> 
> http://oss.sgi.com/archives/xfs/2015-02/msg00314.html
> 
> Why mempools don't work but transaction based reservations will:
> 
> http://oss.sgi.com/archives/xfs/2015-02/msg00339.html
> 
> Reservation needs to be an accounting mechanisms, not preallocation:
> 
> http://oss.sgi.com/archives/xfs/2015-02/msg00456.html
> http://oss.sgi.com/archives/xfs/2015-02/msg00457.html
> http://oss.sgi.com/archives/xfs/2015-02/msg00458.html
> 
> And that's where the discussion currently sits.

I got two problems (one is stall at io_schedule(), the other is kernel panic
due to xfs's assertion failure) using Linux 3.19. I guess those problems are
caused by not retrying !GFP_FS allocations under OOM. Will those problems go
away by using transaction based reservations? And if yes, are they simple
enough to backport to vendor's kernels?

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150225-1.txt.xz )
----------
[ 1225.773411] kworker/3:0H    D ffff88007cadb4f8 11632    27      2 0x00000000
[ 1225.776911]  ffff88007cadb4f8 ffff88007cadb508 ffff88007cac6740 0000000000014080
[ 1225.780670]  ffffffff8101cd19 ffff88007cadbfd8 0000000000014080 ffff88007c28b740
[ 1225.784431]  ffff88007cac6740 ffff88007cadb540 ffff88007f8d4998 ffff88007cadb540
[ 1225.788766] Call Trace:
[ 1225.789988]  [<ffffffff8101cd19>] ? read_tsc+0x9/0x10
[ 1225.792444]  [<ffffffff812acbd9>] ? xfs_iunpin_wait+0x19/0x20
[ 1225.795228]  [<ffffffff816b2590>] io_schedule+0xa0/0x130
[ 1225.797802]  [<ffffffff812a9569>] __xfs_iunpin_wait+0xe9/0x140
[ 1225.800621]  [<ffffffff810af3b0>] ? autoremove_wake_function+0x40/0x40
[ 1225.803770]  [<ffffffff812acbd9>] xfs_iunpin_wait+0x19/0x20
[ 1225.806471]  [<ffffffff812a209c>] xfs_reclaim_inode+0x7c/0x360
[ 1225.809283]  [<ffffffff812a25d7>] xfs_reclaim_inodes_ag+0x257/0x370
[ 1225.812308]  [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0
[ 1225.815532]  [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190
[ 1225.817951]  [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40
[ 1225.819373]  [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20
[ 1225.820898]  [<ffffffff811c29e9>] super_cache_scan+0x169/0x170
[ 1225.822245]  [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370
[ 1225.823588]  [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240
[ 1225.824830]  [<ffffffff8115e0dc>] do_try_to_free_pages+0x16c/0x460
[ 1225.826230]  [<ffffffff8115e48a>] try_to_free_pages+0xba/0x150
[ 1225.827570]  [<ffffffff81151542>] __alloc_pages_nodemask+0x5b2/0x9d0
[ 1225.829030]  [<ffffffff8119ecbc>] kmem_getpages+0x8c/0x200
[ 1225.830277]  [<ffffffff811a122b>] fallback_alloc+0x17b/0x230
[ 1225.831561]  [<ffffffff811a107b>] ____cache_alloc_node+0x18b/0x1c0
[ 1225.833061]  [<ffffffff811a3b00>] kmem_cache_alloc+0x330/0x5c0
[ 1225.834435]  [<ffffffff8133c9d9>] ? ida_pre_get+0x69/0x100
[ 1225.835719]  [<ffffffff8133c9d9>] ida_pre_get+0x69/0x100
[ 1225.836963]  [<ffffffff8133d312>] ida_simple_get+0x42/0xf0
[ 1225.838248]  [<ffffffff81086211>] create_worker+0x31/0x1c0
[ 1225.839519]  [<ffffffff81087831>] worker_thread+0x3d1/0x4d0
[ 1225.840800]  [<ffffffff81087460>] ? rescuer_thread+0x3a0/0x3a0
[ 1225.842123]  [<ffffffff8108c5e2>] kthread+0xd2/0xf0
[ 1225.843234]  [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0
[ 1225.844978]  [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180
[ 1225.846481]  [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0
[ 1225.847718]  [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180
[ 1225.849279] kswapd0         D ffff88007708f998 11552    45      2 0x00000000
[ 1225.850977]  ffff88007708f998 0000000000000000 ffff88007c28b740 0000000000014080
[ 1225.852798]  0000000000000003 ffff88007708ffd8 0000000000014080 ffff880077ff2740
[ 1225.854575]  ffff88007c28b740 0000000000000000 ffff88007948e3a8 ffff88007948e3ac
[ 1225.856358] Call Trace:
[ 1225.856928]  [<ffffffff816b2799>] schedule_preempt_disabled+0x29/0x70
[ 1225.858384]  [<ffffffff816b43d5>] __mutex_lock_slowpath+0x95/0x100
[ 1225.859799]  [<ffffffff816b4463>] mutex_lock+0x23/0x37
[ 1225.860983]  [<ffffffff812a264c>] xfs_reclaim_inodes_ag+0x2cc/0x370
[ 1225.862403]  [<ffffffff8109eb48>] ? __enqueue_entity+0x78/0x80
[ 1225.863742]  [<ffffffff810a5f37>] ? enqueue_entity+0x237/0x8f0
[ 1225.865100]  [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0
[ 1225.866659]  [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190
[ 1225.868106]  [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40
[ 1225.869522]  [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20
[ 1225.871015]  [<ffffffff811c29e9>] super_cache_scan+0x169/0x170
[ 1225.872338]  [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370
[ 1225.873679]  [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240
[ 1225.874920]  [<ffffffff8115ed2d>] kswapd+0x4fd/0x9c0
[ 1225.876049]  [<ffffffff8115e830>] ? mem_cgroup_shrink_node_zone+0x140/0x140
[ 1225.877654]  [<ffffffff8108c5e2>] kthread+0xd2/0xf0
[ 1225.878762]  [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0
[ 1225.880495]  [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180
[ 1225.881996]  [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0
[ 1225.883336]  [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180
----------

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150225-2.txt.xz +
http://I-love.SAKURA.ne.jp/tmp/crash-20150225-2.log.xz )
----------
[  189.586204] Out of memory: Kill process 3701 (a.out) score 834 or sacrifice child
[  189.586205] Killed process 3701 (a.out) total-vm:2167392kB, anon-rss:1465820kB, file-rss:4kB
[  189.586210] Kill process 3702 (a.out) sharing same memory
[  189.586211] Kill process 3714 (a.out) sharing same memory
[  189.586212] Kill process 3748 (a.out) sharing same memory
[  189.586213] Kill process 3755 (a.out) sharing same memory
[  189.593470] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(mp), file: fs/xfs/xfs_inode.c, line: 1701
[  189.593491] ------------[ cut here ]------------
[  189.593492] kernel BUG at fs/xfs/xfs_message.c:106!
[  189.593493] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[  189.593511] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_raw iptable_filter ip_tables coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper cryptd dm_mirror dm_region_hash dm_log microcode dm_mod ppdev parport_pc pcspkr vmw_balloon serio_raw vmw_vmci parport shpchp i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput ata_generic pata_acpi sd_mod ata_piix mptspi libata scsi_transport_spi e1000 mptscsih mptbase floppy
[  189.593512] CPU: 1 PID: 3755 Comm: a.out Not tainted 3.19.0 #42
[  189.593512] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  189.593513] task: ffff88007a848740 ti: ffff88005c064000 task.ti: ffff88005c064000
[  189.593517] RIP: 0010:[<ffffffff812af992>]  [<ffffffff812af992>] assfail+0x22/0x30
[  189.593517] RSP: 0000:ffff88005c067af8  EFLAGS: 00010292
[  189.593518] RAX: 0000000000000054 RBX: ffff880079349c00 RCX: 0000000000000050
[  189.593518] RDX: 0000000000005050 RSI: 0000000000000282 RDI: 0000000000000282
[  189.593519] RBP: ffff88005c067af8 R08: 0000000000000282 R09: 0000000000000000
[  189.593519] R10: ffffffff81ec95c8 R11: 656c696166206e6f R12: ffff88005ee92800
[  189.593519] R13: 00000000fffffff4 R14: ffffffff81838140 R15: ffff880064505390
[  189.593520] FS:  00007f62d93e0740(0000) GS:ffff88007f840000(0000) knlGS:0000000000000000
[  189.593521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  189.593521] CR2: 00007fb901282763 CR3: 0000000077b00000 CR4: 00000000000407e0
[  189.593562] Stack:
[  189.593564]  ffff88005c067b38 ffffffff812ab2d7 ffff880079349e48 ffff88007a6feef0
[  189.593564]  ffff88005c067b38 ffff880079349c00 0000000000000001 ffff880079349db8
[  189.593565]  ffff88005c067b58 ffffffff812acb98 ffff880079349db8 ffff880079349c00
[  189.593565] Call Trace:
[  189.593568]  [<ffffffff812ab2d7>] xfs_inactive_truncate+0x67/0x150
[  189.593569]  [<ffffffff812acb98>] xfs_inactive+0x1c8/0x1f0
[  189.593570]  [<ffffffff812b3216>] xfs_fs_evict_inode+0x86/0xd0
[  189.593572]  [<ffffffff811da0f8>] evict+0xb8/0x190
[  189.593574]  [<ffffffff811daa15>] iput+0xf5/0x180
[  189.593575]  [<ffffffff811d5b58>] __dentry_kill+0x188/0x1f0
[  189.593576]  [<ffffffff811d5c65>] dput+0xa5/0x170
[  189.593577]  [<ffffffff811c0dbd>] __fput+0x16d/0x1e0
[  189.593578]  [<ffffffff811c0e7e>] ____fput+0xe/0x10
[  189.593580]  [<ffffffff8108ac9f>] task_work_run+0xaf/0xf0
[  189.593582]  [<ffffffff81071638>] do_exit+0x2d8/0xbe0
[  189.593583]  [<ffffffff8107a5df>] ? recalc_sigpending+0x1f/0x60
[  189.593584]  [<ffffffff81071fcf>] do_group_exit+0x3f/0xa0
[  189.593585]  [<ffffffff8107d322>] get_signal+0x1d2/0x6f0
[  189.593588]  [<ffffffff810134e8>] do_signal+0x28/0x720
[  189.593589]  [<ffffffff811c1825>] ? __sb_end_write+0x35/0x70
[  189.593591]  [<ffffffff811bf362>] ? vfs_write+0x172/0x1f0
[  189.593592]  [<ffffffff81013c2c>] do_notify_resume+0x4c/0x90
[  189.593594]  [<ffffffff816b6747>] int_signal+0x12/0x17
[  189.593602] Code: 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 f1 41 89 d0 48 c7 c6 48 8b 97 81 48 89 fa 31 c0 48 89 e5 31 ff e8 de fb ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 
[  189.593603] RIP  [<ffffffff812af992>] assfail+0x22/0x30
[  189.593604]  RSP <ffff88005c067af8>
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: __GFP_NOFAIL and oom_killer_disabled?
  2015-02-25 11:22                                                             ` Tetsuo Handa
@ 2015-02-25 16:02                                                               ` Michal Hocko
  2015-02-25 21:48                                                                 ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Michal Hocko @ 2015-02-25 16:02 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, tytso, david, hannes, dchinner, linux-mm, rientjes, oleg,
	mgorman, torvalds

On Wed 25-02-15 20:22:22, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This commit hasn't introduced any behavior changes. GFP_NOFAIL
> > allocations fail when OOM killer is disabled since beginning
> > 7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen).
> 
> I thought that
> 
> -       out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false);
> -       *did_some_progress = 1;
> +       if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false))
> +               *did_some_progress = 1;
> 
> in commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer
> path raceless" introduced a code path which fails to set
> *did_some_progress to non 0 value.

But this commit had also the following hunk:
@@ -2317,9 +2315,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 
        *did_some_progress = 0;
 
-       if (oom_killer_disabled)
-               return NULL;
-

so we even wouldn't get down to out_of_memory and returned with
did_some_progress=0 right away. So the patch hasn't changed the logic.

> > "
> > We haven't seen any bug reports since 2009 so I haven't marked the patch
> > for stable. I have no problem to backport it to stable trees though if
> > people think it is a good precaution.
> > "
> 
> Until 3.18, GFP_NOFAIL for GFP_NOFS / GFP_NOIO did not fail with
> oom_killer_disabled == true because of
> 
> ----------
>         if (!did_some_progress) {
>                 if (oom_gfp_allowed(gfp_mask)) {
>                         if (oom_killer_disabled)
>                                 goto nopage;
> 			(...snipped...)
>                         goto restart;
>                 }
>         }
> 	(...snipped...)
> 	goto rebalance;
> ----------
> 
> and that might be the reason you did not see bug reports.
> In 3.19, GFP_NOFAIL for GFP_NOFS / GFP_NOIO started to fail with
> oom_killer_disabled == true because of

OK, that would change the bahavior for __GFP_NOFAIL|~__GFP_FS
allocations. The patch from Johannes which reverts GFP_NOFS failure mode
should go to stable and that should be sufficient IMO.
 
[...]

> So, it is commit 9879de7373fc "mm: page_alloc: embed OOM killing naturally
> into allocation slowpath" than commit c32b3cbe0d067a9c "oom, PM: make OOM
> detection in the freezer path raceless" that introduced behavior changes?

Yes.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: __GFP_NOFAIL and oom_killer_disabled?
  2015-02-25 16:02                                                               ` Michal Hocko
@ 2015-02-25 21:48                                                                 ` Tetsuo Handa
  2015-02-25 21:51                                                                   ` Andrew Morton
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-25 21:48 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: akpm, tytso, david, dchinner, linux-mm, rientjes, oleg, mgorman,
	torvalds

Michal Hocko wrote:
> On Wed 25-02-15 20:22:22, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > This commit hasn't introduced any behavior changes. GFP_NOFAIL
> > > allocations fail when OOM killer is disabled since beginning
> > > 7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen).
> > 
> > I thought that
> > 
> > -       out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false);
> > -       *did_some_progress = 1;
> > +       if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false))
> > +               *did_some_progress = 1;
> > 
> > in commit c32b3cbe0d067a9c "oom, PM: make OOM detection in the freezer
> > path raceless" introduced a code path which fails to set
> > *did_some_progress to non 0 value.
> 
> But this commit had also the following hunk:
> @@ -2317,9 +2315,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  
>         *did_some_progress = 0;
>  
> -       if (oom_killer_disabled)
> -               return NULL;
> -
> 
> so we even wouldn't get down to out_of_memory and returned with
> did_some_progress=0 right away. So the patch hasn't changed the logic.

OK.

> OK, that would change the bahavior for __GFP_NOFAIL|~__GFP_FS
> allocations. The patch from Johannes which reverts GFP_NOFS failure mode
> should go to stable and that should be sufficient IMO.
>  

mm-page_alloc-revert-inadvertent-__gfp_fs-retry-behavior-change.patch
fixes only ~__GFP_NOFAIL|~__GFP_FS case. I think we need David's version
http://marc.info/?l=linux-mm&m=142489687015873&w=2 for 3.19-stable .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: __GFP_NOFAIL and oom_killer_disabled?
  2015-02-25 21:48                                                                 ` Tetsuo Handa
@ 2015-02-25 21:51                                                                   ` Andrew Morton
  0 siblings, 0 replies; 276+ messages in thread
From: Andrew Morton @ 2015-02-25 21:51 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, tytso, david, dchinner, linux-mm, rientjes, oleg,
	mgorman, torvalds

On Thu, 26 Feb 2015 06:48:02 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:

> > OK, that would change the bahavior for __GFP_NOFAIL|~__GFP_FS
> > allocations. The patch from Johannes which reverts GFP_NOFS failure mode
> > should go to stable and that should be sufficient IMO.
> >  
> 
> mm-page_alloc-revert-inadvertent-__gfp_fs-retry-behavior-change.patch
> fixes only ~__GFP_NOFAIL|~__GFP_FS case. I think we need David's version
> http://marc.info/?l=linux-mm&m=142489687015873&w=2 for 3.19-stable .

afacit nobody has even tested that.  If we want changes made to 3.19.x
then they will need to be well tested, well changelogged and signed off. 
Please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-25 14:31                                         ` Tetsuo Handa
@ 2015-02-27  7:39                                           ` Dave Chinner
  2015-02-27 12:42                                             ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Dave Chinner @ 2015-02-27  7:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm,
	mgorman, torvalds, fernando_b1

On Wed, Feb 25, 2015 at 11:31:17PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > This exact discussion is already underway.
> > 
> > My initial proposal:
> > 
> > http://oss.sgi.com/archives/xfs/2015-02/msg00314.html
> > 
> > Why mempools don't work but transaction based reservations will:
> > 
> > http://oss.sgi.com/archives/xfs/2015-02/msg00339.html
> > 
> > Reservation needs to be an accounting mechanisms, not preallocation:
> > 
> > http://oss.sgi.com/archives/xfs/2015-02/msg00456.html
> > http://oss.sgi.com/archives/xfs/2015-02/msg00457.html
> > http://oss.sgi.com/archives/xfs/2015-02/msg00458.html
> > 
> > And that's where the discussion currently sits.
> 
> I got two problems (one is stall at io_schedule()

This is a typical "blame the messenger" bug report. XFS is stuck in
inode reclaim waiting for log IO completion to occur, along with all
the other processes iin xfs_log_force also stuck waiting for the
same Io completion.

You need to find where that IO completion that everything is waiting
on has got stuck or show that it's not a lost IO and actually an
XFS problem. e.g has the IO stack got stuck on a mempool somewhere?

> , the other is kernel panic
> due to xfs's assertion failure) using Linux 3.19.

> http://I-love.SAKURA.ne.jp/tmp/crash-20150225-2.log.xz )
> ----------
> [  189.586204] Out of memory: Kill process 3701 (a.out) score 834 or sacrifice child
> [  189.586205] Killed process 3701 (a.out) total-vm:2167392kB, anon-rss:1465820kB, file-rss:4kB
> [  189.586210] Kill process 3702 (a.out) sharing same memory
> [  189.586211] Kill process 3714 (a.out) sharing same memory
> [  189.586212] Kill process 3748 (a.out) sharing same memory
> [  189.586213] Kill process 3755 (a.out) sharing same memory
> [  189.593470] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(mp), file: fs/xfs/xfs_inode.c, line: 1701

Which is a failure of xfs_trans_reserve(), and through the calling
context and parameters can only be from xfs_log_reserve().  That's
got a pretty clear cause:

        tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent,
                                KM_SLEEP | KM_MAYFAIL);
        if (!tic)
                return -ENOMEM;

And the reason for the ASSERT is pretty clear: we put it there
because we need to know - as developers - what failures (if any)
ever come through that path. This is called from evict():

> [  189.593565] Call Trace:
> [  189.593568]  [<ffffffff812ab2d7>] xfs_inactive_truncate+0x67/0x150
> [  189.593569]  [<ffffffff812acb98>] xfs_inactive+0x1c8/0x1f0
> [  189.593570]  [<ffffffff812b3216>] xfs_fs_evict_inode+0x86/0xd0
> [  189.593572]  [<ffffffff811da0f8>] evict+0xb8/0x190
> [  189.593574]  [<ffffffff811daa15>] iput+0xf5/0x180

And as such there is no mechanism for actually reporting the error
to userspace and in failing here we are about to leak an inode.

When an XFS developer is testing new code, having a failure like
that get trapped is immensely useful. However, on production
systems, we can just keep going because it's not a fatal error and,
even more importantly, the leaked inode will get cleaned up by log
recovery next time the filesystem is mounted.

IOWs, when you run CONFIG_XFS_DEBUG=y, you'll often get failures
that are valuable to XFS developers but have no runtime effect on
production systems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-27  7:39                                           ` Dave Chinner
@ 2015-02-27 12:42                                             ` Tetsuo Handa
  2015-02-27 13:12                                               ` Dave Chinner
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-02-27 12:42 UTC (permalink / raw)
  To: david
  Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm,
	mgorman, torvalds, fernando_b1

Dave Chinner wrote:
> On Wed, Feb 25, 2015 at 11:31:17PM +0900, Tetsuo Handa wrote:
> > I got two problems (one is stall at io_schedule()
> 
> This is a typical "blame the messenger" bug report. XFS is stuck in
> inode reclaim waiting for log IO completion to occur, along with all
> the other processes iin xfs_log_force also stuck waiting for the
> same Io completion.

I wanted to know whether transaction based reservations can solve these
problems. Inside filesystem layer, I guess you can calculate how much
memory is needed for your filesystem transaction. But I'm wondering
whether we can calculate how much memory is needed inside block layer.
If block layer failed to reserve memory, won't file I/O fail under
extreme memory pressure? And if __GFP_NOFAIL were used inside block
layer, won't the OOM killer deadlock problem arise?

> 
> You need to find where that IO completion that everything is waiting
> on has got stuck or show that it's not a lost IO and actually an
> XFS problem. e.g has the IO stack got stuck on a mempool somewhere?
> 

I didn't get a vmcore for this stall. But it seemed to me that

kworker/3:0H is doing

  xfs_fs_free_cached_objects()
  => xfs_reclaim_inodes_nr()
    => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan)
      => xfs_reclaim_inode() because mutex_trylock(&pag->pag_ici_reclaim_lock)
         was succeessful
         => xfs_iunpin_wait(ip) because xfs_ipincount(ip) was non 0
           => __xfs_iunpin_wait()
             => waiting inside io_schedule() for somebody to unpin

kswapd0 is doing

  xfs_fs_free_cached_objects()
  => xfs_reclaim_inodes_nr()
    => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan)
      => not calling xfs_reclaim_inode() because
         mutex_trylock(&pag->pag_ici_reclaim_lock) failed due to kworker/3:0H
      => SYNC_TRYLOCK is dropped for retry loop due to

            if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
                    trylock = 0;
                    goto restart;
            }

      => calling mutex_lock(&pag->pag_ici_reclaim_lock) and gets blocked
         due to kworker/3:0H

kworker/3:0H is trying to free memory but somebody needs memory to make
forward progress. kswapd0 is also trying to free memory but is blocked by
kworker/3:0H already holding the lock. Since kswapd0 cannot make forward
progress, somebody can't allocate memory. Finally the system started
stalling. Is this decoding correct?

----------
[ 1225.773411] kworker/3:0H    D ffff88007cadb4f8 11632    27      2 0x00000000
[ 1225.776911]  ffff88007cadb4f8 ffff88007cadb508 ffff88007cac6740 0000000000014080
[ 1225.780670]  ffffffff8101cd19 ffff88007cadbfd8 0000000000014080 ffff88007c28b740
[ 1225.784431]  ffff88007cac6740 ffff88007cadb540 ffff88007f8d4998 ffff88007cadb540
[ 1225.788766] Call Trace:
[ 1225.789988]  [<ffffffff8101cd19>] ? read_tsc+0x9/0x10
[ 1225.792444]  [<ffffffff812acbd9>] ? xfs_iunpin_wait+0x19/0x20
[ 1225.795228]  [<ffffffff816b2590>] io_schedule+0xa0/0x130
[ 1225.797802]  [<ffffffff812a9569>] __xfs_iunpin_wait+0xe9/0x140
arch/x86/include/asm/atomic.h:27
fs/xfs/xfs_inode.c:2433
[ 1225.800621]  [<ffffffff810af3b0>] ? autoremove_wake_function+0x40/0x40
[ 1225.803770]  [<ffffffff812acbd9>] xfs_iunpin_wait+0x19/0x20
fs/xfs/xfs_inode.c:2443
[ 1225.806471]  [<ffffffff812a209c>] xfs_reclaim_inode+0x7c/0x360
include/linux/spinlock.h:309
fs/xfs/xfs_inode.h:144
fs/xfs/xfs_icache.c:920
[ 1225.809283]  [<ffffffff812a25d7>] xfs_reclaim_inodes_ag+0x257/0x370
fs/xfs/xfs_icache.c:1105
[ 1225.812308]  [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0
[ 1225.815532]  [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190
[ 1225.817951]  [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40
fs/xfs/xfs_icache.c:1166
[ 1225.819373]  [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20
[ 1225.820898]  [<ffffffff811c29e9>] super_cache_scan+0x169/0x170
[ 1225.822245]  [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370
[ 1225.823588]  [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240
[ 1225.824830]  [<ffffffff8115e0dc>] do_try_to_free_pages+0x16c/0x460
[ 1225.826230]  [<ffffffff8115e48a>] try_to_free_pages+0xba/0x150
[ 1225.827570]  [<ffffffff81151542>] __alloc_pages_nodemask+0x5b2/0x9d0
[ 1225.829030]  [<ffffffff8119ecbc>] kmem_getpages+0x8c/0x200
[ 1225.830277]  [<ffffffff811a122b>] fallback_alloc+0x17b/0x230
[ 1225.831561]  [<ffffffff811a107b>] ____cache_alloc_node+0x18b/0x1c0
[ 1225.833061]  [<ffffffff811a3b00>] kmem_cache_alloc+0x330/0x5c0
[ 1225.834435]  [<ffffffff8133c9d9>] ? ida_pre_get+0x69/0x100
[ 1225.835719]  [<ffffffff8133c9d9>] ida_pre_get+0x69/0x100
[ 1225.836963]  [<ffffffff8133d312>] ida_simple_get+0x42/0xf0
[ 1225.838248]  [<ffffffff81086211>] create_worker+0x31/0x1c0
[ 1225.839519]  [<ffffffff81087831>] worker_thread+0x3d1/0x4d0
[ 1225.840800]  [<ffffffff81087460>] ? rescuer_thread+0x3a0/0x3a0
[ 1225.842123]  [<ffffffff8108c5e2>] kthread+0xd2/0xf0
[ 1225.843234]  [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0
[ 1225.844978]  [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180
[ 1225.846481]  [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0
[ 1225.847718]  [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180
[ 1225.849279] kswapd0         D ffff88007708f998 11552    45      2 0x00000000
[ 1225.850977]  ffff88007708f998 0000000000000000 ffff88007c28b740 0000000000014080
[ 1225.852798]  0000000000000003 ffff88007708ffd8 0000000000014080 ffff880077ff2740
[ 1225.854575]  ffff88007c28b740 0000000000000000 ffff88007948e3a8 ffff88007948e3ac
[ 1225.856358] Call Trace:
[ 1225.856928]  [<ffffffff816b2799>] schedule_preempt_disabled+0x29/0x70
[ 1225.858384]  [<ffffffff816b43d5>] __mutex_lock_slowpath+0x95/0x100
[ 1225.859799]  [<ffffffff816b4463>] mutex_lock+0x23/0x37
arch/x86/include/asm/current.h:14
kernel/locking/mutex.h:22
kernel/locking/mutex.c:103
[ 1225.860983]  [<ffffffff812a264c>] xfs_reclaim_inodes_ag+0x2cc/0x370
fs/xfs/xfs_icache.c:1034
[ 1225.862403]  [<ffffffff8109eb48>] ? __enqueue_entity+0x78/0x80
[ 1225.863742]  [<ffffffff810a5f37>] ? enqueue_entity+0x237/0x8f0
[ 1225.865100]  [<ffffffff81340839>] ? radix_tree_gang_lookup_tag+0x89/0xd0
[ 1225.866659]  [<ffffffff8116fe58>] ? list_lru_walk_node+0x148/0x190
[ 1225.868106]  [<ffffffff812a2783>] xfs_reclaim_inodes_nr+0x33/0x40
fs/xfs/xfs_icache.c:1166
[ 1225.869522]  [<ffffffff812b3545>] xfs_fs_free_cached_objects+0x15/0x20
[ 1225.871015]  [<ffffffff811c29e9>] super_cache_scan+0x169/0x170
[ 1225.872338]  [<ffffffff8115aed6>] shrink_node_slabs+0x1d6/0x370
[ 1225.873679]  [<ffffffff8115dd2a>] shrink_zone+0x20a/0x240
[ 1225.874920]  [<ffffffff8115ed2d>] kswapd+0x4fd/0x9c0
[ 1225.876049]  [<ffffffff8115e830>] ? mem_cgroup_shrink_node_zone+0x140/0x140
[ 1225.877654]  [<ffffffff8108c5e2>] kthread+0xd2/0xf0
[ 1225.878762]  [<ffffffff81010000>] ? perf_trace_xen_mmu_ptep_modify_prot+0x90/0xf0
[ 1225.880495]  [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180
[ 1225.881996]  [<ffffffff816b63fc>] ret_from_fork+0x7c/0xb0
[ 1225.883336]  [<ffffffff8108c510>] ? kthread_create_on_node+0x180/0x180
----------

I killed mutex_lock() and memory allocation from shrinker functions
in drivers/gpu/drm/ttm/ttm_page_alloc[_dma].c because I observed that
kswapd0 was blocked for so long at mutex_lock().

If kswapd0 is blocked forever at e.g. mutex_lock() inside shrinker
functions, who else can make forward progress?

Shouldn't we avoid calling functions which could potentially block for
unpredictable duration (e.g. unkillable locks and/or completion) from
shrinker functions?



> IOWs, when you run CONFIG_XFS_DEBUG=y, you'll often get failures
> that are valuable to XFS developers but have no runtime effect on
> production systems.

Oh, I didn't know this failure is specific to CONFIG_XFS_DEBUG=y ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-27 12:42                                             ` Tetsuo Handa
@ 2015-02-27 13:12                                               ` Dave Chinner
  2015-03-04 12:41                                                 ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Dave Chinner @ 2015-02-27 13:12 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm,
	mgorman, torvalds, fernando_b1

On Fri, Feb 27, 2015 at 09:42:55PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > On Wed, Feb 25, 2015 at 11:31:17PM +0900, Tetsuo Handa wrote:
> > > I got two problems (one is stall at io_schedule()
> > 
> > This is a typical "blame the messenger" bug report. XFS is stuck in
> > inode reclaim waiting for log IO completion to occur, along with all
> > the other processes iin xfs_log_force also stuck waiting for the
> > same Io completion.
> 
> I wanted to know whether transaction based reservations can solve these
> problems. Inside filesystem layer, I guess you can calculate how much
> memory is needed for your filesystem transaction. But I'm wondering
> whether we can calculate how much memory is needed inside block layer.
> If block layer failed to reserve memory, won't file I/O fail under
> extreme memory pressure? And if __GFP_NOFAIL were used inside block
> layer, won't the OOM killer deadlock problem arise?
> 
> > 
> > You need to find where that IO completion that everything is waiting
> > on has got stuck or show that it's not a lost IO and actually an
> > XFS problem. e.g has the IO stack got stuck on a mempool somewhere?
> > 
> 
> I didn't get a vmcore for this stall. But it seemed to me that
> 
> kworker/3:0H is doing
> 
>   xfs_fs_free_cached_objects()
>   => xfs_reclaim_inodes_nr()
>     => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan)
>       => xfs_reclaim_inode() because mutex_trylock(&pag->pag_ici_reclaim_lock)
>          was succeessful
>          => xfs_iunpin_wait(ip) because xfs_ipincount(ip) was non 0
>            => __xfs_iunpin_wait()
>              => waiting inside io_schedule() for somebody to unpin
> 
> kswapd0 is doing
> 
>   xfs_fs_free_cached_objects()
>   => xfs_reclaim_inodes_nr()
>     => xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan)
>       => not calling xfs_reclaim_inode() because
>          mutex_trylock(&pag->pag_ici_reclaim_lock) failed due to kworker/3:0H
>       => SYNC_TRYLOCK is dropped for retry loop due to
> 
>             if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
>                     trylock = 0;
>                     goto restart;
>             }
> 
>       => calling mutex_lock(&pag->pag_ici_reclaim_lock) and gets blocked
>          due to kworker/3:0H
> 
> kworker/3:0H is trying to free memory but somebody needs memory to make
> forward progress. kswapd0 is also trying to free memory but is blocked by
> kworker/3:0H already holding the lock. Since kswapd0 cannot make forward
> progress, somebody can't allocate memory. Finally the system started
> stalling. Is this decoding correct?

Yes. The per-ag lock is a key throttling point for reclaim when
there are many more direct reclaimers than there are allocation
groups. System performance drops badly in low memory conditions if
we have more than one reclaimer operating on an allocation group at
a time as they interfere and contend with each other. Effectively
multiple rclaimers within the one AG turn ascending offset order
inode writeback into random IO, which is orders of magnitude slower
than a single thread can clean and reclaim those same inodes.

Quite simply: if one thread can't make progress due to be stuck
waiting for IO, then another hundred threads trying to do the same
operations are unlikely to make progress, either.

Thing is, the io layer below XFS that appears to be stuck does
GFP_NOIO allocations, and therefore direct reclaim for mempool
allocation in the block layer cannot get stuck on GFP_FS level
reclaim operations....

> I killed mutex_lock() and memory allocation from shrinker functions
> in drivers/gpu/drm/ttm/ttm_page_alloc[_dma].c because I observed that
> kswapd0 was blocked for so long at mutex_lock().

Which, to me, is fixing a symptom rather than understanding the root
cause of why lower layers are not making progress as they are
supposed to.

> If kswapd0 is blocked forever at e.g. mutex_lock() inside shrinker
> functions, who else can make forward progress?

You can't get into these filesystem shrinkers when you do GFP_NOIO
allocations, as the IO path does.

> Shouldn't we avoid calling functions which could potentially block for
> unpredictable duration (e.g. unkillable locks and/or completion) from
> shrinker functions?

No, because otherwise we can't throttle allocation and reclaim to
the rate at which IO can clean dirty objects. i.e. we do this for
the same reason we throttle page cache dirtying to the rate at which
we can clean dirty pages....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  7:32                                                     ` Dave Chinner
@ 2015-02-27 18:24                                                       ` Vlastimil Babka
  -1 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-02-27 18:24 UTC (permalink / raw)
  To: Dave Chinner, Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, torvalds

On 02/23/2015 08:32 AM, Dave Chinner wrote:
>> > And then there will be an unknown number of
>> > slab allocations of unknown size with unknown slabs-per-page rules
>> > - how many pages needed for them?
> However many pages needed to allocate the number of objects we'll
> consume from the slab.

I think the best way is if slab could also learn to provide reserves for
individual objects. Either just mark internally how many of them are reserved,
if sufficient number is free, or translate this to the page allocator reserves,
as slab knows which order it uses for the given objects.

>> > And to make it much worse, how
>> > many pages of which orders?  Bless its heart, slub will go and use
>> > a 1-order page for allocations which should have been in 0-order
>> > pages..
> The majority of allocations will be order-0, though if we know that
> they are going to be significant numbers of high order allocations,
> then it should be simple enough to tell the mm subsystem "need a
> reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have
> memory compaction just do it's stuff. But, IMO, we should cross that
> bridge when somebody actually needs reservations to be that
> specific....

Note that watermark checking for higher-order allocations is somewhat fuzzy
compared to order-0 checks, but I guess some kind of reservations could work
there too.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-27 18:24                                                       ` Vlastimil Babka
  0 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-02-27 18:24 UTC (permalink / raw)
  To: Dave Chinner, Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs

On 02/23/2015 08:32 AM, Dave Chinner wrote:
>> > And then there will be an unknown number of
>> > slab allocations of unknown size with unknown slabs-per-page rules
>> > - how many pages needed for them?
> However many pages needed to allocate the number of objects we'll
> consume from the slab.

I think the best way is if slab could also learn to provide reserves for
individual objects. Either just mark internally how many of them are reserved,
if sufficient number is free, or translate this to the page allocator reserves,
as slab knows which order it uses for the given objects.

>> > And to make it much worse, how
>> > many pages of which orders?  Bless its heart, slub will go and use
>> > a 1-order page for allocations which should have been in 0-order
>> > pages..
> The majority of allocations will be order-0, though if we know that
> they are going to be significant numbers of high order allocations,
> then it should be simple enough to tell the mm subsystem "need a
> reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have
> memory compaction just do it's stuff. But, IMO, we should cross that
> bridge when somebody actually needs reservations to be that
> specific....

Note that watermark checking for higher-order allocations is somewhat fuzzy
compared to order-0 checks, but I guess some kind of reservations could work
there too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-27 18:24                                                       ` Vlastimil Babka
@ 2015-02-28  0:03                                                         ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-28  0:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Fri, Feb 27, 2015 at 07:24:34PM +0100, Vlastimil Babka wrote:
> On 02/23/2015 08:32 AM, Dave Chinner wrote:
> >> > And then there will be an unknown number of
> >> > slab allocations of unknown size with unknown slabs-per-page rules
> >> > - how many pages needed for them?
> > However many pages needed to allocate the number of objects we'll
> > consume from the slab.
> 
> I think the best way is if slab could also learn to provide reserves for
> individual objects. Either just mark internally how many of them are reserved,
> if sufficient number is free, or translate this to the page allocator reserves,
> as slab knows which order it uses for the given objects.

Which is effectively what a slab based mempool is. Mempools don't
guarantee a reserve is available once it's been resized, however,
and we'd have to have mempools configured for every type of
allocation we are going to do. So from that perspective it's not
really a solution.

Further, the kmalloc heap is backed by slab caches. We do *lots* of
variable sized kmalloc allocations in transactions the size of which
aren't known until allocation time.  In that case, we have to assume
it's going to be a page per object, because the allocations could
actually be that size.

AFAICT, the worst case is a slab-backing page allocation for
every slab object that is allocated, so we may as well cater for
that case from the start...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-28  0:03                                                         ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-02-28  0:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On Fri, Feb 27, 2015 at 07:24:34PM +0100, Vlastimil Babka wrote:
> On 02/23/2015 08:32 AM, Dave Chinner wrote:
> >> > And then there will be an unknown number of
> >> > slab allocations of unknown size with unknown slabs-per-page rules
> >> > - how many pages needed for them?
> > However many pages needed to allocate the number of objects we'll
> > consume from the slab.
> 
> I think the best way is if slab could also learn to provide reserves for
> individual objects. Either just mark internally how many of them are reserved,
> if sufficient number is free, or translate this to the page allocator reserves,
> as slab knows which order it uses for the given objects.

Which is effectively what a slab based mempool is. Mempools don't
guarantee a reserve is available once it's been resized, however,
and we'd have to have mempools configured for every type of
allocation we are going to do. So from that perspective it's not
really a solution.

Further, the kmalloc heap is backed by slab caches. We do *lots* of
variable sized kmalloc allocations in transactions the size of which
aren't known until allocation time.  In that case, we have to assume
it's going to be a page per object, because the allocations could
actually be that size.

AFAICT, the worst case is a slab-backing page allocation for
every slab object that is allocated, so we may as well cater for
that case from the start...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28  0:03                                                         ` Dave Chinner
@ 2015-02-28 15:17                                                           ` Theodore Ts'o
  -1 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-02-28 15:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds,
	Vlastimil Babka

On Sat, Feb 28, 2015 at 11:03:59AM +1100, Dave Chinner wrote:
> > I think the best way is if slab could also learn to provide reserves for
> > individual objects. Either just mark internally how many of them are reserved,
> > if sufficient number is free, or translate this to the page allocator reserves,
> > as slab knows which order it uses for the given objects.
> 
> Which is effectively what a slab based mempool is. Mempools don't
> guarantee a reserve is available once it's been resized, however,
> and we'd have to have mempools configured for every type of
> allocation we are going to do. So from that perspective it's not
> really a solution.

The bigger problem is it means that the upper layer which is making
the reservation before it starts taking lock won't necessarily know
exactly which slab objects it and all of the lower layers might need.

So it's much more flexible, and requires less accuracy, if we can just
request that (a) the mm subsystems reserves at least N pages, and (b)
tell it that at this point in time, it's safe for the requesting
subsystem to block until N pages is available.

Can this be guaranteed to be accurate?  No, of course not.  And in
some cases, it may be possible since it might depend on whether the
iSCSI device needs to reconnect to the target, or some sort of
exception handling, before it can complete its I/O request.

But it's better than what we have now, which is that once we've taken
certain locks, and/or started a complex transaction, we can't really
back out, so we end up looping either using GFP_NOFAIL, or around the
memory allocation request if there are still mm developers who are
delusional enough to believe, ala like King Canute, to say, "You must
always be able to handle memory allocation at any point in the kernel
and GFP_NOFAIL is an indicatoin of a subsystem bug!"

I can imagine using some adjustment factors, where a particular
voratious device might require hint to the file system to boost its
memory allocation estimate by 30%, or 50%.  So yes, it's a very,
*very* rough estimate.  And if we guess wrong, we might end up having
to loop ala GFP_NOFAIL anyway.  But it's better than not having such
an estimate.

I also grant that this doesn't work very well for emergency writeback,
or background writeback, where we can't and shouldn't block waiting
for enough memory to become free, since page cleaning is one of the
ways that we might be able to make memory available.  But if that's
the only problem we have, we're in good shape, since that can be
solved by either (a) doing a better job throttling memory allocations
or memory reservation requests in the first place, and/or (b) starting
the background writeback much more aggressively and earlier.

    	       		      	   		- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-28 15:17                                                           ` Theodore Ts'o
  0 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-02-28 15:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Tetsuo Handa,
	mhocko, dchinner, linux-mm, rientjes, oleg, mgorman, torvalds,
	xfs

On Sat, Feb 28, 2015 at 11:03:59AM +1100, Dave Chinner wrote:
> > I think the best way is if slab could also learn to provide reserves for
> > individual objects. Either just mark internally how many of them are reserved,
> > if sufficient number is free, or translate this to the page allocator reserves,
> > as slab knows which order it uses for the given objects.
> 
> Which is effectively what a slab based mempool is. Mempools don't
> guarantee a reserve is available once it's been resized, however,
> and we'd have to have mempools configured for every type of
> allocation we are going to do. So from that perspective it's not
> really a solution.

The bigger problem is it means that the upper layer which is making
the reservation before it starts taking lock won't necessarily know
exactly which slab objects it and all of the lower layers might need.

So it's much more flexible, and requires less accuracy, if we can just
request that (a) the mm subsystems reserves at least N pages, and (b)
tell it that at this point in time, it's safe for the requesting
subsystem to block until N pages is available.

Can this be guaranteed to be accurate?  No, of course not.  And in
some cases, it may be possible since it might depend on whether the
iSCSI device needs to reconnect to the target, or some sort of
exception handling, before it can complete its I/O request.

But it's better than what we have now, which is that once we've taken
certain locks, and/or started a complex transaction, we can't really
back out, so we end up looping either using GFP_NOFAIL, or around the
memory allocation request if there are still mm developers who are
delusional enough to believe, ala like King Canute, to say, "You must
always be able to handle memory allocation at any point in the kernel
and GFP_NOFAIL is an indicatoin of a subsystem bug!"

I can imagine using some adjustment factors, where a particular
voratious device might require hint to the file system to boost its
memory allocation estimate by 30%, or 50%.  So yes, it's a very,
*very* rough estimate.  And if we guess wrong, we might end up having
to loop ala GFP_NOFAIL anyway.  But it's better than not having such
an estimate.

I also grant that this doesn't work very well for emergency writeback,
or background writeback, where we can't and shouldn't block waiting
for enough memory to become free, since page cleaning is one of the
ways that we might be able to make memory available.  But if that's
the only problem we have, we're in good shape, since that can be
solved by either (a) doing a better job throttling memory allocations
or memory reservation requests in the first place, and/or (b) starting
the background writeback much more aggressively and earlier.

    	       		      	   		- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  0:45                                                 ` Dave Chinner
@ 2015-02-28 16:29                                                   ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-28 16:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Mon, Feb 23, 2015 at 11:45:21AM +1100, Dave Chinner wrote:
> On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
> > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> > > I will actively work around aanything that causes filesystem memory
> > > pressure to increase the chance of oom killer invocations. The OOM
> > > killer is not a solution - it is, by definition, a loose cannon and
> > > so we should be reducing dependencies on it.
> > 
> > Once we have a better-working alternative, sure.
> 
> Great, but first a simple request: please stop writing code and
> instead start architecting a solution to the problem. i.e. we need a
> design and have that documented before code gets written. If you
> watched my recent LCA talk, then you'll understand what I mean
> when I say: stop programming and start engineering.

This code was for the sake of argument, see below.

> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > We can provide this.  Are all these callers able to preallocate?
> 
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

You are missing the point of my question.  Whether we allocate right
away or make sure the memory is allocatable later on is a matter of
cost, but the logical outcome is the same.  That is not my concern
right now.

An OOM killer allows transactional allocation sites to get away
without planning ahead.  You are arguing that the OOM killer is a
cop-out on the MM site but I see it as the opposite: it puts a lot of
complexity in the allocator so that callsites can maneuver themselves
into situations where they absolutely need to get memory - or corrupt
user data - without actually making sure their needs will be covered.

If we replace __GFP_NOFAIL + OOM killer with a reserve system, we are
putting the full responsibility on the user.  Are you sure this is
going to reduce our kernel-wide error rate?

> And, really, "reservation" != "preallocation".

That's an implementation detail.  Yes, the example implementation was
dumb and heavy-handed, but a reservation system that works based on
watermarks, and considers clean cache readily allocatable, is not much
more complex than that.

I'm trying to figure out if the current nofail allocators can get
their memory needs figured out beforehand.  And reliably so - what
good are estimates that are right 90% of the time, when failing the
allocation means corrupting user data?  What is the contingency plan?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-28 16:29                                                   ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-28 16:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

On Mon, Feb 23, 2015 at 11:45:21AM +1100, Dave Chinner wrote:
> On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
> > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> > > I will actively work around aanything that causes filesystem memory
> > > pressure to increase the chance of oom killer invocations. The OOM
> > > killer is not a solution - it is, by definition, a loose cannon and
> > > so we should be reducing dependencies on it.
> > 
> > Once we have a better-working alternative, sure.
> 
> Great, but first a simple request: please stop writing code and
> instead start architecting a solution to the problem. i.e. we need a
> design and have that documented before code gets written. If you
> watched my recent LCA talk, then you'll understand what I mean
> when I say: stop programming and start engineering.

This code was for the sake of argument, see below.

> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > We can provide this.  Are all these callers able to preallocate?
> 
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

You are missing the point of my question.  Whether we allocate right
away or make sure the memory is allocatable later on is a matter of
cost, but the logical outcome is the same.  That is not my concern
right now.

An OOM killer allows transactional allocation sites to get away
without planning ahead.  You are arguing that the OOM killer is a
cop-out on the MM site but I see it as the opposite: it puts a lot of
complexity in the allocator so that callsites can maneuver themselves
into situations where they absolutely need to get memory - or corrupt
user data - without actually making sure their needs will be covered.

If we replace __GFP_NOFAIL + OOM killer with a reserve system, we are
putting the full responsibility on the user.  Are you sure this is
going to reduce our kernel-wide error rate?

> And, really, "reservation" != "preallocation".

That's an implementation detail.  Yes, the example implementation was
dumb and heavy-handed, but a reservation system that works based on
watermarks, and considers clean cache readily allocatable, is not much
more complex than that.

I'm trying to figure out if the current nofail allocators can get
their memory needs figured out beforehand.  And reliably so - what
good are estimates that are right 90% of the time, when failing the
allocation means corrupting user data?  What is the contingency plan?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 16:29                                                   ` Johannes Weiner
@ 2015-02-28 16:41                                                     ` Theodore Ts'o
  -1 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-02-28 16:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> 
> I'm trying to figure out if the current nofail allocators can get
> their memory needs figured out beforehand.  And reliably so - what
> good are estimates that are right 90% of the time, when failing the
> allocation means corrupting user data?  What is the contingency plan?

In the ideal world, we can figure out the exact memory needs
beforehand.  But we live in an imperfect world, and given that block
devices *also* need memory, the answer is "of course not".  We can't
be perfect.  But we can least give some kind of hint, and we can offer
to wait before we get into a situation where we need to loop in
GFP_NOWAIT --- which is the contingency/fallback plan.

I'm sure that's not very satisfying, but it's better than what we have
now.

					- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-28 16:41                                                     ` Theodore Ts'o
  0 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-02-28 16:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> 
> I'm trying to figure out if the current nofail allocators can get
> their memory needs figured out beforehand.  And reliably so - what
> good are estimates that are right 90% of the time, when failing the
> allocation means corrupting user data?  What is the contingency plan?

In the ideal world, we can figure out the exact memory needs
beforehand.  But we live in an imperfect world, and given that block
devices *also* need memory, the answer is "of course not".  We can't
be perfect.  But we can least give some kind of hint, and we can offer
to wait before we get into a situation where we need to loop in
GFP_NOWAIT --- which is the contingency/fallback plan.

I'm sure that's not very satisfying, but it's better than what we have
now.

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  0:45                                                 ` Dave Chinner
@ 2015-02-28 18:36                                                   ` Vlastimil Babka
  -1 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-02-28 18:36 UTC (permalink / raw)
  To: Dave Chinner, Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On 23.2.2015 1:45, Dave Chinner wrote:
> On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
>> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
>>> I will actively work around aanything that causes filesystem memory
>>> pressure to increase the chance of oom killer invocations. The OOM
>>> killer is not a solution - it is, by definition, a loose cannon and
>>> so we should be reducing dependencies on it.
>>
>> Once we have a better-working alternative, sure.
> 
> Great, but first a simple request: please stop writing code and
> instead start architecting a solution to the problem. i.e. we need a
> design and have that documented before code gets written. If you
> watched my recent LCA talk, then you'll understand what I mean
> when I say: stop programming and start engineering.

About that... I guess good engineering also means looking at past solutions to
the same problem. I expect there would be a lot of academic work on this, which
might tell us what's (not) possible. And maybe even actual implementations with
real-life experience to learn from?

>>> I really don't care about the OOM Killer corner cases - it's
>>> completely the wrong way line of development to be spending time on
>>> and you aren't going to convince me otherwise. The OOM killer a
>>> crutch used to justify having a memory allocation subsystem that
>>> can't provide forward progress guarantee mechanisms to callers that
>>> need it.
>>
>> We can provide this.  Are all these callers able to preallocate?
> 
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

But won't even the reservation have potentially large impact on performance, if
as you later suggest (IIUC), we don't actually dip into our reserves until
regular reclaim starts failing? Doesn't that mean potentially lot of wasted
memory? Right, it doesn't have to be if we allow clean reclaimable pages to be
part of reserve, but still...

> And, really, "reservation" != "preallocation".
> 
> Maybe it's my filesystem background, but those to things are vastly
> different things.
> 
> Reservations are simply an *accounting* of the maximum amount of a
> reserve required by an operation to guarantee forwards progress. In
> filesystems, we do this for log space (transactions) and some do it
> for filesystem space (e.g. delayed allocation needs correct ENOSPC
> detection so we don't overcommit disk space).  The VM already has
> such concepts (e.g. watermarks and things like min_free_kbytes) that
> it uses to ensure that there are sufficient reserves for certain
> types of allocations to succeed.
> 
> A reserve memory pool is no different - every time a memory reserve
> occurs, a watermark is lifted to accommodate it, and the transaction
> is not allowed to proceed until the amount of free memory exceeds
> that watermark. The memory allocation subsystem then only allows
> *allocations* marked correctly to allocate pages from that the
> reserve that watermark protects. e.g. only allocations using
> __GFP_RESERVE are allowed to dip into the reserve pool.
> 
> By using watermarks, freeing of memory will automatically top
> up the reserve pool which means that we guarantee that reclaimable
> memory allocated for demand paging during transacitons doesn't
> deplete the reserve pool permanently.  As a result, when there is
> plenty of free and/or reclaimable memory, the reserve pool
> watermarks will have almost zero impact on performance and
> behaviour.
> 
> Further, because it's just accounting and behavioural thresholds,
> this allows the mm subsystem to control how the reserve pool is
> accounted internally. e.g. clean, reclaimable pages in the page
> cache could serve as reserve pool pages as they can be immediately
> reclaimed for allocation. This could be acheived by setting reclaim
> targets first to the reserve pool watermark, then the second target
> is enough pages to satisfy the current allocation.

Hmm but what if the clean pages need us to take some locks to unmap and some
proces holding them is blocked... Also we would need to potentally block a
process that wants to dirty a page, is that being done now?

> And, FWIW, there's nothing stopping this mechanism from have order
> based reserve thresholds. e.g. IB could really do with a 64k reserve
> pool threshold and hence help solve the long standing problems they
> have with filling the receive ring in GFP_ATOMIC context...

I don't know the details here, but if the allocation is done for incoming
packets i.e. something you can't predict then how would you set the reserve for
that? If they could predict, they would be able to preallocate necessary buffers
already.

> Sure, that's looking further down the track, but my point still
> remains: we need a viable long term solution to this problem. Maybe
> reservations are not the solution, but I don't see anyone else who
> is thinking of how to address this architectural problem at a system
> level right now.  We need to design and document the model first,
> then review it, then we can start working at the code level to
> implement the solution we've designed.

Right. A conference to discuss this on could come handy :)

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-28 18:36                                                   ` Vlastimil Babka
  0 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-02-28 18:36 UTC (permalink / raw)
  To: Dave Chinner, Johannes Weiner
  Cc: Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs

On 23.2.2015 1:45, Dave Chinner wrote:
> On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
>> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
>>> I will actively work around aanything that causes filesystem memory
>>> pressure to increase the chance of oom killer invocations. The OOM
>>> killer is not a solution - it is, by definition, a loose cannon and
>>> so we should be reducing dependencies on it.
>>
>> Once we have a better-working alternative, sure.
> 
> Great, but first a simple request: please stop writing code and
> instead start architecting a solution to the problem. i.e. we need a
> design and have that documented before code gets written. If you
> watched my recent LCA talk, then you'll understand what I mean
> when I say: stop programming and start engineering.

About that... I guess good engineering also means looking at past solutions to
the same problem. I expect there would be a lot of academic work on this, which
might tell us what's (not) possible. And maybe even actual implementations with
real-life experience to learn from?

>>> I really don't care about the OOM Killer corner cases - it's
>>> completely the wrong way line of development to be spending time on
>>> and you aren't going to convince me otherwise. The OOM killer a
>>> crutch used to justify having a memory allocation subsystem that
>>> can't provide forward progress guarantee mechanisms to callers that
>>> need it.
>>
>> We can provide this.  Are all these callers able to preallocate?
> 
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

But won't even the reservation have potentially large impact on performance, if
as you later suggest (IIUC), we don't actually dip into our reserves until
regular reclaim starts failing? Doesn't that mean potentially lot of wasted
memory? Right, it doesn't have to be if we allow clean reclaimable pages to be
part of reserve, but still...

> And, really, "reservation" != "preallocation".
> 
> Maybe it's my filesystem background, but those to things are vastly
> different things.
> 
> Reservations are simply an *accounting* of the maximum amount of a
> reserve required by an operation to guarantee forwards progress. In
> filesystems, we do this for log space (transactions) and some do it
> for filesystem space (e.g. delayed allocation needs correct ENOSPC
> detection so we don't overcommit disk space).  The VM already has
> such concepts (e.g. watermarks and things like min_free_kbytes) that
> it uses to ensure that there are sufficient reserves for certain
> types of allocations to succeed.
> 
> A reserve memory pool is no different - every time a memory reserve
> occurs, a watermark is lifted to accommodate it, and the transaction
> is not allowed to proceed until the amount of free memory exceeds
> that watermark. The memory allocation subsystem then only allows
> *allocations* marked correctly to allocate pages from that the
> reserve that watermark protects. e.g. only allocations using
> __GFP_RESERVE are allowed to dip into the reserve pool.
> 
> By using watermarks, freeing of memory will automatically top
> up the reserve pool which means that we guarantee that reclaimable
> memory allocated for demand paging during transacitons doesn't
> deplete the reserve pool permanently.  As a result, when there is
> plenty of free and/or reclaimable memory, the reserve pool
> watermarks will have almost zero impact on performance and
> behaviour.
> 
> Further, because it's just accounting and behavioural thresholds,
> this allows the mm subsystem to control how the reserve pool is
> accounted internally. e.g. clean, reclaimable pages in the page
> cache could serve as reserve pool pages as they can be immediately
> reclaimed for allocation. This could be acheived by setting reclaim
> targets first to the reserve pool watermark, then the second target
> is enough pages to satisfy the current allocation.

Hmm but what if the clean pages need us to take some locks to unmap and some
proces holding them is blocked... Also we would need to potentally block a
process that wants to dirty a page, is that being done now?

> And, FWIW, there's nothing stopping this mechanism from have order
> based reserve thresholds. e.g. IB could really do with a 64k reserve
> pool threshold and hence help solve the long standing problems they
> have with filling the receive ring in GFP_ATOMIC context...

I don't know the details here, but if the allocation is done for incoming
packets i.e. something you can't predict then how would you set the reserve for
that? If they could predict, they would be able to preallocate necessary buffers
already.

> Sure, that's looking further down the track, but my point still
> remains: we need a viable long term solution to this problem. Maybe
> reservations are not the solution, but I don't see anyone else who
> is thinking of how to address this architectural problem at a system
> level right now.  We need to design and document the model first,
> then review it, then we can start working at the code level to
> implement the solution we've designed.

Right. A conference to discuss this on could come handy :)

> Cheers,
> 
> Dave.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 16:41                                                     ` Theodore Ts'o
@ 2015-02-28 22:15                                                       ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-28 22:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > 
> > I'm trying to figure out if the current nofail allocators can get
> > their memory needs figured out beforehand.  And reliably so - what
> > good are estimates that are right 90% of the time, when failing the
> > allocation means corrupting user data?  What is the contingency plan?
> 
> In the ideal world, we can figure out the exact memory needs
> beforehand.  But we live in an imperfect world, and given that block
> devices *also* need memory, the answer is "of course not".  We can't
> be perfect.  But we can least give some kind of hint, and we can offer
> to wait before we get into a situation where we need to loop in
> GFP_NOWAIT --- which is the contingency/fallback plan.

Overestimating should be fine, the result would a bit of false memory
pressure.  But underestimating and looping can't be an option or the
original lockups will still be there.  We need to guarantee forward
progress or the problem is somewhat mitigated at best - only now with
quite a bit more complexity in the allocator and the filesystems.

The block code would have to be looked at separately, but doesn't it
already use mempools etc. to guarantee progress?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-02-28 22:15                                                       ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-02-28 22:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > 
> > I'm trying to figure out if the current nofail allocators can get
> > their memory needs figured out beforehand.  And reliably so - what
> > good are estimates that are right 90% of the time, when failing the
> > allocation means corrupting user data?  What is the contingency plan?
> 
> In the ideal world, we can figure out the exact memory needs
> beforehand.  But we live in an imperfect world, and given that block
> devices *also* need memory, the answer is "of course not".  We can't
> be perfect.  But we can least give some kind of hint, and we can offer
> to wait before we get into a situation where we need to loop in
> GFP_NOWAIT --- which is the contingency/fallback plan.

Overestimating should be fine, the result would a bit of false memory
pressure.  But underestimating and looping can't be an option or the
original lockups will still be there.  We need to guarantee forward
progress or the problem is somewhat mitigated at best - only now with
quite a bit more complexity in the allocator and the filesystems.

The block code would have to be looked at separately, but doesn't it
already use mempools etc. to guarantee progress?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 22:15                                                       ` Johannes Weiner
@ 2015-03-01 11:17                                                         ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-03-01 11:17 UTC (permalink / raw)
  To: hannes, tytso
  Cc: dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm,
	fernando_b1, torvalds

Johannes Weiner wrote:
> On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > 
> > > I'm trying to figure out if the current nofail allocators can get
> > > their memory needs figured out beforehand.  And reliably so - what
> > > good are estimates that are right 90% of the time, when failing the
> > > allocation means corrupting user data?  What is the contingency plan?
> > 
> > In the ideal world, we can figure out the exact memory needs
> > beforehand.  But we live in an imperfect world, and given that block
> > devices *also* need memory, the answer is "of course not".  We can't
> > be perfect.  But we can least give some kind of hint, and we can offer
> > to wait before we get into a situation where we need to loop in
> > GFP_NOWAIT --- which is the contingency/fallback plan.
> 
> Overestimating should be fine, the result would a bit of false memory
> pressure.  But underestimating and looping can't be an option or the
> original lockups will still be there.  We need to guarantee forward
> progress or the problem is somewhat mitigated at best - only now with
> quite a bit more complexity in the allocator and the filesystems.
> 
> The block code would have to be looked at separately, but doesn't it
> already use mempools etc. to guarantee progress?
> 

If underestimating is tolerable, can we simply set different watermark
levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations?
For example,

   GFP_KERNEL (or above) can fail if memory usage exceeds 95%
   GFP_NOFS can fail if memory usage exceeds 97%
   GFP_NOIO can fail if memory usage exceeds 98%
   GFP_ATOMIC can fail if memory usage exceeds 99%

I think that below order-0 GFP_NOIO allocation enters into retry-forever loop
when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds
strange. Use of same watermark is preventing kernel worker threads from
processing workqueue. While it is legal to do blocking operation from
workqueue, being blocked forever is an exclusive occupation for workqueue;
other jobs in the workqueue get stuck.

[  907.302050] kworker/1:0     R  running task        0 10832      2 0x00000080
[  907.303961] Workqueue: events_freezable_power_ disk_events_workfn
[  907.305706]  ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190
[  907.307761]  0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190
[  907.309894]  0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408
[  907.311949] Call Trace:
[  907.312989]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  907.314578]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  907.316182]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  907.317889]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  907.319535]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  907.321259]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  907.322945]  [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380
[  907.324606]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  907.326196]  [<ffffffff8125c119>] bio_copy_kern+0x49/0x100
[  907.327788]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  907.329549]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  907.331184]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  907.332877]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  907.334452]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  907.336156]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  907.337893]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  907.339539]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  907.341289]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  907.343115]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  907.344771]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  907.346421]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  907.348057]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  907.349650]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  907.351295]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  907.352765]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  907.354520]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  907.356097]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0

If I change GFP_NOIO in scsi_execute() to GFP_ATOMIC, above trace went away.
If we can reserve some amount of memory for block / filesystem layer than
allow non critical allocation, above trace will likely go away. 

Or, instead maybe we can change GFP_NOIO to do

  (1) try allocation using GFP_ATOMIC|GFP_NOWARN
  (2) try allocating from freelist for GFP_NOIO
  (3) fail the allocation with warning message

steps if we can implement freelist for GFP_NOIO. Ditto for GFP_NOFS.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-01 11:17                                                         ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-03-01 11:17 UTC (permalink / raw)
  To: hannes, tytso
  Cc: david, mhocko, dchinner, linux-mm, rientjes, oleg, akpm, mgorman,
	torvalds, xfs, fernando_b1

Johannes Weiner wrote:
> On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > 
> > > I'm trying to figure out if the current nofail allocators can get
> > > their memory needs figured out beforehand.  And reliably so - what
> > > good are estimates that are right 90% of the time, when failing the
> > > allocation means corrupting user data?  What is the contingency plan?
> > 
> > In the ideal world, we can figure out the exact memory needs
> > beforehand.  But we live in an imperfect world, and given that block
> > devices *also* need memory, the answer is "of course not".  We can't
> > be perfect.  But we can least give some kind of hint, and we can offer
> > to wait before we get into a situation where we need to loop in
> > GFP_NOWAIT --- which is the contingency/fallback plan.
> 
> Overestimating should be fine, the result would a bit of false memory
> pressure.  But underestimating and looping can't be an option or the
> original lockups will still be there.  We need to guarantee forward
> progress or the problem is somewhat mitigated at best - only now with
> quite a bit more complexity in the allocator and the filesystems.
> 
> The block code would have to be looked at separately, but doesn't it
> already use mempools etc. to guarantee progress?
> 

If underestimating is tolerable, can we simply set different watermark
levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations?
For example,

   GFP_KERNEL (or above) can fail if memory usage exceeds 95%
   GFP_NOFS can fail if memory usage exceeds 97%
   GFP_NOIO can fail if memory usage exceeds 98%
   GFP_ATOMIC can fail if memory usage exceeds 99%

I think that below order-0 GFP_NOIO allocation enters into retry-forever loop
when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds
strange. Use of same watermark is preventing kernel worker threads from
processing workqueue. While it is legal to do blocking operation from
workqueue, being blocked forever is an exclusive occupation for workqueue;
other jobs in the workqueue get stuck.

[  907.302050] kworker/1:0     R  running task        0 10832      2 0x00000080
[  907.303961] Workqueue: events_freezable_power_ disk_events_workfn
[  907.305706]  ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190
[  907.307761]  0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190
[  907.309894]  0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408
[  907.311949] Call Trace:
[  907.312989]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  907.314578]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  907.316182]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  907.317889]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  907.319535]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  907.321259]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  907.322945]  [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380
[  907.324606]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  907.326196]  [<ffffffff8125c119>] bio_copy_kern+0x49/0x100
[  907.327788]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  907.329549]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  907.331184]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  907.332877]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  907.334452]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  907.336156]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  907.337893]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  907.339539]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  907.341289]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  907.343115]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  907.344771]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  907.346421]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  907.348057]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  907.349650]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  907.351295]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  907.352765]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  907.354520]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  907.356097]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0

If I change GFP_NOIO in scsi_execute() to GFP_ATOMIC, above trace went away.
If we can reserve some amount of memory for block / filesystem layer than
allow non critical allocation, above trace will likely go away. 

Or, instead maybe we can change GFP_NOIO to do

  (1) try allocation using GFP_ATOMIC|GFP_NOWARN
  (2) try allocating from freelist for GFP_NOIO
  (3) fail the allocation with warning message

steps if we can implement freelist for GFP_NOIO. Ditto for GFP_NOFS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 22:15                                                       ` Johannes Weiner
@ 2015-03-01 13:43                                                         ` Theodore Ts'o
  -1 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-03-01 13:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> Overestimating should be fine, the result would a bit of false memory
> pressure.  But underestimating and looping can't be an option or the
> original lockups will still be there.  We need to guarantee forward
> progress or the problem is somewhat mitigated at best - only now with
> quite a bit more complexity in the allocator and the filesystems.

We've lived with looping as it is and in practice it's actually worked
well.  I can only speak for ext4, but I do a lot of testing under very
high memory pressure situations, and it is used in *production* under
very high stress situations --- and the only time we'e run into
trouble is when the looping behaviour somehow got accidentally
*removed*.

There have been MM experts who have been worrying about this situation
for a very long time, but honestly, it seems to be much more of a
theoretical than actual concern.  So if you don't want to get
hints/estimates about how much memory the file system is about to use,
when the file system is willing to wait or even potentially return
ENOMEM (although I suspect starting to return ENOMEM where most user
space application don't expect it will cause more problems), I'm
personally happy to just use GFP_NOFAIL everywhere --- or to hard code
my own infinite loops if the MM developers want to take GFP_NOFAIL
away.  Because in my experience, looping simply hasn't been as awful
as some folks on this thread have made it out to be.

So if you don't like the complexity because the perfect is the enemy
of the good, we can just drop this and the file systems can simply
continue to loop around their memory allocation calls...  or if that
fails we can start adding subsystem specific mempools, which would be
even more wasteful of memory and probably at least as complicated.

							- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-01 13:43                                                         ` Theodore Ts'o
  0 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-03-01 13:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> Overestimating should be fine, the result would a bit of false memory
> pressure.  But underestimating and looping can't be an option or the
> original lockups will still be there.  We need to guarantee forward
> progress or the problem is somewhat mitigated at best - only now with
> quite a bit more complexity in the allocator and the filesystems.

We've lived with looping as it is and in practice it's actually worked
well.  I can only speak for ext4, but I do a lot of testing under very
high memory pressure situations, and it is used in *production* under
very high stress situations --- and the only time we'e run into
trouble is when the looping behaviour somehow got accidentally
*removed*.

There have been MM experts who have been worrying about this situation
for a very long time, but honestly, it seems to be much more of a
theoretical than actual concern.  So if you don't want to get
hints/estimates about how much memory the file system is about to use,
when the file system is willing to wait or even potentially return
ENOMEM (although I suspect starting to return ENOMEM where most user
space application don't expect it will cause more problems), I'm
personally happy to just use GFP_NOFAIL everywhere --- or to hard code
my own infinite loops if the MM developers want to take GFP_NOFAIL
away.  Because in my experience, looping simply hasn't been as awful
as some folks on this thread have made it out to be.

So if you don't like the complexity because the perfect is the enemy
of the good, we can just drop this and the file systems can simply
continue to loop around their memory allocation calls...  or if that
fails we can start adding subsystem specific mempools, which would be
even more wasteful of memory and probably at least as complicated.

							- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 13:43                                                         ` Theodore Ts'o
@ 2015-03-01 16:15                                                           ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-01 16:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > Overestimating should be fine, the result would a bit of false memory
> > pressure.  But underestimating and looping can't be an option or the
> > original lockups will still be there.  We need to guarantee forward
> > progress or the problem is somewhat mitigated at best - only now with
> > quite a bit more complexity in the allocator and the filesystems.
> 
> We've lived with looping as it is and in practice it's actually worked
> well.  I can only speak for ext4, but I do a lot of testing under very
> high memory pressure situations, and it is used in *production* under
> very high stress situations --- and the only time we'e run into
> trouble is when the looping behaviour somehow got accidentally
> *removed*.
> 
> There have been MM experts who have been worrying about this situation
> for a very long time, but honestly, it seems to be much more of a
> theoretical than actual concern.

Well, looping is a valid thing to do in most situations because on a
loaded system there is a decent chance that an unrelated thread will
volunteer some unreclaimable memory, or exit altogether.  Right now,
we rely on this happening, and it works most of the time.  Maybe all
the time, depending on how your machine is used.  But when it does't,
machines do lock up in practice.

We had these lockups in cgroups with just a handful of threads, which
all got stuck in the allocator and there was nobody left to volunteer
unreclaimable memory.  When this was being addressed, we knew that the
same can theoretically happen on the system-level but weren't aware of
any reports.  Well now, here we are.

It's been argued in this thread that systems shouldn't be pushed to
such extremes in real life and that we simply expect failure at some
point.  If that's the consensus, then yes, we can stop this and tell
users that they should scale back.  But I'm not convinced just yet
that this is the best we can do.

> So if you don't want to get hints/estimates about how much memory
> the file system is about to use, when the file system is willing to
> wait or even potentially return ENOMEM (although I suspect starting
> to return ENOMEM where most user space application don't expect it
> will cause more problems), I'm personally happy to just use
> GFP_NOFAIL everywhere --- or to hard code my own infinite loops if
> the MM developers want to take GFP_NOFAIL away.  Because in my
> experience, looping simply hasn't been as awful as some folks on
> this thread have made it out to be.

As I've said before, I'd be happy to get estimates from the filesystem
so that we can adjust our reserves, instead of simply running against
the wall at some point and hoping that the OOM killer heuristics will
save the day.

Until then, I'd much prefer __GFP_NOFAIL over open-coded loops.  If
the OOM killer is too aggressive, we can tone it down, but as it
stands that mechanism is the last attempt at forward progress if
looping doesn't work out.  In addition, when we finally transition to
private memory reserves, we can easily find the callsites that need to
be annotated with __GFP_MAY_DIP_INTO_PRIVATE_RESERVES.

> So if you don't like the complexity because the perfect is the enemy
> of the good, we can just drop this and the file systems can simply
> continue to loop around their memory allocation calls...  or if that
> fails we can start adding subsystem specific mempools, which would be
> even more wasteful of memory and probably at least as complicated.

It really depends on what the goal here is.  You don't have to be
perfectly accurate, but if you can give us a worst-case estimate we
can actually guarantee forward progress and eliminate these lockups
entirely, like in the block layer.  Sure, there will be bugs and the
estimates won't be right from the start, but we can converge towards
the right answer.  If the allocations which are allowed to dip into
the reserves - the current nofail sites? - can be annotated with a gfp
flag, we can easily verify the estimates by serving those sites
exclusively from the private reserve pool and emit warnings when that
runs dry.  We wouldn't even have to stress the system for that.

But there are legitimate concerns that this might never work.  For
example, the requirements could be so unpredictable, or assessing them
with reasonable accuracy could be so expensive, that the margin of
error would make the worst case estimate too big to be useful.  Big
enough that the reserves would harm well-behaved systems.  And if
useful worst-case estimates are unattainable, I don't think we need to
bother with reserves.  We can just stick with looping and OOM killing,
that works most of the time, too.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-01 16:15                                                           ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-01 16:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > Overestimating should be fine, the result would a bit of false memory
> > pressure.  But underestimating and looping can't be an option or the
> > original lockups will still be there.  We need to guarantee forward
> > progress or the problem is somewhat mitigated at best - only now with
> > quite a bit more complexity in the allocator and the filesystems.
> 
> We've lived with looping as it is and in practice it's actually worked
> well.  I can only speak for ext4, but I do a lot of testing under very
> high memory pressure situations, and it is used in *production* under
> very high stress situations --- and the only time we'e run into
> trouble is when the looping behaviour somehow got accidentally
> *removed*.
> 
> There have been MM experts who have been worrying about this situation
> for a very long time, but honestly, it seems to be much more of a
> theoretical than actual concern.

Well, looping is a valid thing to do in most situations because on a
loaded system there is a decent chance that an unrelated thread will
volunteer some unreclaimable memory, or exit altogether.  Right now,
we rely on this happening, and it works most of the time.  Maybe all
the time, depending on how your machine is used.  But when it does't,
machines do lock up in practice.

We had these lockups in cgroups with just a handful of threads, which
all got stuck in the allocator and there was nobody left to volunteer
unreclaimable memory.  When this was being addressed, we knew that the
same can theoretically happen on the system-level but weren't aware of
any reports.  Well now, here we are.

It's been argued in this thread that systems shouldn't be pushed to
such extremes in real life and that we simply expect failure at some
point.  If that's the consensus, then yes, we can stop this and tell
users that they should scale back.  But I'm not convinced just yet
that this is the best we can do.

> So if you don't want to get hints/estimates about how much memory
> the file system is about to use, when the file system is willing to
> wait or even potentially return ENOMEM (although I suspect starting
> to return ENOMEM where most user space application don't expect it
> will cause more problems), I'm personally happy to just use
> GFP_NOFAIL everywhere --- or to hard code my own infinite loops if
> the MM developers want to take GFP_NOFAIL away.  Because in my
> experience, looping simply hasn't been as awful as some folks on
> this thread have made it out to be.

As I've said before, I'd be happy to get estimates from the filesystem
so that we can adjust our reserves, instead of simply running against
the wall at some point and hoping that the OOM killer heuristics will
save the day.

Until then, I'd much prefer __GFP_NOFAIL over open-coded loops.  If
the OOM killer is too aggressive, we can tone it down, but as it
stands that mechanism is the last attempt at forward progress if
looping doesn't work out.  In addition, when we finally transition to
private memory reserves, we can easily find the callsites that need to
be annotated with __GFP_MAY_DIP_INTO_PRIVATE_RESERVES.

> So if you don't like the complexity because the perfect is the enemy
> of the good, we can just drop this and the file systems can simply
> continue to loop around their memory allocation calls...  or if that
> fails we can start adding subsystem specific mempools, which would be
> even more wasteful of memory and probably at least as complicated.

It really depends on what the goal here is.  You don't have to be
perfectly accurate, but if you can give us a worst-case estimate we
can actually guarantee forward progress and eliminate these lockups
entirely, like in the block layer.  Sure, there will be bugs and the
estimates won't be right from the start, but we can converge towards
the right answer.  If the allocations which are allowed to dip into
the reserves - the current nofail sites? - can be annotated with a gfp
flag, we can easily verify the estimates by serving those sites
exclusively from the private reserve pool and emit warnings when that
runs dry.  We wouldn't even have to stress the system for that.

But there are legitimate concerns that this might never work.  For
example, the requirements could be so unpredictable, or assessing them
with reasonable accuracy could be so expensive, that the margin of
error would make the worst case estimate too big to be useful.  Big
enough that the reserves would harm well-behaved systems.  And if
useful worst-case estimates are unattainable, I don't think we need to
bother with reserves.  We can just stick with looping and OOM killing,
that works most of the time, too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 16:15                                                           ` Johannes Weiner
@ 2015-03-01 19:36                                                             ` Theodore Ts'o
  -1 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-03-01 19:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote:
> 
> We had these lockups in cgroups with just a handful of threads, which
> all got stuck in the allocator and there was nobody left to volunteer
> unreclaimable memory.  When this was being addressed, we knew that the
> same can theoretically happen on the system-level but weren't aware of
> any reports.  Well now, here we are.

I think the "few threads in a small" cgroup problem is a little
difference, because in those cases very often the global system has
enough memory, and there is always the possibility that we might relax
the memory cgroup guarantees a little in order to allow forward
progress.

In fact, arguably this *is* the right thing to do, because we have
situations where (a) the VFS takes the directory mutex, (b) the
directory blocks have been pushed out of memory, and so (c) a system
call running in container with a small amount of memory and/or a small
amount of disk bandwidth allowed via its prop I/O settings ends up
taking a very long time for the directory blocks to be read into
memory.  If a high priority process, like say a cluster management
daemon, also tries to to read the same directory, it can end up
stalled for long enough for the software watchdog to take out the
entire machine from the cluster.

The hard problem here is that the lock is taken by the VFS, *before*
it calls into the file system specific layer, and so the VFS has no
idea (a) how much memory or disk bandwidth it needs, and (b) whether
it needs any memory or disk bandwidth in the first place in order to
service a directory lookup operation (most of the time, it doesn't).
So there may be situations where in the restricted cgroup, it would
useful for the file system to be able to say, "you know, we're holding
onto a lock and the fact that the disk controller is going to force
this low priority cgroup to wait over a minute for the I/O to even be
queued out to the disk, maybe we should make an exception and bust the
disk controller cgroup cap".

(There is a related problem where a cgroup with a low disk bandwidth
quota is slowing down writeback, and we are desperately short on
global memory, and where relaxing the disk bandwidth limit via some
kind of priority inheritance scheme would prevent "innocent" high,
proprity cgroups from having some of their processes get OOM-killed.
I suppose one could claim that the high priority cgroups tend to
belong to the sysadmin, who set the stupid disk bandwidth caps in the
first place, so there is a certain justice in having the high priority
processes getting OOM killed, but still, it would be nice if we could
do the right thing automatically.)


But in any case, some of these workarounds, where we relax a
particuarly tightly constrained cgroup limit, are obviously not going
to help when the entire system is low on memory.

> It really depends on what the goal here is.  You don't have to be
> perfectly accurate, but if you can give us a worst-case estimate we
> can actually guarantee forward progress and eliminate these lockups
> entirely, like in the block layer.  Sure, there will be bugs and the
> estimates won't be right from the start, but we can converge towards
> the right answer.  If the allocations which are allowed to dip into
> the reserves - the current nofail sites? - can be annotated with a gfp
> flag, we can easily verify the estimates by serving those sites
> exclusively from the private reserve pool and emit warnings when that
> runs dry.  We wouldn't even have to stress the system for that.
> 
> But there are legitimate concerns that this might never work.  For
> example, the requirements could be so unpredictable, or assessing them
> with reasonable accuracy could be so expensive, that the margin of
> error would make the worst case estimate too big to be useful.  Big
> enough that the reserves would harm well-behaved systems.  And if
> useful worst-case estimates are unattainable, I don't think we need to
> bother with reserves.  We can just stick with looping and OOM killing,
> that works most of the time, too.

I'm not sure that you want to reserve for the worst-case.  What might
work is if subsystems (probably primarily file systems) give you
estimates for the usual case and the worst case, and you reserve for
the something in between these two bounds.  In practice there will be
a huge number of file systems operations taking place in your typical
super-busy system, and if you reserve for the worst case, it probably
will be too much.  We need to make sure there is enough memory
available for some forward progress, and if we need to stall a few
operations with some sleeping loops, so be it.  So I don't think the
"heads up" mounts don't have to be strict reservations in the sense
that the memory will be available instantly without any sleeping or
looping.

I would also suggest that "reservations" be tied to a task struct and
not to some magic __GFP_* flag, since it's not just allocations done
by the file system, but also by the block device drivers, and if
certain write operations fail, the results will be catastrophic -- and
the block device can't tell whether a particular I/O operatoion must
succeed or we declare the file system as needing manual recovery and
potentially reboot the entire system, and an I/O operation where a
fail could be handled by reflecting ENOMEM back up to userspace.  The
difference is a property of the call stack, so the simplest way of
handing this is to store the reservation in the task struct, and let
the reservation get automatically returned to the system when a
particular process makes a transition from kernel space to user space.

The bottom line is that I agree that looping and OOM-killing works
most of the time, and so I'm happy with something that makes life a
little bit better and a little bit more predictable for the VM, if
that makes the system behave a bit more smoothly under high memory
pressures.  But at the same time, we don't want to make things too
complicated; whether that means that we don't try to achieve
perfection, or simply not worry about the global memory pressure
situation, and instead try to think about other solutions to handle
the "small number of threads in a container, and try to OOM kill a bit
less frequently, and instead force it to loop/sleep for a bit, and
then let a random foreground kernel thread in the container allow to
"borrow" a small amount of memory to hopefully let it make forward
progress, especially if it is holding locks, or is in the process of
exiting, etc.

						- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-01 19:36                                                             ` Theodore Ts'o
  0 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-03-01 19:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote:
> 
> We had these lockups in cgroups with just a handful of threads, which
> all got stuck in the allocator and there was nobody left to volunteer
> unreclaimable memory.  When this was being addressed, we knew that the
> same can theoretically happen on the system-level but weren't aware of
> any reports.  Well now, here we are.

I think the "few threads in a small" cgroup problem is a little
difference, because in those cases very often the global system has
enough memory, and there is always the possibility that we might relax
the memory cgroup guarantees a little in order to allow forward
progress.

In fact, arguably this *is* the right thing to do, because we have
situations where (a) the VFS takes the directory mutex, (b) the
directory blocks have been pushed out of memory, and so (c) a system
call running in container with a small amount of memory and/or a small
amount of disk bandwidth allowed via its prop I/O settings ends up
taking a very long time for the directory blocks to be read into
memory.  If a high priority process, like say a cluster management
daemon, also tries to to read the same directory, it can end up
stalled for long enough for the software watchdog to take out the
entire machine from the cluster.

The hard problem here is that the lock is taken by the VFS, *before*
it calls into the file system specific layer, and so the VFS has no
idea (a) how much memory or disk bandwidth it needs, and (b) whether
it needs any memory or disk bandwidth in the first place in order to
service a directory lookup operation (most of the time, it doesn't).
So there may be situations where in the restricted cgroup, it would
useful for the file system to be able to say, "you know, we're holding
onto a lock and the fact that the disk controller is going to force
this low priority cgroup to wait over a minute for the I/O to even be
queued out to the disk, maybe we should make an exception and bust the
disk controller cgroup cap".

(There is a related problem where a cgroup with a low disk bandwidth
quota is slowing down writeback, and we are desperately short on
global memory, and where relaxing the disk bandwidth limit via some
kind of priority inheritance scheme would prevent "innocent" high,
proprity cgroups from having some of their processes get OOM-killed.
I suppose one could claim that the high priority cgroups tend to
belong to the sysadmin, who set the stupid disk bandwidth caps in the
first place, so there is a certain justice in having the high priority
processes getting OOM killed, but still, it would be nice if we could
do the right thing automatically.)


But in any case, some of these workarounds, where we relax a
particuarly tightly constrained cgroup limit, are obviously not going
to help when the entire system is low on memory.

> It really depends on what the goal here is.  You don't have to be
> perfectly accurate, but if you can give us a worst-case estimate we
> can actually guarantee forward progress and eliminate these lockups
> entirely, like in the block layer.  Sure, there will be bugs and the
> estimates won't be right from the start, but we can converge towards
> the right answer.  If the allocations which are allowed to dip into
> the reserves - the current nofail sites? - can be annotated with a gfp
> flag, we can easily verify the estimates by serving those sites
> exclusively from the private reserve pool and emit warnings when that
> runs dry.  We wouldn't even have to stress the system for that.
> 
> But there are legitimate concerns that this might never work.  For
> example, the requirements could be so unpredictable, or assessing them
> with reasonable accuracy could be so expensive, that the margin of
> error would make the worst case estimate too big to be useful.  Big
> enough that the reserves would harm well-behaved systems.  And if
> useful worst-case estimates are unattainable, I don't think we need to
> bother with reserves.  We can just stick with looping and OOM killing,
> that works most of the time, too.

I'm not sure that you want to reserve for the worst-case.  What might
work is if subsystems (probably primarily file systems) give you
estimates for the usual case and the worst case, and you reserve for
the something in between these two bounds.  In practice there will be
a huge number of file systems operations taking place in your typical
super-busy system, and if you reserve for the worst case, it probably
will be too much.  We need to make sure there is enough memory
available for some forward progress, and if we need to stall a few
operations with some sleeping loops, so be it.  So I don't think the
"heads up" mounts don't have to be strict reservations in the sense
that the memory will be available instantly without any sleeping or
looping.

I would also suggest that "reservations" be tied to a task struct and
not to some magic __GFP_* flag, since it's not just allocations done
by the file system, but also by the block device drivers, and if
certain write operations fail, the results will be catastrophic -- and
the block device can't tell whether a particular I/O operatoion must
succeed or we declare the file system as needing manual recovery and
potentially reboot the entire system, and an I/O operation where a
fail could be handled by reflecting ENOMEM back up to userspace.  The
difference is a property of the call stack, so the simplest way of
handing this is to store the reservation in the task struct, and let
the reservation get automatically returned to the system when a
particular process makes a transition from kernel space to user space.

The bottom line is that I agree that looping and OOM-killing works
most of the time, and so I'm happy with something that makes life a
little bit better and a little bit more predictable for the VM, if
that makes the system behave a bit more smoothly under high memory
pressures.  But at the same time, we don't want to make things too
complicated; whether that means that we don't try to achieve
perfection, or simply not worry about the global memory pressure
situation, and instead try to think about other solutions to handle
the "small number of threads in a container, and try to OOM kill a bit
less frequently, and instead force it to loop/sleep for a bit, and
then let a random foreground kernel thread in the container allow to
"borrow" a small amount of memory to hopefully let it make forward
progress, especially if it is holding locks, or is in the process of
exiting, etc.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 13:43                                                         ` Theodore Ts'o
@ 2015-03-01 20:17                                                           ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-01 20:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > Overestimating should be fine, the result would a bit of false memory
> > pressure.  But underestimating and looping can't be an option or the
> > original lockups will still be there.  We need to guarantee forward
> > progress or the problem is somewhat mitigated at best - only now with
> > quite a bit more complexity in the allocator and the filesystems.
> 
> We've lived with looping as it is and in practice it's actually worked
> well.  I can only speak for ext4, but I do a lot of testing under very
> high memory pressure situations, and it is used in *production* under
> very high stress situations --- and the only time we'e run into
> trouble is when the looping behaviour somehow got accidentally
> *removed*.

Memory is a finite resource and there are (unlimited) consumers that
do not allow their share to be reclaimed/recycled.  Mainly this is the
kernel itself, but it also includes anon memory once swap space runs
out, as well as mlocked and dirty memory.  It's not a question of
whether there exists a true point of OOM (where not enough memory is
recyclable to satisfy new allocations).  That point inevitably exists.
It's a policy question of how to inform userspace once it is reached.

We agree that we can't unconditionally fail allocations, because we
might be in the middle of a transaction, where an allocation failure
can potentially corrupt userdata.  However, endlessly looping for
progress that can not happen at this point has the exact same effect:
the transaction won't finish.  Only the machine locks up in addition.
It's great that your setups don't ever truly go out of memory, but
that doesn't mean it can't happen in practice.

One answer to users at this point could certainly be to stay away from
the true point of OOM, and if you don't then that's your problem.  But
the issue I take with this answer is that, for the sake of memory
utilization, users kind of do want to get fairly close to this point,
and at the same time it's hard to reliably predict the memory
consumption of a workload in advance.  It can depend on the timing
between threads, it can depend on user/network-supplied input, and it
can simply be a bug in the application.  And if that OOM situation is
accidentally entered, I'd prefer we had a better answer than locking
up the machine and blame the user.

So one attempt to make progress in this situation is to kill userspace
applications that are pinning unreclaimable memory.  This is what we
are doing now, but there are several problems with it.  For one, we
are doing a terrible job and might still get stuck sometimes, which
deteriorates the situation back to failing the allocation and
corrupting the filesystem.  Secondly, killing tasks is disruptive, and
because it's driven by heuristics we're never going to kill the
"right" one in all situations.

Reserves would allow us to look ahead and avoid starting transactions
that can not be finished given the available resources.  So we are at
least avoiding filesystem corruption.  The tasks could probably be put
to sleep for some time in the hope that ongoing transactions complete
and release memory, but there might not be any, and eventually the OOM
situation has to be communicated to userspace.  Arguably, an -ENOMEM
from a syscall at this point might be easier to handle than a SIGKILL
from the OOM killer in an unrelated task.

So if we could pull off reserves, they look like the most attractive
solution to me.  If not, the OOM killer needs to be fixed to always
make forward progress instead.  I proposed a patch for that already.
But infinite loops that force the user to reboot the machine at the
point of OOM seem like a terrible policy.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-01 20:17                                                           ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-01 20:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > Overestimating should be fine, the result would a bit of false memory
> > pressure.  But underestimating and looping can't be an option or the
> > original lockups will still be there.  We need to guarantee forward
> > progress or the problem is somewhat mitigated at best - only now with
> > quite a bit more complexity in the allocator and the filesystems.
> 
> We've lived with looping as it is and in practice it's actually worked
> well.  I can only speak for ext4, but I do a lot of testing under very
> high memory pressure situations, and it is used in *production* under
> very high stress situations --- and the only time we'e run into
> trouble is when the looping behaviour somehow got accidentally
> *removed*.

Memory is a finite resource and there are (unlimited) consumers that
do not allow their share to be reclaimed/recycled.  Mainly this is the
kernel itself, but it also includes anon memory once swap space runs
out, as well as mlocked and dirty memory.  It's not a question of
whether there exists a true point of OOM (where not enough memory is
recyclable to satisfy new allocations).  That point inevitably exists.
It's a policy question of how to inform userspace once it is reached.

We agree that we can't unconditionally fail allocations, because we
might be in the middle of a transaction, where an allocation failure
can potentially corrupt userdata.  However, endlessly looping for
progress that can not happen at this point has the exact same effect:
the transaction won't finish.  Only the machine locks up in addition.
It's great that your setups don't ever truly go out of memory, but
that doesn't mean it can't happen in practice.

One answer to users at this point could certainly be to stay away from
the true point of OOM, and if you don't then that's your problem.  But
the issue I take with this answer is that, for the sake of memory
utilization, users kind of do want to get fairly close to this point,
and at the same time it's hard to reliably predict the memory
consumption of a workload in advance.  It can depend on the timing
between threads, it can depend on user/network-supplied input, and it
can simply be a bug in the application.  And if that OOM situation is
accidentally entered, I'd prefer we had a better answer than locking
up the machine and blame the user.

So one attempt to make progress in this situation is to kill userspace
applications that are pinning unreclaimable memory.  This is what we
are doing now, but there are several problems with it.  For one, we
are doing a terrible job and might still get stuck sometimes, which
deteriorates the situation back to failing the allocation and
corrupting the filesystem.  Secondly, killing tasks is disruptive, and
because it's driven by heuristics we're never going to kill the
"right" one in all situations.

Reserves would allow us to look ahead and avoid starting transactions
that can not be finished given the available resources.  So we are at
least avoiding filesystem corruption.  The tasks could probably be put
to sleep for some time in the hope that ongoing transactions complete
and release memory, but there might not be any, and eventually the OOM
situation has to be communicated to userspace.  Arguably, an -ENOMEM
from a syscall at this point might be easier to handle than a SIGKILL
from the OOM killer in an unrelated task.

So if we could pull off reserves, they look like the most attractive
solution to me.  If not, the OOM killer needs to be fixed to always
make forward progress instead.  I proposed a patch for that already.
But infinite loops that force the user to reboot the machine at the
point of OOM seem like a terrible policy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 19:36                                                             ` Theodore Ts'o
@ 2015-03-01 20:44                                                               ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-01 20:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sun, Mar 01, 2015 at 02:36:35PM -0500, Theodore Ts'o wrote:
> On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote:
> > 
> > We had these lockups in cgroups with just a handful of threads, which
> > all got stuck in the allocator and there was nobody left to volunteer
> > unreclaimable memory.  When this was being addressed, we knew that the
> > same can theoretically happen on the system-level but weren't aware of
> > any reports.  Well now, here we are.
> 
> I think the "few threads in a small" cgroup problem is a little
> difference, because in those cases very often the global system has
> enough memory, and there is always the possibility that we might relax
> the memory cgroup guarantees a little in order to allow forward
> progress.

That's exactly how we fixed it.  __GFP_NOFAIL are allowed to simply
bypass the cgroup memory limits when reclaim within the group fails to
make room for the allocation.  I'm just mentioning that because the
global case doesn't have the same out, but is susceptible to the same
deadlock situation when there are no other threads volunteering pages.

If your machines are loaded with hundreds or thousands of threads, the
chances that a thread stuck in the allocator will be bailed out by the
other threads in the system is likely (or that you run into CPU limits
first), but if you have only a handful of memory-intensive tasks, this
might not be the case.  The cgroup problem was closer to that second
scenario, where few threads split all available memory between them.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-01 20:44                                                               ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-01 20:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Tetsuo Handa, mhocko, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Sun, Mar 01, 2015 at 02:36:35PM -0500, Theodore Ts'o wrote:
> On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote:
> > 
> > We had these lockups in cgroups with just a handful of threads, which
> > all got stuck in the allocator and there was nobody left to volunteer
> > unreclaimable memory.  When this was being addressed, we knew that the
> > same can theoretically happen on the system-level but weren't aware of
> > any reports.  Well now, here we are.
> 
> I think the "few threads in a small" cgroup problem is a little
> difference, because in those cases very often the global system has
> enough memory, and there is always the possibility that we might relax
> the memory cgroup guarantees a little in order to allow forward
> progress.

That's exactly how we fixed it.  __GFP_NOFAIL are allowed to simply
bypass the cgroup memory limits when reclaim within the group fails to
make room for the allocation.  I'm just mentioning that because the
global case doesn't have the same out, but is susceptible to the same
deadlock situation when there are no other threads volunteering pages.

If your machines are loaded with hundreds or thousands of threads, the
chances that a thread stuck in the allocator will be bailed out by the
other threads in the system is likely (or that you run into CPU limits
first), but if you have only a handful of memory-intensive tasks, this
might not be the case.  The cgroup problem was closer to that second
scenario, where few threads split all available memory between them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 22:15                                                       ` Johannes Weiner
@ 2015-03-01 21:48                                                         ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-01 21:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, akpm, torvalds

On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > 
> > > I'm trying to figure out if the current nofail allocators can get
> > > their memory needs figured out beforehand.  And reliably so - what
> > > good are estimates that are right 90% of the time, when failing the
> > > allocation means corrupting user data?  What is the contingency plan?
> > 
> > In the ideal world, we can figure out the exact memory needs
> > beforehand.  But we live in an imperfect world, and given that block
> > devices *also* need memory, the answer is "of course not".  We can't
> > be perfect.  But we can least give some kind of hint, and we can offer
> > to wait before we get into a situation where we need to loop in
> > GFP_NOWAIT --- which is the contingency/fallback plan.
> 
> Overestimating should be fine, the result would a bit of false memory
> pressure.  But underestimating and looping can't be an option or the
> original lockups will still be there.  We need to guarantee forward
> progress or the problem is somewhat mitigated at best - only now with
> quite a bit more complexity in the allocator and the filesystems.

The additional complexity in XFS is actually quite minor, and
initial "rough worst case" memory usage estimates are not that hard
to measure....

> The block code would have to be looked at separately, but doesn't it
> already use mempools etc. to guarantee progress?

Yes, it does. I'm not concerned about the block layer.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-01 21:48                                                         ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-01 21:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, akpm, mgorman, torvalds, xfs

On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > 
> > > I'm trying to figure out if the current nofail allocators can get
> > > their memory needs figured out beforehand.  And reliably so - what
> > > good are estimates that are right 90% of the time, when failing the
> > > allocation means corrupting user data?  What is the contingency plan?
> > 
> > In the ideal world, we can figure out the exact memory needs
> > beforehand.  But we live in an imperfect world, and given that block
> > devices *also* need memory, the answer is "of course not".  We can't
> > be perfect.  But we can least give some kind of hint, and we can offer
> > to wait before we get into a situation where we need to loop in
> > GFP_NOWAIT --- which is the contingency/fallback plan.
> 
> Overestimating should be fine, the result would a bit of false memory
> pressure.  But underestimating and looping can't be an option or the
> original lockups will still be there.  We need to guarantee forward
> progress or the problem is somewhat mitigated at best - only now with
> quite a bit more complexity in the allocator and the filesystems.

The additional complexity in XFS is actually quite minor, and
initial "rough worst case" memory usage estimates are not that hard
to measure....

> The block code would have to be looked at separately, but doesn't it
> already use mempools etc. to guarantee progress?

Yes, it does. I'm not concerned about the block layer.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 21:48                                                         ` Dave Chinner
@ 2015-03-02  0:17                                                           ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-02  0:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, mhocko,
	linux-mm, mgorman, dchinner, akpm, torvalds

On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote:
> On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > > 
> > > > I'm trying to figure out if the current nofail allocators can get
> > > > their memory needs figured out beforehand.  And reliably so - what
> > > > good are estimates that are right 90% of the time, when failing the
> > > > allocation means corrupting user data?  What is the contingency plan?
> > > 
> > > In the ideal world, we can figure out the exact memory needs
> > > beforehand.  But we live in an imperfect world, and given that block
> > > devices *also* need memory, the answer is "of course not".  We can't
> > > be perfect.  But we can least give some kind of hint, and we can offer
> > > to wait before we get into a situation where we need to loop in
> > > GFP_NOWAIT --- which is the contingency/fallback plan.
> > 
> > Overestimating should be fine, the result would a bit of false memory
> > pressure.  But underestimating and looping can't be an option or the
> > original lockups will still be there.  We need to guarantee forward
> > progress or the problem is somewhat mitigated at best - only now with
> > quite a bit more complexity in the allocator and the filesystems.
> 
> The additional complexity in XFS is actually quite minor, and
> initial "rough worst case" memory usage estimates are not that hard
> to measure....

And, just to point out that the OOM killer can be invoked without a
single transaction-based filesystem ENOMEM failure, here's what
xfs/084 does on 4.0-rc1:

[  148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[  148.822113] resvtest cpuset=/ mems_allowed=0
[  148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825
[  148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  148.826471]  0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c
[  148.828220]  ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000
[  148.829958]  0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8
[  148.831734] Call Trace:
[  148.832325]  [<ffffffff81dcb570>] dump_stack+0x4c/0x65
[  148.833493]  [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb
[  148.834855]  [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0
[  148.836195]  [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40
[  148.837633]  [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500
[  148.838925]  [<ffffffff8117e44b>] out_of_memory+0x5b/0x80
[  148.840162]  [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810
[  148.841592]  [<ffffffff811c0531>] alloc_pages_current+0x91/0x100
[  148.842950]  [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0
[  148.844286]  [<ffffffff8117c688>] filemap_fault+0x1b8/0x420
[  148.845545]  [<ffffffff811a05ed>] __do_fault+0x3d/0x70
[  148.846706]  [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230
[  148.848042]  [<ffffffff81090305>] __do_page_fault+0x1a5/0x460
[  148.849333]  [<ffffffff81090675>] trace_do_page_fault+0x45/0x130
[  148.850681]  [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0
[  148.852025]  [<ffffffff81dd1567>] ? schedule+0x37/0x90
[  148.853187]  [<ffffffff81dd8b88>] async_page_fault+0x28/0x30
[  148.854456] Mem-Info:
[  148.854986] Node 0 DMA per-cpu:
[  148.855727] CPU    0: hi:    0, btch:   1 usd:   0
[  148.856820] Node 0 DMA32 per-cpu:
[  148.857600] CPU    0: hi:  186, btch:  31 usd:   0
[  148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0
[  148.858688]  active_file:19 inactive_file:2 isolated_file:0
[  148.858688]  unevictable:0 dirty:0 writeback:0 unstable:0
[  148.858688]  free:1965 slab_reclaimable:2816 slab_unreclaimable:2184
[  148.858688]  mapped:3 shmem:2 pagetables:1259 bounce:0
[  148.858688]  free_cma:0
[  148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as
[  148.874431] lowmem_reserve[]: 0 966 966 966
[  148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s
[  148.884817] lowmem_reserve[]: 0 0 0 0
[  148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB
[  148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB
[  148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  148.894949] 47361 total pagecache pages
[  148.895816] 47334 pages in swap cache
[  148.896657] Swap cache stats: add 124669, delete 77335, find 83/169
[  148.898057] Free swap  = 0kB
[  148.898714] Total swap = 497976kB
[  148.899470] 262044 pages RAM
[  148.900145] 0 pages HighMem/MovableOnly
[  148.901006] 10253 pages reserved
[  148.901735] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  148.903637] [ 1204]     0  1204     6039        1      15       3      163         -1000 udevd
[  148.905571] [ 1323]     0  1323     6038        1      14       3      165         -1000 udevd
[  148.907499] [ 1324]     0  1324     6038        1      14       3      164         -1000 udevd
[  148.909439] [ 2176]     0  2176     2524        0       6       2      571             0 dhclient
[  148.911427] [ 2227]     0  2227     9267        0      22       3       95             0 rpcbind
[  148.913392] [ 2632]     0  2632    64981       30      29       3      136             0 rsyslogd
[  148.915391] [ 2686]     0  2686     1062        1       6       3       36             0 acpid
[  148.917325] [ 2826]     0  2826     4753        0      12       2       44             0 atd
[  148.919209] [ 2877]     0  2877     6473        0      17       3       66             0 cron
[  148.921120] [ 2911]   104  2911     7078        1      17       3       81             0 dbus-daemon
[  148.923150] [ 3591]     0  3591    13731        0      28       2      165         -1000 sshd
[  148.925073] [ 3603]     0  3603    22024        0      43       2      215             0 winbindd
[  148.927066] [ 3612]     0  3612    22024        0      42       2      216             0 winbindd
[  148.929062] [ 3636]     0  3636     3722        1      11       3       41             0 getty
[  148.930981] [ 3637]     0  3637     3722        1      11       3       40             0 getty
[  148.932915] [ 3638]     0  3638     3722        1      11       3       39             0 getty
[  148.934835] [ 3639]     0  3639     3722        1      11       3       40             0 getty
[  148.936789] [ 3640]     0  3640     3722        1      11       3       40             0 getty
[  148.938704] [ 3641]     0  3641     3722        1      10       3       38             0 getty
[  148.940635] [ 3642]     0  3642     3677        1      11       3       40             0 getty
[  148.942550] [ 3643]     0  3643    25894        2      52       2      248             0 sshd
[  148.944469] [ 3649]     0  3649   146652        1      35       4      320             0 console-kit-dae
[  148.946578] [ 3716]     0  3716    48287        1      31       4      171             0 polkitd
[  148.948552] [ 3722]  1000  3722    25894        0      51       2      250             0 sshd
[  148.950457] [ 3723]  1000  3723     5435        3      15       3      495             0 bash
[  148.952375] [ 3742]     0  3742    17157        1      37       2      160             0 sudo
[  148.954275] [ 3743]     0  3743     3365        1      11       3      516             0 check
[  148.956229] [ 4130]     0  4130     3334        1      11       3      484             0 084
[  148.958108] [ 4342]     0  4342   314556   191159     619       4   119808             0 resvtest
[  148.960104] [ 4343]     0  4343     3334        0      11       3      485             0 084
[  148.961990] [ 4344]     0  4344     3334        0      11       3      485             0 084
[  148.963876] [ 4345]     0  4345     3305        0      11       3       36             0 sed
[  148.965766] [ 4346]     0  4346     3305        0      11       3       37             0 sed
[  148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child
[  148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB
[  149.415288] XFS (vda): Unmounting Filesystem
[  150.211229] XFS (vda): Mounting V5 Filesystem
[  150.292092] XFS (vda): Ending clean mount
[  150.342307] XFS (vda): Unmounting Filesystem
[  150.346522] XFS (vdb): Unmounting Filesystem
[  151.264135] XFS: kmalloc allocations by trans type
[  151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024
[  151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144
[  151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536
[  151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696
[  151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384
[  151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696
[  151.272833] XFS: slab allocations by trans type
[  151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0
[  151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0
[  151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0
[  151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0
[  151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0
[  151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0
[  151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0
[  151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0
[  151.283476] XFS: vmalloc allocations by trans type
[  151.284535] XFS: page allocations by trans type

Those XFS allocation stats are largest measured allocations done
under transaction context broken down by allocation and transaction
type.  No failures that would result in looping, even though the
system invoked the OOM killer on a filesystem workload....

I need to break the slab allocations down further by cache (other
workloads are generating over 50 slab allocations per transaction),
but another hour's work and a few days of observation of the stats
in my normal day-to-day work wll get me all the information I need
to do a decent first pass at memory reservation requirements for
XFS.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02  0:17                                                           ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-02  0:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, akpm, torvalds

On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote:
> On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > > 
> > > > I'm trying to figure out if the current nofail allocators can get
> > > > their memory needs figured out beforehand.  And reliably so - what
> > > > good are estimates that are right 90% of the time, when failing the
> > > > allocation means corrupting user data?  What is the contingency plan?
> > > 
> > > In the ideal world, we can figure out the exact memory needs
> > > beforehand.  But we live in an imperfect world, and given that block
> > > devices *also* need memory, the answer is "of course not".  We can't
> > > be perfect.  But we can least give some kind of hint, and we can offer
> > > to wait before we get into a situation where we need to loop in
> > > GFP_NOWAIT --- which is the contingency/fallback plan.
> > 
> > Overestimating should be fine, the result would a bit of false memory
> > pressure.  But underestimating and looping can't be an option or the
> > original lockups will still be there.  We need to guarantee forward
> > progress or the problem is somewhat mitigated at best - only now with
> > quite a bit more complexity in the allocator and the filesystems.
> 
> The additional complexity in XFS is actually quite minor, and
> initial "rough worst case" memory usage estimates are not that hard
> to measure....

And, just to point out that the OOM killer can be invoked without a
single transaction-based filesystem ENOMEM failure, here's what
xfs/084 does on 4.0-rc1:

[  148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[  148.822113] resvtest cpuset=/ mems_allowed=0
[  148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825
[  148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  148.826471]  0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c
[  148.828220]  ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000
[  148.829958]  0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8
[  148.831734] Call Trace:
[  148.832325]  [<ffffffff81dcb570>] dump_stack+0x4c/0x65
[  148.833493]  [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb
[  148.834855]  [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0
[  148.836195]  [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40
[  148.837633]  [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500
[  148.838925]  [<ffffffff8117e44b>] out_of_memory+0x5b/0x80
[  148.840162]  [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810
[  148.841592]  [<ffffffff811c0531>] alloc_pages_current+0x91/0x100
[  148.842950]  [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0
[  148.844286]  [<ffffffff8117c688>] filemap_fault+0x1b8/0x420
[  148.845545]  [<ffffffff811a05ed>] __do_fault+0x3d/0x70
[  148.846706]  [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230
[  148.848042]  [<ffffffff81090305>] __do_page_fault+0x1a5/0x460
[  148.849333]  [<ffffffff81090675>] trace_do_page_fault+0x45/0x130
[  148.850681]  [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0
[  148.852025]  [<ffffffff81dd1567>] ? schedule+0x37/0x90
[  148.853187]  [<ffffffff81dd8b88>] async_page_fault+0x28/0x30
[  148.854456] Mem-Info:
[  148.854986] Node 0 DMA per-cpu:
[  148.855727] CPU    0: hi:    0, btch:   1 usd:   0
[  148.856820] Node 0 DMA32 per-cpu:
[  148.857600] CPU    0: hi:  186, btch:  31 usd:   0
[  148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0
[  148.858688]  active_file:19 inactive_file:2 isolated_file:0
[  148.858688]  unevictable:0 dirty:0 writeback:0 unstable:0
[  148.858688]  free:1965 slab_reclaimable:2816 slab_unreclaimable:2184
[  148.858688]  mapped:3 shmem:2 pagetables:1259 bounce:0
[  148.858688]  free_cma:0
[  148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as
[  148.874431] lowmem_reserve[]: 0 966 966 966
[  148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s
[  148.884817] lowmem_reserve[]: 0 0 0 0
[  148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB
[  148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB
[  148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  148.894949] 47361 total pagecache pages
[  148.895816] 47334 pages in swap cache
[  148.896657] Swap cache stats: add 124669, delete 77335, find 83/169
[  148.898057] Free swap  = 0kB
[  148.898714] Total swap = 497976kB
[  148.899470] 262044 pages RAM
[  148.900145] 0 pages HighMem/MovableOnly
[  148.901006] 10253 pages reserved
[  148.901735] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  148.903637] [ 1204]     0  1204     6039        1      15       3      163         -1000 udevd
[  148.905571] [ 1323]     0  1323     6038        1      14       3      165         -1000 udevd
[  148.907499] [ 1324]     0  1324     6038        1      14       3      164         -1000 udevd
[  148.909439] [ 2176]     0  2176     2524        0       6       2      571             0 dhclient
[  148.911427] [ 2227]     0  2227     9267        0      22       3       95             0 rpcbind
[  148.913392] [ 2632]     0  2632    64981       30      29       3      136             0 rsyslogd
[  148.915391] [ 2686]     0  2686     1062        1       6       3       36             0 acpid
[  148.917325] [ 2826]     0  2826     4753        0      12       2       44             0 atd
[  148.919209] [ 2877]     0  2877     6473        0      17       3       66             0 cron
[  148.921120] [ 2911]   104  2911     7078        1      17       3       81             0 dbus-daemon
[  148.923150] [ 3591]     0  3591    13731        0      28       2      165         -1000 sshd
[  148.925073] [ 3603]     0  3603    22024        0      43       2      215             0 winbindd
[  148.927066] [ 3612]     0  3612    22024        0      42       2      216             0 winbindd
[  148.929062] [ 3636]     0  3636     3722        1      11       3       41             0 getty
[  148.930981] [ 3637]     0  3637     3722        1      11       3       40             0 getty
[  148.932915] [ 3638]     0  3638     3722        1      11       3       39             0 getty
[  148.934835] [ 3639]     0  3639     3722        1      11       3       40             0 getty
[  148.936789] [ 3640]     0  3640     3722        1      11       3       40             0 getty
[  148.938704] [ 3641]     0  3641     3722        1      10       3       38             0 getty
[  148.940635] [ 3642]     0  3642     3677        1      11       3       40             0 getty
[  148.942550] [ 3643]     0  3643    25894        2      52       2      248             0 sshd
[  148.944469] [ 3649]     0  3649   146652        1      35       4      320             0 console-kit-dae
[  148.946578] [ 3716]     0  3716    48287        1      31       4      171             0 polkitd
[  148.948552] [ 3722]  1000  3722    25894        0      51       2      250             0 sshd
[  148.950457] [ 3723]  1000  3723     5435        3      15       3      495             0 bash
[  148.952375] [ 3742]     0  3742    17157        1      37       2      160             0 sudo
[  148.954275] [ 3743]     0  3743     3365        1      11       3      516             0 check
[  148.956229] [ 4130]     0  4130     3334        1      11       3      484             0 084
[  148.958108] [ 4342]     0  4342   314556   191159     619       4   119808             0 resvtest
[  148.960104] [ 4343]     0  4343     3334        0      11       3      485             0 084
[  148.961990] [ 4344]     0  4344     3334        0      11       3      485             0 084
[  148.963876] [ 4345]     0  4345     3305        0      11       3       36             0 sed
[  148.965766] [ 4346]     0  4346     3305        0      11       3       37             0 sed
[  148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child
[  148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB
[  149.415288] XFS (vda): Unmounting Filesystem
[  150.211229] XFS (vda): Mounting V5 Filesystem
[  150.292092] XFS (vda): Ending clean mount
[  150.342307] XFS (vda): Unmounting Filesystem
[  150.346522] XFS (vdb): Unmounting Filesystem
[  151.264135] XFS: kmalloc allocations by trans type
[  151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024
[  151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144
[  151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536
[  151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696
[  151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384
[  151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696
[  151.272833] XFS: slab allocations by trans type
[  151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0
[  151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0
[  151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0
[  151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0
[  151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0
[  151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0
[  151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0
[  151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0
[  151.283476] XFS: vmalloc allocations by trans type
[  151.284535] XFS: page allocations by trans type

Those XFS allocation stats are largest measured allocations done
under transaction context broken down by allocation and transaction
type.  No failures that would result in looping, even though the
system invoked the OOM killer on a filesystem workload....

I need to break the slab allocations down further by cache (other
workloads are generating over 50 slab allocations per transaction),
but another hour's work and a few days of observation of the stats
in my normal day-to-day work wll get me all the information I need
to do a decent first pass at memory reservation requirements for
XFS.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  7:32                                                     ` Dave Chinner
@ 2015-03-02  9:39                                                       ` Vlastimil Babka
  -1 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-03-02  9:39 UTC (permalink / raw)
  To: Dave Chinner, Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, torvalds

On 02/23/2015 08:32 AM, Dave Chinner wrote:
> On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
>> On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
>>
>> Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
>> reserve.  So to reserve N pages we increase the page allocator dynamic
>> reserve by N, do some reclaim if necessary then deposit N tokens into
>> the caller's task_struct (it'll be a set of zone/nr-pages tuples I
>> suppose).
>>
>> When allocating pages the caller should drain its reserves in
>> preference to dipping into the regular freelist.  This guy has already
>> done his reclaim and shouldn't be penalised a second time.  I guess
>> Johannes's preallocation code should switch to doing this for the same
>> reason, plus the fact that snipping a page off
>> task_struct.prealloc_pages is super-fast and needs to be done sometime
>> anyway so why not do it by default.
>
> That is at odds with the requirements of demand paging, which
> allocate for objects that are reclaimable within the course of the
> transaction. The reserve is there to ensure forward progress for
> allocations for objects that aren't freed until after the
> transaction completes, but if we drain it for reclaimable objects we
> then have nothing left in the reserve pool when we actually need it.
>
> We do not know ahead of time if the object we are allocating is
> going to modified and hence locked into the transaction. Hence we
> can't say "use the reserve for this *specific* allocation", and so
> the only guidance we can really give is "we will to allocate and
> *permanently consume* this much memory", and the reserve pool needs
> to cover that consumption to guarantee forwards progress.

I'm not sure I understand properly. You don't know if a specific 
allocation is permanent or reclaimable, but you can tell in advance how 
much in total will be permanent? Is it because you are conservative and 
assume everything will be permanent, or how?

Can you at least at some later point in transaction recognize that "OK, 
this object was not permanent after all" and tell mm that it can lower 
your reserve?

> Forwards progress for all other allocations is guaranteed because
> they are reclaimable objects - they either freed directly back to
> their source (slab, heap, page lists) or they are freed by shrinkers
> once they have been released from the transaction.

Which are the "all other allocations?" Above you wrote that all 
allocations are treated as potentially permanent. Also how does the fact 
that an object is later reclaimable, affect forward progress during its 
allocation? Or all you talking about allocations from contexts that 
don't use reserves?

> Hence we need allocations to come from the free list and trigger
> reclaim, regardless of the fact there is a reserve pool there. The
> reserve pool needs to be a last resort once there are no other
> avenues to allocate memory. i.e. it would be used to replace the OOM
> killer for GFP_NOFAIL allocations.

That's probably going to result in lot of wasted memory and I still 
don't understand why it's needed, if your reserve estimate is guaranteed 
to cover the worst-case.

>> Both reservation and preallocation are vulnerable to deadlocks - 10,000
>> tasks all trying to reserve/prealloc 100 pages, they all have 50 pages
>> and we ran out of memory.  Whoops.
>
> Yes, that's the big problem with preallocation, as well as your
> proposed "depelete the reserved memory first" approach. They
> *require* up front "preallocation" of free memory, either directly
> by the application, or internally by the mm subsystem.

I don't see why it would deadlock, if during reserve time the mm can 
return ENOMEM as the reserver should be able to back out at that point.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02  9:39                                                       ` Vlastimil Babka
  0 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-03-02  9:39 UTC (permalink / raw)
  To: Dave Chinner, Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs

On 02/23/2015 08:32 AM, Dave Chinner wrote:
> On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
>> On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
>>
>> Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
>> reserve.  So to reserve N pages we increase the page allocator dynamic
>> reserve by N, do some reclaim if necessary then deposit N tokens into
>> the caller's task_struct (it'll be a set of zone/nr-pages tuples I
>> suppose).
>>
>> When allocating pages the caller should drain its reserves in
>> preference to dipping into the regular freelist.  This guy has already
>> done his reclaim and shouldn't be penalised a second time.  I guess
>> Johannes's preallocation code should switch to doing this for the same
>> reason, plus the fact that snipping a page off
>> task_struct.prealloc_pages is super-fast and needs to be done sometime
>> anyway so why not do it by default.
>
> That is at odds with the requirements of demand paging, which
> allocate for objects that are reclaimable within the course of the
> transaction. The reserve is there to ensure forward progress for
> allocations for objects that aren't freed until after the
> transaction completes, but if we drain it for reclaimable objects we
> then have nothing left in the reserve pool when we actually need it.
>
> We do not know ahead of time if the object we are allocating is
> going to modified and hence locked into the transaction. Hence we
> can't say "use the reserve for this *specific* allocation", and so
> the only guidance we can really give is "we will to allocate and
> *permanently consume* this much memory", and the reserve pool needs
> to cover that consumption to guarantee forwards progress.

I'm not sure I understand properly. You don't know if a specific 
allocation is permanent or reclaimable, but you can tell in advance how 
much in total will be permanent? Is it because you are conservative and 
assume everything will be permanent, or how?

Can you at least at some later point in transaction recognize that "OK, 
this object was not permanent after all" and tell mm that it can lower 
your reserve?

> Forwards progress for all other allocations is guaranteed because
> they are reclaimable objects - they either freed directly back to
> their source (slab, heap, page lists) or they are freed by shrinkers
> once they have been released from the transaction.

Which are the "all other allocations?" Above you wrote that all 
allocations are treated as potentially permanent. Also how does the fact 
that an object is later reclaimable, affect forward progress during its 
allocation? Or all you talking about allocations from contexts that 
don't use reserves?

> Hence we need allocations to come from the free list and trigger
> reclaim, regardless of the fact there is a reserve pool there. The
> reserve pool needs to be a last resort once there are no other
> avenues to allocate memory. i.e. it would be used to replace the OOM
> killer for GFP_NOFAIL allocations.

That's probably going to result in lot of wasted memory and I still 
don't understand why it's needed, if your reserve estimate is guaranteed 
to cover the worst-case.

>> Both reservation and preallocation are vulnerable to deadlocks - 10,000
>> tasks all trying to reserve/prealloc 100 pages, they all have 50 pages
>> and we ran out of memory.  Whoops.
>
> Yes, that's the big problem with preallocation, as well as your
> proposed "depelete the reserved memory first" approach. They
> *require* up front "preallocation" of free memory, either directly
> by the application, or internally by the mm subsystem.

I don't see why it would deadlock, if during reserve time the mm can 
return ENOMEM as the reserver should be able to back out at that point.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02  0:17                                                           ` Dave Chinner
@ 2015-03-02 12:46                                                             ` Brian Foster
  -1 siblings, 0 replies; 276+ messages in thread
From: Brian Foster @ 2015-03-02 12:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Tetsuo Handa, Johannes Weiner, oleg, xfs,
	mhocko, linux-mm, mgorman, dchinner, rientjes, akpm, torvalds

On Mon, Mar 02, 2015 at 11:17:23AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote:
> > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > > > 
> > > > > I'm trying to figure out if the current nofail allocators can get
> > > > > their memory needs figured out beforehand.  And reliably so - what
> > > > > good are estimates that are right 90% of the time, when failing the
> > > > > allocation means corrupting user data?  What is the contingency plan?
> > > > 
> > > > In the ideal world, we can figure out the exact memory needs
> > > > beforehand.  But we live in an imperfect world, and given that block
> > > > devices *also* need memory, the answer is "of course not".  We can't
> > > > be perfect.  But we can least give some kind of hint, and we can offer
> > > > to wait before we get into a situation where we need to loop in
> > > > GFP_NOWAIT --- which is the contingency/fallback plan.
> > > 
> > > Overestimating should be fine, the result would a bit of false memory
> > > pressure.  But underestimating and looping can't be an option or the
> > > original lockups will still be there.  We need to guarantee forward
> > > progress or the problem is somewhat mitigated at best - only now with
> > > quite a bit more complexity in the allocator and the filesystems.
> > 
> > The additional complexity in XFS is actually quite minor, and
> > initial "rough worst case" memory usage estimates are not that hard
> > to measure....
> 
> And, just to point out that the OOM killer can be invoked without a
> single transaction-based filesystem ENOMEM failure, here's what
> xfs/084 does on 4.0-rc1:
> 
> [  148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
> [  148.822113] resvtest cpuset=/ mems_allowed=0
> [  148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825
> [  148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [  148.826471]  0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c
> [  148.828220]  ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000
> [  148.829958]  0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8
> [  148.831734] Call Trace:
> [  148.832325]  [<ffffffff81dcb570>] dump_stack+0x4c/0x65
> [  148.833493]  [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb
> [  148.834855]  [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0
> [  148.836195]  [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40
> [  148.837633]  [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500
> [  148.838925]  [<ffffffff8117e44b>] out_of_memory+0x5b/0x80
> [  148.840162]  [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810
> [  148.841592]  [<ffffffff811c0531>] alloc_pages_current+0x91/0x100
> [  148.842950]  [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0
> [  148.844286]  [<ffffffff8117c688>] filemap_fault+0x1b8/0x420
> [  148.845545]  [<ffffffff811a05ed>] __do_fault+0x3d/0x70
> [  148.846706]  [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230
> [  148.848042]  [<ffffffff81090305>] __do_page_fault+0x1a5/0x460
> [  148.849333]  [<ffffffff81090675>] trace_do_page_fault+0x45/0x130
> [  148.850681]  [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0
> [  148.852025]  [<ffffffff81dd1567>] ? schedule+0x37/0x90
> [  148.853187]  [<ffffffff81dd8b88>] async_page_fault+0x28/0x30
> [  148.854456] Mem-Info:
> [  148.854986] Node 0 DMA per-cpu:
> [  148.855727] CPU    0: hi:    0, btch:   1 usd:   0
> [  148.856820] Node 0 DMA32 per-cpu:
> [  148.857600] CPU    0: hi:  186, btch:  31 usd:   0
> [  148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0
> [  148.858688]  active_file:19 inactive_file:2 isolated_file:0
> [  148.858688]  unevictable:0 dirty:0 writeback:0 unstable:0
> [  148.858688]  free:1965 slab_reclaimable:2816 slab_unreclaimable:2184
> [  148.858688]  mapped:3 shmem:2 pagetables:1259 bounce:0
> [  148.858688]  free_cma:0
> [  148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as
> [  148.874431] lowmem_reserve[]: 0 966 966 966
> [  148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s
> [  148.884817] lowmem_reserve[]: 0 0 0 0
> [  148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB
> [  148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB
> [  148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [  148.894949] 47361 total pagecache pages
> [  148.895816] 47334 pages in swap cache
> [  148.896657] Swap cache stats: add 124669, delete 77335, find 83/169
> [  148.898057] Free swap  = 0kB
> [  148.898714] Total swap = 497976kB
> [  148.899470] 262044 pages RAM
> [  148.900145] 0 pages HighMem/MovableOnly
> [  148.901006] 10253 pages reserved
> [  148.901735] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [  148.903637] [ 1204]     0  1204     6039        1      15       3      163         -1000 udevd
> [  148.905571] [ 1323]     0  1323     6038        1      14       3      165         -1000 udevd
> [  148.907499] [ 1324]     0  1324     6038        1      14       3      164         -1000 udevd
> [  148.909439] [ 2176]     0  2176     2524        0       6       2      571             0 dhclient
> [  148.911427] [ 2227]     0  2227     9267        0      22       3       95             0 rpcbind
> [  148.913392] [ 2632]     0  2632    64981       30      29       3      136             0 rsyslogd
> [  148.915391] [ 2686]     0  2686     1062        1       6       3       36             0 acpid
> [  148.917325] [ 2826]     0  2826     4753        0      12       2       44             0 atd
> [  148.919209] [ 2877]     0  2877     6473        0      17       3       66             0 cron
> [  148.921120] [ 2911]   104  2911     7078        1      17       3       81             0 dbus-daemon
> [  148.923150] [ 3591]     0  3591    13731        0      28       2      165         -1000 sshd
> [  148.925073] [ 3603]     0  3603    22024        0      43       2      215             0 winbindd
> [  148.927066] [ 3612]     0  3612    22024        0      42       2      216             0 winbindd
> [  148.929062] [ 3636]     0  3636     3722        1      11       3       41             0 getty
> [  148.930981] [ 3637]     0  3637     3722        1      11       3       40             0 getty
> [  148.932915] [ 3638]     0  3638     3722        1      11       3       39             0 getty
> [  148.934835] [ 3639]     0  3639     3722        1      11       3       40             0 getty
> [  148.936789] [ 3640]     0  3640     3722        1      11       3       40             0 getty
> [  148.938704] [ 3641]     0  3641     3722        1      10       3       38             0 getty
> [  148.940635] [ 3642]     0  3642     3677        1      11       3       40             0 getty
> [  148.942550] [ 3643]     0  3643    25894        2      52       2      248             0 sshd
> [  148.944469] [ 3649]     0  3649   146652        1      35       4      320             0 console-kit-dae
> [  148.946578] [ 3716]     0  3716    48287        1      31       4      171             0 polkitd
> [  148.948552] [ 3722]  1000  3722    25894        0      51       2      250             0 sshd
> [  148.950457] [ 3723]  1000  3723     5435        3      15       3      495             0 bash
> [  148.952375] [ 3742]     0  3742    17157        1      37       2      160             0 sudo
> [  148.954275] [ 3743]     0  3743     3365        1      11       3      516             0 check
> [  148.956229] [ 4130]     0  4130     3334        1      11       3      484             0 084
> [  148.958108] [ 4342]     0  4342   314556   191159     619       4   119808             0 resvtest
> [  148.960104] [ 4343]     0  4343     3334        0      11       3      485             0 084
> [  148.961990] [ 4344]     0  4344     3334        0      11       3      485             0 084
> [  148.963876] [ 4345]     0  4345     3305        0      11       3       36             0 sed
> [  148.965766] [ 4346]     0  4346     3305        0      11       3       37             0 sed
> [  148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child
> [  148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB
> [  149.415288] XFS (vda): Unmounting Filesystem
> [  150.211229] XFS (vda): Mounting V5 Filesystem
> [  150.292092] XFS (vda): Ending clean mount
> [  150.342307] XFS (vda): Unmounting Filesystem
> [  150.346522] XFS (vdb): Unmounting Filesystem
> [  151.264135] XFS: kmalloc allocations by trans type
> [  151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024
> [  151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144
> [  151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536
> [  151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696
> [  151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384
> [  151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696
> [  151.272833] XFS: slab allocations by trans type
> [  151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0
> [  151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0
> [  151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0
> [  151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0
> [  151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0
> [  151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0
> [  151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0
> [  151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0
> [  151.283476] XFS: vmalloc allocations by trans type
> [  151.284535] XFS: page allocations by trans type
> 
> Those XFS allocation stats are largest measured allocations done
> under transaction context broken down by allocation and transaction
> type.  No failures that would result in looping, even though the
> system invoked the OOM killer on a filesystem workload....
> 
> I need to break the slab allocations down further by cache (other
> workloads are generating over 50 slab allocations per transaction),
> but another hour's work and a few days of observation of the stats
> in my normal day-to-day work wll get me all the information I need
> to do a decent first pass at memory reservation requirements for
> XFS.
> 

This sounds like something that would serve us well under sysfs,
particularly if we do adopt the kind of reservation model being
discussed in this thread. I wouldn't expect these values to change
drastically or that often, but they could certainly adjust over time to
the point of being out of line with a reservation. A tool like this
combined with Johannes' idea of a warning or something along those lines
for a reservation overrun should give us all we need to identify
something is wrong and have the ability to fix it.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 12:46                                                             ` Brian Foster
  0 siblings, 0 replies; 276+ messages in thread
From: Brian Foster @ 2015-03-02 12:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Theodore Ts'o, Tetsuo Handa, dchinner, oleg,
	xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds

On Mon, Mar 02, 2015 at 11:17:23AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote:
> > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > > > 
> > > > > I'm trying to figure out if the current nofail allocators can get
> > > > > their memory needs figured out beforehand.  And reliably so - what
> > > > > good are estimates that are right 90% of the time, when failing the
> > > > > allocation means corrupting user data?  What is the contingency plan?
> > > > 
> > > > In the ideal world, we can figure out the exact memory needs
> > > > beforehand.  But we live in an imperfect world, and given that block
> > > > devices *also* need memory, the answer is "of course not".  We can't
> > > > be perfect.  But we can least give some kind of hint, and we can offer
> > > > to wait before we get into a situation where we need to loop in
> > > > GFP_NOWAIT --- which is the contingency/fallback plan.
> > > 
> > > Overestimating should be fine, the result would a bit of false memory
> > > pressure.  But underestimating and looping can't be an option or the
> > > original lockups will still be there.  We need to guarantee forward
> > > progress or the problem is somewhat mitigated at best - only now with
> > > quite a bit more complexity in the allocator and the filesystems.
> > 
> > The additional complexity in XFS is actually quite minor, and
> > initial "rough worst case" memory usage estimates are not that hard
> > to measure....
> 
> And, just to point out that the OOM killer can be invoked without a
> single transaction-based filesystem ENOMEM failure, here's what
> xfs/084 does on 4.0-rc1:
> 
> [  148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
> [  148.822113] resvtest cpuset=/ mems_allowed=0
> [  148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825
> [  148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [  148.826471]  0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c
> [  148.828220]  ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000
> [  148.829958]  0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8
> [  148.831734] Call Trace:
> [  148.832325]  [<ffffffff81dcb570>] dump_stack+0x4c/0x65
> [  148.833493]  [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb
> [  148.834855]  [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0
> [  148.836195]  [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40
> [  148.837633]  [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500
> [  148.838925]  [<ffffffff8117e44b>] out_of_memory+0x5b/0x80
> [  148.840162]  [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810
> [  148.841592]  [<ffffffff811c0531>] alloc_pages_current+0x91/0x100
> [  148.842950]  [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0
> [  148.844286]  [<ffffffff8117c688>] filemap_fault+0x1b8/0x420
> [  148.845545]  [<ffffffff811a05ed>] __do_fault+0x3d/0x70
> [  148.846706]  [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230
> [  148.848042]  [<ffffffff81090305>] __do_page_fault+0x1a5/0x460
> [  148.849333]  [<ffffffff81090675>] trace_do_page_fault+0x45/0x130
> [  148.850681]  [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0
> [  148.852025]  [<ffffffff81dd1567>] ? schedule+0x37/0x90
> [  148.853187]  [<ffffffff81dd8b88>] async_page_fault+0x28/0x30
> [  148.854456] Mem-Info:
> [  148.854986] Node 0 DMA per-cpu:
> [  148.855727] CPU    0: hi:    0, btch:   1 usd:   0
> [  148.856820] Node 0 DMA32 per-cpu:
> [  148.857600] CPU    0: hi:  186, btch:  31 usd:   0
> [  148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0
> [  148.858688]  active_file:19 inactive_file:2 isolated_file:0
> [  148.858688]  unevictable:0 dirty:0 writeback:0 unstable:0
> [  148.858688]  free:1965 slab_reclaimable:2816 slab_unreclaimable:2184
> [  148.858688]  mapped:3 shmem:2 pagetables:1259 bounce:0
> [  148.858688]  free_cma:0
> [  148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as
> [  148.874431] lowmem_reserve[]: 0 966 966 966
> [  148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s
> [  148.884817] lowmem_reserve[]: 0 0 0 0
> [  148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB
> [  148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB
> [  148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [  148.894949] 47361 total pagecache pages
> [  148.895816] 47334 pages in swap cache
> [  148.896657] Swap cache stats: add 124669, delete 77335, find 83/169
> [  148.898057] Free swap  = 0kB
> [  148.898714] Total swap = 497976kB
> [  148.899470] 262044 pages RAM
> [  148.900145] 0 pages HighMem/MovableOnly
> [  148.901006] 10253 pages reserved
> [  148.901735] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [  148.903637] [ 1204]     0  1204     6039        1      15       3      163         -1000 udevd
> [  148.905571] [ 1323]     0  1323     6038        1      14       3      165         -1000 udevd
> [  148.907499] [ 1324]     0  1324     6038        1      14       3      164         -1000 udevd
> [  148.909439] [ 2176]     0  2176     2524        0       6       2      571             0 dhclient
> [  148.911427] [ 2227]     0  2227     9267        0      22       3       95             0 rpcbind
> [  148.913392] [ 2632]     0  2632    64981       30      29       3      136             0 rsyslogd
> [  148.915391] [ 2686]     0  2686     1062        1       6       3       36             0 acpid
> [  148.917325] [ 2826]     0  2826     4753        0      12       2       44             0 atd
> [  148.919209] [ 2877]     0  2877     6473        0      17       3       66             0 cron
> [  148.921120] [ 2911]   104  2911     7078        1      17       3       81             0 dbus-daemon
> [  148.923150] [ 3591]     0  3591    13731        0      28       2      165         -1000 sshd
> [  148.925073] [ 3603]     0  3603    22024        0      43       2      215             0 winbindd
> [  148.927066] [ 3612]     0  3612    22024        0      42       2      216             0 winbindd
> [  148.929062] [ 3636]     0  3636     3722        1      11       3       41             0 getty
> [  148.930981] [ 3637]     0  3637     3722        1      11       3       40             0 getty
> [  148.932915] [ 3638]     0  3638     3722        1      11       3       39             0 getty
> [  148.934835] [ 3639]     0  3639     3722        1      11       3       40             0 getty
> [  148.936789] [ 3640]     0  3640     3722        1      11       3       40             0 getty
> [  148.938704] [ 3641]     0  3641     3722        1      10       3       38             0 getty
> [  148.940635] [ 3642]     0  3642     3677        1      11       3       40             0 getty
> [  148.942550] [ 3643]     0  3643    25894        2      52       2      248             0 sshd
> [  148.944469] [ 3649]     0  3649   146652        1      35       4      320             0 console-kit-dae
> [  148.946578] [ 3716]     0  3716    48287        1      31       4      171             0 polkitd
> [  148.948552] [ 3722]  1000  3722    25894        0      51       2      250             0 sshd
> [  148.950457] [ 3723]  1000  3723     5435        3      15       3      495             0 bash
> [  148.952375] [ 3742]     0  3742    17157        1      37       2      160             0 sudo
> [  148.954275] [ 3743]     0  3743     3365        1      11       3      516             0 check
> [  148.956229] [ 4130]     0  4130     3334        1      11       3      484             0 084
> [  148.958108] [ 4342]     0  4342   314556   191159     619       4   119808             0 resvtest
> [  148.960104] [ 4343]     0  4343     3334        0      11       3      485             0 084
> [  148.961990] [ 4344]     0  4344     3334        0      11       3      485             0 084
> [  148.963876] [ 4345]     0  4345     3305        0      11       3       36             0 sed
> [  148.965766] [ 4346]     0  4346     3305        0      11       3       37             0 sed
> [  148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child
> [  148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB
> [  149.415288] XFS (vda): Unmounting Filesystem
> [  150.211229] XFS (vda): Mounting V5 Filesystem
> [  150.292092] XFS (vda): Ending clean mount
> [  150.342307] XFS (vda): Unmounting Filesystem
> [  150.346522] XFS (vdb): Unmounting Filesystem
> [  151.264135] XFS: kmalloc allocations by trans type
> [  151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024
> [  151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144
> [  151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536
> [  151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696
> [  151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384
> [  151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696
> [  151.272833] XFS: slab allocations by trans type
> [  151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0
> [  151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0
> [  151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0
> [  151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0
> [  151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0
> [  151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0
> [  151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0
> [  151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0
> [  151.283476] XFS: vmalloc allocations by trans type
> [  151.284535] XFS: page allocations by trans type
> 
> Those XFS allocation stats are largest measured allocations done
> under transaction context broken down by allocation and transaction
> type.  No failures that would result in looping, even though the
> system invoked the OOM killer on a filesystem workload....
> 
> I need to break the slab allocations down further by cache (other
> workloads are generating over 50 slab allocations per transaction),
> but another hour's work and a few days of observation of the stats
> in my normal day-to-day work wll get me all the information I need
> to do a decent first pass at memory reservation requirements for
> XFS.
> 

This sounds like something that would serve us well under sysfs,
particularly if we do adopt the kind of reservation model being
discussed in this thread. I wouldn't expect these values to change
drastically or that often, but they could certainly adjust over time to
the point of being out of line with a reservation. A tool like this
combined with Johannes' idea of a warning or something along those lines
for a reservation overrun should give us all we need to identify
something is wrong and have the ability to fix it.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  0:45                                                 ` Dave Chinner
@ 2015-03-02 15:18                                                   ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-03-02 15:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Mon 23-02-15 11:45:21, Dave Chinner wrote:
[...]
> A reserve memory pool is no different - every time a memory reserve
> occurs, a watermark is lifted to accommodate it, and the transaction
> is not allowed to proceed until the amount of free memory exceeds
> that watermark. The memory allocation subsystem then only allows
> *allocations* marked correctly to allocate pages from that the
> reserve that watermark protects. e.g. only allocations using
> __GFP_RESERVE are allowed to dip into the reserve pool.

The idea is sound. But I am pretty sure we will find many corner
cases. E.g. what if the mere reservation attempt causes the system
to go OOM and trigger the OOM killer? Sure that wouldn't be too much
different from the OOM triggered during the allocation but there is one
major difference. Reservations need to be estimated and I expect the
estimation would be on the more conservative side and so the OOM might
not happen without them.

> By using watermarks, freeing of memory will automatically top
> up the reserve pool which means that we guarantee that reclaimable
> memory allocated for demand paging during transacitons doesn't
> deplete the reserve pool permanently.  As a result, when there is
> plenty of free and/or reclaimable memory, the reserve pool
> watermarks will have almost zero impact on performance and
> behaviour.

Typical busy system won't be very far away from the high watermark
so there would be a reclaim performed during increased watermaks
(aka reservation) and that might lead to visible performance
degradation. This might be acceptable but it also adds a certain level
of unpredictability when performance characteristics might change
suddenly.

> Further, because it's just accounting and behavioural thresholds,
> this allows the mm subsystem to control how the reserve pool is
> accounted internally. e.g. clean, reclaimable pages in the page
> cache could serve as reserve pool pages as they can be immediately
> reclaimed for allocation.

But they also can turn into hard/impossible to reclaim as well. Clean
pages might get dirty and e.g. swap backed pages run out of their
backing storage. So I guess we cannot count with those pages without
reclaiming them first and hiding them into the reserve. Which is what
you suggest below probably but I wasn't really sure...

> This could be acheived by setting reclaim targets first to the reserve
> pool watermark, then the second target is enough pages to satisfy the
> current allocation.
> 
> And, FWIW, there's nothing stopping this mechanism from have order
> based reserve thresholds. e.g. IB could really do with a 64k reserve
> pool threshold and hence help solve the long standing problems they
> have with filling the receive ring in GFP_ATOMIC context...
> 
> Sure, that's looking further down the track, but my point still
> remains: we need a viable long term solution to this problem. Maybe
> reservations are not the solution, but I don't see anyone else who
> is thinking of how to address this architectural problem at a system
> level right now.

I think the idea is good! It will just be quite tricky to get there
without causing more problems than those being solved. The biggest
question mark so far seems to be the reservation size estimation. If
it is hard for any caller to know the size beforehand (which would
be really close to the actually used size) then the whole complexity
in the code sounds like an overkill and asking administrator to tune
min_free_kbytes seems a better fit (we would still have to teach the
allocator to access reserves when really necessary) because the system
would behave more predictably (although some memory would be wasted).

> We need to design and document the model first, then review it, then
> we can start working at the code level to implement the solution we've
> designed.

I have already asked James to add this on LSF agenda but nothing has
materialized on the schedule yet. I will poke him again.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 15:18                                                   ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-03-02 15:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, linux-mm, rientjes,
	oleg, akpm, mgorman, torvalds, xfs

On Mon 23-02-15 11:45:21, Dave Chinner wrote:
[...]
> A reserve memory pool is no different - every time a memory reserve
> occurs, a watermark is lifted to accommodate it, and the transaction
> is not allowed to proceed until the amount of free memory exceeds
> that watermark. The memory allocation subsystem then only allows
> *allocations* marked correctly to allocate pages from that the
> reserve that watermark protects. e.g. only allocations using
> __GFP_RESERVE are allowed to dip into the reserve pool.

The idea is sound. But I am pretty sure we will find many corner
cases. E.g. what if the mere reservation attempt causes the system
to go OOM and trigger the OOM killer? Sure that wouldn't be too much
different from the OOM triggered during the allocation but there is one
major difference. Reservations need to be estimated and I expect the
estimation would be on the more conservative side and so the OOM might
not happen without them.

> By using watermarks, freeing of memory will automatically top
> up the reserve pool which means that we guarantee that reclaimable
> memory allocated for demand paging during transacitons doesn't
> deplete the reserve pool permanently.  As a result, when there is
> plenty of free and/or reclaimable memory, the reserve pool
> watermarks will have almost zero impact on performance and
> behaviour.

Typical busy system won't be very far away from the high watermark
so there would be a reclaim performed during increased watermaks
(aka reservation) and that might lead to visible performance
degradation. This might be acceptable but it also adds a certain level
of unpredictability when performance characteristics might change
suddenly.

> Further, because it's just accounting and behavioural thresholds,
> this allows the mm subsystem to control how the reserve pool is
> accounted internally. e.g. clean, reclaimable pages in the page
> cache could serve as reserve pool pages as they can be immediately
> reclaimed for allocation.

But they also can turn into hard/impossible to reclaim as well. Clean
pages might get dirty and e.g. swap backed pages run out of their
backing storage. So I guess we cannot count with those pages without
reclaiming them first and hiding them into the reserve. Which is what
you suggest below probably but I wasn't really sure...

> This could be acheived by setting reclaim targets first to the reserve
> pool watermark, then the second target is enough pages to satisfy the
> current allocation.
> 
> And, FWIW, there's nothing stopping this mechanism from have order
> based reserve thresholds. e.g. IB could really do with a 64k reserve
> pool threshold and hence help solve the long standing problems they
> have with filling the receive ring in GFP_ATOMIC context...
> 
> Sure, that's looking further down the track, but my point still
> remains: we need a viable long term solution to this problem. Maybe
> reservations are not the solution, but I don't see anyone else who
> is thinking of how to address this architectural problem at a system
> level right now.

I think the idea is good! It will just be quite tricky to get there
without causing more problems than those being solved. The biggest
question mark so far seems to be the reservation size estimation. If
it is hard for any caller to know the size beforehand (which would
be really close to the actually used size) then the whole complexity
in the code sounds like an overkill and asking administrator to tune
min_free_kbytes seems a better fit (we would still have to teach the
allocator to access reserves when really necessary) because the system
would behave more predictably (although some memory would be wasted).

> We need to design and document the model first, then review it, then
> we can start working at the code level to implement the solution we've
> designed.

I have already asked James to add this on LSF agenda but nothing has
materialized on the schedule yet. I will poke him again.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 15:18                                                   ` Michal Hocko
@ 2015-03-02 16:05                                                     ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-02 16:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> On Mon 23-02-15 11:45:21, Dave Chinner wrote:
> [...]
> > A reserve memory pool is no different - every time a memory reserve
> > occurs, a watermark is lifted to accommodate it, and the transaction
> > is not allowed to proceed until the amount of free memory exceeds
> > that watermark. The memory allocation subsystem then only allows
> > *allocations* marked correctly to allocate pages from that the
> > reserve that watermark protects. e.g. only allocations using
> > __GFP_RESERVE are allowed to dip into the reserve pool.
> 
> The idea is sound. But I am pretty sure we will find many corner
> cases. E.g. what if the mere reservation attempt causes the system
> to go OOM and trigger the OOM killer? Sure that wouldn't be too much
> different from the OOM triggered during the allocation but there is one
> major difference. Reservations need to be estimated and I expect the
> estimation would be on the more conservative side and so the OOM might
> not happen without them.

The whole idea is that filesystems request the reserves while they can
still sleep for progress or fail the macro-operation with -ENOMEM.

And the estimate wouldn't just be on the conservative side, it would
have to be the worst-case scenario.  If we run out of reserves in an
allocation that can not fail that would be a bug that can lock up the
machine.  We can then fall back to the OOM killer in a last-ditch
effort to make forward progress, but as the victim tasks can get stuck
behind state/locks held by the allocation side, the machine might lock
up after all.

> > By using watermarks, freeing of memory will automatically top
> > up the reserve pool which means that we guarantee that reclaimable
> > memory allocated for demand paging during transacitons doesn't
> > deplete the reserve pool permanently.  As a result, when there is
> > plenty of free and/or reclaimable memory, the reserve pool
> > watermarks will have almost zero impact on performance and
> > behaviour.
> 
> Typical busy system won't be very far away from the high watermark
> so there would be a reclaim performed during increased watermaks
> (aka reservation) and that might lead to visible performance
> degradation. This might be acceptable but it also adds a certain level
> of unpredictability when performance characteristics might change
> suddenly.

There is usually a good deal of clean cache.  As Dave pointed out
before, clean cache can be considered re-allocatable from NOFS
contexts, and so we'd only have to maintain this invariant:

	min_wmark + private_reserves < free_pages + clean_cache

> > Further, because it's just accounting and behavioural thresholds,
> > this allows the mm subsystem to control how the reserve pool is
> > accounted internally. e.g. clean, reclaimable pages in the page
> > cache could serve as reserve pool pages as they can be immediately
> > reclaimed for allocation.
> 
> But they also can turn into hard/impossible to reclaim as well. Clean
> pages might get dirty and e.g. swap backed pages run out of their
> backing storage. So I guess we cannot count with those pages without
> reclaiming them first and hiding them into the reserve. Which is what
> you suggest below probably but I wasn't really sure...

Pages reserved for use by the page cleaning path can't be considered
dirtyable.  They have to be included in the dirty_balance_reserve.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 16:05                                                     ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-02 16:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg,
	akpm, mgorman, torvalds, xfs

On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> On Mon 23-02-15 11:45:21, Dave Chinner wrote:
> [...]
> > A reserve memory pool is no different - every time a memory reserve
> > occurs, a watermark is lifted to accommodate it, and the transaction
> > is not allowed to proceed until the amount of free memory exceeds
> > that watermark. The memory allocation subsystem then only allows
> > *allocations* marked correctly to allocate pages from that the
> > reserve that watermark protects. e.g. only allocations using
> > __GFP_RESERVE are allowed to dip into the reserve pool.
> 
> The idea is sound. But I am pretty sure we will find many corner
> cases. E.g. what if the mere reservation attempt causes the system
> to go OOM and trigger the OOM killer? Sure that wouldn't be too much
> different from the OOM triggered during the allocation but there is one
> major difference. Reservations need to be estimated and I expect the
> estimation would be on the more conservative side and so the OOM might
> not happen without them.

The whole idea is that filesystems request the reserves while they can
still sleep for progress or fail the macro-operation with -ENOMEM.

And the estimate wouldn't just be on the conservative side, it would
have to be the worst-case scenario.  If we run out of reserves in an
allocation that can not fail that would be a bug that can lock up the
machine.  We can then fall back to the OOM killer in a last-ditch
effort to make forward progress, but as the victim tasks can get stuck
behind state/locks held by the allocation side, the machine might lock
up after all.

> > By using watermarks, freeing of memory will automatically top
> > up the reserve pool which means that we guarantee that reclaimable
> > memory allocated for demand paging during transacitons doesn't
> > deplete the reserve pool permanently.  As a result, when there is
> > plenty of free and/or reclaimable memory, the reserve pool
> > watermarks will have almost zero impact on performance and
> > behaviour.
> 
> Typical busy system won't be very far away from the high watermark
> so there would be a reclaim performed during increased watermaks
> (aka reservation) and that might lead to visible performance
> degradation. This might be acceptable but it also adds a certain level
> of unpredictability when performance characteristics might change
> suddenly.

There is usually a good deal of clean cache.  As Dave pointed out
before, clean cache can be considered re-allocatable from NOFS
contexts, and so we'd only have to maintain this invariant:

	min_wmark + private_reserves < free_pages + clean_cache

> > Further, because it's just accounting and behavioural thresholds,
> > this allows the mm subsystem to control how the reserve pool is
> > accounted internally. e.g. clean, reclaimable pages in the page
> > cache could serve as reserve pool pages as they can be immediately
> > reclaimed for allocation.
> 
> But they also can turn into hard/impossible to reclaim as well. Clean
> pages might get dirty and e.g. swap backed pages run out of their
> backing storage. So I guess we cannot count with those pages without
> reclaiming them first and hiding them into the reserve. Which is what
> you suggest below probably but I wasn't really sure...

Pages reserved for use by the page cleaning path can't be considered
dirtyable.  They have to be included in the dirty_balance_reserve.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 15:18                                                   ` Michal Hocko
@ 2015-03-02 16:39                                                     ` Theodore Ts'o
  -1 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-03-02 16:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> The idea is sound. But I am pretty sure we will find many corner
> cases. E.g. what if the mere reservation attempt causes the system
> to go OOM and trigger the OOM killer?

Doctor, doctor, it hurts when I do that....

So don't trigger the OOM killer.  We can let the caller decide
whether the reservation request should block or return ENOMEM, but the
whole point of the reservation request idea is that this happens
*before* we've taken any mutexes, so blocking won't prevent forward
progress.

The file system could send down a different flag if we are doing
writebacks for page cleaning purposes, in which case the reservation
request would be a "just a heads up, we *will* be needing this much
memory, but this is not something where we can block or return ENOMEM,
so please give us the highest priority for using the free reserves".

> I think the idea is good! It will just be quite tricky to get there
> without causing more problems than those being solved. The biggest
> question mark so far seems to be the reservation size estimation. If
> it is hard for any caller to know the size beforehand (which would
> be really close to the actually used size) then the whole complexity
> in the code sounds like an overkill and asking administrator to tune
> min_free_kbytes seems a better fit (we would still have to teach the
> allocator to access reserves when really necessary) because the system
> would behave more predictably (although some memory would be wasted).

If we do need to teach the allocator to access reserves when really
necessary, don't we have that already via GFP_NOIO/GFP_NOFS and
GFP_NOFAIL?  If the goal is do something more fine-grained,
unfortunately at least for the short-term we'll need to preserve the
existing behaviour and issue warnings until the file system starts
adding GFP_NOFAIL to those memory allocations where previously,
GFP_NOFS was effectively guaranteeing that failures would almostt
never happen.

I know at least one place discovered with recent change (and revert)
where I'll be fixing ext4, but I suspect it won't be the only one,
especially in the block device drivers.

						- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 16:39                                                     ` Theodore Ts'o
  0 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-03-02 16:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Johannes Weiner, Tetsuo Handa, dchinner, linux-mm,
	rientjes, oleg, akpm, mgorman, torvalds, xfs

On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> The idea is sound. But I am pretty sure we will find many corner
> cases. E.g. what if the mere reservation attempt causes the system
> to go OOM and trigger the OOM killer?

Doctor, doctor, it hurts when I do that....

So don't trigger the OOM killer.  We can let the caller decide
whether the reservation request should block or return ENOMEM, but the
whole point of the reservation request idea is that this happens
*before* we've taken any mutexes, so blocking won't prevent forward
progress.

The file system could send down a different flag if we are doing
writebacks for page cleaning purposes, in which case the reservation
request would be a "just a heads up, we *will* be needing this much
memory, but this is not something where we can block or return ENOMEM,
so please give us the highest priority for using the free reserves".

> I think the idea is good! It will just be quite tricky to get there
> without causing more problems than those being solved. The biggest
> question mark so far seems to be the reservation size estimation. If
> it is hard for any caller to know the size beforehand (which would
> be really close to the actually used size) then the whole complexity
> in the code sounds like an overkill and asking administrator to tune
> min_free_kbytes seems a better fit (we would still have to teach the
> allocator to access reserves when really necessary) because the system
> would behave more predictably (although some memory would be wasted).

If we do need to teach the allocator to access reserves when really
necessary, don't we have that already via GFP_NOIO/GFP_NOFS and
GFP_NOFAIL?  If the goal is do something more fine-grained,
unfortunately at least for the short-term we'll need to preserve the
existing behaviour and issue warnings until the file system starts
adding GFP_NOFAIL to those memory allocations where previously,
GFP_NOFS was effectively guaranteeing that failures would almostt
never happen.

I know at least one place discovered with recent change (and revert)
where I'll be fixing ext4, but I suspect it won't be the only one,
especially in the block device drivers.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 16:39                                                     ` Theodore Ts'o
@ 2015-03-02 16:58                                                       ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-03-02 16:58 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Mon 02-03-15 11:39:13, Theodore Ts'o wrote:
> On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> > The idea is sound. But I am pretty sure we will find many corner
> > cases. E.g. what if the mere reservation attempt causes the system
> > to go OOM and trigger the OOM killer?
> 
> Doctor, doctor, it hurts when I do that....
> 
> So don't trigger the OOM killer.  We can let the caller decide whether
> the reservation request should block or return ENOMEM, but the whole
> point of the reservation request idea is that this happens *before*
> we've taken any mutexes, so blocking won't prevent forward progress.

Maybe I wasn't clear. I wasn't concerned about the context which
is doing to reservation. I was more concerned about all the other
allocation requests which might fail now (becasuse they do not have
access to the reserves). So you think that we should simply disable OOM
killer while there is any reservation active? Wouldn't that be even more
fragile when something goes terribly wrong?

> The file system could send down a different flag if we are doing
> writebacks for page cleaning purposes, in which case the reservation
> request would be a "just a heads up, we *will* be needing this much
> memory, but this is not something where we can block or return ENOMEM,
> so please give us the highest priority for using the free reserves".

Sure that thing is clear.
 
> > I think the idea is good! It will just be quite tricky to get there
> > without causing more problems than those being solved. The biggest
> > question mark so far seems to be the reservation size estimation. If
> > it is hard for any caller to know the size beforehand (which would
> > be really close to the actually used size) then the whole complexity
> > in the code sounds like an overkill and asking administrator to tune
> > min_free_kbytes seems a better fit (we would still have to teach the
> > allocator to access reserves when really necessary) because the system
> > would behave more predictably (although some memory would be wasted).
> 
> If we do need to teach the allocator to access reserves when really
> necessary, don't we have that already via GFP_NOIO/GFP_NOFS and
> GFP_NOFAIL?

GFP_NOFAIL doesn't sound like the best fit. Not all NOFAIL callers need
to access reserves - e.g. if they are not blocking anybody from making
progress.

> If the goal is do something more fine-grained,
> unfortunately at least for the short-term we'll need to preserve the
> existing behaviour and issue warnings until the file system starts
> adding GFP_NOFAIL to those memory allocations where previously,
> GFP_NOFS was effectively guaranteeing that failures would almostt
> never happen.

GFP_NOFS not failing is even worse than GFP_KERNEL not failing. Because
the first one has only very limited ways to perform a reclaim. It
basically relies on somebody else to make a progress and that is
definitely a bad model.

> I know at least one place discovered with recent change (and revert)
> where I'll be fixing ext4, but I suspect it won't be the only one,
> especially in the block device drivers.
> 
> 						- Ted

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 16:58                                                       ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-03-02 16:58 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Johannes Weiner, Tetsuo Handa, dchinner, linux-mm,
	rientjes, oleg, akpm, mgorman, torvalds, xfs

On Mon 02-03-15 11:39:13, Theodore Ts'o wrote:
> On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> > The idea is sound. But I am pretty sure we will find many corner
> > cases. E.g. what if the mere reservation attempt causes the system
> > to go OOM and trigger the OOM killer?
> 
> Doctor, doctor, it hurts when I do that....
> 
> So don't trigger the OOM killer.  We can let the caller decide whether
> the reservation request should block or return ENOMEM, but the whole
> point of the reservation request idea is that this happens *before*
> we've taken any mutexes, so blocking won't prevent forward progress.

Maybe I wasn't clear. I wasn't concerned about the context which
is doing to reservation. I was more concerned about all the other
allocation requests which might fail now (becasuse they do not have
access to the reserves). So you think that we should simply disable OOM
killer while there is any reservation active? Wouldn't that be even more
fragile when something goes terribly wrong?

> The file system could send down a different flag if we are doing
> writebacks for page cleaning purposes, in which case the reservation
> request would be a "just a heads up, we *will* be needing this much
> memory, but this is not something where we can block or return ENOMEM,
> so please give us the highest priority for using the free reserves".

Sure that thing is clear.
 
> > I think the idea is good! It will just be quite tricky to get there
> > without causing more problems than those being solved. The biggest
> > question mark so far seems to be the reservation size estimation. If
> > it is hard for any caller to know the size beforehand (which would
> > be really close to the actually used size) then the whole complexity
> > in the code sounds like an overkill and asking administrator to tune
> > min_free_kbytes seems a better fit (we would still have to teach the
> > allocator to access reserves when really necessary) because the system
> > would behave more predictably (although some memory would be wasted).
> 
> If we do need to teach the allocator to access reserves when really
> necessary, don't we have that already via GFP_NOIO/GFP_NOFS and
> GFP_NOFAIL?

GFP_NOFAIL doesn't sound like the best fit. Not all NOFAIL callers need
to access reserves - e.g. if they are not blocking anybody from making
progress.

> If the goal is do something more fine-grained,
> unfortunately at least for the short-term we'll need to preserve the
> existing behaviour and issue warnings until the file system starts
> adding GFP_NOFAIL to those memory allocations where previously,
> GFP_NOFS was effectively guaranteeing that failures would almostt
> never happen.

GFP_NOFS not failing is even worse than GFP_KERNEL not failing. Because
the first one has only very limited ways to perform a reclaim. It
basically relies on somebody else to make a progress and that is
definitely a bad model.

> I know at least one place discovered with recent change (and revert)
> where I'll be fixing ext4, but I suspect it won't be the only one,
> especially in the block device drivers.
> 
> 						- Ted

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 16:05                                                     ` Johannes Weiner
@ 2015-03-02 17:10                                                       ` Michal Hocko
  -1 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-03-02 17:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Mon 02-03-15 11:05:37, Johannes Weiner wrote:
> On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
[...]
> > Typical busy system won't be very far away from the high watermark
> > so there would be a reclaim performed during increased watermaks
> > (aka reservation) and that might lead to visible performance
> > degradation. This might be acceptable but it also adds a certain level
> > of unpredictability when performance characteristics might change
> > suddenly.
> 
> There is usually a good deal of clean cache.  As Dave pointed out
> before, clean cache can be considered re-allocatable from NOFS
> contexts, and so we'd only have to maintain this invariant:
> 
> 	min_wmark + private_reserves < free_pages + clean_cache

Do I understand you correctly that we do not have to reclaim clean pages
as per the above invariant?

If yes, how do you reflect overcommit on the clean_cache from multiple
requestor (who are doing reservations)?
My point was that if we keep clean pages on the LRU rather than forcing
to reclaim them via increased watermarks then it might happen that
different callers with access to reserves wouldn't get promissed amount
of reserved memory because clean_cache is basically a shared resource.

[...]
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 17:10                                                       ` Michal Hocko
  0 siblings, 0 replies; 276+ messages in thread
From: Michal Hocko @ 2015-03-02 17:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg,
	akpm, mgorman, torvalds, xfs

On Mon 02-03-15 11:05:37, Johannes Weiner wrote:
> On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
[...]
> > Typical busy system won't be very far away from the high watermark
> > so there would be a reclaim performed during increased watermaks
> > (aka reservation) and that might lead to visible performance
> > degradation. This might be acceptable but it also adds a certain level
> > of unpredictability when performance characteristics might change
> > suddenly.
> 
> There is usually a good deal of clean cache.  As Dave pointed out
> before, clean cache can be considered re-allocatable from NOFS
> contexts, and so we'd only have to maintain this invariant:
> 
> 	min_wmark + private_reserves < free_pages + clean_cache

Do I understand you correctly that we do not have to reclaim clean pages
as per the above invariant?

If yes, how do you reflect overcommit on the clean_cache from multiple
requestor (who are doing reservations)?
My point was that if we keep clean pages on the LRU rather than forcing
to reclaim them via increased watermarks then it might happen that
different callers with access to reserves wouldn't get promissed amount
of reserved memory because clean_cache is basically a shared resource.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 17:10                                                       ` Michal Hocko
@ 2015-03-02 17:27                                                         ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-02 17:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Mon, Mar 02, 2015 at 06:10:58PM +0100, Michal Hocko wrote:
> On Mon 02-03-15 11:05:37, Johannes Weiner wrote:
> > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> [...]
> > > Typical busy system won't be very far away from the high watermark
> > > so there would be a reclaim performed during increased watermaks
> > > (aka reservation) and that might lead to visible performance
> > > degradation. This might be acceptable but it also adds a certain level
> > > of unpredictability when performance characteristics might change
> > > suddenly.
> > 
> > There is usually a good deal of clean cache.  As Dave pointed out
> > before, clean cache can be considered re-allocatable from NOFS
> > contexts, and so we'd only have to maintain this invariant:
> > 
> > 	min_wmark + private_reserves < free_pages + clean_cache
> 
> Do I understand you correctly that we do not have to reclaim clean pages
> as per the above invariant?
> 
> If yes, how do you reflect overcommit on the clean_cache from multiple
> requestor (who are doing reservations)?
> My point was that if we keep clean pages on the LRU rather than forcing
> to reclaim them via increased watermarks then it might happen that
> different callers with access to reserves wouldn't get promissed amount
> of reserved memory because clean_cache is basically a shared resource.

The sum of all private reservations has to be accounted globally, we
obviously can't overcommit the available resources in order to solve
problems stemming from overcommiting the available resources.

The page allocator can't hand out free pages and page reclaim can not
reclaim clean cache unless that invariant is met.  They both have to
consider them consumed.  It's the same as pre-allocation, the only
thing we save is having to actually reclaim the pages and take them
off the freelist at reservation time - which is a good optimization
since the filesystem might not actually need them all.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 17:27                                                         ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-02 17:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Tetsuo Handa, dchinner, linux-mm, rientjes, oleg,
	akpm, mgorman, torvalds, xfs

On Mon, Mar 02, 2015 at 06:10:58PM +0100, Michal Hocko wrote:
> On Mon 02-03-15 11:05:37, Johannes Weiner wrote:
> > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> [...]
> > > Typical busy system won't be very far away from the high watermark
> > > so there would be a reclaim performed during increased watermaks
> > > (aka reservation) and that might lead to visible performance
> > > degradation. This might be acceptable but it also adds a certain level
> > > of unpredictability when performance characteristics might change
> > > suddenly.
> > 
> > There is usually a good deal of clean cache.  As Dave pointed out
> > before, clean cache can be considered re-allocatable from NOFS
> > contexts, and so we'd only have to maintain this invariant:
> > 
> > 	min_wmark + private_reserves < free_pages + clean_cache
> 
> Do I understand you correctly that we do not have to reclaim clean pages
> as per the above invariant?
> 
> If yes, how do you reflect overcommit on the clean_cache from multiple
> requestor (who are doing reservations)?
> My point was that if we keep clean pages on the LRU rather than forcing
> to reclaim them via increased watermarks then it might happen that
> different callers with access to reserves wouldn't get promissed amount
> of reserved memory because clean_cache is basically a shared resource.

The sum of all private reservations has to be accounted globally, we
obviously can't overcommit the available resources in order to solve
problems stemming from overcommiting the available resources.

The page allocator can't hand out free pages and page reclaim can not
reclaim clean cache unless that invariant is met.  They both have to
consider them consumed.  It's the same as pre-allocation, the only
thing we save is having to actually reclaim the pages and take them
off the freelist at reservation time - which is a good optimization
since the filesystem might not actually need them all.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  7:32                                                     ` Dave Chinner
@ 2015-03-02 20:22                                                       ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-02 20:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > When allocating pages the caller should drain its reserves in
> > preference to dipping into the regular freelist.  This guy has already
> > done his reclaim and shouldn't be penalised a second time.  I guess
> > Johannes's preallocation code should switch to doing this for the same
> > reason, plus the fact that snipping a page off
> > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > anyway so why not do it by default.
> 
> That is at odds with the requirements of demand paging, which
> allocate for objects that are reclaimable within the course of the
> transaction. The reserve is there to ensure forward progress for
> allocations for objects that aren't freed until after the
> transaction completes, but if we drain it for reclaimable objects we
> then have nothing left in the reserve pool when we actually need it.
>
> We do not know ahead of time if the object we are allocating is
> going to modified and hence locked into the transaction. Hence we
> can't say "use the reserve for this *specific* allocation", and so
> the only guidance we can really give is "we will to allocate and
> *permanently consume* this much memory", and the reserve pool needs
> to cover that consumption to guarantee forwards progress.
> 
> Forwards progress for all other allocations is guaranteed because
> they are reclaimable objects - they either freed directly back to
> their source (slab, heap, page lists) or they are freed by shrinkers
> once they have been released from the transaction.
> 
> Hence we need allocations to come from the free list and trigger
> reclaim, regardless of the fact there is a reserve pool there. The
> reserve pool needs to be a last resort once there are no other
> avenues to allocate memory. i.e. it would be used to replace the OOM
> killer for GFP_NOFAIL allocations.

That won't work.  Clean cache can be temporarily unavailable and
off-LRU for several reasons - compaction, migration, pending page
promotion, other reclaimers.  How often are we trying before we dip
into the reserve pool?  As you have noticed, the OOM killer goes off
seemingly prematurely at times, and the reason for that is that we
simply don't KNOW the exact point when we ran out of reclaimable
memory.  We cannot take an atomic snapshot of all zones, of all nodes,
of all tasks running in order to determine this reliably, we have to
approximate it.  That's why OOM is defined as "we have scanned a great
many pages and couldn't free any of them."

So unless you tell us which allocations should come from previously
declared reserves, and which ones should rely on reclaim and may fail,
the reserves can deplete prematurely and we're back to square one.

> > And to make it much worse, how
> > many pages of which orders?  Bless its heart, slub will go and use
> > a 1-order page for allocations which should have been in 0-order
> > pages..

It can always fall back to the minimum order.

> The majority of allocations will be order-0, though if we know that
> they are going to be significant numbers of high order allocations,
> then it should be simple enough to tell the mm subsystem "need a
> reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have
> memory compaction just do it's stuff. But, IMO, we should cross that
> bridge when somebody actually needs reservations to be that
> specific....

Compaction can be at an impasse for the same reasons mentioned above.
It can not just stop_machine() to guarantee it can assemble a higher
order page from a bunch of in-use order-0 cache pages.  If you need
higher-order allocations in a transaction, you have to pre-allocate.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 20:22                                                       ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-02 20:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs

On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > When allocating pages the caller should drain its reserves in
> > preference to dipping into the regular freelist.  This guy has already
> > done his reclaim and shouldn't be penalised a second time.  I guess
> > Johannes's preallocation code should switch to doing this for the same
> > reason, plus the fact that snipping a page off
> > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > anyway so why not do it by default.
> 
> That is at odds with the requirements of demand paging, which
> allocate for objects that are reclaimable within the course of the
> transaction. The reserve is there to ensure forward progress for
> allocations for objects that aren't freed until after the
> transaction completes, but if we drain it for reclaimable objects we
> then have nothing left in the reserve pool when we actually need it.
>
> We do not know ahead of time if the object we are allocating is
> going to modified and hence locked into the transaction. Hence we
> can't say "use the reserve for this *specific* allocation", and so
> the only guidance we can really give is "we will to allocate and
> *permanently consume* this much memory", and the reserve pool needs
> to cover that consumption to guarantee forwards progress.
> 
> Forwards progress for all other allocations is guaranteed because
> they are reclaimable objects - they either freed directly back to
> their source (slab, heap, page lists) or they are freed by shrinkers
> once they have been released from the transaction.
> 
> Hence we need allocations to come from the free list and trigger
> reclaim, regardless of the fact there is a reserve pool there. The
> reserve pool needs to be a last resort once there are no other
> avenues to allocate memory. i.e. it would be used to replace the OOM
> killer for GFP_NOFAIL allocations.

That won't work.  Clean cache can be temporarily unavailable and
off-LRU for several reasons - compaction, migration, pending page
promotion, other reclaimers.  How often are we trying before we dip
into the reserve pool?  As you have noticed, the OOM killer goes off
seemingly prematurely at times, and the reason for that is that we
simply don't KNOW the exact point when we ran out of reclaimable
memory.  We cannot take an atomic snapshot of all zones, of all nodes,
of all tasks running in order to determine this reliably, we have to
approximate it.  That's why OOM is defined as "we have scanned a great
many pages and couldn't free any of them."

So unless you tell us which allocations should come from previously
declared reserves, and which ones should rely on reclaim and may fail,
the reserves can deplete prematurely and we're back to square one.

> > And to make it much worse, how
> > many pages of which orders?  Bless its heart, slub will go and use
> > a 1-order page for allocations which should have been in 0-order
> > pages..

It can always fall back to the minimum order.

> The majority of allocations will be order-0, though if we know that
> they are going to be significant numbers of high order allocations,
> then it should be simple enough to tell the mm subsystem "need a
> reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have
> memory compaction just do it's stuff. But, IMO, we should cross that
> bridge when somebody actually needs reservations to be that
> specific....

Compaction can be at an impasse for the same reasons mentioned above.
It can not just stop_machine() to guarantee it can assemble a higher
order page from a bunch of in-use order-0 cache pages.  If you need
higher-order allocations in a transaction, you have to pre-allocate.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02  9:39                                                       ` Vlastimil Babka
@ 2015-03-02 22:31                                                         ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-02 22:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote:
> On 02/23/2015 08:32 AM, Dave Chinner wrote:
> >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
> >>
> >>Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
> >>reserve.  So to reserve N pages we increase the page allocator dynamic
> >>reserve by N, do some reclaim if necessary then deposit N tokens into
> >>the caller's task_struct (it'll be a set of zone/nr-pages tuples I
> >>suppose).
> >>
> >>When allocating pages the caller should drain its reserves in
> >>preference to dipping into the regular freelist.  This guy has already
> >>done his reclaim and shouldn't be penalised a second time.  I guess
> >>Johannes's preallocation code should switch to doing this for the same
> >>reason, plus the fact that snipping a page off
> >>task_struct.prealloc_pages is super-fast and needs to be done sometime
> >>anyway so why not do it by default.
> >
> >That is at odds with the requirements of demand paging, which
> >allocate for objects that are reclaimable within the course of the
> >transaction. The reserve is there to ensure forward progress for
> >allocations for objects that aren't freed until after the
> >transaction completes, but if we drain it for reclaimable objects we
> >then have nothing left in the reserve pool when we actually need it.
> >
> >We do not know ahead of time if the object we are allocating is
> >going to modified and hence locked into the transaction. Hence we
> >can't say "use the reserve for this *specific* allocation", and so
> >the only guidance we can really give is "we will to allocate and
> >*permanently consume* this much memory", and the reserve pool needs
> >to cover that consumption to guarantee forwards progress.
> 
> I'm not sure I understand properly. You don't know if a specific
> allocation is permanent or reclaimable, but you can tell in advance
> how much in total will be permanent? Is it because you are
> conservative and assume everything will be permanent, or how?

Because we know the worst case object modification constraints
*exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know
exactly what in memory objects we lock into the transaction and what
memory is required to modify and track those objects. e.g: for a
data extent allocation, the log reservation is as such:

/*
 * In a write transaction we can allocate a maximum of 2
 * extents.  This gives:
 *    the inode getting the new extents: inode size
 *    the inode's bmap btree: max depth * block size
 *    the agfs of the ags from which the extents are allocated: 2 * sector
 *    the superblock free block counter: sector size
 *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
 * And the bmap_finish transaction can free bmap blocks in a join:
 *    the agfs of the ags containing the blocks: 2 * sector size
 *    the agfls of the ags containing the blocks: 2 * sector size
 *    the super block free block counter: sector size
 *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
 */
STATIC uint
xfs_calc_write_reservation(
        struct xfs_mount        *mp)
{
        return XFS_DQUOT_LOGRES(mp) +
                MAX((xfs_calc_inode_res(mp, 1) +
                     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
                                      XFS_FSB_TO_B(mp, 1)) +
                     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
                     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
                                      XFS_FSB_TO_B(mp, 1))),
                    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
                     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
                                      XFS_FSB_TO_B(mp, 1))));
}

It's trivial to extend this logic to to memory allocation
requirements, because the above is an exact encoding of all the
objects we "permanently consume" memory for within the transaction.

What we don't know is how many objects we might need to scan to find
the objects we will eventually modify.  Here's an (admittedly
extreme) example to demonstrate a worst case scenario: allocate a
64k data extent. Because it is an exact size allocation, we look it
up in the by-size free space btree. Free space is fragmented, so
there are about a million 64k free space extents in the tree.

Once we find the first 64k extent, we search them to find the best
locality target match.  The btree records are 16 bytes each, so we
fit roughly 500 to a 4k block. Say we search half the extents to
find the best match - i.e. we walk a thousand leaf blocks before
finding the match we want, and modify that leaf block.

Now, the modification removed an entry from the leaf and tht
triggers leaf merge thresholds, so a merge with the 1002nd block
occurs. That block now demand pages in and we then modify and join
it to the transaction. Now we walk back up the btree to update
indexes, merging blocks all the way back up to the root.  We have a
worst case size btree (5 levels) and we merge at every level meaning
we demand page another 8 btree blocks and modify them.

In this case, we've demand paged ~1010 btree blocks, but only
modified 10 of them. i.e. the memory we consumed permanently was
only 10 4k buffers (approx. 10 slab and 10 page allocations), but
the allocation demand was 2 orders of magnitude more than the
unreclaimable memory consumption of the btree modification.

I hope you start to see the scope of the problem now...

> Can you at least at some later point in transaction recognize that
> "OK, this object was not permanent after all" and tell mm that it
> can lower your reserve?

I'm not including any memory used by objects we know won't be locked
into the transaction in the reserve. Demand paged object memory is
essentially unbound but is easily reclaimable. That reclaim will
give us forward progress guarantees on the memory required here.

> >Yes, that's the big problem with preallocation, as well as your
> >proposed "depelete the reserved memory first" approach. They
> >*require* up front "preallocation" of free memory, either directly
> >by the application, or internally by the mm subsystem.
> 
> I don't see why it would deadlock, if during reserve time the mm can
> return ENOMEM as the reserver should be able to back out at that
> point.

Preallocated reserves do not allow for unbound demand paging of
reclaimable objects within reserved allocation contexts.

Cheers

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 22:31                                                         ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-02 22:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote:
> On 02/23/2015 08:32 AM, Dave Chinner wrote:
> >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
> >>
> >>Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
> >>reserve.  So to reserve N pages we increase the page allocator dynamic
> >>reserve by N, do some reclaim if necessary then deposit N tokens into
> >>the caller's task_struct (it'll be a set of zone/nr-pages tuples I
> >>suppose).
> >>
> >>When allocating pages the caller should drain its reserves in
> >>preference to dipping into the regular freelist.  This guy has already
> >>done his reclaim and shouldn't be penalised a second time.  I guess
> >>Johannes's preallocation code should switch to doing this for the same
> >>reason, plus the fact that snipping a page off
> >>task_struct.prealloc_pages is super-fast and needs to be done sometime
> >>anyway so why not do it by default.
> >
> >That is at odds with the requirements of demand paging, which
> >allocate for objects that are reclaimable within the course of the
> >transaction. The reserve is there to ensure forward progress for
> >allocations for objects that aren't freed until after the
> >transaction completes, but if we drain it for reclaimable objects we
> >then have nothing left in the reserve pool when we actually need it.
> >
> >We do not know ahead of time if the object we are allocating is
> >going to modified and hence locked into the transaction. Hence we
> >can't say "use the reserve for this *specific* allocation", and so
> >the only guidance we can really give is "we will to allocate and
> >*permanently consume* this much memory", and the reserve pool needs
> >to cover that consumption to guarantee forwards progress.
> 
> I'm not sure I understand properly. You don't know if a specific
> allocation is permanent or reclaimable, but you can tell in advance
> how much in total will be permanent? Is it because you are
> conservative and assume everything will be permanent, or how?

Because we know the worst case object modification constraints
*exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know
exactly what in memory objects we lock into the transaction and what
memory is required to modify and track those objects. e.g: for a
data extent allocation, the log reservation is as such:

/*
 * In a write transaction we can allocate a maximum of 2
 * extents.  This gives:
 *    the inode getting the new extents: inode size
 *    the inode's bmap btree: max depth * block size
 *    the agfs of the ags from which the extents are allocated: 2 * sector
 *    the superblock free block counter: sector size
 *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
 * And the bmap_finish transaction can free bmap blocks in a join:
 *    the agfs of the ags containing the blocks: 2 * sector size
 *    the agfls of the ags containing the blocks: 2 * sector size
 *    the super block free block counter: sector size
 *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
 */
STATIC uint
xfs_calc_write_reservation(
        struct xfs_mount        *mp)
{
        return XFS_DQUOT_LOGRES(mp) +
                MAX((xfs_calc_inode_res(mp, 1) +
                     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
                                      XFS_FSB_TO_B(mp, 1)) +
                     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
                     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
                                      XFS_FSB_TO_B(mp, 1))),
                    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
                     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
                                      XFS_FSB_TO_B(mp, 1))));
}

It's trivial to extend this logic to to memory allocation
requirements, because the above is an exact encoding of all the
objects we "permanently consume" memory for within the transaction.

What we don't know is how many objects we might need to scan to find
the objects we will eventually modify.  Here's an (admittedly
extreme) example to demonstrate a worst case scenario: allocate a
64k data extent. Because it is an exact size allocation, we look it
up in the by-size free space btree. Free space is fragmented, so
there are about a million 64k free space extents in the tree.

Once we find the first 64k extent, we search them to find the best
locality target match.  The btree records are 16 bytes each, so we
fit roughly 500 to a 4k block. Say we search half the extents to
find the best match - i.e. we walk a thousand leaf blocks before
finding the match we want, and modify that leaf block.

Now, the modification removed an entry from the leaf and tht
triggers leaf merge thresholds, so a merge with the 1002nd block
occurs. That block now demand pages in and we then modify and join
it to the transaction. Now we walk back up the btree to update
indexes, merging blocks all the way back up to the root.  We have a
worst case size btree (5 levels) and we merge at every level meaning
we demand page another 8 btree blocks and modify them.

In this case, we've demand paged ~1010 btree blocks, but only
modified 10 of them. i.e. the memory we consumed permanently was
only 10 4k buffers (approx. 10 slab and 10 page allocations), but
the allocation demand was 2 orders of magnitude more than the
unreclaimable memory consumption of the btree modification.

I hope you start to see the scope of the problem now...

> Can you at least at some later point in transaction recognize that
> "OK, this object was not permanent after all" and tell mm that it
> can lower your reserve?

I'm not including any memory used by objects we know won't be locked
into the transaction in the reserve. Demand paged object memory is
essentially unbound but is easily reclaimable. That reclaim will
give us forward progress guarantees on the memory required here.

> >Yes, that's the big problem with preallocation, as well as your
> >proposed "depelete the reserved memory first" approach. They
> >*require* up front "preallocation" of free memory, either directly
> >by the application, or internally by the mm subsystem.
> 
> I don't see why it would deadlock, if during reserve time the mm can
> return ENOMEM as the reserver should be able to back out at that
> point.

Preallocated reserves do not allow for unbound demand paging of
reclaimable objects within reserved allocation contexts.

Cheers

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 20:22                                                       ` Johannes Weiner
@ 2015-03-02 23:12                                                         ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-02 23:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote:
> On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > > When allocating pages the caller should drain its reserves in
> > > preference to dipping into the regular freelist.  This guy has already
> > > done his reclaim and shouldn't be penalised a second time.  I guess
> > > Johannes's preallocation code should switch to doing this for the same
> > > reason, plus the fact that snipping a page off
> > > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > > anyway so why not do it by default.
> > 
> > That is at odds with the requirements of demand paging, which
> > allocate for objects that are reclaimable within the course of the
> > transaction. The reserve is there to ensure forward progress for
> > allocations for objects that aren't freed until after the
> > transaction completes, but if we drain it for reclaimable objects we
> > then have nothing left in the reserve pool when we actually need it.
> >
> > We do not know ahead of time if the object we are allocating is
> > going to modified and hence locked into the transaction. Hence we
> > can't say "use the reserve for this *specific* allocation", and so
> > the only guidance we can really give is "we will to allocate and
> > *permanently consume* this much memory", and the reserve pool needs
> > to cover that consumption to guarantee forwards progress.
> > 
> > Forwards progress for all other allocations is guaranteed because
> > they are reclaimable objects - they either freed directly back to
> > their source (slab, heap, page lists) or they are freed by shrinkers
> > once they have been released from the transaction.
> > 
> > Hence we need allocations to come from the free list and trigger
> > reclaim, regardless of the fact there is a reserve pool there. The
> > reserve pool needs to be a last resort once there are no other
> > avenues to allocate memory. i.e. it would be used to replace the OOM
> > killer for GFP_NOFAIL allocations.
> 
> That won't work.

I don't see why not...

> Clean cache can be temporarily unavailable and
> off-LRU for several reasons - compaction, migration, pending page
> promotion, other reclaimers.  How often are we trying before we dip
> into the reserve pool?  As you have noticed, the OOM killer goes off
> seemingly prematurely at times, and the reason for that is that we
> simply don't KNOW the exact point when we ran out of reclaimable
> memory.

Sure, but that's irrelevant to the problem at hand. At some point,
the Mm subsystem is going to decide "we're at OOM" - it's *what
happens next* that matters.

> We cannot take an atomic snapshot of all zones, of all nodes,
> of all tasks running in order to determine this reliably, we have to
> approximate it.  That's why OOM is defined as "we have scanned a great
> many pages and couldn't free any of them."

Yes, and reserve pools *do not change* the logic that leads to that
decision. What changes is that we don't "kick the OOM killer",
instead we "allocate from the reserve pool." The reserve pool
*replaces* the OOM killer as a method of guaranteeing forwards
allocation progress for those subsystems that can use reservations.
If there is no reserve pool for the current task, then you can still
kick the OOM killer....

> So unless you tell us which allocations should come from previously
> declared reserves, and which ones should rely on reclaim and may fail,
> the reserves can deplete prematurely and we're back to square one.

Like the OOM killer, filesystems are not omnipotent and are not
perfect.  Requiring us to be so is entirely unreasonable, and is
*entirely unnecessary* from the POV of the mm subsystem.

Reservations give the mm subsystem a *strong model* for guaranteeing
forwards allocation progress, and it can be independently verified
and tested without having to care about how some subsystem uses it.
The mm subsystem supplies the *mechanism*, and mm developers are
entirely focussed around ensuring the mechanism works and is
verifiable.  i.e. you could write some debug kernel module to
exercise, verify and regression test the model behaviour, which is
something that simply cannot be done with the OOM killer.

Reservation sizes required by a subsystem are *policy*. They are not
a problem the mm subsystem needs to be concerned with as the
subsystem has to get the reservations right for the mechanism to
work. i.e. Managing reservation sizes is my responsibility as a
subsystem maintainer, just like it's currently my responsibility for
ensuring that transient ENOMEM conditions don't result in a
filesystem shutdown....

> Compaction can be at an impasse for the same reasons mentioned above.
> It can not just stop_machine() to guarantee it can assemble a higher
> order page from a bunch of in-use order-0 cache pages.  If you need
> higher-order allocations in a transaction, you have to pre-allocate.

It's much simpler just to use order-0 reservations and vmalloc if we
can't get high order allocations. We already do this in most places
where high order allocations are required, so there's really no
change needed here. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-02 23:12                                                         ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-02 23:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs

On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote:
> On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > > When allocating pages the caller should drain its reserves in
> > > preference to dipping into the regular freelist.  This guy has already
> > > done his reclaim and shouldn't be penalised a second time.  I guess
> > > Johannes's preallocation code should switch to doing this for the same
> > > reason, plus the fact that snipping a page off
> > > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > > anyway so why not do it by default.
> > 
> > That is at odds with the requirements of demand paging, which
> > allocate for objects that are reclaimable within the course of the
> > transaction. The reserve is there to ensure forward progress for
> > allocations for objects that aren't freed until after the
> > transaction completes, but if we drain it for reclaimable objects we
> > then have nothing left in the reserve pool when we actually need it.
> >
> > We do not know ahead of time if the object we are allocating is
> > going to modified and hence locked into the transaction. Hence we
> > can't say "use the reserve for this *specific* allocation", and so
> > the only guidance we can really give is "we will to allocate and
> > *permanently consume* this much memory", and the reserve pool needs
> > to cover that consumption to guarantee forwards progress.
> > 
> > Forwards progress for all other allocations is guaranteed because
> > they are reclaimable objects - they either freed directly back to
> > their source (slab, heap, page lists) or they are freed by shrinkers
> > once they have been released from the transaction.
> > 
> > Hence we need allocations to come from the free list and trigger
> > reclaim, regardless of the fact there is a reserve pool there. The
> > reserve pool needs to be a last resort once there are no other
> > avenues to allocate memory. i.e. it would be used to replace the OOM
> > killer for GFP_NOFAIL allocations.
> 
> That won't work.

I don't see why not...

> Clean cache can be temporarily unavailable and
> off-LRU for several reasons - compaction, migration, pending page
> promotion, other reclaimers.  How often are we trying before we dip
> into the reserve pool?  As you have noticed, the OOM killer goes off
> seemingly prematurely at times, and the reason for that is that we
> simply don't KNOW the exact point when we ran out of reclaimable
> memory.

Sure, but that's irrelevant to the problem at hand. At some point,
the Mm subsystem is going to decide "we're at OOM" - it's *what
happens next* that matters.

> We cannot take an atomic snapshot of all zones, of all nodes,
> of all tasks running in order to determine this reliably, we have to
> approximate it.  That's why OOM is defined as "we have scanned a great
> many pages and couldn't free any of them."

Yes, and reserve pools *do not change* the logic that leads to that
decision. What changes is that we don't "kick the OOM killer",
instead we "allocate from the reserve pool." The reserve pool
*replaces* the OOM killer as a method of guaranteeing forwards
allocation progress for those subsystems that can use reservations.
If there is no reserve pool for the current task, then you can still
kick the OOM killer....

> So unless you tell us which allocations should come from previously
> declared reserves, and which ones should rely on reclaim and may fail,
> the reserves can deplete prematurely and we're back to square one.

Like the OOM killer, filesystems are not omnipotent and are not
perfect.  Requiring us to be so is entirely unreasonable, and is
*entirely unnecessary* from the POV of the mm subsystem.

Reservations give the mm subsystem a *strong model* for guaranteeing
forwards allocation progress, and it can be independently verified
and tested without having to care about how some subsystem uses it.
The mm subsystem supplies the *mechanism*, and mm developers are
entirely focussed around ensuring the mechanism works and is
verifiable.  i.e. you could write some debug kernel module to
exercise, verify and regression test the model behaviour, which is
something that simply cannot be done with the OOM killer.

Reservation sizes required by a subsystem are *policy*. They are not
a problem the mm subsystem needs to be concerned with as the
subsystem has to get the reservations right for the mechanism to
work. i.e. Managing reservation sizes is my responsibility as a
subsystem maintainer, just like it's currently my responsibility for
ensuring that transient ENOMEM conditions don't result in a
filesystem shutdown....

> Compaction can be at an impasse for the same reasons mentioned above.
> It can not just stop_machine() to guarantee it can assemble a higher
> order page from a bunch of in-use order-0 cache pages.  If you need
> higher-order allocations in a transaction, you have to pre-allocate.

It's much simpler just to use order-0 reservations and vmalloc if we
can't get high order allocations. We already do this in most places
where high order allocations are required, so there's really no
change needed here. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 23:12                                                         ` Dave Chinner
@ 2015-03-03  2:50                                                           ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-03  2:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > > > When allocating pages the caller should drain its reserves in
> > > > preference to dipping into the regular freelist.  This guy has already
> > > > done his reclaim and shouldn't be penalised a second time.  I guess
> > > > Johannes's preallocation code should switch to doing this for the same
> > > > reason, plus the fact that snipping a page off
> > > > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > > > anyway so why not do it by default.
> > > 
> > > That is at odds with the requirements of demand paging, which
> > > allocate for objects that are reclaimable within the course of the
> > > transaction. The reserve is there to ensure forward progress for
> > > allocations for objects that aren't freed until after the
> > > transaction completes, but if we drain it for reclaimable objects we
> > > then have nothing left in the reserve pool when we actually need it.
> > >
> > > We do not know ahead of time if the object we are allocating is
> > > going to modified and hence locked into the transaction. Hence we
> > > can't say "use the reserve for this *specific* allocation", and so
> > > the only guidance we can really give is "we will to allocate and
> > > *permanently consume* this much memory", and the reserve pool needs
> > > to cover that consumption to guarantee forwards progress.
> > > 
> > > Forwards progress for all other allocations is guaranteed because
> > > they are reclaimable objects - they either freed directly back to
> > > their source (slab, heap, page lists) or they are freed by shrinkers
> > > once they have been released from the transaction.
> > > 
> > > Hence we need allocations to come from the free list and trigger
> > > reclaim, regardless of the fact there is a reserve pool there. The
> > > reserve pool needs to be a last resort once there are no other
> > > avenues to allocate memory. i.e. it would be used to replace the OOM
> > > killer for GFP_NOFAIL allocations.
> > 
> > That won't work.
> 
> I don't see why not...
> 
> > Clean cache can be temporarily unavailable and
> > off-LRU for several reasons - compaction, migration, pending page
> > promotion, other reclaimers.  How often are we trying before we dip
> > into the reserve pool?  As you have noticed, the OOM killer goes off
> > seemingly prematurely at times, and the reason for that is that we
> > simply don't KNOW the exact point when we ran out of reclaimable
> > memory.
> 
> Sure, but that's irrelevant to the problem at hand. At some point,
> the Mm subsystem is going to decide "we're at OOM" - it's *what
> happens next* that matters.

It's not irrelevant at all.  That point is an arbitrary magic number
that is a byproduct of many implementation details and concurrency in
the memory management layer.  It's completely fine to tie allocations
which can fail to this point, but you can't reasonably calibrate your
emergency reserves, which are supposed to guarantee progress, to such
an unpredictable variable.

When you reserve based on the share of allocations that you know will
be unreclaimable, you are assuming that all other allocations will be
reclaimable, and that is simply flawed.  There is so much concurrency
in the MM subsystem that you can't reasonably expect a single scanner
instance to recover the majority of theoretically reclaimable memory.

> > We cannot take an atomic snapshot of all zones, of all nodes,
> > of all tasks running in order to determine this reliably, we have to
> > approximate it.  That's why OOM is defined as "we have scanned a great
> > many pages and couldn't free any of them."
> 
> Yes, and reserve pools *do not change* the logic that leads to that
> decision. What changes is that we don't "kick the OOM killer",
> instead we "allocate from the reserve pool." The reserve pool
> *replaces* the OOM killer as a method of guaranteeing forwards
> allocation progress for those subsystems that can use reservations.

In order to replace the OOM killer in its role as progress guarantee,
the reserves can't run dry during the transaction.  Because what are
we going to do in that case?

> If there is no reserve pool for the current task, then you can still
> kick the OOM killer....

... so we are not actually replacing the OOM killer, we just defer it
with reserves that were calibrated to an anecdotal snapshot of a fuzzy
quantity of reclaim activity?  Is the idea here to just pile sh*tty,
mostly-working mechanisms on top of each other in the hope that one of
them will kick things along just enough to avoid locking up?

> > So unless you tell us which allocations should come from previously
> > declared reserves, and which ones should rely on reclaim and may fail,
> > the reserves can deplete prematurely and we're back to square one.
> 
> Like the OOM killer, filesystems are not omnipotent and are not
> perfect.  Requiring us to be so is entirely unreasonable, and is
> *entirely unnecessary* from the POV of the mm subsystem.
> 
> Reservations give the mm subsystem a *strong model* for guaranteeing
> forwards allocation progress, and it can be independently verified
> and tested without having to care about how some subsystem uses it.
> The mm subsystem supplies the *mechanism*, and mm developers are
> entirely focussed around ensuring the mechanism works and is
> verifiable.  i.e. you could write some debug kernel module to
> exercise, verify and regression test the model behaviour, which is
> something that simply cannot be done with the OOM killer.
> 
> Reservation sizes required by a subsystem are *policy*. They are not
> a problem the mm subsystem needs to be concerned with as the
> subsystem has to get the reservations right for the mechanism to
> work. i.e. Managing reservation sizes is my responsibility as a
> subsystem maintainer, just like it's currently my responsibility for
> ensuring that transient ENOMEM conditions don't result in a
> filesystem shutdown....

Anything that depends on the point at which the memory management
system gives up reclaiming pages is not verifiable in the slightest.
It will vary from kernel to kernel, from workload to workload, from
run to run.  It will regress in the blink of an eye.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-03  2:50                                                           ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-03  2:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs

On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > > > When allocating pages the caller should drain its reserves in
> > > > preference to dipping into the regular freelist.  This guy has already
> > > > done his reclaim and shouldn't be penalised a second time.  I guess
> > > > Johannes's preallocation code should switch to doing this for the same
> > > > reason, plus the fact that snipping a page off
> > > > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > > > anyway so why not do it by default.
> > > 
> > > That is at odds with the requirements of demand paging, which
> > > allocate for objects that are reclaimable within the course of the
> > > transaction. The reserve is there to ensure forward progress for
> > > allocations for objects that aren't freed until after the
> > > transaction completes, but if we drain it for reclaimable objects we
> > > then have nothing left in the reserve pool when we actually need it.
> > >
> > > We do not know ahead of time if the object we are allocating is
> > > going to modified and hence locked into the transaction. Hence we
> > > can't say "use the reserve for this *specific* allocation", and so
> > > the only guidance we can really give is "we will to allocate and
> > > *permanently consume* this much memory", and the reserve pool needs
> > > to cover that consumption to guarantee forwards progress.
> > > 
> > > Forwards progress for all other allocations is guaranteed because
> > > they are reclaimable objects - they either freed directly back to
> > > their source (slab, heap, page lists) or they are freed by shrinkers
> > > once they have been released from the transaction.
> > > 
> > > Hence we need allocations to come from the free list and trigger
> > > reclaim, regardless of the fact there is a reserve pool there. The
> > > reserve pool needs to be a last resort once there are no other
> > > avenues to allocate memory. i.e. it would be used to replace the OOM
> > > killer for GFP_NOFAIL allocations.
> > 
> > That won't work.
> 
> I don't see why not...
> 
> > Clean cache can be temporarily unavailable and
> > off-LRU for several reasons - compaction, migration, pending page
> > promotion, other reclaimers.  How often are we trying before we dip
> > into the reserve pool?  As you have noticed, the OOM killer goes off
> > seemingly prematurely at times, and the reason for that is that we
> > simply don't KNOW the exact point when we ran out of reclaimable
> > memory.
> 
> Sure, but that's irrelevant to the problem at hand. At some point,
> the Mm subsystem is going to decide "we're at OOM" - it's *what
> happens next* that matters.

It's not irrelevant at all.  That point is an arbitrary magic number
that is a byproduct of many implementation details and concurrency in
the memory management layer.  It's completely fine to tie allocations
which can fail to this point, but you can't reasonably calibrate your
emergency reserves, which are supposed to guarantee progress, to such
an unpredictable variable.

When you reserve based on the share of allocations that you know will
be unreclaimable, you are assuming that all other allocations will be
reclaimable, and that is simply flawed.  There is so much concurrency
in the MM subsystem that you can't reasonably expect a single scanner
instance to recover the majority of theoretically reclaimable memory.

> > We cannot take an atomic snapshot of all zones, of all nodes,
> > of all tasks running in order to determine this reliably, we have to
> > approximate it.  That's why OOM is defined as "we have scanned a great
> > many pages and couldn't free any of them."
> 
> Yes, and reserve pools *do not change* the logic that leads to that
> decision. What changes is that we don't "kick the OOM killer",
> instead we "allocate from the reserve pool." The reserve pool
> *replaces* the OOM killer as a method of guaranteeing forwards
> allocation progress for those subsystems that can use reservations.

In order to replace the OOM killer in its role as progress guarantee,
the reserves can't run dry during the transaction.  Because what are
we going to do in that case?

> If there is no reserve pool for the current task, then you can still
> kick the OOM killer....

... so we are not actually replacing the OOM killer, we just defer it
with reserves that were calibrated to an anecdotal snapshot of a fuzzy
quantity of reclaim activity?  Is the idea here to just pile sh*tty,
mostly-working mechanisms on top of each other in the hope that one of
them will kick things along just enough to avoid locking up?

> > So unless you tell us which allocations should come from previously
> > declared reserves, and which ones should rely on reclaim and may fail,
> > the reserves can deplete prematurely and we're back to square one.
> 
> Like the OOM killer, filesystems are not omnipotent and are not
> perfect.  Requiring us to be so is entirely unreasonable, and is
> *entirely unnecessary* from the POV of the mm subsystem.
> 
> Reservations give the mm subsystem a *strong model* for guaranteeing
> forwards allocation progress, and it can be independently verified
> and tested without having to care about how some subsystem uses it.
> The mm subsystem supplies the *mechanism*, and mm developers are
> entirely focussed around ensuring the mechanism works and is
> verifiable.  i.e. you could write some debug kernel module to
> exercise, verify and regression test the model behaviour, which is
> something that simply cannot be done with the OOM killer.
> 
> Reservation sizes required by a subsystem are *policy*. They are not
> a problem the mm subsystem needs to be concerned with as the
> subsystem has to get the reservations right for the mechanism to
> work. i.e. Managing reservation sizes is my responsibility as a
> subsystem maintainer, just like it's currently my responsibility for
> ensuring that transient ENOMEM conditions don't result in a
> filesystem shutdown....

Anything that depends on the point at which the memory management
system gives up reclaiming pages is not verifiable in the slightest.
It will vary from kernel to kernel, from workload to workload, from
run to run.  It will regress in the blink of an eye.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 22:31                                                         ` Dave Chinner
@ 2015-03-03  9:13                                                           ` Vlastimil Babka
  -1 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-03-03  9:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On 03/02/2015 11:31 PM, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote:
>> On 02/23/2015 08:32 AM, Dave Chinner wrote:
>> >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
>> >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
>> >We do not know ahead of time if the object we are allocating is
>> >going to modified and hence locked into the transaction. Hence we
>> >can't say "use the reserve for this *specific* allocation", and so
>> >the only guidance we can really give is "we will to allocate and
>> >*permanently consume* this much memory", and the reserve pool needs
>> >to cover that consumption to guarantee forwards progress.
>> 
>> I'm not sure I understand properly. You don't know if a specific
>> allocation is permanent or reclaimable, but you can tell in advance
>> how much in total will be permanent? Is it because you are
>> conservative and assume everything will be permanent, or how?
> 
> Because we know the worst case object modification constraints
> *exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know
> exactly what in memory objects we lock into the transaction and what
> memory is required to modify and track those objects. e.g: for a
> data extent allocation, the log reservation is as such:
> 
> /*
>  * In a write transaction we can allocate a maximum of 2
>  * extents.  This gives:
>  *    the inode getting the new extents: inode size
>  *    the inode's bmap btree: max depth * block size
>  *    the agfs of the ags from which the extents are allocated: 2 * sector
>  *    the superblock free block counter: sector size
>  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
>  * And the bmap_finish transaction can free bmap blocks in a join:
>  *    the agfs of the ags containing the blocks: 2 * sector size
>  *    the agfls of the ags containing the blocks: 2 * sector size
>  *    the super block free block counter: sector size
>  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
>  */
> STATIC uint
> xfs_calc_write_reservation(
>         struct xfs_mount        *mp)
> {
>         return XFS_DQUOT_LOGRES(mp) +
>                 MAX((xfs_calc_inode_res(mp, 1) +
>                      xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
>                                       XFS_FSB_TO_B(mp, 1)) +
>                      xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
>                      xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
>                                       XFS_FSB_TO_B(mp, 1))),
>                     (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
>                      xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
>                                       XFS_FSB_TO_B(mp, 1))));
> }
> 
> It's trivial to extend this logic to to memory allocation
> requirements, because the above is an exact encoding of all the
> objects we "permanently consume" memory for within the transaction.
> 
> What we don't know is how many objects we might need to scan to find
> the objects we will eventually modify.  Here's an (admittedly
> extreme) example to demonstrate a worst case scenario: allocate a
> 64k data extent. Because it is an exact size allocation, we look it
> up in the by-size free space btree. Free space is fragmented, so
> there are about a million 64k free space extents in the tree.
> 
> Once we find the first 64k extent, we search them to find the best
> locality target match.  The btree records are 16 bytes each, so we
> fit roughly 500 to a 4k block. Say we search half the extents to
> find the best match - i.e. we walk a thousand leaf blocks before
> finding the match we want, and modify that leaf block.
> 
> Now, the modification removed an entry from the leaf and tht
> triggers leaf merge thresholds, so a merge with the 1002nd block
> occurs. That block now demand pages in and we then modify and join
> it to the transaction. Now we walk back up the btree to update
> indexes, merging blocks all the way back up to the root.  We have a
> worst case size btree (5 levels) and we merge at every level meaning
> we demand page another 8 btree blocks and modify them.
> 
> In this case, we've demand paged ~1010 btree blocks, but only
> modified 10 of them. i.e. the memory we consumed permanently was
> only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> the allocation demand was 2 orders of magnitude more than the
> unreclaimable memory consumption of the btree modification.
> 
> I hope you start to see the scope of the problem now...

Thanks, that example did help me understand your position much better.
So you would need to reserve for a worst case number of the objects you modify,
plus some slack for the demand-paged objects that you need to temporarily
access, before you can drop and reclaim them (I suppose that in some of the tree
operations, you need to be holding references to e.g. two nodes at a time, or
maybe the full depth). Or maybe since all these temporary objects are
potentially modifiable, it's already accounted for in the "might be modified" part.

>> Can you at least at some later point in transaction recognize that
>> "OK, this object was not permanent after all" and tell mm that it
>> can lower your reserve?
> 
> I'm not including any memory used by objects we know won't be locked
> into the transaction in the reserve. Demand paged object memory is
> essentially unbound but is easily reclaimable. That reclaim will
> give us forward progress guarantees on the memory required here.
> 
>> >Yes, that's the big problem with preallocation, as well as your
>> >proposed "depelete the reserved memory first" approach. They
>> >*require* up front "preallocation" of free memory, either directly
>> >by the application, or internally by the mm subsystem.
>> 
>> I don't see why it would deadlock, if during reserve time the mm can
>> return ENOMEM as the reserver should be able to back out at that
>> point.
> 
> Preallocated reserves do not allow for unbound demand paging of
> reclaimable objects within reserved allocation contexts.

OK I think I get the point now.

So, lots of the concerns by me and others were about the wasted memory due to
reservations, and increased pressure on the rest of the system. I was thinking,
are you able, at the beginning of the transaction (for this purposes, I think of
transaction as the work that starts with the memory reservation, then it cannot
rollback and relies on the reserves, until it commits and frees the memory),
determine whether the transaction cannot be blocked in its progress by any other
transaction, and the only thing that would block it would be inability to
allocate memory during its course?

If that was the case, we could "share" the reserved memory for all ongoing
transactions of a single class (i.e. xfs transactions). If a transaction knows
it cannot be blocked by anything else, only then it passes the
GFP_CAN_USE_RESERVE flag to the allocator. Once the allocator gives part of the
reserve to one such transaction, it will deny the reserves to other such
transactions, until the first one finishes. In practice it would be more complex
of course, but it should guarantee forward progress without lots of
wasted memory (maybe we wouldn't have to rely on treting clean reclaimable pages
as reserve in that case, which was also pointed out to be problematic).

Of course it all depends on whether you are able to determine the "guaranteed to
not block". I can however easily imagine it's not possible...

> Cheers
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-03  9:13                                                           ` Vlastimil Babka
  0 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-03-03  9:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On 03/02/2015 11:31 PM, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote:
>> On 02/23/2015 08:32 AM, Dave Chinner wrote:
>> >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
>> >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
>> >We do not know ahead of time if the object we are allocating is
>> >going to modified and hence locked into the transaction. Hence we
>> >can't say "use the reserve for this *specific* allocation", and so
>> >the only guidance we can really give is "we will to allocate and
>> >*permanently consume* this much memory", and the reserve pool needs
>> >to cover that consumption to guarantee forwards progress.
>> 
>> I'm not sure I understand properly. You don't know if a specific
>> allocation is permanent or reclaimable, but you can tell in advance
>> how much in total will be permanent? Is it because you are
>> conservative and assume everything will be permanent, or how?
> 
> Because we know the worst case object modification constraints
> *exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know
> exactly what in memory objects we lock into the transaction and what
> memory is required to modify and track those objects. e.g: for a
> data extent allocation, the log reservation is as such:
> 
> /*
>  * In a write transaction we can allocate a maximum of 2
>  * extents.  This gives:
>  *    the inode getting the new extents: inode size
>  *    the inode's bmap btree: max depth * block size
>  *    the agfs of the ags from which the extents are allocated: 2 * sector
>  *    the superblock free block counter: sector size
>  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
>  * And the bmap_finish transaction can free bmap blocks in a join:
>  *    the agfs of the ags containing the blocks: 2 * sector size
>  *    the agfls of the ags containing the blocks: 2 * sector size
>  *    the super block free block counter: sector size
>  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
>  */
> STATIC uint
> xfs_calc_write_reservation(
>         struct xfs_mount        *mp)
> {
>         return XFS_DQUOT_LOGRES(mp) +
>                 MAX((xfs_calc_inode_res(mp, 1) +
>                      xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
>                                       XFS_FSB_TO_B(mp, 1)) +
>                      xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
>                      xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
>                                       XFS_FSB_TO_B(mp, 1))),
>                     (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
>                      xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
>                                       XFS_FSB_TO_B(mp, 1))));
> }
> 
> It's trivial to extend this logic to to memory allocation
> requirements, because the above is an exact encoding of all the
> objects we "permanently consume" memory for within the transaction.
> 
> What we don't know is how many objects we might need to scan to find
> the objects we will eventually modify.  Here's an (admittedly
> extreme) example to demonstrate a worst case scenario: allocate a
> 64k data extent. Because it is an exact size allocation, we look it
> up in the by-size free space btree. Free space is fragmented, so
> there are about a million 64k free space extents in the tree.
> 
> Once we find the first 64k extent, we search them to find the best
> locality target match.  The btree records are 16 bytes each, so we
> fit roughly 500 to a 4k block. Say we search half the extents to
> find the best match - i.e. we walk a thousand leaf blocks before
> finding the match we want, and modify that leaf block.
> 
> Now, the modification removed an entry from the leaf and tht
> triggers leaf merge thresholds, so a merge with the 1002nd block
> occurs. That block now demand pages in and we then modify and join
> it to the transaction. Now we walk back up the btree to update
> indexes, merging blocks all the way back up to the root.  We have a
> worst case size btree (5 levels) and we merge at every level meaning
> we demand page another 8 btree blocks and modify them.
> 
> In this case, we've demand paged ~1010 btree blocks, but only
> modified 10 of them. i.e. the memory we consumed permanently was
> only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> the allocation demand was 2 orders of magnitude more than the
> unreclaimable memory consumption of the btree modification.
> 
> I hope you start to see the scope of the problem now...

Thanks, that example did help me understand your position much better.
So you would need to reserve for a worst case number of the objects you modify,
plus some slack for the demand-paged objects that you need to temporarily
access, before you can drop and reclaim them (I suppose that in some of the tree
operations, you need to be holding references to e.g. two nodes at a time, or
maybe the full depth). Or maybe since all these temporary objects are
potentially modifiable, it's already accounted for in the "might be modified" part.

>> Can you at least at some later point in transaction recognize that
>> "OK, this object was not permanent after all" and tell mm that it
>> can lower your reserve?
> 
> I'm not including any memory used by objects we know won't be locked
> into the transaction in the reserve. Demand paged object memory is
> essentially unbound but is easily reclaimable. That reclaim will
> give us forward progress guarantees on the memory required here.
> 
>> >Yes, that's the big problem with preallocation, as well as your
>> >proposed "depelete the reserved memory first" approach. They
>> >*require* up front "preallocation" of free memory, either directly
>> >by the application, or internally by the mm subsystem.
>> 
>> I don't see why it would deadlock, if during reserve time the mm can
>> return ENOMEM as the reserver should be able to back out at that
>> point.
> 
> Preallocated reserves do not allow for unbound demand paging of
> reclaimable objects within reserved allocation contexts.

OK I think I get the point now.

So, lots of the concerns by me and others were about the wasted memory due to
reservations, and increased pressure on the rest of the system. I was thinking,
are you able, at the beginning of the transaction (for this purposes, I think of
transaction as the work that starts with the memory reservation, then it cannot
rollback and relies on the reserves, until it commits and frees the memory),
determine whether the transaction cannot be blocked in its progress by any other
transaction, and the only thing that would block it would be inability to
allocate memory during its course?

If that was the case, we could "share" the reserved memory for all ongoing
transactions of a single class (i.e. xfs transactions). If a transaction knows
it cannot be blocked by anything else, only then it passes the
GFP_CAN_USE_RESERVE flag to the allocator. Once the allocator gives part of the
reserve to one such transaction, it will deny the reserves to other such
transactions, until the first one finishes. In practice it would be more complex
of course, but it should guarantee forward progress without lots of
wasted memory (maybe we wouldn't have to rely on treting clean reclaimable pages
as reserve in that case, which was also pointed out to be problematic).

Of course it all depends on whether you are able to determine the "guaranteed to
not block". I can however easily imagine it's not possible...

> Cheers
> 
> Dave.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-03  9:13                                                           ` Vlastimil Babka
@ 2015-03-04  1:33                                                             ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04  1:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote:
> On 03/02/2015 11:31 PM, Dave Chinner wrote:
> > On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote:
> > 
> > /*
> >  * In a write transaction we can allocate a maximum of 2
> >  * extents.  This gives:
> >  *    the inode getting the new extents: inode size
> >  *    the inode's bmap btree: max depth * block size
> >  *    the agfs of the ags from which the extents are allocated: 2 * sector
> >  *    the superblock free block counter: sector size
> >  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.....
> Thanks, that example did help me understand your position much better.
> So you would need to reserve for a worst case number of the objects you modify,
> plus some slack for the demand-paged objects that you need to temporarily
> access, before you can drop and reclaim them (I suppose that in some of the tree
> operations, you need to be holding references to e.g. two nodes at a time, or
> maybe the full depth). Or maybe since all these temporary objects are
> potentially modifiable, it's already accounted for in the "might be modified" part.

Already accounted for in the "might be modified path".

> >> Can you at least at some later point in transaction recognize that
> >> "OK, this object was not permanent after all" and tell mm that it
> >> can lower your reserve?
> > 
> > I'm not including any memory used by objects we know won't be locked
> > into the transaction in the reserve. Demand paged object memory is
> > essentially unbound but is easily reclaimable. That reclaim will
> > give us forward progress guarantees on the memory required here.
> > 
> >> >Yes, that's the big problem with preallocation, as well as your
> >> >proposed "depelete the reserved memory first" approach. They
> >> >*require* up front "preallocation" of free memory, either directly
> >> >by the application, or internally by the mm subsystem.
> >> 
> >> I don't see why it would deadlock, if during reserve time the mm can
> >> return ENOMEM as the reserver should be able to back out at that
> >> point.
> > 
> > Preallocated reserves do not allow for unbound demand paging of
> > reclaimable objects within reserved allocation contexts.
> 
> OK I think I get the point now.
> 
> So, lots of the concerns by me and others were about the wasted memory due to
> reservations, and increased pressure on the rest of the system. I was thinking,
> are you able, at the beginning of the transaction (for this purposes, I think of
> transaction as the work that starts with the memory reservation, then it cannot
> rollback and relies on the reserves, until it commits and frees the memory),
> determine whether the transaction cannot be blocked in its progress by any other
> transaction, and the only thing that would block it would be inability to
> allocate memory during its course?

No. e.g. any transaction that requires allocation or freeing of an
inode or extent can get stuck behind any other transaction that is
allocating/freeing and inode/extent. And this will happen when
holding inode locks, which means other transactions on that inode
will then get stuck on the inode lock, and so on. Blocking
dependencies within transactions are everywhere and cannot be
avoided.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-04  1:33                                                             ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04  1:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote:
> On 03/02/2015 11:31 PM, Dave Chinner wrote:
> > On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote:
> > 
> > /*
> >  * In a write transaction we can allocate a maximum of 2
> >  * extents.  This gives:
> >  *    the inode getting the new extents: inode size
> >  *    the inode's bmap btree: max depth * block size
> >  *    the agfs of the ags from which the extents are allocated: 2 * sector
> >  *    the superblock free block counter: sector size
> >  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.....
> Thanks, that example did help me understand your position much better.
> So you would need to reserve for a worst case number of the objects you modify,
> plus some slack for the demand-paged objects that you need to temporarily
> access, before you can drop and reclaim them (I suppose that in some of the tree
> operations, you need to be holding references to e.g. two nodes at a time, or
> maybe the full depth). Or maybe since all these temporary objects are
> potentially modifiable, it's already accounted for in the "might be modified" part.

Already accounted for in the "might be modified path".

> >> Can you at least at some later point in transaction recognize that
> >> "OK, this object was not permanent after all" and tell mm that it
> >> can lower your reserve?
> > 
> > I'm not including any memory used by objects we know won't be locked
> > into the transaction in the reserve. Demand paged object memory is
> > essentially unbound but is easily reclaimable. That reclaim will
> > give us forward progress guarantees on the memory required here.
> > 
> >> >Yes, that's the big problem with preallocation, as well as your
> >> >proposed "depelete the reserved memory first" approach. They
> >> >*require* up front "preallocation" of free memory, either directly
> >> >by the application, or internally by the mm subsystem.
> >> 
> >> I don't see why it would deadlock, if during reserve time the mm can
> >> return ENOMEM as the reserver should be able to back out at that
> >> point.
> > 
> > Preallocated reserves do not allow for unbound demand paging of
> > reclaimable objects within reserved allocation contexts.
> 
> OK I think I get the point now.
> 
> So, lots of the concerns by me and others were about the wasted memory due to
> reservations, and increased pressure on the rest of the system. I was thinking,
> are you able, at the beginning of the transaction (for this purposes, I think of
> transaction as the work that starts with the memory reservation, then it cannot
> rollback and relies on the reserves, until it commits and frees the memory),
> determine whether the transaction cannot be blocked in its progress by any other
> transaction, and the only thing that would block it would be inability to
> allocate memory during its course?

No. e.g. any transaction that requires allocation or freeing of an
inode or extent can get stuck behind any other transaction that is
allocating/freeing and inode/extent. And this will happen when
holding inode locks, which means other transactions on that inode
will then get stuck on the inode lock, and so on. Blocking
dependencies within transactions are everywhere and cannot be
avoided.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-03  2:50                                                           ` Johannes Weiner
@ 2015-03-04  6:52                                                             ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04  6:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Mon, Mar 02, 2015 at 09:50:23PM -0500, Johannes Weiner wrote:
> On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote:
> > On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> > > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > > > > When allocating pages the caller should drain its reserves in
> > > > > preference to dipping into the regular freelist.  This guy has already
> > > > > done his reclaim and shouldn't be penalised a second time.  I guess
> > > > > Johannes's preallocation code should switch to doing this for the same
> > > > > reason, plus the fact that snipping a page off
> > > > > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > > > > anyway so why not do it by default.
> > > > 
> > > > That is at odds with the requirements of demand paging, which
> > > > allocate for objects that are reclaimable within the course of the
> > > > transaction. The reserve is there to ensure forward progress for
> > > > allocations for objects that aren't freed until after the
> > > > transaction completes, but if we drain it for reclaimable objects we
> > > > then have nothing left in the reserve pool when we actually need it.
> > > >
> > > > We do not know ahead of time if the object we are allocating is
> > > > going to modified and hence locked into the transaction. Hence we
> > > > can't say "use the reserve for this *specific* allocation", and so
> > > > the only guidance we can really give is "we will to allocate and
> > > > *permanently consume* this much memory", and the reserve pool needs
> > > > to cover that consumption to guarantee forwards progress.
> > > > 
> > > > Forwards progress for all other allocations is guaranteed because
> > > > they are reclaimable objects - they either freed directly back to
> > > > their source (slab, heap, page lists) or they are freed by shrinkers
> > > > once they have been released from the transaction.
> > > > 
> > > > Hence we need allocations to come from the free list and trigger
> > > > reclaim, regardless of the fact there is a reserve pool there. The
> > > > reserve pool needs to be a last resort once there are no other
> > > > avenues to allocate memory. i.e. it would be used to replace the OOM
> > > > killer for GFP_NOFAIL allocations.
> > > 
> > > That won't work.
> > 
> > I don't see why not...
> > 
> > > Clean cache can be temporarily unavailable and
> > > off-LRU for several reasons - compaction, migration, pending page
> > > promotion, other reclaimers.  How often are we trying before we dip
> > > into the reserve pool?  As you have noticed, the OOM killer goes off
> > > seemingly prematurely at times, and the reason for that is that we
> > > simply don't KNOW the exact point when we ran out of reclaimable
> > > memory.
> > 
> > Sure, but that's irrelevant to the problem at hand. At some point,
> > the Mm subsystem is going to decide "we're at OOM" - it's *what
> > happens next* that matters.
> 
> It's not irrelevant at all.  That point is an arbitrary magic number
> that is a byproduct of many imlementation details and concurrency in
> the memory management layer.  It's completely fine to tie allocations
> which can fail to this point, but you can't reasonably calibrate your
> emergency reserves, which are supposed to guarantee progress, to such
> an unpredictable variable.
> 
> When you reserve based on the share of allocations that you know will
> be unreclaimable, you are assuming that all other allocations will be
> reclaimable, and that is simply flawed.  There is so much concurrency
> in the MM subsystem that you can't reasonably expect a single scanner
> instance to recover the majority of theoretically reclaimable memory.

On one hand you say "memory accounting is unreliable, so detecting
OOM is unreliable, and so we have an unreliable trigger point.

On the other hand you say "single scanner instance can't reclaim all
memory", again stating we have an unreliable trigger point.

On the gripping hand, that unreliable trigger point is what
kicks the OOM killer.

Yet you consider that point to be reliable enough to kick the OOM
killer, but too unreliable to trigger allocation from a reserve
pool?

Say what?

I suspect you've completely misunderstood what I've been suggesting.

By definition, we have the pages we reserved in the reserve pool,
and unless we've exhausted that reservation with permanent
allocations we should always be able to allocate from it. If the
pool got emptied by demand page allocations, then we back off and
retry reclaim until the reclaimable objects are released back into
the reserve pool. i.e. reclaim fills reserve pools first, then when
they are full pages can go back on free lists for normal
allocations.  This provides the mechanism for forwards progress, and
it's essentially the same mechanism that mempools use to guarantee
forwards progess. the only difference is that reserve pool refilling
comes through reclaim via shrinker invocation...

In reality, though, I don't really care how the mm subsystem
implements that pool as long as it handles the cases I've described
(e.g http://oss.sgi.com/archives/xfs/2015-03/msg00039.html). I don't
think we're making progress here, anyway, so unless you come up with
some other solution this thread is going to die here....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-04  6:52                                                             ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04  6:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs

On Mon, Mar 02, 2015 at 09:50:23PM -0500, Johannes Weiner wrote:
> On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote:
> > On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> > > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > > > > When allocating pages the caller should drain its reserves in
> > > > > preference to dipping into the regular freelist.  This guy has already
> > > > > done his reclaim and shouldn't be penalised a second time.  I guess
> > > > > Johannes's preallocation code should switch to doing this for the same
> > > > > reason, plus the fact that snipping a page off
> > > > > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > > > > anyway so why not do it by default.
> > > > 
> > > > That is at odds with the requirements of demand paging, which
> > > > allocate for objects that are reclaimable within the course of the
> > > > transaction. The reserve is there to ensure forward progress for
> > > > allocations for objects that aren't freed until after the
> > > > transaction completes, but if we drain it for reclaimable objects we
> > > > then have nothing left in the reserve pool when we actually need it.
> > > >
> > > > We do not know ahead of time if the object we are allocating is
> > > > going to modified and hence locked into the transaction. Hence we
> > > > can't say "use the reserve for this *specific* allocation", and so
> > > > the only guidance we can really give is "we will to allocate and
> > > > *permanently consume* this much memory", and the reserve pool needs
> > > > to cover that consumption to guarantee forwards progress.
> > > > 
> > > > Forwards progress for all other allocations is guaranteed because
> > > > they are reclaimable objects - they either freed directly back to
> > > > their source (slab, heap, page lists) or they are freed by shrinkers
> > > > once they have been released from the transaction.
> > > > 
> > > > Hence we need allocations to come from the free list and trigger
> > > > reclaim, regardless of the fact there is a reserve pool there. The
> > > > reserve pool needs to be a last resort once there are no other
> > > > avenues to allocate memory. i.e. it would be used to replace the OOM
> > > > killer for GFP_NOFAIL allocations.
> > > 
> > > That won't work.
> > 
> > I don't see why not...
> > 
> > > Clean cache can be temporarily unavailable and
> > > off-LRU for several reasons - compaction, migration, pending page
> > > promotion, other reclaimers.  How often are we trying before we dip
> > > into the reserve pool?  As you have noticed, the OOM killer goes off
> > > seemingly prematurely at times, and the reason for that is that we
> > > simply don't KNOW the exact point when we ran out of reclaimable
> > > memory.
> > 
> > Sure, but that's irrelevant to the problem at hand. At some point,
> > the Mm subsystem is going to decide "we're at OOM" - it's *what
> > happens next* that matters.
> 
> It's not irrelevant at all.  That point is an arbitrary magic number
> that is a byproduct of many imlementation details and concurrency in
> the memory management layer.  It's completely fine to tie allocations
> which can fail to this point, but you can't reasonably calibrate your
> emergency reserves, which are supposed to guarantee progress, to such
> an unpredictable variable.
> 
> When you reserve based on the share of allocations that you know will
> be unreclaimable, you are assuming that all other allocations will be
> reclaimable, and that is simply flawed.  There is so much concurrency
> in the MM subsystem that you can't reasonably expect a single scanner
> instance to recover the majority of theoretically reclaimable memory.

On one hand you say "memory accounting is unreliable, so detecting
OOM is unreliable, and so we have an unreliable trigger point.

On the other hand you say "single scanner instance can't reclaim all
memory", again stating we have an unreliable trigger point.

On the gripping hand, that unreliable trigger point is what
kicks the OOM killer.

Yet you consider that point to be reliable enough to kick the OOM
killer, but too unreliable to trigger allocation from a reserve
pool?

Say what?

I suspect you've completely misunderstood what I've been suggesting.

By definition, we have the pages we reserved in the reserve pool,
and unless we've exhausted that reservation with permanent
allocations we should always be able to allocate from it. If the
pool got emptied by demand page allocations, then we back off and
retry reclaim until the reclaimable objects are released back into
the reserve pool. i.e. reclaim fills reserve pools first, then when
they are full pages can go back on free lists for normal
allocations.  This provides the mechanism for forwards progress, and
it's essentially the same mechanism that mempools use to guarantee
forwards progess. the only difference is that reserve pool refilling
comes through reclaim via shrinker invocation...

In reality, though, I don't really care how the mm subsystem
implements that pool as long as it handles the cases I've described
(e.g http://oss.sgi.com/archives/xfs/2015-03/msg00039.html). I don't
think we're making progress here, anyway, so unless you come up with
some other solution this thread is going to die here....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04  1:33                                                             ` Dave Chinner
@ 2015-03-04  8:50                                                               ` Vlastimil Babka
  -1 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-03-04  8:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On 03/04/2015 02:33 AM, Dave Chinner wrote:
> On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote:
>>>
>>> Preallocated reserves do not allow for unbound demand paging of
>>> reclaimable objects within reserved allocation contexts.
>>
>> OK I think I get the point now.
>>
>> So, lots of the concerns by me and others were about the wasted memory due to
>> reservations, and increased pressure on the rest of the system. I was thinking,
>> are you able, at the beginning of the transaction (for this purposes, I think of
>> transaction as the work that starts with the memory reservation, then it cannot
>> rollback and relies on the reserves, until it commits and frees the memory),
>> determine whether the transaction cannot be blocked in its progress by any other
>> transaction, and the only thing that would block it would be inability to
>> allocate memory during its course?
>
> No. e.g. any transaction that requires allocation or freeing of an
> inode or extent can get stuck behind any other transaction that is
> allocating/freeing and inode/extent. And this will happen when
> holding inode locks, which means other transactions on that inode
> will then get stuck on the inode lock, and so on. Blocking
> dependencies within transactions are everywhere and cannot be
> avoided.

Hm, I see. I thought that perhaps to avoid deadlocks between 
transactions (which you already have to do somehow), either the 
dependencies have to be structured in a way that there's always some 
transaction that can't block on others. Or you have a way to detect 
potential deadlocks before they happen, and stall somebody who tries to 
lock. Both should (at least theoretically) mean that you would be able 
to point to such transaction, although I can imagine the cost of being 
able to do that could be prohibitive.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-04  8:50                                                               ` Vlastimil Babka
  0 siblings, 0 replies; 276+ messages in thread
From: Vlastimil Babka @ 2015-03-04  8:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On 03/04/2015 02:33 AM, Dave Chinner wrote:
> On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote:
>>>
>>> Preallocated reserves do not allow for unbound demand paging of
>>> reclaimable objects within reserved allocation contexts.
>>
>> OK I think I get the point now.
>>
>> So, lots of the concerns by me and others were about the wasted memory due to
>> reservations, and increased pressure on the rest of the system. I was thinking,
>> are you able, at the beginning of the transaction (for this purposes, I think of
>> transaction as the work that starts with the memory reservation, then it cannot
>> rollback and relies on the reserves, until it commits and frees the memory),
>> determine whether the transaction cannot be blocked in its progress by any other
>> transaction, and the only thing that would block it would be inability to
>> allocate memory during its course?
>
> No. e.g. any transaction that requires allocation or freeing of an
> inode or extent can get stuck behind any other transaction that is
> allocating/freeing and inode/extent. And this will happen when
> holding inode locks, which means other transactions on that inode
> will then get stuck on the inode lock, and so on. Blocking
> dependencies within transactions are everywhere and cannot be
> avoided.

Hm, I see. I thought that perhaps to avoid deadlocks between 
transactions (which you already have to do somehow), either the 
dependencies have to be structured in a way that there's always some 
transaction that can't block on others. Or you have a way to detect 
potential deadlocks before they happen, and stall somebody who tries to 
lock. Both should (at least theoretically) mean that you would be able 
to point to such transaction, although I can imagine the cost of being 
able to do that could be prohibitive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04  8:50                                                               ` Vlastimil Babka
@ 2015-03-04 11:03                                                                 ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04 11:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Wed, Mar 04, 2015 at 09:50:58AM +0100, Vlastimil Babka wrote:
> On 03/04/2015 02:33 AM, Dave Chinner wrote:
> >On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote:
> >>>
> >>>Preallocated reserves do not allow for unbound demand paging of
> >>>reclaimable objects within reserved allocation contexts.
> >>
> >>OK I think I get the point now.
> >>
> >>So, lots of the concerns by me and others were about the wasted memory due to
> >>reservations, and increased pressure on the rest of the system. I was thinking,
> >>are you able, at the beginning of the transaction (for this purposes, I think of
> >>transaction as the work that starts with the memory reservation, then it cannot
> >>rollback and relies on the reserves, until it commits and frees the memory),
> >>determine whether the transaction cannot be blocked in its progress by any other
> >>transaction, and the only thing that would block it would be inability to
> >>allocate memory during its course?
> >
> >No. e.g. any transaction that requires allocation or freeing of an
> >inode or extent can get stuck behind any other transaction that is
> >allocating/freeing and inode/extent. And this will happen when
> >holding inode locks, which means other transactions on that inode
> >will then get stuck on the inode lock, and so on. Blocking
> >dependencies within transactions are everywhere and cannot be
> >avoided.
> 
> Hm, I see. I thought that perhaps to avoid deadlocks between
> transactions (which you already have to do somehow),

Of course, by following lock ordering rules, rules about holding
locks over transaction reservations, allowing bulk reservations for
rolling transactions that don't unlock objects between transaction
commits, having allocation group ordering rules, block allocation
ordering rules, transactional lock recursion suport to prevent
transaction deadlocking walking over objects already locked into the
transaction, etc.

By following those rules, we guarantee forwards progress in the
transaction subsystem. If we can also guarantee forwards progress in
memory allocation inside transaction context (like Irix did all
those years ago :P), then we can guarantee that transactions will
always complete unless there is a bug or corruption is detected
during an operation...

> either the
> dependencies have to be structured in a way that there's always some
> transaction that can't block on others. Or you have a way to detect
> potential deadlocks before they happen, and stall somebody who tries
> to lock.

$ git grep ASSERT fs/xfs |wc -l
1716

About 3% of the code in XFS is ASSERT statements used to verify
context specific state is correct in CONFIG_XFS_DEBUG=y builds.

FYI, from cloc:

Subsystem      files          blank        comment	   code
-------------------------------------------------------------------------------
fs/xfs		157          10841          25339          69140
mm/		 97          13923          25534          67870
fs/btrfs	 86          14443          15097          85065

Cheers,

Dave.

PS: XFS userspace has another 110,000 lines of code in xfsprogs and
60,000 lines of code in xfsdump, and there's also 80,000 lines of
test code in xfstests.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-04 11:03                                                                 ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04 11:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On Wed, Mar 04, 2015 at 09:50:58AM +0100, Vlastimil Babka wrote:
> On 03/04/2015 02:33 AM, Dave Chinner wrote:
> >On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote:
> >>>
> >>>Preallocated reserves do not allow for unbound demand paging of
> >>>reclaimable objects within reserved allocation contexts.
> >>
> >>OK I think I get the point now.
> >>
> >>So, lots of the concerns by me and others were about the wasted memory due to
> >>reservations, and increased pressure on the rest of the system. I was thinking,
> >>are you able, at the beginning of the transaction (for this purposes, I think of
> >>transaction as the work that starts with the memory reservation, then it cannot
> >>rollback and relies on the reserves, until it commits and frees the memory),
> >>determine whether the transaction cannot be blocked in its progress by any other
> >>transaction, and the only thing that would block it would be inability to
> >>allocate memory during its course?
> >
> >No. e.g. any transaction that requires allocation or freeing of an
> >inode or extent can get stuck behind any other transaction that is
> >allocating/freeing and inode/extent. And this will happen when
> >holding inode locks, which means other transactions on that inode
> >will then get stuck on the inode lock, and so on. Blocking
> >dependencies within transactions are everywhere and cannot be
> >avoided.
> 
> Hm, I see. I thought that perhaps to avoid deadlocks between
> transactions (which you already have to do somehow),

Of course, by following lock ordering rules, rules about holding
locks over transaction reservations, allowing bulk reservations for
rolling transactions that don't unlock objects between transaction
commits, having allocation group ordering rules, block allocation
ordering rules, transactional lock recursion suport to prevent
transaction deadlocking walking over objects already locked into the
transaction, etc.

By following those rules, we guarantee forwards progress in the
transaction subsystem. If we can also guarantee forwards progress in
memory allocation inside transaction context (like Irix did all
those years ago :P), then we can guarantee that transactions will
always complete unless there is a bug or corruption is detected
during an operation...

> either the
> dependencies have to be structured in a way that there's always some
> transaction that can't block on others. Or you have a way to detect
> potential deadlocks before they happen, and stall somebody who tries
> to lock.

$ git grep ASSERT fs/xfs |wc -l
1716

About 3% of the code in XFS is ASSERT statements used to verify
context specific state is correct in CONFIG_XFS_DEBUG=y builds.

FYI, from cloc:

Subsystem      files          blank        comment	   code
-------------------------------------------------------------------------------
fs/xfs		157          10841          25339          69140
mm/		 97          13923          25534          67870
fs/btrfs	 86          14443          15097          85065

Cheers,

Dave.

PS: XFS userspace has another 110,000 lines of code in xfsprogs and
60,000 lines of code in xfsdump, and there's also 80,000 lines of
test code in xfstests.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-27 13:12                                               ` Dave Chinner
@ 2015-03-04 12:41                                                 ` Tetsuo Handa
  2015-03-04 13:25                                                   ` Dave Chinner
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-03-04 12:41 UTC (permalink / raw)
  To: david
  Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm,
	mgorman, torvalds, fernando_b1

Dave Chinner wrote:
> On Fri, Feb 27, 2015 at 09:42:55PM +0900, Tetsuo Handa wrote:
> > If kswapd0 is blocked forever at e.g. mutex_lock() inside shrinker
> > functions, who else can make forward progress?
> 
> You can't get into these filesystem shrinkers when you do GFP_NOIO
> allocations, as the IO path does.
> 
> > Shouldn't we avoid calling functions which could potentially block for
> > unpredictable duration (e.g. unkillable locks and/or completion) from
> > shrinker functions?
> 
> No, because otherwise we can't throttle allocation and reclaim to
> the rate at which IO can clean dirty objects. i.e. we do this for
> the same reason we throttle page cache dirtying to the rate at which
> we can clean dirty pages....

I'm misunderstanding something. The description for kswapd() function
in mm/vmscan.c says "This basically trickles out pages so that we have
_some_ free memory available even if there is no other activity that frees
anything up".

Forever blocking kswapd0 somewhere inside filesystem shrinker functions is
equivalent with removing kswapd() function because it also prevents non
filesystem shrinker functions from being called by kswapd0, doesn't it?
Then, the description will become "We won't have _some_ free memory available
if there is no other activity that frees anything up", won't it?

Does kswapd0 exist only for reducing the delay caused by reclaiming
synchronously? Disabling kswapd0 affects nothing about functionality?
The system can make forward progress even if nobody can call non filesystem
shrinkers, can't it?

If yes, then why do we need to make special handling for
excluding kswapd0 at

	while (unlikely(too_many_isolated(zone, file, sc))) {
		congestion_wait(BLK_RW_ASYNC, HZ/10);

		/* We are about to die and free our memory. Return now. */
		if (fatal_signal_pending(current))
			return SWAP_CLUSTER_MAX;
	}

loop inside shrink_inactive_list() ?

I can't understand the difference between "kswapd0 sleeping forever at
too_many_isolated() loop inside shrink_inactive_list()" and "kswapd0
sleeping forever at mutex_lock() inside xfs_reclaim_inodes_ag()".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 16:58                                                       ` Michal Hocko
@ 2015-03-04 12:52                                                         ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04 12:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs,
	Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds

On Mon, Mar 02, 2015 at 05:58:23PM +0100, Michal Hocko wrote:
> On Mon 02-03-15 11:39:13, Theodore Ts'o wrote:
> > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> > > The idea is sound. But I am pretty sure we will find many corner
> > > cases. E.g. what if the mere reservation attempt causes the system
> > > to go OOM and trigger the OOM killer?
> > 
> > Doctor, doctor, it hurts when I do that....
> > 
> > So don't trigger the OOM killer.  We can let the caller decide whether
> > the reservation request should block or return ENOMEM, but the whole
> > point of the reservation request idea is that this happens *before*
> > we've taken any mutexes, so blocking won't prevent forward progress.
> 
> Maybe I wasn't clear. I wasn't concerned about the context which
> is doing to reservation. I was more concerned about all the other
> allocation requests which might fail now (becasuse they do not have
> access to the reserves). So you think that we should simply disable OOM
> killer while there is any reservation active? Wouldn't that be even more
> fragile when something goes terribly wrong?

That's a silly strawman.  Why wouldn't you simply block them until
the reserves are released when the transaction completes and the
unused memory goes back to the free pool?

Let me try another tack. My qualifications are as a
distributed control system engineer, not a computer scientist. I
see everything as a system of interconnected feedback loops: an
operating system is nothing but a set of very complex, tightly
interconnected control systems.

Precedence? IO-less dirty throttling - that came about after I'd
been advocating a control theory based algorithm for several years
to solve the breakdown problems of dirty page throttling.  We look
at the code Fenguang Wu wrote as one of the major success stories of
Linux - the writeback code just works and nobody ever has to tune it
anymore.

I see the problem of direct memory reclaim as being very similar to
the problems the old IO based write throttling had: it has unbound
concurrency, severe unfairness and breaks down badly when heavily
loaded.  As a control system, it has the same terrible design
as the IO-based write throttling had.

There are other many similarities, too.

Allocation can only take place at the rate at which reclaim occurs,
and we only have a limited budget of allocatable pages. This is the
same as the dirty page throttling - dirtying pages is limited to the
rate we can clean pages, and there are a limited budget of dirty
pages in the system.

Reclaiming pages is also done most efficiently by a single thread
per zone where lots of internal context can be kept (kswapd). This
is similar to optimal writeback of dirty pages requires a
single thread with internal context per block device..

Waiting for free pages to arrive can be done by an ordered queuing
system, and we can account for the number of pages each allocation
requires in the queueing system and hence only need wake the number
of waiters that will consume the memory just freed. Just like we do
with the the dirty page throttling queue.

As such, the same solutions could be applied. As the allocation
demand exceeds the supply of free pages, we throttle allocation by
sleeping on an ordered queue and only waking waiters at the rate
at which kswapd reclaim can free pages. It's trivial to account
accurately, and the feedback loop is relatively simple, too.

We can also easily maintain a reserve of free pages this way, usable
only by allocation marked with special flags.  The reserve threshold
can be dynamic, and tasks that request it to change can be blocked
until the reserve has been built up to meet caler requirements.
Allocations that are allowed to dip into the reserve may do so
rather than being added to the queue that waits for reclaim.

Reclaim would always fill the reserve back up to it's limits first,
and tasks that have reservations can release them gradually as they
mark them as consumed by the reservation context (e.g. when a
filesystem joins an object to a transaction and modifies it),
thereby reducing the reserve that task has available as it
progresses.

So, there's yet another possible solution to the allocation
reservation problem, and one that solves several other problems that
are being described as making reservation pools difficult or even
impossible to implement.

Seriously, I'm not expecting this problem to be solved tomorrow;
what I want is reliable, deterministic memory allocation behaviour
from the mm subsystem. I want people to be thinking about how to
acheive that rather than limiting their solutions by what we have
now and can hack into the current code, because otherwise we'll
never end up with a reliable memory allocation reservation system....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-04 12:52                                                         ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04 12:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Theodore Ts'o, Johannes Weiner, Tetsuo Handa, dchinner,
	linux-mm, rientjes, oleg, akpm, mgorman, torvalds, xfs

On Mon, Mar 02, 2015 at 05:58:23PM +0100, Michal Hocko wrote:
> On Mon 02-03-15 11:39:13, Theodore Ts'o wrote:
> > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> > > The idea is sound. But I am pretty sure we will find many corner
> > > cases. E.g. what if the mere reservation attempt causes the system
> > > to go OOM and trigger the OOM killer?
> > 
> > Doctor, doctor, it hurts when I do that....
> > 
> > So don't trigger the OOM killer.  We can let the caller decide whether
> > the reservation request should block or return ENOMEM, but the whole
> > point of the reservation request idea is that this happens *before*
> > we've taken any mutexes, so blocking won't prevent forward progress.
> 
> Maybe I wasn't clear. I wasn't concerned about the context which
> is doing to reservation. I was more concerned about all the other
> allocation requests which might fail now (becasuse they do not have
> access to the reserves). So you think that we should simply disable OOM
> killer while there is any reservation active? Wouldn't that be even more
> fragile when something goes terribly wrong?

That's a silly strawman.  Why wouldn't you simply block them until
the reserves are released when the transaction completes and the
unused memory goes back to the free pool?

Let me try another tack. My qualifications are as a
distributed control system engineer, not a computer scientist. I
see everything as a system of interconnected feedback loops: an
operating system is nothing but a set of very complex, tightly
interconnected control systems.

Precedence? IO-less dirty throttling - that came about after I'd
been advocating a control theory based algorithm for several years
to solve the breakdown problems of dirty page throttling.  We look
at the code Fenguang Wu wrote as one of the major success stories of
Linux - the writeback code just works and nobody ever has to tune it
anymore.

I see the problem of direct memory reclaim as being very similar to
the problems the old IO based write throttling had: it has unbound
concurrency, severe unfairness and breaks down badly when heavily
loaded.  As a control system, it has the same terrible design
as the IO-based write throttling had.

There are other many similarities, too.

Allocation can only take place at the rate at which reclaim occurs,
and we only have a limited budget of allocatable pages. This is the
same as the dirty page throttling - dirtying pages is limited to the
rate we can clean pages, and there are a limited budget of dirty
pages in the system.

Reclaiming pages is also done most efficiently by a single thread
per zone where lots of internal context can be kept (kswapd). This
is similar to optimal writeback of dirty pages requires a
single thread with internal context per block device..

Waiting for free pages to arrive can be done by an ordered queuing
system, and we can account for the number of pages each allocation
requires in the queueing system and hence only need wake the number
of waiters that will consume the memory just freed. Just like we do
with the the dirty page throttling queue.

As such, the same solutions could be applied. As the allocation
demand exceeds the supply of free pages, we throttle allocation by
sleeping on an ordered queue and only waking waiters at the rate
at which kswapd reclaim can free pages. It's trivial to account
accurately, and the feedback loop is relatively simple, too.

We can also easily maintain a reserve of free pages this way, usable
only by allocation marked with special flags.  The reserve threshold
can be dynamic, and tasks that request it to change can be blocked
until the reserve has been built up to meet caler requirements.
Allocations that are allowed to dip into the reserve may do so
rather than being added to the queue that waits for reclaim.

Reclaim would always fill the reserve back up to it's limits first,
and tasks that have reservations can release them gradually as they
mark them as consumed by the reservation context (e.g. when a
filesystem joins an object to a transaction and modifies it),
thereby reducing the reserve that task has available as it
progresses.

So, there's yet another possible solution to the allocation
reservation problem, and one that solves several other problems that
are being described as making reservation pools difficult or even
impossible to implement.

Seriously, I'm not expecting this problem to be solved tomorrow;
what I want is reliable, deterministic memory allocation behaviour
from the mm subsystem. I want people to be thinking about how to
acheive that rather than limiting their solutions by what we have
now and can hack into the current code, because otherwise we'll
never end up with a reliable memory allocation reservation system....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04 12:41                                                 ` Tetsuo Handa
@ 2015-03-04 13:25                                                   ` Dave Chinner
  2015-03-04 14:11                                                     ` Tetsuo Handa
  0 siblings, 1 reply; 276+ messages in thread
From: Dave Chinner @ 2015-03-04 13:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm,
	mgorman, torvalds, fernando_b1

On Wed, Mar 04, 2015 at 09:41:01PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > On Fri, Feb 27, 2015 at 09:42:55PM +0900, Tetsuo Handa wrote:
> > > If kswapd0 is blocked forever at e.g. mutex_lock() inside shrinker
> > > functions, who else can make forward progress?
> > 
> > You can't get into these filesystem shrinkers when you do GFP_NOIO
> > allocations, as the IO path does.
> > 
> > > Shouldn't we avoid calling functions which could potentially block for
> > > unpredictable duration (e.g. unkillable locks and/or completion) from
> > > shrinker functions?
> > 
> > No, because otherwise we can't throttle allocation and reclaim to
> > the rate at which IO can clean dirty objects. i.e. we do this for
> > the same reason we throttle page cache dirtying to the rate at which
> > we can clean dirty pages....
> 
> I'm misunderstanding something. The description for kswapd() function
> in mm/vmscan.c says "This basically trickles out pages so that we have
> _some_ free memory available even if there is no other activity that frees
> anything up".

Sure.

> Forever blocking kswapd0 somewhere inside filesystem shrinker functions is
> equivalent with removing kswapd() function because it also prevents non
> filesystem shrinker functions from being called by kswapd0, doesn't it?

Yes, but that's not intentional. Remember, we keep talking about the
filesystem not being able to guarantee forwards progress if
allocations block forever? Well...

> Then, the description will become "We won't have _some_ free memory available
> if there is no other activity that frees anything up", won't it?

... we've ended up blocking kswapd because it's waiting on a journal
commit to complete, and that journal commit is blocked waiting for
forwards progress in memory allocation...

Yes, it's another one of those nasty dependencies I keep pointing
out that filesystems have, and that can only be solved by
guaranteeing we can always make forwards allocation progress from
transaction reserve to transaction commit.

> Does kswapd0 exist only for reducing the delay caused by reclaiming
> synchronously? Disabling kswapd0 affects nothing about functionality?
> The system can make forward progress even if nobody can call non filesystem
> shrinkers, can't it?

The throttling is required to control the unbound parallelism of
direct reclaim. If we don't do this, inode cache reclaim causes
random inode writeback and thrashes the disks with random IO,
causing severe degradation in performance under heavy memory
pressure. So we throttle inode reclaim to a single thread per AG so
we get nice sequential IO patterns from inode cache reclaim - the
difference is that we can reclaim several hundred thousand dirty
inodes per second versus a few hundred...

And because memory allocation is bound by reclaim speed, we throttle
the direct reclaimers to prevent IO breakdown conditions from
occurring and hence keep performance under memory pressure
relatively high and mostly predictable.

It's rare that kswapd actually gets stuck like this - I've only ever
seen it once, and I've never had anyone running a production system
report deadlocks like this...

> I can't understand the difference between "kswapd0 sleeping forever at
> too_many_isolated() loop inside shrink_inactive_list()" and "kswapd0
> sleeping forever at mutex_lock() inside xfs_reclaim_inodes_ag()".

I don't really care.

The direct reclaim behaviour is a much bigger problem, and the risk
of occasionally having problems with kswapd is miniscule in
comparison. Sure, you can provoke it, but unless you are intentially
doing nasty things to production systems, it will never be a problem
that you trip over.

We can't solve every problem with the current memory
allcoatin/reclaim design - we've chosen the lesser evil here, and
we're going to have to live with it until we get a more robust
memory allocation subsystem implementation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04 13:25                                                   ` Dave Chinner
@ 2015-03-04 14:11                                                     ` Tetsuo Handa
  2015-03-05  1:36                                                       ` Dave Chinner
  0 siblings, 1 reply; 276+ messages in thread
From: Tetsuo Handa @ 2015-03-04 14:11 UTC (permalink / raw)
  To: david
  Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm,
	mgorman, torvalds, fernando_b1

Dave Chinner wrote:
> > Forever blocking kswapd0 somewhere inside filesystem shrinker functions is
> > equivalent with removing kswapd() function because it also prevents non
> > filesystem shrinker functions from being called by kswapd0, doesn't it?
> 
> Yes, but that's not intentional. Remember, we keep talking about the
> filesystem not being able to guarantee forwards progress if
> allocations block forever? Well...
> 
> > Then, the description will become "We won't have _some_ free memory available
> > if there is no other activity that frees anything up", won't it?
> 
> ... we've ended up blocking kswapd because it's waiting on a journal
> commit to complete, and that journal commit is blocked waiting for
> forwards progress in memory allocation...
> 
> Yes, it's another one of those nasty dependencies I keep pointing
> out that filesystems have, and that can only be solved by
> guaranteeing we can always make forwards allocation progress from
> transaction reserve to transaction commit.

If this is an unexpected deadlock, don't we want below change for
xfs_reclaim_inodes_ag() ?

-	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
+	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0 && !current_is_kswapd()) {
 		trylock = 0;
 		goto restart;
 	}

> It's rare that kswapd actually gets stuck like this - I've only ever
> seen it once, and I've never had anyone running a production system
> report deadlocks like this...

I guess we will unlikely see this again, for so far this is observed with
only Linux 3.19 which lacks commit cc87317726f8 ("mm: page_alloc: revert
inadvertent !__GFP_FS retry behavior change").

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04  6:52                                                             ` Dave Chinner
@ 2015-03-04 15:04                                                               ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-04 15:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Wed, Mar 04, 2015 at 05:52:42PM +1100, Dave Chinner wrote:
> I suspect you've completely misunderstood what I've been suggesting.
> 
> By definition, we have the pages we reserved in the reserve pool,
> and unless we've exhausted that reservation with permanent
> allocations we should always be able to allocate from it. If the
> pool got emptied by demand page allocations, then we back off and
> retry reclaim until the reclaimable objects are released back into
> the reserve pool. i.e. reclaim fills reserve pools first, then when
> they are full pages can go back on free lists for normal
> allocations.  This provides the mechanism for forwards progress, and
> it's essentially the same mechanism that mempools use to guarantee
> forwards progess. the only difference is that reserve pool refilling
> comes through reclaim via shrinker invocation...

Yes, I had something else in mind.

In order to rely on replenishing through reclaim, you have to make
sure that all allocations taken out of the pool are guaranteed to come
back in a reasonable time frame.  So once Ted said that the filesystem
will not be able to declare which allocations of a task are allowed to
dip into its reserves, and thus allocations of indefinite lifetime can
enter the picture, my mind went to a one-off reserve pool that doesn't
rely on replenishing in order to make forward progress.  You declare
the worst-case, finish the transaction, and return what is left of the
reserves.  This obviously conflicts with the estimation model that you
are proposing, I hope it's now clear where our misunderstanding lies.

Yes, we can make this work if you can tell us which allocations have
limited/controllable lifetime.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-04 15:04                                                               ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-04 15:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Tetsuo Handa, mhocko, dchinner, linux-mm,
	rientjes, oleg, mgorman, torvalds, xfs

On Wed, Mar 04, 2015 at 05:52:42PM +1100, Dave Chinner wrote:
> I suspect you've completely misunderstood what I've been suggesting.
> 
> By definition, we have the pages we reserved in the reserve pool,
> and unless we've exhausted that reservation with permanent
> allocations we should always be able to allocate from it. If the
> pool got emptied by demand page allocations, then we back off and
> retry reclaim until the reclaimable objects are released back into
> the reserve pool. i.e. reclaim fills reserve pools first, then when
> they are full pages can go back on free lists for normal
> allocations.  This provides the mechanism for forwards progress, and
> it's essentially the same mechanism that mempools use to guarantee
> forwards progess. the only difference is that reserve pool refilling
> comes through reclaim via shrinker invocation...

Yes, I had something else in mind.

In order to rely on replenishing through reclaim, you have to make
sure that all allocations taken out of the pool are guaranteed to come
back in a reasonable time frame.  So once Ted said that the filesystem
will not be able to declare which allocations of a task are allowed to
dip into its reserves, and thus allocations of indefinite lifetime can
enter the picture, my mind went to a one-off reserve pool that doesn't
rely on replenishing in order to make forward progress.  You declare
the worst-case, finish the transaction, and return what is left of the
reserves.  This obviously conflicts with the estimation model that you
are proposing, I hope it's now clear where our misunderstanding lies.

Yes, we can make this work if you can tell us which allocations have
limited/controllable lifetime.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04 15:04                                                               ` Johannes Weiner
@ 2015-03-04 17:38                                                                 ` Theodore Ts'o
  -1 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-03-04 17:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, Andrew Morton, torvalds

On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote:
> Yes, we can make this work if you can tell us which allocations have
> limited/controllable lifetime.

It may be helpful to be a bit precise about definitions here.  There
are a number of different object lifetimes:

a) will be released before the kernel thread returns control to
userspace

b) will be released once the current I/O operation finishes.  (In the
case of nbd where the remote server has unexpectedy gone away might be
quite a while, but I'm not sure how much we care about that scenario)

c) can be trivially released if the mm subsystem asks via calling a
shrinker

d) can be released only after doing some amount of bounded work (i.e.,
cleaning a dirty page)

e) impossible to predict when it can be released (e.g., dcache, inodes
attached to an open file descriptors, buffer heads that won't be freed
until the file system is umounted, etc.)


I'm guessing that what you mean is (b), but what about cases such as
(c)?

Would the mm subsystem find it helpful if it had more information
about object lifetime?  For example, the CMA folks seem to really care
about know whether memory allocations falls in category (e) or not.

						- Ted
						

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-04 17:38                                                                 ` Theodore Ts'o
  0 siblings, 0 replies; 276+ messages in thread
From: Theodore Ts'o @ 2015-03-04 17:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Andrew Morton, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote:
> Yes, we can make this work if you can tell us which allocations have
> limited/controllable lifetime.

It may be helpful to be a bit precise about definitions here.  There
are a number of different object lifetimes:

a) will be released before the kernel thread returns control to
userspace

b) will be released once the current I/O operation finishes.  (In the
case of nbd where the remote server has unexpectedy gone away might be
quite a while, but I'm not sure how much we care about that scenario)

c) can be trivially released if the mm subsystem asks via calling a
shrinker

d) can be released only after doing some amount of bounded work (i.e.,
cleaning a dirty page)

e) impossible to predict when it can be released (e.g., dcache, inodes
attached to an open file descriptors, buffer heads that won't be freed
until the file system is umounted, etc.)


I'm guessing that what you mean is (b), but what about cases such as
(c)?

Would the mm subsystem find it helpful if it had more information
about object lifetime?  For example, the CMA folks seem to really care
about know whether memory allocations falls in category (e) or not.

						- Ted
						

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04 17:38                                                                 ` Theodore Ts'o
@ 2015-03-04 23:17                                                                   ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04 23:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Wed, Mar 04, 2015 at 12:38:41PM -0500, Theodore Ts'o wrote:
> On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote:
> > Yes, we can make this work if you can tell us which allocations have
> > limited/controllable lifetime.
> 
> It may be helpful to be a bit precise about definitions here.  There
> are a number of different object lifetimes:
> 
> a) will be released before the kernel thread returns control to
> userspace
> 
> b) will be released once the current I/O operation finishes.  (In the
> case of nbd where the remote server has unexpectedy gone away might be
> quite a while, but I'm not sure how much we care about that scenario)
> 
> c) can be trivially released if the mm subsystem asks via calling a
> shrinker
> 
> d) can be released only after doing some amount of bounded work (i.e.,
> cleaning a dirty page)
> 
> e) impossible to predict when it can be released (e.g., dcache, inodes
> attached to an open file descriptors, buffer heads that won't be freed
> until the file system is umounted, etc.)
> 
> 
> I'm guessing that what you mean is (b), but what about cases such as
> (c)?

The thing is, in the XFS transaction case we are hitting e) for
every allocation, and only after IO and/or some processing do we
know whether it will fall into c), d) or whether it will be
permanently consumed.

> Would the mm subsystem find it helpful if it had more information
> about object lifetime?  For example, the CMA folks seem to really care
> about know whether memory allocations falls in category (e) or not.

The problem is that most filesystem allocations fall into category
(e). Worse is that the state of an object can change without
allocations having taken place e.g. an object on a reclaimable LRU
can be found via a cache lookup, then joined to and modified in a
transaction. Hence objects can change state from "reclaimable" to
"permanently consumed" without actually going through memory reclaim
and allocation.

IOWs, what is really required is the ability to say "this amount of
allocation reserve is now consumed" /some time after/ we've done the
allocation. i.e. when we join the object to the transaction and
modify it, that's when we need to be able to reduce the reservation
limit as that memory is now permanently consumed by the transaction
context. Objects that fall into c) and d) don't need to have anyting
special done, because reclaim will eventually free the memory they
hold once the allocating context releases them.

Indeed, this model works even when we find those c) and d) objects
in cache rather than allocating them. They would get correctly
accounted as "consumed reserve" because we no longer need to
allocate that memory in transaction context and so that reserve can
be released back to the free pool....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-04 23:17                                                                   ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-04 23:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Johannes Weiner, Andrew Morton, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On Wed, Mar 04, 2015 at 12:38:41PM -0500, Theodore Ts'o wrote:
> On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote:
> > Yes, we can make this work if you can tell us which allocations have
> > limited/controllable lifetime.
> 
> It may be helpful to be a bit precise about definitions here.  There
> are a number of different object lifetimes:
> 
> a) will be released before the kernel thread returns control to
> userspace
> 
> b) will be released once the current I/O operation finishes.  (In the
> case of nbd where the remote server has unexpectedy gone away might be
> quite a while, but I'm not sure how much we care about that scenario)
> 
> c) can be trivially released if the mm subsystem asks via calling a
> shrinker
> 
> d) can be released only after doing some amount of bounded work (i.e.,
> cleaning a dirty page)
> 
> e) impossible to predict when it can be released (e.g., dcache, inodes
> attached to an open file descriptors, buffer heads that won't be freed
> until the file system is umounted, etc.)
> 
> 
> I'm guessing that what you mean is (b), but what about cases such as
> (c)?

The thing is, in the XFS transaction case we are hitting e) for
every allocation, and only after IO and/or some processing do we
know whether it will fall into c), d) or whether it will be
permanently consumed.

> Would the mm subsystem find it helpful if it had more information
> about object lifetime?  For example, the CMA folks seem to really care
> about know whether memory allocations falls in category (e) or not.

The problem is that most filesystem allocations fall into category
(e). Worse is that the state of an object can change without
allocations having taken place e.g. an object on a reclaimable LRU
can be found via a cache lookup, then joined to and modified in a
transaction. Hence objects can change state from "reclaimable" to
"permanently consumed" without actually going through memory reclaim
and allocation.

IOWs, what is really required is the ability to say "this amount of
allocation reserve is now consumed" /some time after/ we've done the
allocation. i.e. when we join the object to the transaction and
modify it, that's when we need to be able to reduce the reservation
limit as that memory is now permanently consumed by the transaction
context. Objects that fall into c) and d) don't need to have anyting
special done, because reclaim will eventually free the memory they
hold once the allocating context releases them.

Indeed, this model works even when we find those c) and d) objects
in cache rather than allocating them. They would get correctly
accounted as "consumed reserve" because we no longer need to
allocate that memory in transaction context and so that reserve can
be released back to the free pool....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04 14:11                                                     ` Tetsuo Handa
@ 2015-03-05  1:36                                                       ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-05  1:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: tytso, rientjes, hannes, mhocko, dchinner, linux-mm, oleg, akpm,
	mgorman, torvalds, fernando_b1

On Wed, Mar 04, 2015 at 11:11:48PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > > Forever blocking kswapd0 somewhere inside filesystem shrinker functions is
> > > equivalent with removing kswapd() function because it also prevents non
> > > filesystem shrinker functions from being called by kswapd0, doesn't it?
> > 
> > Yes, but that's not intentional. Remember, we keep talking about the
> > filesystem not being able to guarantee forwards progress if
> > allocations block forever? Well...
> > 
> > > Then, the description will become "We won't have _some_ free memory available
> > > if there is no other activity that frees anything up", won't it?
> > 
> > ... we've ended up blocking kswapd because it's waiting on a journal
> > commit to complete, and that journal commit is blocked waiting for
> > forwards progress in memory allocation...
> > 
> > Yes, it's another one of those nasty dependencies I keep pointing
> > out that filesystems have, and that can only be solved by
> > guaranteeing we can always make forwards allocation progress from
> > transaction reserve to transaction commit.
> 
> If this is an unexpected deadlock, don't we want below change for
> xfs_reclaim_inodes_ag() ?
> 
> -	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
> +	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0 && !current_is_kswapd()) {
>  		trylock = 0;
>  		goto restart;
>  	}

What, so when direct reclaim has choked up all inode reclaim slots
completely kswapd just burns CPU spinning while it fails to make
progress?

Besides, that does not address the actual issue that caused kswapd
to block on a log force. That's caused by the SYNC_WAIT flag telling
reclaim to wait for IO completion - this is the reclaim throttling
mechanism we need to prevent reclaim from degrading to random IO
patterns and completely trashing reclaim rates.  Hence reclaiming an
inode waits in xfs_iunpin_wait() for the log to be flushed before
reclaiming inode that is pinned by an unflushed transaction.

This works because there is also a background reclaim worker running
doing fast, highly efficient, sequential order, non-blocking
asynchronous inode writeback. Hence, more often than not, reclaim
does not block on more than one dirty inode per scan because the
rest of the inodes it walks have already been cleaned and are ready
for immediate reclaim.

We have multiple layers of reclaim work going on in XFS even within
each cache/shrinker infrastructure. Indeed, If I start having to
explain how this inode shrinker algorithm ties back into journal
tail pushing to optimise async metadata flushing so that the XFS
buffer cache shrinker hits clean inode buffers and hence can reclaim
the memory the inode shrinker consumes doing inode writeback as
quickly as possible, then I think heads might start to explode.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 11:17                                                         ` Tetsuo Handa
@ 2015-03-06 11:53                                                           ` Tetsuo Handa
  -1 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-03-06 11:53 UTC (permalink / raw)
  To: david
  Cc: tytso, hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, fernando_b1, torvalds

Tetsuo Handa wrote:
> If underestimating is tolerable, can we simply set different watermark
> levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations?
> For example,
> 
>    GFP_KERNEL (or above) can fail if memory usage exceeds 95%
>    GFP_NOFS can fail if memory usage exceeds 97%
>    GFP_NOIO can fail if memory usage exceeds 98%
>    GFP_ATOMIC can fail if memory usage exceeds 99%
> 
> I think that below order-0 GFP_NOIO allocation enters into retry-forever loop
> when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds
> strange. Use of same watermark is preventing kernel worker threads from
> processing workqueue. While it is legal to do blocking operation from
> workqueue, being blocked forever is an exclusive occupation for workqueue;
> other jobs in the workqueue get stuck.
> 

Below experimental patch which raises zone watermark works for me.

----------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..92233e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1710,6 +1710,7 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+	gfp_t gfp_mask;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7abfa70..1a6b830 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1810,6 +1810,12 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		min -= min / 2;
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
+	if (min == mark) {
+		if (current->gfp_mask & __GFP_FS)
+			min <<= 1;
+		if (current->gfp_mask & __GFP_IO)
+			min <<= 1;
+	}
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
 	if (!(alloc_flags & ALLOC_CMA))
@@ -2810,6 +2816,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		.nodemask = nodemask,
 		.migratetype = gfpflags_to_migratetype(gfp_mask),
 	};
+	gfp_t orig_gfp_mask;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2831,6 +2838,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 
+	orig_gfp_mask = current->gfp_mask;
+	current->gfp_mask = gfp_mask;
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
@@ -2873,6 +2882,7 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
+	current->gfp_mask = orig_gfp_mask;
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
----------

Thanks again to Jonathan Corbet for writing https://lwn.net/Articles/635354/ .
Is Dave Chinner's "reservations" suggestion conceptually doing the patch above?

Dave's suggestion is to ask each GFP_NOFS and GFP_NOIO users to estimate
how much amount of pages they need for their transaction like

	if (min == mark) {
		if (current->gfp_mask & __GFP_FS)
			min += atomic_read(&reservation_for_gfp_fs);
		if (current->gfp_mask & __GFP_IO)
			min += atomic_read(&reservation_for_gfp_io);
	}

than ask the administrator to specify a static amount like

	if (min == mark) {
		if (current->gfp_mask & __GFP_FS)
			min += sysctl_reservation_for_gfp_fs;
		if (current->gfp_mask & __GFP_IO)
			min += sysctl_reservation_for_gfp_io;
	}

?

The retry-forever loop will happen if underestimated, won't it?
Then, how to handle it when the OOM killer missed the target (due to
__GFP_FS) or the OOM killer cannot be invoked (due to !__GFP_FS)?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-06 11:53                                                           ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-03-06 11:53 UTC (permalink / raw)
  To: david
  Cc: hannes, tytso, mhocko, dchinner, linux-mm, rientjes, oleg, akpm,
	mgorman, torvalds, xfs, fernando_b1

Tetsuo Handa wrote:
> If underestimating is tolerable, can we simply set different watermark
> levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations?
> For example,
> 
>    GFP_KERNEL (or above) can fail if memory usage exceeds 95%
>    GFP_NOFS can fail if memory usage exceeds 97%
>    GFP_NOIO can fail if memory usage exceeds 98%
>    GFP_ATOMIC can fail if memory usage exceeds 99%
> 
> I think that below order-0 GFP_NOIO allocation enters into retry-forever loop
> when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds
> strange. Use of same watermark is preventing kernel worker threads from
> processing workqueue. While it is legal to do blocking operation from
> workqueue, being blocked forever is an exclusive occupation for workqueue;
> other jobs in the workqueue get stuck.
> 

Below experimental patch which raises zone watermark works for me.

----------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..92233e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1710,6 +1710,7 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+	gfp_t gfp_mask;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7abfa70..1a6b830 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1810,6 +1810,12 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		min -= min / 2;
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
+	if (min == mark) {
+		if (current->gfp_mask & __GFP_FS)
+			min <<= 1;
+		if (current->gfp_mask & __GFP_IO)
+			min <<= 1;
+	}
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
 	if (!(alloc_flags & ALLOC_CMA))
@@ -2810,6 +2816,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		.nodemask = nodemask,
 		.migratetype = gfpflags_to_migratetype(gfp_mask),
 	};
+	gfp_t orig_gfp_mask;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2831,6 +2838,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 
+	orig_gfp_mask = current->gfp_mask;
+	current->gfp_mask = gfp_mask;
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
@@ -2873,6 +2882,7 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
+	current->gfp_mask = orig_gfp_mask;
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
----------

Thanks again to Jonathan Corbet for writing https://lwn.net/Articles/635354/ .
Is Dave Chinner's "reservations" suggestion conceptually doing the patch above?

Dave's suggestion is to ask each GFP_NOFS and GFP_NOIO users to estimate
how much amount of pages they need for their transaction like

	if (min == mark) {
		if (current->gfp_mask & __GFP_FS)
			min += atomic_read(&reservation_for_gfp_fs);
		if (current->gfp_mask & __GFP_IO)
			min += atomic_read(&reservation_for_gfp_io);
	}

than ask the administrator to specify a static amount like

	if (min == mark) {
		if (current->gfp_mask & __GFP_FS)
			min += sysctl_reservation_for_gfp_fs;
		if (current->gfp_mask & __GFP_IO)
			min += sysctl_reservation_for_gfp_io;
	}

?

The retry-forever loop will happen if underestimated, won't it?
Then, how to handle it when the OOM killer missed the target (due to
__GFP_FS) or the OOM killer cannot be invoked (due to !__GFP_FS)?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 22:31                                                         ` Dave Chinner
@ 2015-03-07  0:20                                                           ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-07  0:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, Andrew Morton, torvalds, Vlastimil Babka

On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> What we don't know is how many objects we might need to scan to find
> the objects we will eventually modify.  Here's an (admittedly
> extreme) example to demonstrate a worst case scenario: allocate a
> 64k data extent. Because it is an exact size allocation, we look it
> up in the by-size free space btree. Free space is fragmented, so
> there are about a million 64k free space extents in the tree.
> 
> Once we find the first 64k extent, we search them to find the best
> locality target match.  The btree records are 16 bytes each, so we
> fit roughly 500 to a 4k block. Say we search half the extents to
> find the best match - i.e. we walk a thousand leaf blocks before
> finding the match we want, and modify that leaf block.
> 
> Now, the modification removed an entry from the leaf and tht
> triggers leaf merge thresholds, so a merge with the 1002nd block
> occurs. That block now demand pages in and we then modify and join
> it to the transaction. Now we walk back up the btree to update
> indexes, merging blocks all the way back up to the root.  We have a
> worst case size btree (5 levels) and we merge at every level meaning
> we demand page another 8 btree blocks and modify them.
> 
> In this case, we've demand paged ~1010 btree blocks, but only
> modified 10 of them. i.e. the memory we consumed permanently was
> only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> the allocation demand was 2 orders of magnitude more than the
> unreclaimable memory consumption of the btree modification.
> 
> I hope you start to see the scope of the problem now...

Isn't this bounded one way or another?  Sure, the inaccuracy itself is
high, but when you put the absolute numbers in perspective it really
doesn't seem to matter: with your extreme case of 3MB per transaction,
you can still run 5k+ of them in parallel on a small 16G machine.
Occupy a generous 75% of RAM with anonymous pages, and you can STILL
run over a thousand transactions concurrently.  That would seem like a
decent pipeline to keep the storage device occupied.

The level of precision that you are asking for comes with complexity
and fragility that I'm not convinced is necessary, or justified.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-07  0:20                                                           ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-07  0:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vlastimil Babka, Andrew Morton, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> What we don't know is how many objects we might need to scan to find
> the objects we will eventually modify.  Here's an (admittedly
> extreme) example to demonstrate a worst case scenario: allocate a
> 64k data extent. Because it is an exact size allocation, we look it
> up in the by-size free space btree. Free space is fragmented, so
> there are about a million 64k free space extents in the tree.
> 
> Once we find the first 64k extent, we search them to find the best
> locality target match.  The btree records are 16 bytes each, so we
> fit roughly 500 to a 4k block. Say we search half the extents to
> find the best match - i.e. we walk a thousand leaf blocks before
> finding the match we want, and modify that leaf block.
> 
> Now, the modification removed an entry from the leaf and tht
> triggers leaf merge thresholds, so a merge with the 1002nd block
> occurs. That block now demand pages in and we then modify and join
> it to the transaction. Now we walk back up the btree to update
> indexes, merging blocks all the way back up to the root.  We have a
> worst case size btree (5 levels) and we merge at every level meaning
> we demand page another 8 btree blocks and modify them.
> 
> In this case, we've demand paged ~1010 btree blocks, but only
> modified 10 of them. i.e. the memory we consumed permanently was
> only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> the allocation demand was 2 orders of magnitude more than the
> unreclaimable memory consumption of the btree modification.
> 
> I hope you start to see the scope of the problem now...

Isn't this bounded one way or another?  Sure, the inaccuracy itself is
high, but when you put the absolute numbers in perspective it really
doesn't seem to matter: with your extreme case of 3MB per transaction,
you can still run 5k+ of them in parallel on a small 16G machine.
Occupy a generous 75% of RAM with anonymous pages, and you can STILL
run over a thousand transactions concurrently.  That would seem like a
decent pipeline to keep the storage device occupied.

The level of precision that you are asking for comes with complexity
and fragility that I'm not convinced is necessary, or justified.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-07  0:20                                                           ` Johannes Weiner
@ 2015-03-07  3:43                                                             ` Dave Chinner
  -1 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-07  3:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, Andrew Morton, torvalds, Vlastimil Babka

On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote:
> On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> > What we don't know is how many objects we might need to scan to find
> > the objects we will eventually modify.  Here's an (admittedly
> > extreme) example to demonstrate a worst case scenario: allocate a
> > 64k data extent. Because it is an exact size allocation, we look it
> > up in the by-size free space btree. Free space is fragmented, so
> > there are about a million 64k free space extents in the tree.
> > 
> > Once we find the first 64k extent, we search them to find the best
> > locality target match.  The btree records are 16 bytes each, so we
> > fit roughly 500 to a 4k block. Say we search half the extents to
> > find the best match - i.e. we walk a thousand leaf blocks before
> > finding the match we want, and modify that leaf block.
> > 
> > Now, the modification removed an entry from the leaf and tht
> > triggers leaf merge thresholds, so a merge with the 1002nd block
> > occurs. That block now demand pages in and we then modify and join
> > it to the transaction. Now we walk back up the btree to update
> > indexes, merging blocks all the way back up to the root.  We have a
> > worst case size btree (5 levels) and we merge at every level meaning
> > we demand page another 8 btree blocks and modify them.
> > 
> > In this case, we've demand paged ~1010 btree blocks, but only
> > modified 10 of them. i.e. the memory we consumed permanently was
> > only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> > the allocation demand was 2 orders of magnitude more than the
> > unreclaimable memory consumption of the btree modification.
> > 
> > I hope you start to see the scope of the problem now...
> 
> Isn't this bounded one way or another?

Fo a single transaction? No.

> Sure, the inaccuracy itself is
> high, but when you put the absolute numbers in perspective it really
> doesn't seem to matter: with your extreme case of 3MB per transaction,
> you can still run 5k+ of them in parallel on a small 16G machine.

No you can't. The number of concurrent transactions is bounded by
the size of the log and the amount of unused space available for
reservation in the log. Under heavy modification loads, that's
usually somewhere between 15-25% of the log, so worst case is a few
hundred megabytes. The memory reservation demand is in the same
order of magnitude as the log space reservation demand.....

> Occupy a generous 75% of RAM with anonymous pages, and you can STILL
> run over a thousand transactions concurrently.  That would seem like a
> decent pipeline to keep the storage device occupied.

Typical systems won't ever get to that - they don't do more than a
handful of current transactions at a time - the "thousands of
transactions" occur on dedicated storage servers like petabyte scale
NFS servers that have hundreds of gigabytes of RAM and
hundreds-to-thousands of processing threads to keep the request
pipeline full. The memory in those machines is entirely dedicated to
the filesystem, so keeping a usuable pool of a few gigabytes for
transaction reservations isn't a big deal.

The point here is that you're taking what I'm describing as the
requirements of a reservation pool and then applying the worst case
to situations where completely inappropriate. That's what I mean
when I told Michal to stop building silly strawman situations; large
amounts of concurrency are required for huge machines, not your
desktop workstation.

And, realistically, sizing that reservation pool appropriately is my
problem to solve - it will depend on many factors, one of which is
the actual geometry of the filesystem itself. You need to stop
thinking like you can control how application use the memory
allocation and reclaim subsystem and start to trust we will our
memory usage appropriately to maintain maximum system throughput.

After all, we already do that for all the filesystem caches the mm
subsystem doesn't control - why do you think I have had such an
interest in shrinker scalability? For XFS, the only cache we
actually don't control reclaim from is user data in the page cache -
we control everything else directly from custom shrinkers.....

> The level of precision that you are asking for comes with complexity
> and fragility that I'm not convinced is necessary, or justified.

Look, if you dont think reservations will work, then how about you
suggest something that will. I don't really care what you implement,
as long as it meets the needs of demand paging, I have direct
control over memory usage and concurrency policy and the allocation
mechanism guarantees forward progress without needing the OOM
killer.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-07  3:43                                                             ` Dave Chinner
  0 siblings, 0 replies; 276+ messages in thread
From: Dave Chinner @ 2015-03-07  3:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Andrew Morton, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote:
> On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> > What we don't know is how many objects we might need to scan to find
> > the objects we will eventually modify.  Here's an (admittedly
> > extreme) example to demonstrate a worst case scenario: allocate a
> > 64k data extent. Because it is an exact size allocation, we look it
> > up in the by-size free space btree. Free space is fragmented, so
> > there are about a million 64k free space extents in the tree.
> > 
> > Once we find the first 64k extent, we search them to find the best
> > locality target match.  The btree records are 16 bytes each, so we
> > fit roughly 500 to a 4k block. Say we search half the extents to
> > find the best match - i.e. we walk a thousand leaf blocks before
> > finding the match we want, and modify that leaf block.
> > 
> > Now, the modification removed an entry from the leaf and tht
> > triggers leaf merge thresholds, so a merge with the 1002nd block
> > occurs. That block now demand pages in and we then modify and join
> > it to the transaction. Now we walk back up the btree to update
> > indexes, merging blocks all the way back up to the root.  We have a
> > worst case size btree (5 levels) and we merge at every level meaning
> > we demand page another 8 btree blocks and modify them.
> > 
> > In this case, we've demand paged ~1010 btree blocks, but only
> > modified 10 of them. i.e. the memory we consumed permanently was
> > only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> > the allocation demand was 2 orders of magnitude more than the
> > unreclaimable memory consumption of the btree modification.
> > 
> > I hope you start to see the scope of the problem now...
> 
> Isn't this bounded one way or another?

Fo a single transaction? No.

> Sure, the inaccuracy itself is
> high, but when you put the absolute numbers in perspective it really
> doesn't seem to matter: with your extreme case of 3MB per transaction,
> you can still run 5k+ of them in parallel on a small 16G machine.

No you can't. The number of concurrent transactions is bounded by
the size of the log and the amount of unused space available for
reservation in the log. Under heavy modification loads, that's
usually somewhere between 15-25% of the log, so worst case is a few
hundred megabytes. The memory reservation demand is in the same
order of magnitude as the log space reservation demand.....

> Occupy a generous 75% of RAM with anonymous pages, and you can STILL
> run over a thousand transactions concurrently.  That would seem like a
> decent pipeline to keep the storage device occupied.

Typical systems won't ever get to that - they don't do more than a
handful of current transactions at a time - the "thousands of
transactions" occur on dedicated storage servers like petabyte scale
NFS servers that have hundreds of gigabytes of RAM and
hundreds-to-thousands of processing threads to keep the request
pipeline full. The memory in those machines is entirely dedicated to
the filesystem, so keeping a usuable pool of a few gigabytes for
transaction reservations isn't a big deal.

The point here is that you're taking what I'm describing as the
requirements of a reservation pool and then applying the worst case
to situations where completely inappropriate. That's what I mean
when I told Michal to stop building silly strawman situations; large
amounts of concurrency are required for huge machines, not your
desktop workstation.

And, realistically, sizing that reservation pool appropriately is my
problem to solve - it will depend on many factors, one of which is
the actual geometry of the filesystem itself. You need to stop
thinking like you can control how application use the memory
allocation and reclaim subsystem and start to trust we will our
memory usage appropriately to maintain maximum system throughput.

After all, we already do that for all the filesystem caches the mm
subsystem doesn't control - why do you think I have had such an
interest in shrinker scalability? For XFS, the only cache we
actually don't control reclaim from is user data in the page cache -
we control everything else directly from custom shrinkers.....

> The level of precision that you are asking for comes with complexity
> and fragility that I'm not convinced is necessary, or justified.

Look, if you dont think reservations will work, then how about you
suggest something that will. I don't really care what you implement,
as long as it meets the needs of demand paging, I have direct
control over memory usage and concurrency policy and the allocation
mechanism guarantees forward progress without needing the OOM
killer.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-07  3:43                                                             ` Dave Chinner
@ 2015-03-07 15:08                                                               ` Johannes Weiner
  -1 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-07 15:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, Andrew Morton, torvalds, Vlastimil Babka

On Sat, Mar 07, 2015 at 02:43:47PM +1100, Dave Chinner wrote:
> On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote:
> > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> > > What we don't know is how many objects we might need to scan to find
> > > the objects we will eventually modify.  Here's an (admittedly
> > > extreme) example to demonstrate a worst case scenario: allocate a
> > > 64k data extent. Because it is an exact size allocation, we look it
> > > up in the by-size free space btree. Free space is fragmented, so
> > > there are about a million 64k free space extents in the tree.
> > > 
> > > Once we find the first 64k extent, we search them to find the best
> > > locality target match.  The btree records are 16 bytes each, so we
> > > fit roughly 500 to a 4k block. Say we search half the extents to
> > > find the best match - i.e. we walk a thousand leaf blocks before
> > > finding the match we want, and modify that leaf block.
> > > 
> > > Now, the modification removed an entry from the leaf and tht
> > > triggers leaf merge thresholds, so a merge with the 1002nd block
> > > occurs. That block now demand pages in and we then modify and join
> > > it to the transaction. Now we walk back up the btree to update
> > > indexes, merging blocks all the way back up to the root.  We have a
> > > worst case size btree (5 levels) and we merge at every level meaning
> > > we demand page another 8 btree blocks and modify them.
> > > 
> > > In this case, we've demand paged ~1010 btree blocks, but only
> > > modified 10 of them. i.e. the memory we consumed permanently was
> > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> > > the allocation demand was 2 orders of magnitude more than the
> > > unreclaimable memory consumption of the btree modification.
> > > 
> > > I hope you start to see the scope of the problem now...
> > 
> > Isn't this bounded one way or another?
> 
> Fo a single transaction? No.

So you can have an infinite number of allocations in the context of a
transaction, and only the objects that are going to be locked in are
bounded?

> > Sure, the inaccuracy itself is
> > high, but when you put the absolute numbers in perspective it really
> > doesn't seem to matter: with your extreme case of 3MB per transaction,
> > you can still run 5k+ of them in parallel on a small 16G machine.
> 
> No you can't. The number of concurrent transactions is bounded by
> the size of the log and the amount of unused space available for
> reservation in the log. Under heavy modification loads, that's
> usually somewhere between 15-25% of the log, so worst case is a few
> hundred megabytes. The memory reservation demand is in the same
> order of magnitude as the log space reservation demand.....
> 
> > Occupy a generous 75% of RAM with anonymous pages, and you can STILL
> > run over a thousand transactions concurrently.  That would seem like a
> > decent pipeline to keep the storage device occupied.
> 
> Typical systems won't ever get to that - they don't do more than a
> handful of current transactions at a time - the "thousands of
> transactions" occur on dedicated storage servers like petabyte scale
> NFS servers that have hundreds of gigabytes of RAM and
> hundreds-to-thousands of processing threads to keep the request
> pipeline full. The memory in those machines is entirely dedicated to
> the filesystem, so keeping a usuable pool of a few gigabytes for
> transaction reservations isn't a big deal.
> 
> The point here is that you're taking what I'm describing as the
> requirements of a reservation pool and then applying the worst case
> to situations where completely inappropriate. That's what I mean
> when I told Michal to stop building silly strawman situations; large
> amounts of concurrency are required for huge machines, not your
> desktop workstation.

Why do you have to take everything I say in bad faith and choose to be
smug instead of constructive?  This is unneccessary.  OF COURSE you
know your constraints better than we do.  Now explain how they matter
in practice, because that's what dictates the design in engineering.

I'm trying to figure out your requirements to find the simplest model,
and yes I'm obviously going to follow up when you give me incomplete
information.  I'm responding to this:

: What we don't know is how many objects we might need to scan to find
: the objects we will eventually modify.  Here's an (admittedly
: extreme) example to demonstrate a worst case scenario:

You gave us numbers that you called "worst case", so I took them and
put them in a scenario where it looks like memory wouldn't be the
bottle neck in real life, even if we just had simple pre-allocation
semantics.  If it was a silly example, why not provide a better one?

I'm fine with reservations and I'm fine with adding more complexity
when you demonstrate that it's needed.  Your argument seems to have
been that worst-case estimates are way off, but can you please just
demonstrate why it matters in practice?  Instead of having me do it
and calling my attempts strawman arguments?  I can just guess your
constraints, it's up to you to make a case for your requirements.

Here is another example where you responded to akpm:

---
> When allocating pages the caller should drain its reserves in
> preference to dipping into the regular freelist.  This guy has already
> done his reclaim and shouldn't be penalised a second time.  I guess
> Johannes's preallocation code should switch to doing this for the same
> reason, plus the fact that snipping a page off
> task_struct.prealloc_pages is super-fast and needs to be done sometime
> anyway so why not do it by default.

That is at odds with the requirements of demand paging, which
allocate for objects that are reclaimable within the course of the
transaction. The reserve is there to ensure forward progress for
allocations for objects that aren't freed until after the
transaction completes, but if we drain it for reclaimable objects we
then have nothing left in the reserve pool when we actually need it.

We do not know ahead of time if the object we are allocating is
going to modified and hence locked into the transaction. Hence we
can't say "use the reserve for this *specific* allocation", and so
the only guidance we can really give is "we will to allocate and
*permanently consume* this much memory", and the reserve pool needs
to cover that consumption to guarantee forwards progress.

Forwards progress for all other allocations is guaranteed because
they are reclaimable objects - they either freed directly back to
their source (slab, heap, page lists) or they are freed by shrinkers
once they have been released from the transaction.

Hence we need allocations to come from the free list and trigger
reclaim, regardless of the fact there is a reserve pool there. The
reserve pool needs to be a last resort once there are no other
avenues to allocate memory. i.e. it would be used to replace the OOM
killer for GFP_NOFAIL allocations.
---

Andrew makes a proposal and backs it up with real life benefits:
simpler, faster.  You on the other hand follow up with a list of
unfounded claims and your only counter-argument really seems to be
that Andrew's proposal differs from what you've had in mind.  What you
had in mind was obviously driven by constraints known to you, but it's
not an argument until you actually include them.  We're not taking
your claims at face value, that's not how this ever works.

Just explain why and how your requirements, demand paging reserves in
this case, matter in real life.  Then we can take them seriously.

> And, realistically, sizing that reservation pool appropriately is my
> problem to solve - it will depend on many factors, one of which is
> the actual geometry of the filesystem itself. You need to stop
> thinking like you can control how application use the memory
> allocation and reclaim subsystem and start to trust we will our
> memory usage appropriately to maintain maximum system throughput.

You've been working on the kernel long enough to know that this is not
how it goes.  I don't care about getting a list of things you claim
you need and implementing them blindly, trusting that you know what
you're doing when it comes to memory.  If you want us to expose an
interface, which puts constraints on our implementation, then you
better provide justification for every single requirement.

> After all, we already do that for all the filesystem caches the mm
> subsystem doesn't control - why do you think I have had such an
> interest in shrinker scalability? For XFS, the only cache we
> actually don't control reclaim from is user data in the page cache -
> we control everything else directly from custom shrinkers.....

You mean those global object pools that are aged through unrelated and
independent per-zone pressure values?

Look, we are specialized in different subsystems, which means we know
the details in front of us better than the details in the surrounding
areas.  You are quick to dismiss constraints and scalability concerns
in the memory subsystem, and I do the same for memory users.  We are
having this discussion in order to explore where our problem spaces
intersect, and we could be making more progress if you stopped
assuming that everybody else is an idiot and you already found the
perfect solution.

We need data on your parameters in order to make a basic cost-benefit
analysis of any proposed solutions.  Don't just propose something and
talk down to us when we ask for clarifications on your constraints.
It's not getting us anywhere.  Explore the problem space with us,
explain your constraints and exact requirements based on real life
data, and then we can look for potential solutions.  That is how we
evaluate every single proposal for the kernel, and it's how it's going
to work in this case.  It's not that complicated.

> > The level of precision that you are asking for comes with complexity
> > and fragility that I'm not convinced is necessary, or justified.
> 
> Look, if you dont think reservations will work, then how about you
> suggest something that will. I don't really care what you implement,
> as long as it meets the needs of demand paging, I have direct
> control over memory usage and concurrency policy and the allocation
> mechanism guarantees forward progress without needing the OOM
> killer.

Reservations are fine and I also want them to replace the OOM killer,
we agree on that.

The only thing my email was about was that, in light of the worst-case
numbers you quoted, it didn't look like the demand paging requirement
is strictly necessary to make the system work in practice, which is
why I'm questioning that particular requirement and prompting you to
clarify your position.  You have yet to address this.

Until then, the simplest semantics are preallocation semantics, where
you in advance establish private reserve pools (which can be backed by
clean cache) from which you allocate directly using __GFP_RESERVE.  If
the pool is empty it's immediately detectable and attributable to the
culprit, and the other reserves are not impacted by it.

A globally shared demand-paged pool is much more fragile because you
trust other participants in the system to keep their promise and not
pin more objects than they reserved for.  Otherwise, they deadlock
your transaction and corrupt your userdata.  How does "XFS filesystem
corrupted because it shares its emergency memory pool to ensure data
integrity with some buggy driver" sound to you?

It's also harder to verify.  If one of the participants misbehaves and
pins more objects than they initially reserved for, how do we identify
the culprit when the system locks up?

Make an actual case why preallocation semantics are unworkable on real
systems with real memory and real filesystems and real data on them,
then we can consider making the model more complex and fragile.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
@ 2015-03-07 15:08                                                               ` Johannes Weiner
  0 siblings, 0 replies; 276+ messages in thread
From: Johannes Weiner @ 2015-03-07 15:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vlastimil Babka, Andrew Morton, Tetsuo Handa, mhocko, dchinner,
	linux-mm, rientjes, oleg, mgorman, torvalds, xfs

On Sat, Mar 07, 2015 at 02:43:47PM +1100, Dave Chinner wrote:
> On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote:
> > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> > > What we don't know is how many objects we might need to scan to find
> > > the objects we will eventually modify.  Here's an (admittedly
> > > extreme) example to demonstrate a worst case scenario: allocate a
> > > 64k data extent. Because it is an exact size allocation, we look it
> > > up in the by-size free space btree. Free space is fragmented, so
> > > there are about a million 64k free space extents in the tree.
> > > 
> > > Once we find the first 64k extent, we search them to find the best
> > > locality target match.  The btree records are 16 bytes each, so we
> > > fit roughly 500 to a 4k block. Say we search half the extents to
> > > find the best match - i.e. we walk a thousand leaf blocks before
> > > finding the match we want, and modify that leaf block.
> > > 
> > > Now, the modification removed an entry from the leaf and tht
> > > triggers leaf merge thresholds, so a merge with the 1002nd block
> > > occurs. That block now demand pages in and we then modify and join
> > > it to the transaction. Now we walk back up the btree to update
> > > indexes, merging blocks all the way back up to the root.  We have a
> > > worst case size btree (5 levels) and we merge at every level meaning
> > > we demand page another 8 btree blocks and modify them.
> > > 
> > > In this case, we've demand paged ~1010 btree blocks, but only
> > > modified 10 of them. i.e. the memory we consumed permanently was
> > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> > > the allocation demand was 2 orders of magnitude more than the
> > > unreclaimable memory consumption of the btree modification.
> > > 
> > > I hope you start to see the scope of the problem now...
> > 
> > Isn't this bounded one way or another?
> 
> Fo a single transaction? No.

So you can have an infinite number of allocations in the context of a
transaction, and only the objects that are going to be locked in are
bounded?

> > Sure, the inaccuracy itself is
> > high, but when you put the absolute numbers in perspective it really
> > doesn't seem to matter: with your extreme case of 3MB per transaction,
> > you can still run 5k+ of them in parallel on a small 16G machine.
> 
> No you can't. The number of concurrent transactions is bounded by
> the size of the log and the amount of unused space available for
> reservation in the log. Under heavy modification loads, that's
> usually somewhere between 15-25% of the log, so worst case is a few
> hundred megabytes. The memory reservation demand is in the same
> order of magnitude as the log space reservation demand.....
> 
> > Occupy a generous 75% of RAM with anonymous pages, and you can STILL
> > run over a thousand transactions concurrently.  That would seem like a
> > decent pipeline to keep the storage device occupied.
> 
> Typical systems won't ever get to that - they don't do more than a
> handful of current transactions at a time - the "thousands of
> transactions" occur on dedicated storage servers like petabyte scale
> NFS servers that have hundreds of gigabytes of RAM and
> hundreds-to-thousands of processing threads to keep the request
> pipeline full. The memory in those machines is entirely dedicated to
> the filesystem, so keeping a usuable pool of a few gigabytes for
> transaction reservations isn't a big deal.
> 
> The point here is that you're taking what I'm describing as the
> requirements of a reservation pool and then applying the worst case
> to situations where completely inappropriate. That's what I mean
> when I told Michal to stop building silly strawman situations; large
> amounts of concurrency are required for huge machines, not your
> desktop workstation.

Why do you have to take everything I say in bad faith and choose to be
smug instead of constructive?  This is unneccessary.  OF COURSE you
know your constraints better than we do.  Now explain how they matter
in practice, because that's what dictates the design in engineering.

I'm trying to figure out your requirements to find the simplest model,
and yes I'm obviously going to follow up when you give me incomplete
information.  I'm responding to this:

: What we don't know is how many objects we might need to scan to find
: the objects we will eventually modify.  Here's an (admittedly
: extreme) example to demonstrate a worst case scenario:

You gave us numbers that you called "worst case", so I took them and
put them in a scenario where it looks like memory wouldn't be the
bottle neck in real life, even if we just had simple pre-allocation
semantics.  If it was a silly example, why not provide a better one?

I'm fine with reservations and I'm fine with adding more complexity
when you demonstrate that it's needed.  Your argument seems to have
been that worst-case estimates are way off, but can you please just
demonstrate why it matters in practice?  Instead of having me do it
and calling my attempts strawman arguments?  I can just guess your
constraints, it's up to you to make a case for your requirements.

Here is another example where you responded to akpm:

---
> When allocating pages the caller should drain its reserves in
> preference to dipping into the regular freelist.  This guy has already
> done his reclaim and shouldn't be penalised a second time.  I guess
> Johannes's preallocation code should switch to doing this for the same
> reason, plus the fact that snipping a page off
> task_struct.prealloc_pages is super-fast and needs to be done sometime
> anyway so why not do it by default.

That is at odds with the requirements of demand paging, which
allocate for objects that are reclaimable within the course of the
transaction. The reserve is there to ensure forward progress for
allocations for objects that aren't freed until after the
transaction completes, but if we drain it for reclaimable objects we
then have nothing left in the reserve pool when we actually need it.

We do not know ahead of time if the object we are allocating is
going to modified and hence locked into the transaction. Hence we
can't say "use the reserve for this *specific* allocation", and so
the only guidance we can really give is "we will to allocate and
*permanently consume* this much memory", and the reserve pool needs
to cover that consumption to guarantee forwards progress.

Forwards progress for all other allocations is guaranteed because
they are reclaimable objects - they either freed directly back to
their source (slab, heap, page lists) or they are freed by shrinkers
once they have been released from the transaction.

Hence we need allocations to come from the free list and trigger
reclaim, regardless of the fact there is a reserve pool there. The
reserve pool needs to be a last resort once there are no other
avenues to allocate memory. i.e. it would be used to replace the OOM
killer for GFP_NOFAIL allocations.
---

Andrew makes a proposal and backs it up with real life benefits:
simpler, faster.  You on the other hand follow up with a list of
unfounded claims and your only counter-argument really seems to be
that Andrew's proposal differs from what you've had in mind.  What you
had in mind was obviously driven by constraints known to you, but it's
not an argument until you actually include them.  We're not taking
your claims at face value, that's not how this ever works.

Just explain why and how your requirements, demand paging reserves in
this case, matter in real life.  Then we can take them seriously.

> And, realistically, sizing that reservation pool appropriately is my
> problem to solve - it will depend on many factors, one of which is
> the actual geometry of the filesystem itself. You need to stop
> thinking like you can control how application use the memory
> allocation and reclaim subsystem and start to trust we will our
> memory usage appropriately to maintain maximum system throughput.

You've been working on the kernel long enough to know that this is not
how it goes.  I don't care about getting a list of things you claim
you need and implementing them blindly, trusting that you know what
you're doing when it comes to memory.  If you want us to expose an
interface, which puts constraints on our implementation, then you
better provide justification for every single requirement.

> After all, we already do that for all the filesystem caches the mm
> subsystem doesn't control - why do you think I have had such an
> interest in shrinker scalability? For XFS, the only cache we
> actually don't control reclaim from is user data in the page cache -
> we control everything else directly from custom shrinkers.....

You mean those global object pools that are aged through unrelated and
independent per-zone pressure values?

Look, we are specialized in different subsystems, which means we know
the details in front of us better than the details in the surrounding
areas.  You are quick to dismiss constraints and scalability concerns
in the memory subsystem, and I do the same for memory users.  We are
having this discussion in order to explore where our problem spaces
intersect, and we could be making more progress if you stopped
assuming that everybody else is an idiot and you already found the
perfect solution.

We need data on your parameters in order to make a basic cost-benefit
analysis of any proposed solutions.  Don't just propose something and
talk down to us when we ask for clarifications on your constraints.
It's not getting us anywhere.  Explore the problem space with us,
explain your constraints and exact requirements based on real life
data, and then we can look for potential solutions.  That is how we
evaluate every single proposal for the kernel, and it's how it's going
to work in this case.  It's not that complicated.

> > The level of precision that you are asking for comes with complexity
> > and fragility that I'm not convinced is necessary, or justified.
> 
> Look, if you dont think reservations will work, then how about you
> suggest something that will. I don't really care what you implement,
> as long as it meets the needs of demand paging, I have direct
> control over memory usage and concurrency policy and the allocation
> mechanism guarantees forward progress without needing the OOM
> killer.

Reservations are fine and I also want them to replace the OOM killer,
we agree on that.

The only thing my email was about was that, in light of the worst-case
numbers you quoted, it didn't look like the demand paging requirement
is strictly necessary to make the system work in practice, which is
why I'm questioning that particular requirement and prompting you to
clarify your position.  You have yet to address this.

Until then, the simplest semantics are preallocation semantics, where
you in advance establish private reserve pools (which can be backed by
clean cache) from which you allocate directly using __GFP_RESERVE.  If
the pool is empty it's immediately detectable and attributable to the
culprit, and the other reserves are not impacted by it.

A globally shared demand-paged pool is much more fragile because you
trust other participants in the system to keep their promise and not
pin more objects than they reserved for.  Otherwise, they deadlock
your transaction and corrupt your userdata.  How does "XFS filesystem
corrupted because it shares its emergency memory pool to ensure data
integrity with some buggy driver" sound to you?

It's also harder to verify.  If one of the participants misbehaves and
pins more objects than they initially reserved for, how do we identify
the culprit when the system locks up?

Make an actual case why preallocation semantics are unworkable on real
systems with real memory and real filesystems and real data on them,
then we can consider making the model more complex and fragile.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-11 18:59                                       ` Oleg Nesterov
@ 2015-03-14 13:03                                         ` Tetsuo Handa
  0 siblings, 0 replies; 276+ messages in thread
From: Tetsuo Handa @ 2015-03-14 13:03 UTC (permalink / raw)
  To: oleg
  Cc: mhocko, hannes, david, dchinner, linux-mm, rientjes, akpm,
	mgorman, torvalds

Oleg Nesterov wrote:
> On 02/11, Oleg Nesterov wrote:
> >
> > On 02/11, Tetsuo Handa wrote:
> > >
> > > (Asking Oleg this time.)
> >
> > Well, sorry, I ignored the previous discussion, not sure I understand you
> > correctly.
> >
> > > > Though, more serious behavior with this reproducer is (B) where the system
> > > > stalls forever without kernel messages being saved to /var/log/messages .
> > > > out_of_memory() does not select victims until the coredump to pipe can make
> > > > progress whereas the coredump to pipe can't make progress until memory
> > > > allocation succeeds or fails.
> > >
> > > This behavior is related to commit d003f371b2701635 ("oom: don't assume
> > > that a coredumping thread will exit soon"). That commit tried to take
> > > SIGNAL_GROUP_COREDUMP into account, but actually it is failing to do so.
> >
> > Heh. Please see the changelog. This "fix" is obviously very limited, it does
> > not even try to solve all problems (even with coredump in particular).
> >
> > Note also that SIGNAL_GROUP_COREDUMP is not even set if the process (not a
> > sub-thread) shares the memory with the coredumping task. It would be better
> > to check mm->core_state != NULL instead, but this needs the locking. Plus
> > that process likely sleeps in D state in exit_mm(), so this can't help.
> >
> > And that is why we set SIGNAL_GROUP_COREDUMP in zap_threads(), not in
> > zap_process(). We probably want to make that "wait for coredump_finish()"
> > sleep in exit_mm() killable, but this is not simple.
> 
> on a cecond thought, perhaps it makes sense to set SIGNAL_GROUP_COREDUMP
> anyway, even if a CLONE_VM process participating in coredump is not killable.
> I'll recheck tomorrow.

Ping?

> 
> > Sorry for noise if the above is not relevant.
> >
> > Oleg.
> 
> 

I tried https://lkml.org/lkml/2015/3/11/707 with retry_allocation_attempts == 1
(with http://marc.info/?l=linux-mm&m=141671829611143&w=2 for debug printk() ).

Although 0x2015a (which is !__GFP_FS) allocation likely fails within a few
jiffies under TIF_MEMDIE condition, TIF_MEMDIE condition itself cannot be solved
until SIGNAL_GROUP_COREDUMP patch is proposed.

----------
XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
warn_alloc_failed: 212565 callbacks suppressed
crond: page allocation failure: order:0, mode:0x2015a
rngd: page allocation failure: order:0, mode:0x2015a
CPU: 3 PID: 1667 Comm: rngd Not tainted 4.0.0-rc3+ #37
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
 0000000000000000 00000000ce4cec53 0000000000000000 ffffffff815f30c4
 000000000002015a ffffffff8111063e ffff88007fffdb00 0000000000000000
 0000000000000040 ffff88007c223db0 0000000000000000 00000000ce4cec53
Call Trace:
 [<ffffffff815f30c4>] ? dump_stack+0x40/0x50
 [<ffffffff8111063e>] ? warn_alloc_failed+0xee/0x150
 [<ffffffff81113b03>] ? __alloc_pages_nodemask+0x623/0xa10
 [<ffffffff81150c57>] ? alloc_pages_current+0x87/0x100
 [<ffffffff8110d30d>] ? filemap_fault+0x1bd/0x400
 [<ffffffff812e3dbc>] ? radix_tree_next_chunk+0x5c/0x240
 [<ffffffff8112f85b>] ? __do_fault+0x4b/0xe0
 [<ffffffff81134465>] ? handle_mm_fault+0xc85/0x1640
 [<ffffffff81051c9a>] ? __do_page_fault+0x16a/0x430
 [<ffffffff81051f90>] ? do_page_fault+0x30/0x70
 [<ffffffff815fb03f>] ? error_exit+0x1f/0x60
 [<ffffffff815fae18>] ? page_fault+0x28/0x30
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 276+ messages in thread

end of thread, other threads:[~2015-03-14 13:53 UTC | newest]

Thread overview: 276+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-12 13:54 [RFC PATCH] oom: Don't count on mm-less current process Tetsuo Handa
2014-12-16 12:47 ` Michal Hocko
2014-12-17 11:54   ` Tetsuo Handa
2014-12-17 13:08     ` Michal Hocko
2014-12-18 12:11       ` Tetsuo Handa
2014-12-18 15:33         ` Michal Hocko
2014-12-19 12:07           ` Tetsuo Handa
2014-12-19 12:49             ` Michal Hocko
2014-12-20  9:13               ` Tetsuo Handa
2014-12-20 11:42                 ` Tetsuo Handa
2014-12-22 20:25                   ` Michal Hocko
2014-12-23  1:00                     ` Tetsuo Handa
2014-12-23  9:51                       ` Michal Hocko
2014-12-23 11:46                         ` Tetsuo Handa
2014-12-23 11:57                           ` Tetsuo Handa
2014-12-23 12:12                             ` Tetsuo Handa
2014-12-23 12:27                             ` Michal Hocko
2014-12-23 12:24                           ` Michal Hocko
2014-12-23 13:00                             ` Tetsuo Handa
2014-12-23 13:09                               ` Michal Hocko
2014-12-23 13:20                                 ` Tetsuo Handa
2014-12-23 13:43                                   ` Michal Hocko
2014-12-23 14:11                                     ` Tetsuo Handa
2014-12-23 14:57                                       ` Michal Hocko
2014-12-19 12:22           ` How to handle TIF_MEMDIE stalls? Tetsuo Handa
2014-12-20  2:03             ` Dave Chinner
2014-12-20 12:41               ` Tetsuo Handa
2014-12-20 22:35                 ` Dave Chinner
2014-12-21  8:45                   ` Tetsuo Handa
2014-12-21 20:42                     ` Dave Chinner
2014-12-22 16:57                       ` Michal Hocko
2014-12-22 21:30                         ` Dave Chinner
2014-12-23  9:41                           ` Johannes Weiner
2014-12-24  1:06                             ` Dave Chinner
2014-12-24  2:40                               ` Linus Torvalds
2014-12-29 18:19                     ` Michal Hocko
2014-12-30  6:42                       ` Tetsuo Handa
2014-12-30 11:21                         ` Michal Hocko
2014-12-30 13:33                           ` Tetsuo Handa
2014-12-31 10:24                             ` Tetsuo Handa
2015-02-09 11:44                           ` Tetsuo Handa
2015-02-10 13:58                             ` Tetsuo Handa
2015-02-10 15:19                               ` Johannes Weiner
2015-02-11  2:23                                 ` Tetsuo Handa
2015-02-11 13:37                                   ` Tetsuo Handa
2015-02-11 18:50                                     ` Oleg Nesterov
2015-02-11 18:59                                       ` Oleg Nesterov
2015-03-14 13:03                                         ` Tetsuo Handa
2015-02-17 12:23                                   ` Tetsuo Handa
2015-02-17 12:53                                     ` Johannes Weiner
2015-02-17 15:38                                       ` Michal Hocko
2015-02-17 22:54                                       ` Dave Chinner
2015-02-17 22:54                                         ` Dave Chinner
2015-02-17 23:32                                         ` Dave Chinner
2015-02-17 23:32                                           ` Dave Chinner
2015-02-18  8:25                                         ` Michal Hocko
2015-02-18  8:25                                           ` Michal Hocko
2015-02-18 10:48                                           ` Dave Chinner
2015-02-18 10:48                                             ` Dave Chinner
2015-02-18 12:16                                             ` Michal Hocko
2015-02-18 12:16                                               ` Michal Hocko
2015-02-18 21:31                                               ` Dave Chinner
2015-02-18 21:31                                                 ` Dave Chinner
2015-02-19  9:40                                                 ` Michal Hocko
2015-02-19  9:40                                                   ` Michal Hocko
2015-02-19 22:03                                                   ` Dave Chinner
2015-02-19 22:03                                                     ` Dave Chinner
2015-02-20  9:27                                                     ` Michal Hocko
2015-02-20  9:27                                                       ` Michal Hocko
2015-02-19 11:01                                               ` Johannes Weiner
2015-02-19 11:01                                                 ` Johannes Weiner
2015-02-19 12:29                                                 ` Michal Hocko
2015-02-19 12:29                                                   ` Michal Hocko
2015-02-19 12:58                                                   ` Michal Hocko
2015-02-19 12:58                                                     ` Michal Hocko
2015-02-19 15:29                                                     ` Tetsuo Handa
2015-02-19 15:29                                                       ` Tetsuo Handa
2015-02-19 15:29                                                       ` Tetsuo Handa
2015-02-19 21:53                                                       ` Tetsuo Handa
2015-02-19 21:53                                                         ` Tetsuo Handa
2015-02-19 21:53                                                         ` Tetsuo Handa
2015-02-20  9:13                                                       ` Michal Hocko
2015-02-20  9:13                                                         ` Michal Hocko
2015-02-20 13:37                                                         ` Stefan Ring
2015-02-20 13:37                                                           ` Stefan Ring
2015-02-19 13:29                                                   ` Tetsuo Handa
2015-02-19 13:29                                                     ` Tetsuo Handa
2015-02-19 13:29                                                     ` Tetsuo Handa
2015-02-20  9:10                                                     ` Michal Hocko
2015-02-20  9:10                                                       ` Michal Hocko
2015-02-20 12:20                                                       ` Tetsuo Handa
2015-02-20 12:20                                                         ` Tetsuo Handa
2015-02-20 12:20                                                         ` Tetsuo Handa
2015-02-20 12:38                                                         ` Michal Hocko
2015-02-20 12:38                                                           ` Michal Hocko
2015-02-19 21:43                                                   ` Dave Chinner
2015-02-19 21:43                                                     ` Dave Chinner
2015-02-20 12:48                                                     ` Michal Hocko
2015-02-20 12:48                                                       ` Michal Hocko
2015-02-20 23:09                                                       ` Dave Chinner
2015-02-20 23:09                                                         ` Dave Chinner
2015-02-19 10:24                                         ` Johannes Weiner
2015-02-19 10:24                                           ` Johannes Weiner
2015-02-19 22:52                                           ` Dave Chinner
2015-02-19 22:52                                             ` Dave Chinner
2015-02-20 10:36                                             ` Tetsuo Handa
2015-02-20 10:36                                               ` Tetsuo Handa
2015-02-20 23:15                                               ` Dave Chinner
2015-02-20 23:15                                                 ` Dave Chinner
2015-02-21  3:20                                                 ` Theodore Ts'o
2015-02-21  3:20                                                   ` Theodore Ts'o
2015-02-21  9:19                                                   ` Andrew Morton
2015-02-21  9:19                                                     ` Andrew Morton
2015-02-21 13:48                                                     ` Tetsuo Handa
2015-02-21 13:48                                                       ` Tetsuo Handa
2015-02-21 13:48                                                       ` Tetsuo Handa
2015-02-21 21:38                                                     ` Dave Chinner
2015-02-21 21:38                                                       ` Dave Chinner
2015-02-21 21:38                                                       ` Dave Chinner
2015-02-22  0:20                                                     ` Johannes Weiner
2015-02-22  0:20                                                       ` Johannes Weiner
2015-02-23 10:48                                                       ` Michal Hocko
2015-02-23 10:48                                                         ` Michal Hocko
2015-02-23 10:48                                                         ` Michal Hocko
2015-02-23 11:23                                                         ` Tetsuo Handa
2015-02-23 11:23                                                           ` Tetsuo Handa
2015-02-23 11:23                                                           ` Tetsuo Handa
2015-02-23 21:33                                                       ` David Rientjes
2015-02-23 21:33                                                         ` David Rientjes
2015-02-23 21:33                                                         ` David Rientjes
2015-02-22 14:48                                                     ` __GFP_NOFAIL and oom_killer_disabled? Tetsuo Handa
2015-02-23 10:21                                                       ` Michal Hocko
2015-02-23 13:03                                                         ` Tetsuo Handa
2015-02-24 18:14                                                           ` Michal Hocko
2015-02-25 11:22                                                             ` Tetsuo Handa
2015-02-25 16:02                                                               ` Michal Hocko
2015-02-25 21:48                                                                 ` Tetsuo Handa
2015-02-25 21:51                                                                   ` Andrew Morton
2015-02-21 12:00                                                   ` How to handle TIF_MEMDIE stalls? Tetsuo Handa
2015-02-21 12:00                                                     ` Tetsuo Handa
2015-02-21 12:00                                                     ` Tetsuo Handa
2015-02-23 10:26                                                   ` Michal Hocko
2015-02-23 10:26                                                     ` Michal Hocko
2015-02-23 10:26                                                     ` Michal Hocko
2015-02-21 11:12                                                 ` Tetsuo Handa
2015-02-21 11:12                                                   ` Tetsuo Handa
2015-02-21 21:48                                                   ` Dave Chinner
2015-02-21 21:48                                                     ` Dave Chinner
2015-02-21 23:52                                             ` Johannes Weiner
2015-02-21 23:52                                               ` Johannes Weiner
2015-02-23  0:45                                               ` Dave Chinner
2015-02-23  0:45                                                 ` Dave Chinner
2015-02-23  1:29                                                 ` Andrew Morton
2015-02-23  1:29                                                   ` Andrew Morton
2015-02-23  7:32                                                   ` Dave Chinner
2015-02-23  7:32                                                     ` Dave Chinner
2015-02-27 18:24                                                     ` Vlastimil Babka
2015-02-27 18:24                                                       ` Vlastimil Babka
2015-02-28  0:03                                                       ` Dave Chinner
2015-02-28  0:03                                                         ` Dave Chinner
2015-02-28 15:17                                                         ` Theodore Ts'o
2015-02-28 15:17                                                           ` Theodore Ts'o
2015-03-02  9:39                                                     ` Vlastimil Babka
2015-03-02  9:39                                                       ` Vlastimil Babka
2015-03-02 22:31                                                       ` Dave Chinner
2015-03-02 22:31                                                         ` Dave Chinner
2015-03-03  9:13                                                         ` Vlastimil Babka
2015-03-03  9:13                                                           ` Vlastimil Babka
2015-03-04  1:33                                                           ` Dave Chinner
2015-03-04  1:33                                                             ` Dave Chinner
2015-03-04  8:50                                                             ` Vlastimil Babka
2015-03-04  8:50                                                               ` Vlastimil Babka
2015-03-04 11:03                                                               ` Dave Chinner
2015-03-04 11:03                                                                 ` Dave Chinner
2015-03-07  0:20                                                         ` Johannes Weiner
2015-03-07  0:20                                                           ` Johannes Weiner
2015-03-07  3:43                                                           ` Dave Chinner
2015-03-07  3:43                                                             ` Dave Chinner
2015-03-07 15:08                                                             ` Johannes Weiner
2015-03-07 15:08                                                               ` Johannes Weiner
2015-03-02 20:22                                                     ` Johannes Weiner
2015-03-02 20:22                                                       ` Johannes Weiner
2015-03-02 23:12                                                       ` Dave Chinner
2015-03-02 23:12                                                         ` Dave Chinner
2015-03-03  2:50                                                         ` Johannes Weiner
2015-03-03  2:50                                                           ` Johannes Weiner
2015-03-04  6:52                                                           ` Dave Chinner
2015-03-04  6:52                                                             ` Dave Chinner
2015-03-04 15:04                                                             ` Johannes Weiner
2015-03-04 15:04                                                               ` Johannes Weiner
2015-03-04 17:38                                                               ` Theodore Ts'o
2015-03-04 17:38                                                                 ` Theodore Ts'o
2015-03-04 23:17                                                                 ` Dave Chinner
2015-03-04 23:17                                                                   ` Dave Chinner
2015-02-28 16:29                                                 ` Johannes Weiner
2015-02-28 16:29                                                   ` Johannes Weiner
2015-02-28 16:41                                                   ` Theodore Ts'o
2015-02-28 16:41                                                     ` Theodore Ts'o
2015-02-28 22:15                                                     ` Johannes Weiner
2015-02-28 22:15                                                       ` Johannes Weiner
2015-03-01 11:17                                                       ` Tetsuo Handa
2015-03-01 11:17                                                         ` Tetsuo Handa
2015-03-06 11:53                                                         ` Tetsuo Handa
2015-03-06 11:53                                                           ` Tetsuo Handa
2015-03-01 13:43                                                       ` Theodore Ts'o
2015-03-01 13:43                                                         ` Theodore Ts'o
2015-03-01 16:15                                                         ` Johannes Weiner
2015-03-01 16:15                                                           ` Johannes Weiner
2015-03-01 19:36                                                           ` Theodore Ts'o
2015-03-01 19:36                                                             ` Theodore Ts'o
2015-03-01 20:44                                                             ` Johannes Weiner
2015-03-01 20:44                                                               ` Johannes Weiner
2015-03-01 20:17                                                         ` Johannes Weiner
2015-03-01 20:17                                                           ` Johannes Weiner
2015-03-01 21:48                                                       ` Dave Chinner
2015-03-01 21:48                                                         ` Dave Chinner
2015-03-02  0:17                                                         ` Dave Chinner
2015-03-02  0:17                                                           ` Dave Chinner
2015-03-02 12:46                                                           ` Brian Foster
2015-03-02 12:46                                                             ` Brian Foster
2015-02-28 18:36                                                 ` Vlastimil Babka
2015-02-28 18:36                                                   ` Vlastimil Babka
2015-03-02 15:18                                                 ` Michal Hocko
2015-03-02 15:18                                                   ` Michal Hocko
2015-03-02 16:05                                                   ` Johannes Weiner
2015-03-02 16:05                                                     ` Johannes Weiner
2015-03-02 17:10                                                     ` Michal Hocko
2015-03-02 17:10                                                       ` Michal Hocko
2015-03-02 17:27                                                       ` Johannes Weiner
2015-03-02 17:27                                                         ` Johannes Weiner
2015-03-02 16:39                                                   ` Theodore Ts'o
2015-03-02 16:39                                                     ` Theodore Ts'o
2015-03-02 16:58                                                     ` Michal Hocko
2015-03-02 16:58                                                       ` Michal Hocko
2015-03-04 12:52                                                       ` Dave Chinner
2015-03-04 12:52                                                         ` Dave Chinner
2015-02-17 14:59                                     ` Michal Hocko
2015-02-17 14:50                                 ` Michal Hocko
2015-02-17 14:37                             ` Michal Hocko
2015-02-17 14:44                               ` Michal Hocko
2015-02-16 11:23                           ` Tetsuo Handa
2015-02-16 15:42                             ` Johannes Weiner
2015-02-17 11:57                               ` Tetsuo Handa
2015-02-17 13:16                                 ` Johannes Weiner
2015-02-17 16:50                                   ` Michal Hocko
2015-02-17 23:25                                     ` Dave Chinner
2015-02-18  8:48                                       ` Michal Hocko
2015-02-18 11:23                                         ` Tetsuo Handa
2015-02-18 11:23                                           ` Tetsuo Handa
2015-02-18 12:29                                           ` Michal Hocko
2015-02-18 12:29                                             ` Michal Hocko
2015-02-18 14:06                                             ` Tetsuo Handa
2015-02-18 14:06                                               ` Tetsuo Handa
2015-02-18 14:25                                               ` Michal Hocko
2015-02-19 10:48                                                 ` Tetsuo Handa
2015-02-19 10:48                                                   ` Tetsuo Handa
2015-02-20  8:26                                                   ` Michal Hocko
2015-02-20  8:26                                                     ` Michal Hocko
2015-02-23 22:08                                 ` David Rientjes
2015-02-24 11:20                                   ` Tetsuo Handa
2015-02-24 15:20                                     ` Theodore Ts'o
2015-02-24 21:02                                       ` Dave Chinner
2015-02-25 14:31                                         ` Tetsuo Handa
2015-02-27  7:39                                           ` Dave Chinner
2015-02-27 12:42                                             ` Tetsuo Handa
2015-02-27 13:12                                               ` Dave Chinner
2015-03-04 12:41                                                 ` Tetsuo Handa
2015-03-04 13:25                                                   ` Dave Chinner
2015-03-04 14:11                                                     ` Tetsuo Handa
2015-03-05  1:36                                                       ` Dave Chinner
2015-02-17 16:33                             ` Michal Hocko
2014-12-29 17:40                   ` [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) Michal Hocko
2014-12-29 18:45                     ` Linus Torvalds
2014-12-29 19:33                       ` Michal Hocko
2014-12-30 13:42                         ` Michal Hocko
2014-12-30 21:45                           ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.