linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
@ 2016-03-29 13:27 Michal Hocko
  2016-03-29 13:45 ` Tetsuo Handa
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Michal Hocko @ 2016-03-29 13:27 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Johannes Weiner, Tetsuo Handa, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom is the central place to decide when the
out_of_memory should be invoked. This is a good approach for most checks
there because they are page allocator specific and the allocation fails
right after.

The notable exception is GFP_NOFS context which is faking
did_some_progress and keep the page allocator looping even though there
couldn't have been any progress from the OOM killer. This patch doesn't
change this behavior because we are not ready to allow those allocation
requests to fail yet. Instead __GFP_FS check is moved down to
out_of_memory and prevent from OOM victim selection there. There are
two reasons for that
	- OOM notifiers might release some memory even from this context
	  as none of the registered notifier seems to be FS related
	- this might help a dying thread to get an access to memory
          reserves and move on which will make the behavior more
          consistent with the case when the task gets killed from a
          different context.

Keep a comment in __alloc_pages_may_oom to make sure we do not forget
how GFP_NOFS is special and that we really want to do something about
it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
I am sending this as an RFC now even though I think this makes more
sense than what we have right now. Maybe there are some side effects
I do not see, though. A more tricky part is the OOM notifier part
becasue future notifiers might decide to depend on the FS and we can
lockup. Is this something to worry about, though? Would such a notifier
be correct at all? I would call it broken as it would put OOM killer out
of the way on the contended system which is a plain bug IMHO.

If this looks like a reasonable approach I would go on think about how
we can extend this for the oom_reaper and queue the current thread for
the reaper to free some of the memory.

Any thoughts

 mm/oom_kill.c   |  4 ++++
 mm/page_alloc.c | 24 ++++++++++--------------
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 86349586eacb..1c2b7a82f0c4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	/* The OOM killer does not compensate for IO-less reclaim. */
+	if (!(oc->gfp_mask & __GFP_FS))
+		return true;
+
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1b889dba7bd4..736ea28abfcf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2872,22 +2872,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 *
-			 * But do not keep looping if oom_killer_disable()
-			 * was already called, for the system is trying to
-			 * enter a quiescent state during suspend.
-			 */
-			*did_some_progress = !oom_killer_disabled;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
+		/*
+		 * XXX: GFP_NOFS allocations should rather fail than rely on
+		 * other request to make a forward progress.
+		 * We are in an unfortunate situation where out_of_memory cannot
+		 * do much for this context but let's try it to at least get
+		 * access to memory reserved if the current task is killed (see
+		 * out_of_memory). Once filesystems are ready to handle allocation
+		 * failures more gracefully we should just bail out here.
+		 */
+
 		/* The OOM killer may not free memory on a specific node */
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko
@ 2016-03-29 13:45 ` Tetsuo Handa
  2016-03-29 14:22   ` Michal Hocko
  2016-03-29 14:14 ` Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2016-03-29 13:45 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, hannes, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __alloc_pages_may_oom is the central place to decide when the
> out_of_memory should be invoked. This is a good approach for most checks
> there because they are page allocator specific and the allocation fails
> right after.
> 
> The notable exception is GFP_NOFS context which is faking
> did_some_progress and keep the page allocator looping even though there
> couldn't have been any progress from the OOM killer. This patch doesn't
> change this behavior because we are not ready to allow those allocation
> requests to fail yet. Instead __GFP_FS check is moved down to
> out_of_memory and prevent from OOM victim selection there. There are
> two reasons for that
> 	- OOM notifiers might release some memory even from this context
> 	  as none of the registered notifier seems to be FS related
> 	- this might help a dying thread to get an access to memory
>           reserves and move on which will make the behavior more
>           consistent with the case when the task gets killed from a
>           different context.

Allowing !__GFP_FS allocations to get TIF_MEMDIE by calling the shortcuts in
out_of_memory() would be fine. But I don't like the direction you want to go.

I don't like failing !__GFP_FS allocations without selecting OOM victim
( http://lkml.kernel.org/r/201603252054.ADH30264.OJQFFLMOHFSOVt@I-love.SAKURA.ne.jp ).

Also, I suggested removing all shortcuts by setting TIF_MEMDIE from oom_kill_process()
( http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ).

> 
> Keep a comment in __alloc_pages_may_oom to make sure we do not forget
> how GFP_NOFS is special and that we really want to do something about
> it.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> 
> Hi,
> I am sending this as an RFC now even though I think this makes more
> sense than what we have right now. Maybe there are some side effects
> I do not see, though. A more tricky part is the OOM notifier part
> becasue future notifiers might decide to depend on the FS and we can
> lockup. Is this something to worry about, though? Would such a notifier
> be correct at all? I would call it broken as it would put OOM killer out
> of the way on the contended system which is a plain bug IMHO.
> 
> If this looks like a reasonable approach I would go on think about how
> we can extend this for the oom_reaper and queue the current thread for
> the reaper to free some of the memory.
> 
> Any thoughts

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko
  2016-03-29 13:45 ` Tetsuo Handa
@ 2016-03-29 14:14 ` Michal Hocko
  2016-03-29 22:13 ` David Rientjes
  2016-04-05 11:12 ` Tetsuo Handa
  3 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2016-03-29 14:14 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Johannes Weiner, Tetsuo Handa, Andrew Morton, LKML

On Tue 29-03-16 15:27:35, Michal Hocko wrote:
[...]
> If this looks like a reasonable approach I would go on think about how
> we can extend this for the oom_reaper and queue the current thread for
> the reaper to free some of the memory.

And this is what I came up with (untested yet). Doesn't too bad to me:
---
>From 1129d802a6feff7fa04b582701d11f556a149f12 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 29 Mar 2016 16:04:10 +0200
Subject: [PATCH] oom, oom_reaper: Try to reap tasks which skip regular OOM
 killer path

If either the current task is already killed or PF_EXITING or a selected
task is PF_EXITING then the oom killer is suppressed and so is the oom
reaper. This patch adds try_oom_reaper which checks the given task
and queues it for the oom reaper if that is safe to be done meaning
that the task doesn't share the mm with an alive process.

This might help to release the memory pressure while the task tries to
exit.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 64 insertions(+), 18 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1c2b7a82f0c4..2f637728b12a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,6 +412,25 @@ bool oom_killer_disabled __read_mostly;
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 
+/*
+ * task->mm can be NULL if the task is the exited group leader.  So to
+ * determine whether the task is using a particular mm, we examine all the
+ * task's threads: if one of those is using this mm then this task was also
+ * using it.
+ */
+static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+{
+	struct task_struct *t;
+
+	for_each_thread(p, t) {
+		struct mm_struct *t_mm = READ_ONCE(t->mm);
+		if (t_mm)
+			return t_mm == mm;
+	}
+	return false;
+}
+
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -563,6 +582,45 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
+/* Check if we can reap the given task. This has to be called with stable
+ * tsk->mm
+ */
+static void try_oom_reaper(struct task_struct *tsk)
+{
+	struct mm_struct *mm = tsk->mm;
+	bool can_oom_reap = true;
+	struct task_struct *p;
+
+	if (!mm)
+		return;
+
+	/*
+	 * There might be other threads/processes which are either not
+	 * dying or even not killable.
+	 */
+	if (atomic_read(&mm->mm_users) > 1) {
+		rcu_read_lock();
+		for_each_process(p) {
+			if (!process_shares_mm(p, mm))
+				continue;
+			if (same_thread_group(p, tsk))
+				continue;
+			/*
+			 * other process sharing the mm is not dying so we cannot
+			 * simply reap the address space.
+			 */
+			if (!fatal_signal_pending(p) || !task_will_free_mem(p)) {
+				can_oom_reap = false;
+				break;
+			}
+		}
+		rcu_read_unlock();
+	}
+
+	if (can_oom_reap)
+		wake_oom_reaper(tsk);
+}
+
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -575,6 +633,10 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
+static void try_oom_reaper(struct task_struct *tsk)
+{
+}
+
 static void wake_oom_reaper(struct task_struct *tsk)
 {
 }
@@ -653,24 +715,6 @@ void oom_killer_enable(void)
 }
 
 /*
- * task->mm can be NULL if the task is the exited group leader.  So to
- * determine whether the task is using a particular mm, we examine all the
- * task's threads: if one of those is using this mm then this task was also
- * using it.
- */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
-{
-	struct task_struct *t;
-
-	for_each_thread(p, t) {
-		struct mm_struct *t_mm = READ_ONCE(t->mm);
-		if (t_mm)
-			return t_mm == mm;
-	}
-	return false;
-}
-
-/*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
  */
@@ -694,6 +738,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_lock(p);
 	if (p->mm && task_will_free_mem(p)) {
 		mark_oom_victim(p);
+		try_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -873,6 +918,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (current->mm &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
 		mark_oom_victim(current);
+		try_oom_reaper(current);
 		return true;
 	}
 
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-29 13:45 ` Tetsuo Handa
@ 2016-03-29 14:22   ` Michal Hocko
  2016-03-29 15:29     ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-03-29 14:22 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel

On Tue 29-03-16 22:45:40, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __alloc_pages_may_oom is the central place to decide when the
> > out_of_memory should be invoked. This is a good approach for most checks
> > there because they are page allocator specific and the allocation fails
> > right after.
> > 
> > The notable exception is GFP_NOFS context which is faking
> > did_some_progress and keep the page allocator looping even though there
> > couldn't have been any progress from the OOM killer. This patch doesn't
> > change this behavior because we are not ready to allow those allocation
> > requests to fail yet. Instead __GFP_FS check is moved down to
> > out_of_memory and prevent from OOM victim selection there. There are
> > two reasons for that
> > 	- OOM notifiers might release some memory even from this context
> > 	  as none of the registered notifier seems to be FS related
> > 	- this might help a dying thread to get an access to memory
> >           reserves and move on which will make the behavior more
> >           consistent with the case when the task gets killed from a
> >           different context.
> 
> Allowing !__GFP_FS allocations to get TIF_MEMDIE by calling the shortcuts in
> out_of_memory() would be fine. But I don't like the direction you want to go.
> 
> I don't like failing !__GFP_FS allocations without selecting OOM victim
> ( http://lkml.kernel.org/r/201603252054.ADH30264.OJQFFLMOHFSOVt@I-love.SAKURA.ne.jp ).

I didn't get to read and digest that email yet but from a quick glance
it doesn't seem to be directly related to this patch. Even if we decide
that __GFP_FS vs. OOM killer logic is flawed for some reason then would
build on top as granting the access to memory reserves is not against
it.

> Also, I suggested removing all shortcuts by setting TIF_MEMDIE from oom_kill_process()
> ( http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ).

I personally do not like this much. I believe we have already tried to
explain why we have (some of) those shortcuts. They might be too
optimistic and there is a room for improvements for sure but I am not
convinced we can get rid of them that easily.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-29 14:22   ` Michal Hocko
@ 2016-03-29 15:29     ` Tetsuo Handa
  0 siblings, 0 replies; 14+ messages in thread
From: Tetsuo Handa @ 2016-03-29 15:29 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel

Michal Hocko wrote:
> On Tue 29-03-16 22:45:40, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > __alloc_pages_may_oom is the central place to decide when the
> > > out_of_memory should be invoked. This is a good approach for most checks
> > > there because they are page allocator specific and the allocation fails
> > > right after.
> > > 
> > > The notable exception is GFP_NOFS context which is faking
> > > did_some_progress and keep the page allocator looping even though there
> > > couldn't have been any progress from the OOM killer. This patch doesn't
> > > change this behavior because we are not ready to allow those allocation
> > > requests to fail yet. Instead __GFP_FS check is moved down to
> > > out_of_memory and prevent from OOM victim selection there. There are
> > > two reasons for that
> > > 	- OOM notifiers might release some memory even from this context
> > > 	  as none of the registered notifier seems to be FS related
> > > 	- this might help a dying thread to get an access to memory
> > >           reserves and move on which will make the behavior more
> > >           consistent with the case when the task gets killed from a
> > >           different context.
> > 
> > Allowing !__GFP_FS allocations to get TIF_MEMDIE by calling the shortcuts in
> > out_of_memory() would be fine. But I don't like the direction you want to go.
> > 
> > I don't like failing !__GFP_FS allocations without selecting OOM victim
> > ( http://lkml.kernel.org/r/201603252054.ADH30264.OJQFFLMOHFSOVt@I-love.SAKURA.ne.jp ).
> 
> I didn't get to read and digest that email yet but from a quick glance
> it doesn't seem to be directly related to this patch. Even if we decide
> that __GFP_FS vs. OOM killer logic is flawed for some reason then would
> build on top as granting the access to memory reserves is not against
> it.
> 

I think that removing these shortcuts is better.

> > Also, I suggested removing all shortcuts by setting TIF_MEMDIE from oom_kill_process()
> > ( http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ).
> 
> I personally do not like this much. I believe we have already tried to
> explain why we have (some of) those shortcuts. They might be too
> optimistic and there is a room for improvements for sure but I am not
> convinced we can get rid of them that easily.

These shortcuts are too optimistic. They assume that the target thread can call
exit_oom_victim() but the reality is that the target task can get stuck at
down_read(&mm->mmap_sem) in exit_mm(). If SIGKILL were sent to all thread
groups sharing that mm, the possibility of the target thread getting stuck at
down_read(&mm->mmap_sem) in exit_mm() is significantly reduced.

http://lkml.kernel.org/r/20160329141442.GD4466@dhcp22.suse.cz tried to let
the OOM reaper to call exit_oom_victim() on behalf of the target thread
by waking up the OOM reaper. But the OOM reaper won't call exit_oom_victim()
because the OOM reaper will fail to reap memory because some thread sharing
that mm and holding mm->mmap_sem for write will not receive SIGKILL if we use
these shortcuts. As far as I know, all existing explanations for why we have
these shortcuts are ignoring the possibility of such some thread.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko
  2016-03-29 13:45 ` Tetsuo Handa
  2016-03-29 14:14 ` Michal Hocko
@ 2016-03-29 22:13 ` David Rientjes
  2016-03-30  9:47   ` Michal Hocko
  2016-04-05 11:12 ` Tetsuo Handa
  3 siblings, 1 reply; 14+ messages in thread
From: David Rientjes @ 2016-03-29 22:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Johannes Weiner, Tetsuo Handa, Andrew Morton, LKML,
	Michal Hocko

On Tue, 29 Mar 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 86349586eacb..1c2b7a82f0c4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc)
>  		return true;
>  	}
>  
> +	/* The OOM killer does not compensate for IO-less reclaim. */
> +	if (!(oc->gfp_mask & __GFP_FS))
> +		return true;
> +
>  	/*
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.

I don't object to this necessarily, but I think we need input from those 
that have taken the time to implement their own oom notifier to see if 
they agree.  In the past, they would only be called if reclaim has 
completely failed; now, they can be called in low memory situations when 
reclaim has had very little chance to be successful.  Getting an ack from 
them would be helpful.

I also think we have discussed this before, but I think the oom notifier 
handling should be in done in the page allocator proper, i.e. in 
__alloc_pages_may_oom().  We can leave out_of_memory() for a clear defined 
purpose: to kill a process when all reclaim has failed.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-29 22:13 ` David Rientjes
@ 2016-03-30  9:47   ` Michal Hocko
  2016-03-30 11:46     ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-03-30  9:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Johannes Weiner, Tetsuo Handa, Andrew Morton, LKML

On Tue 29-03-16 15:13:54, David Rientjes wrote:
> On Tue, 29 Mar 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 86349586eacb..1c2b7a82f0c4 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc)
> >  		return true;
> >  	}
> >  
> > +	/* The OOM killer does not compensate for IO-less reclaim. */
> > +	if (!(oc->gfp_mask & __GFP_FS))
> > +		return true;
> > +
> >  	/*
> >  	 * Check if there were limitations on the allocation (only relevant for
> >  	 * NUMA) that may require different handling.
> 
> I don't object to this necessarily, but I think we need input from those 
> that have taken the time to implement their own oom notifier to see if 
> they agree.  In the past, they would only be called if reclaim has 
> completely failed; now, they can be called in low memory situations when 
> reclaim has had very little chance to be successful.  Getting an ack from 
> them would be helpful.

I will make sure to put them on the CC and mention this in the changelog
when I post this next time. I personally think that this shouldn't make
much difference in the real life because GFP_NOFS only loads are rare
and we should rather help by releasing memory when it is available
rather than rely on something else to do it for us. Waiting for Godot is
never a good strategy.

> I also think we have discussed this before, but I think the oom notifier 
> handling should be in done in the page allocator proper, i.e. in 
> __alloc_pages_may_oom().  We can leave out_of_memory() for a clear defined 
> purpose: to kill a process when all reclaim has failed.

I vaguely remember there was some issue with that the last time we have
discussed that. It was the duplication from the page fault and allocator
paths AFAIR. Nothing that cannot be handled though but the OOM notifier
API is just too ugly to spread outside OOM proper I guess. Why we cannot
move those users to use proper shrinkers interface (after it gets
extended by a priority of some sort and release some objects only after
we are really in troubles)? Something for a separate discussion,
though...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-30  9:47   ` Michal Hocko
@ 2016-03-30 11:46     ` Tetsuo Handa
  2016-03-30 12:11       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2016-03-30 11:46 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: linux-mm, hannes, akpm, linux-kernel

Michal Hocko wrote:
> On Tue 29-03-16 15:13:54, David Rientjes wrote:
> > On Tue, 29 Mar 2016, Michal Hocko wrote:
> > 
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 86349586eacb..1c2b7a82f0c4 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc)
> > >  		return true;
> > >  	}
> > >  
> > > +	/* The OOM killer does not compensate for IO-less reclaim. */
> > > +	if (!(oc->gfp_mask & __GFP_FS))
> > > +		return true;
> > > +

This patch will disable pagefault_out_of_memory() because currently
pagefault_out_of_memory() is passing oc->gfp_mask == 0.

Because of current behavior, calling oom notifiers from !__GFP_FS seems
to be safe.

> > >  	/*
> > >  	 * Check if there were limitations on the allocation (only relevant for
> > >  	 * NUMA) that may require different handling.
> > 
> > I don't object to this necessarily, but I think we need input from those 
> > that have taken the time to implement their own oom notifier to see if 
> > they agree.  In the past, they would only be called if reclaim has 
> > completely failed; now, they can be called in low memory situations when 
> > reclaim has had very little chance to be successful.  Getting an ack from 
> > them would be helpful.
> 
> I will make sure to put them on the CC and mention this in the changelog
> when I post this next time. I personally think that this shouldn't make
> much difference in the real life because GFP_NOFS only loads are rare

GFP_NOFS only loads are rare. But some GFP_KERNEL load which got TIF_MEMDIE
might be waiting for GFP_NOFS or GFP_NOIO loads to make progress.

I think we are not ready to handle situations where out_of_memory() is called
again after current thread got TIF_MEMDIE due to __GFP_NOFAIL allocation
request when we ran out of memory reserves. We should not assume that the
victim target thread does not have TIF_MEMDIE yet. I think we can handle it
by making mark_oom_victim() return a bool and return via shortcut only if
mark_oom_victim() successfully set TIF_MEMDIE. Though I don't like the
shortcut approach that lacks a guaranteed unlocking mechanism.

> and we should rather help by releasing memory when it is available
> rather than rely on something else to do it for us. Waiting for Godot is
> never a good strategy.
> 
> > I also think we have discussed this before, but I think the oom notifier 
> > handling should be in done in the page allocator proper, i.e. in 
> > __alloc_pages_may_oom().  We can leave out_of_memory() for a clear defined 
> > purpose: to kill a process when all reclaim has failed.
> 
> I vaguely remember there was some issue with that the last time we have
> discussed that. It was the duplication from the page fault and allocator
> paths AFAIR. Nothing that cannot be handled though but the OOM notifier
> API is just too ugly to spread outside OOM proper I guess. Why we cannot
> move those users to use proper shrinkers interface (after it gets
> extended by a priority of some sort and release some objects only after
> we are really in troubles)? Something for a separate discussion,
> though...

Calling oom notifiers from SysRq-f is what we want?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-30 11:46     ` Tetsuo Handa
@ 2016-03-30 12:11       ` Michal Hocko
  2016-03-31 11:56         ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-03-30 12:11 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, linux-mm, hannes, akpm, linux-kernel

On Wed 30-03-16 20:46:48, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 29-03-16 15:13:54, David Rientjes wrote:
> > > On Tue, 29 Mar 2016, Michal Hocko wrote:
> > > 
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 86349586eacb..1c2b7a82f0c4 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc)
> > > >  		return true;
> > > >  	}
> > > >  
> > > > +	/* The OOM killer does not compensate for IO-less reclaim. */
> > > > +	if (!(oc->gfp_mask & __GFP_FS))
> > > > +		return true;
> > > > +
> 
> This patch will disable pagefault_out_of_memory() because currently
> pagefault_out_of_memory() is passing oc->gfp_mask == 0.
> 
> Because of current behavior, calling oom notifiers from !__GFP_FS seems
> to be safe.

You are right! I have completely missed that and thought we were
providing GFP_KERNEL there. So we have two choices. Either we do
use GFP_KERNEL (same as we do for sysrq+f) or we special case
pagefault_out_of_memory in some way. The second option seems to be safer
because the gfp_mask has to contain at least ___GFP_DIRECT_RECLAIM to
trigger the OOM path.

> > > >  	/*
> > > >  	 * Check if there were limitations on the allocation (only relevant for
> > > >  	 * NUMA) that may require different handling.
> > > 
> > > I don't object to this necessarily, but I think we need input from those 
> > > that have taken the time to implement their own oom notifier to see if 
> > > they agree.  In the past, they would only be called if reclaim has 
> > > completely failed; now, they can be called in low memory situations when 
> > > reclaim has had very little chance to be successful.  Getting an ack from 
> > > them would be helpful.
> > 
> > I will make sure to put them on the CC and mention this in the changelog
> > when I post this next time. I personally think that this shouldn't make
> > much difference in the real life because GFP_NOFS only loads are rare
> 
> GFP_NOFS only loads are rare. But some GFP_KERNEL load which got TIF_MEMDIE
> might be waiting for GFP_NOFS or GFP_NOIO loads to make progress.

How would that matter to oom notifiers?

> I think we are not ready to handle situations where out_of_memory() is called
> again after current thread got TIF_MEMDIE due to __GFP_NOFAIL allocation
> request when we ran out of memory reserves. We should not assume that the
> victim target thread does not have TIF_MEMDIE yet. I think we can handle it
> by making mark_oom_victim() return a bool and return via shortcut only if
> mark_oom_victim() successfully set TIF_MEMDIE. Though I don't like the
> shortcut approach that lacks a guaranteed unlocking mechanism.

That would lead to premature follow up OOM when TIF_MEMDIE makes some
progress just not in time.
 
> > and we should rather help by releasing memory when it is available
> > rather than rely on something else to do it for us. Waiting for Godot is
> > never a good strategy.
> > 
> > > I also think we have discussed this before, but I think the oom notifier 
> > > handling should be in done in the page allocator proper, i.e. in 
> > > __alloc_pages_may_oom().  We can leave out_of_memory() for a clear defined 
> > > purpose: to kill a process when all reclaim has failed.
> > 
> > I vaguely remember there was some issue with that the last time we have
> > discussed that. It was the duplication from the page fault and allocator
> > paths AFAIR. Nothing that cannot be handled though but the OOM notifier
> > API is just too ugly to spread outside OOM proper I guess. Why we cannot
> > move those users to use proper shrinkers interface (after it gets
> > extended by a priority of some sort and release some objects only after
> > we are really in troubles)? Something for a separate discussion,
> > though...
> 
> Calling oom notifiers from SysRq-f is what we want?

I am not really sure about that to be honest. The semantic is really
weak but what would be a downside? This operation shouldn't be fatal
and dropped object can be reconstructed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-30 12:11       ` Michal Hocko
@ 2016-03-31 11:56         ` Tetsuo Handa
  2016-03-31 15:11           ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2016-03-31 11:56 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, linux-mm, hannes, akpm, linux-kernel

Michal Hocko wrote:
> On Wed 30-03-16 20:46:48, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 29-03-16 15:13:54, David Rientjes wrote:
> > > > On Tue, 29 Mar 2016, Michal Hocko wrote:
> > > > 
> > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > index 86349586eacb..1c2b7a82f0c4 100644
> > > > > --- a/mm/oom_kill.c
> > > > > +++ b/mm/oom_kill.c
> > > > > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc)
> > > > >  		return true;
> > > > >  	}
> > > > >  
> > > > > +	/* The OOM killer does not compensate for IO-less reclaim. */
> > > > > +	if (!(oc->gfp_mask & __GFP_FS))
> > > > > +		return true;
> > > > > +
> > 
> > This patch will disable pagefault_out_of_memory() because currently
> > pagefault_out_of_memory() is passing oc->gfp_mask == 0.
> > 
> > Because of current behavior, calling oom notifiers from !__GFP_FS seems
> > to be safe.
> 
> You are right! I have completely missed that and thought we were
> providing GFP_KERNEL there. So we have two choices. Either we do
> use GFP_KERNEL (same as we do for sysrq+f) or we special case
> pagefault_out_of_memory in some way. The second option seems to be safer
> because the gfp_mask has to contain at least ___GFP_DIRECT_RECLAIM to
> trigger the OOM path.

Oops, I missed that this patch also disables out_of_memory() for !__GFP_FS &&
__GFP_NOFAIL allocation requests.

> > I think we are not ready to handle situations where out_of_memory() is called
> > again after current thread got TIF_MEMDIE due to __GFP_NOFAIL allocation
> > request when we ran out of memory reserves. We should not assume that the
> > victim target thread does not have TIF_MEMDIE yet. I think we can handle it
> > by making mark_oom_victim() return a bool and return via shortcut only if
> > mark_oom_victim() successfully set TIF_MEMDIE. Though I don't like the
> > shortcut approach that lacks a guaranteed unlocking mechanism.
> 
> That would lead to premature follow up OOM when TIF_MEMDIE makes some
> progress just not in time.

We can never know whether the OOM killer prematurely killed a victim.
It is possible that get_page_from_freelist() will succeed even if
select_bad_process() did not find a TIF_MEMDIE thread. You said you don't
want to violate the layer
( http://lkml.kernel.org/r/20160129152307.GF32174@dhcp22.suse.cz ).

What we can do is tolerate possible premature OOM killer invocation using
some threshold. You are proposing such change as OOM detection rework that
might possibly cause premature OOM killer invocation.
Waiting forever unconditionally (e.g.
http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp )
is no good. Suppressing OOM killer invocation forever unconditionally (e.g.
decide based on only !__GFP_FS, decide based on only TIF_MEMDIE) is no good.

Even if we stop returning via shortcut by making mark_oom_victim() return a
bool, select_bad_process() will work as hold off mechanism. By combining with
timeout (or something finite one) for TIF_MEMDIE, we can tolerate possible
premature OOM killer invocation. It is much better than OOM-livelocked forever.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-31 11:56         ` Tetsuo Handa
@ 2016-03-31 15:11           ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2016-03-31 15:11 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, linux-mm, hannes, akpm, linux-kernel

On Thu 31-03-16 20:56:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 30-03-16 20:46:48, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Tue 29-03-16 15:13:54, David Rientjes wrote:
> > > > > On Tue, 29 Mar 2016, Michal Hocko wrote:
> > > > > 
> > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > > index 86349586eacb..1c2b7a82f0c4 100644
> > > > > > --- a/mm/oom_kill.c
> > > > > > +++ b/mm/oom_kill.c
> > > > > > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc)
> > > > > >  		return true;
> > > > > >  	}
> > > > > >  
> > > > > > +	/* The OOM killer does not compensate for IO-less reclaim. */
> > > > > > +	if (!(oc->gfp_mask & __GFP_FS))
> > > > > > +		return true;
> > > > > > +
> > > 
> > > This patch will disable pagefault_out_of_memory() because currently
> > > pagefault_out_of_memory() is passing oc->gfp_mask == 0.
> > > 
> > > Because of current behavior, calling oom notifiers from !__GFP_FS seems
> > > to be safe.
> > 
> > You are right! I have completely missed that and thought we were
> > providing GFP_KERNEL there. So we have two choices. Either we do
> > use GFP_KERNEL (same as we do for sysrq+f) or we special case
> > pagefault_out_of_memory in some way. The second option seems to be safer
> > because the gfp_mask has to contain at least ___GFP_DIRECT_RECLAIM to
> > trigger the OOM path.
> 
> Oops, I missed that this patch also disables out_of_memory() for !__GFP_FS &&
> __GFP_NOFAIL allocation requests.

True. The following should take care of that:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 54aa4ec06889..32d8210b8773 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -882,7 +882,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * make sure exclude 0 mask - all other users should have at least
 	 * ___GFP_DIRECT_RECLAIM to get here.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
+	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
 		return true;
 
 	/*

Thanks for spotting this!

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko
                   ` (2 preceding siblings ...)
  2016-03-29 22:13 ` David Rientjes
@ 2016-04-05 11:12 ` Tetsuo Handa
  2016-04-06 10:28   ` Tetsuo Handa
  2016-04-06 12:41   ` Michal Hocko
  3 siblings, 2 replies; 14+ messages in thread
From: Tetsuo Handa @ 2016-04-05 11:12 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, hannes, akpm, linux-kernel, mhocko

I did an OOM torture test using Linux 4.6-rc2 with kmallocwd patch
on xfs and ext4 filesystems using reproducer shown below.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/prctl.h>
#include <signal.h>
#include <sys/mman.h>

static char buffer[4096] = { };

static int writer(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	sleep(2);
	while (1) {
		void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
		munmap(ptr, 4096);
	}
	return 0;
}

static int file_io(void *unused)
{
	const int fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
	sleep(2);
	while (write(fd, buffer, sizeof(buffer)) > 0);
	close(fd);
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	if (chdir("/tmp"))
		return 1;
	for (i = 0; i < 64; i++)
		if (fork() == 0) {
			const int idx = i;
			char buffer2[64] = { };
			const int fd = open("/proc/self/oom_score_adj", O_WRONLY);
			write(fd, "1000", 4);
			close(fd);
			snprintf(buffer, sizeof(buffer), "file_io.%02u", idx);
			prctl(PR_SET_NAME, (unsigned long) buffer, 0, 0, 0);
			for (i = 0; i < 16; i++)
				clone(file_io, malloc(1024) + 1024, CLONE_VM, NULL);
			snprintf(buffer2, sizeof(buffer2), "writer.%02u", idx);
			prctl(PR_SET_NAME, (unsigned long) buffer2, 0, 0, 0);
			for (i = 0; i < 16; i++)
				clone(writer, malloc(1024) + 1024, CLONE_VM, NULL);
			while (1)
				pause();
		}
	{ /* A dummy process for invoking the OOM killer. */
		char *buf = NULL;
		unsigned long i;
		unsigned long size = 0;
		prctl(PR_SET_NAME, (unsigned long) "memeater", 0, 0, 0);
		for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
			char *cp = realloc(buf, size);
			if (!cp) {
				size >>= 1;
				break;
			}
			buf = cp;
		}
		sleep(4);
		for (i = 0; i < size; i += 4096)
			buf[i] = '\0'; /* Will cause OOM due to overcommit */
	}
	kill(-1, SIGKILL);
	return * (char *) NULL; /* Not reached. */
}
---------- Reproducer end ----------

What I can observe under OOM livelock condition is a three-way dependency loop.

  (1) An OOM victim (which has TIF_MEMDIE) is unable to make forward progress
      due to blocked at unkillable lock waiting for other thread's memory
      allocation.

  (2) A filesystem writeback work item is unable to make forward progress
      due to waiting for GFP_NOFS memory allocation to be satisfied because
      storage I/O is stalling.

  (3) A disk I/O work item is unable to make forward progress due to
      waiting for GFP_NOIO memory allocation to be satisfied because
      an OOM victim does not release memory but the OOM reaper does not
      unlock TIF_MEMDIE.

Complete log for xfs is at http://I-love.SAKURA.ne.jp/tmp/serial-20160404.txt.xz
----------
[   98.749616] Killed process 1424 (file_io.08) total-vm:4332kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB
[  143.136457] MemAlloc-Info: stalling=2 dying=178 exiting=31 victim=1 oom_count=2324984/335679
[  143.143740] MemAlloc: kswapd0(49) flags=0xa40840 switches=466 uninterruptible
[  143.149661] kswapd0         D 0000000000000001     0    49      2 0x00000000
[  143.155312]  ffff88003689c6c0 ffff8800368a4000 ffff8800368a38b0 ffff88003c251c10
[  143.161566]  ffff88003c251c28 ffff8800368a39e8 0000000000000001 ffffffff81556fbc
[  143.167957]  ffff88003689c6c0 ffffffff81559108 0000000000000000 ffff88003c251c18
[  143.174116] Call Trace:
[  143.176643]  [<ffffffff81556fbc>] ? schedule+0x2c/0x80
[  143.180854]  [<ffffffff81559108>] ? rwsem_down_read_failed+0xf8/0x150
[  143.186358]  [<ffffffff810a20b0>] ? wait_woken+0x80/0x80
[  143.190572]  [<ffffffff8126d5e4>] ? call_rwsem_down_read_failed+0x14/0x30
[  143.196129]  [<ffffffff81558a67>] ? down_read+0x17/0x20
[  143.200356]  [<ffffffffa021c19e>] ? xfs_map_blocks+0x7e/0x150 [xfs]
[  143.205430]  [<ffffffffa021cffa>] ? xfs_do_writepage+0x16a/0x510 [xfs]
[  143.210701]  [<ffffffffa021d3d1>] ? xfs_vm_writepage+0x31/0x70 [xfs]
[  143.215819]  [<ffffffff811225f2>] ? pageout.isra.43+0x182/0x230
[  143.220678]  [<ffffffff811239eb>] ? shrink_page_list+0x84b/0xb20
[  143.225484]  [<ffffffff8112444b>] ? shrink_inactive_list+0x20b/0x490
[  143.230481]  [<ffffffff81125071>] ? shrink_zone_memcg+0x5d1/0x790
[  143.235430]  [<ffffffff8117553d>] ? mem_cgroup_iter+0x14d/0x2b0
[  143.240220]  [<ffffffff81125307>] ? shrink_zone+0xd7/0x2f0
[  143.244725]  [<ffffffff811261c6>] ? kswapd+0x406/0x7d0
[  143.248903]  [<ffffffff81125dc0>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[  143.254293]  [<ffffffff81083b68>] ? kthread+0xc8/0xe0
[  143.258401]  [<ffffffff8155a502>] ? ret_from_fork+0x22/0x40
[  143.262846]  [<ffffffff81083aa0>] ? kthread_create_on_node+0x1a0/0x1a0
[  143.267956] MemAlloc: kworker/2:1(61) flags=0x4208860 switches=75880 seq=4 gfp=0x2400000(GFP_NOIO) order=0 delay=39526 uninterruptible
[  143.277844] kworker/2:1     R  running task        0    61      2 0x00000000
[  143.283598] Workqueue: events_freezable_power_ disk_events_workfn
[  143.288592]  ffff880036940880 ffff88000013c000 ffff88000013b768 ffff88003f64dfc0
[  143.295797]  ffff88000013b700 00000000fffd98ea 0000000000000017 ffffffff81556fbc
[  143.301706]  ffff88003f64dfc0 ffffffff8155965e 0000000000000000 0000000000000286
[  143.307659] Call Trace:
[  143.309941]  [<ffffffff81556fbc>] ? schedule+0x2c/0x80
[  143.314292]  [<ffffffff8155965e>] ? schedule_timeout+0x11e/0x1c0
[  143.319045]  [<ffffffff810c0270>] ? cascade+0x80/0x80
[  143.323122]  [<ffffffff8112e9f7>] ? wait_iff_congested+0xd7/0x120
[  143.327887]  [<ffffffff810a20b0>] ? wait_woken+0x80/0x80
[  143.332129]  [<ffffffff8112454f>] ? shrink_inactive_list+0x30f/0x490
[  143.337349]  [<ffffffff81125071>] ? shrink_zone_memcg+0x5d1/0x790
[  143.342071]  [<ffffffff8117553d>] ? mem_cgroup_iter+0x14d/0x2b0
[  143.346651]  [<ffffffff81125307>] ? shrink_zone+0xd7/0x2f0
[  143.350955]  [<ffffffff8112586a>] ? do_try_to_free_pages+0x15a/0x3e0
[  143.355829]  [<ffffffff81125b85>] ? try_to_free_pages+0x95/0xc0
[  143.360409]  [<ffffffff8111a38f>] ? __alloc_pages_nodemask+0x63f/0xc40
[  143.365433]  [<ffffffff8115dcef>] ? alloc_pages_current+0x7f/0x100
[  143.370275]  [<ffffffff8123456b>] ? bio_copy_kern+0xbb/0x170
[  143.374695]  [<ffffffff8123d53a>] ? blk_rq_map_kern+0x6a/0x120
[  143.379295]  [<ffffffff81237ca2>] ? blk_get_request+0x72/0xd0
[  143.383477]  [<ffffffff81388cf2>] ? scsi_execute+0x122/0x150
[  143.388023]  [<ffffffff81388df5>] ? scsi_execute_req_flags+0x85/0xf0
[  143.392883]  [<ffffffffa01dd719>] ? sr_check_events+0xb9/0x2b0 [sr_mod]
[  143.397909]  [<ffffffffa01d114f>] ? cdrom_check_events+0xf/0x30 [cdrom]
[  143.403016]  [<ffffffff8124772a>] ? disk_check_events+0x5a/0x140
[  143.407606]  [<ffffffff8107e484>] ? process_one_work+0x134/0x310
[  143.412245]  [<ffffffff8107e77d>] ? worker_thread+0x11d/0x4a0
[  143.416729]  [<ffffffff81556a51>] ? __schedule+0x271/0x7b0
[  143.421047]  [<ffffffff8107e660>] ? process_one_work+0x310/0x310
[  143.425624]  [<ffffffff81083b68>] ? kthread+0xc8/0xe0
[  143.429595]  [<ffffffff8155a502>] ? ret_from_fork+0x22/0x40
[  143.433897]  [<ffffffff81083aa0>] ? kthread_create_on_node+0x1a0/0x1a0
[  143.440187] MemAlloc: kworker/u128:2(270) flags=0x4a28860 switches=68907 seq=90 gfp=0x2400240(GFP_NOFS|__GFP_NOWARN) order=0 delay=60000 uninterruptible
[  143.450674] kworker/u128:2  D 0000000000000017     0   270      2 0x00000000
[  143.456069] Workqueue: writeback wb_workfn (flush-8:0)
[  143.460752]  ffff880036034180 ffff880039ffc000 ffff880039ffae68 ffff88003f66dfc0
[  143.466560]  ffff880039ffae00 00000000fffd99b1 0000000000000017 ffffffff810c041f
[  143.472246]  ffff88003f66dfc0 ffffffff8155965e 0000000000000000 0000000000000286
[  143.478837] Call Trace:
[  143.481096]  [<ffffffff81556fbc>] ? schedule+0x2c/0x80
[  143.485192]  [<ffffffff8155965e>] ? schedule_timeout+0x11e/0x1c0
[  143.489958]  [<ffffffff810c0270>] ? cascade+0x80/0x80
[  143.494002]  [<ffffffff8112e9f7>] ? wait_iff_congested+0xd7/0x120
[  143.498750]  [<ffffffff810a20b0>] ? wait_woken+0x80/0x80
[  143.502968]  [<ffffffff8112454f>] ? shrink_inactive_list+0x30f/0x490
[  143.507907]  [<ffffffff81125071>] ? shrink_zone_memcg+0x5d1/0x790
[  143.512611]  [<ffffffff8117553d>] ? mem_cgroup_iter+0x14d/0x2b0
[  143.517315]  [<ffffffff81125307>] ? shrink_zone+0xd7/0x2f0
[  143.521575]  [<ffffffff8112586a>] ? do_try_to_free_pages+0x15a/0x3e0
[  143.526428]  [<ffffffff81125b85>] ? try_to_free_pages+0x95/0xc0
[  143.530957]  [<ffffffff8111a38f>] ? __alloc_pages_nodemask+0x63f/0xc40
[  143.536014]  [<ffffffff8115dcef>] ? alloc_pages_current+0x7f/0x100
[  143.541053]  [<ffffffffa02539c2>] ? xfs_buf_allocate_memory+0x16a/0x2a5 [xfs]
[  143.546614]  [<ffffffffa022251b>] ? xfs_buf_get_map+0xeb/0x140 [xfs]
[  143.551461]  [<ffffffffa0222a03>] ? xfs_buf_read_map+0x23/0xd0 [xfs]
[  143.556319]  [<ffffffffa024a827>] ? xfs_trans_read_buf_map+0x87/0x190 [xfs]
[  143.561610]  [<ffffffffa01fdc22>] ? xfs_btree_read_buf_block.constprop.29+0x72/0xc0 [xfs]
[  143.568068]  [<ffffffffa01fdce8>] ? xfs_btree_lookup_get_block+0x78/0xe0 [xfs]
[  143.573722]  [<ffffffffa0202262>] ? xfs_btree_lookup+0xc2/0x570 [xfs]
[  143.578671]  [<ffffffffa01e9712>] ? xfs_alloc_fixup_trees+0x282/0x350 [xfs]
[  143.583941]  [<ffffffffa01eb7af>] ? xfs_alloc_ag_vextent_near+0x55f/0x910 [xfs]
[  143.589444]  [<ffffffffa01ebc55>] ? xfs_alloc_ag_vextent+0xf5/0x120 [xfs]
[  143.594584]  [<ffffffffa01ec72b>] ? xfs_alloc_vextent+0x3bb/0x470 [xfs]
[  143.599674]  [<ffffffffa01f9de7>] ? xfs_bmap_btalloc+0x3d7/0x760 [xfs]
[  143.604422]  [<ffffffffa01fab34>] ? xfs_bmapi_write+0x474/0xa20 [xfs]
[  143.609329]  [<ffffffffa022de73>] ? xfs_iomap_write_allocate+0x163/0x380 [xfs]
[  143.614804]  [<ffffffffa021c255>] ? xfs_map_blocks+0x135/0x150 [xfs]
[  143.619661]  [<ffffffffa021cffa>] ? xfs_do_writepage+0x16a/0x510 [xfs]
[  143.624496]  [<ffffffff8111c9fe>] ? write_cache_pages+0x1ae/0x400
[  143.629218]  [<ffffffffa021ce90>] ? xfs_aops_discard_page+0x130/0x130 [xfs]
[  143.634413]  [<ffffffffa021ccbf>] ? xfs_vm_writepages+0x5f/0xa0 [xfs]
[  143.639403]  [<ffffffff811aa9fc>] ? __writeback_single_inode+0x2c/0x170
[  143.644474]  [<ffffffff811ab013>] ? writeback_sb_inodes+0x223/0x4e0
[  143.649194]  [<ffffffff811ab352>] ? __writeback_inodes_wb+0x82/0xb0
[  143.654019]  [<ffffffff811ab56c>] ? wb_writeback+0x1ec/0x220
[  143.658215]  [<ffffffff811aba5e>] ? wb_workfn+0xde/0x290
[  143.662373]  [<ffffffff8107e484>] ? process_one_work+0x134/0x310
[  143.667058]  [<ffffffff8107e77d>] ? worker_thread+0x11d/0x4a0
[  143.671623]  [<ffffffff81556a51>] ? __schedule+0x271/0x7b0
[  143.676393]  [<ffffffff8107e660>] ? process_one_work+0x310/0x310
[  143.681168]  [<ffffffff81083b68>] ? kthread+0xc8/0xe0
[  143.685169]  [<ffffffff8155a502>] ? ret_from_fork+0x22/0x40
[  143.689497]  [<ffffffff81083aa0>] ? kthread_create_on_node+0x1a0/0x1a0
(...snipped...)
[  143.791611] MemAlloc: file_io.08(1424) flags=0x400040 switches=1058 uninterruptible dying victim
[  143.798403] file_io.08      D ffff88003c285d98     0  1424      1 0x00100084
[  143.803820]  ffff88003d36e180 ffff88003d374000 ffff88003d373d80 ffff88003c285d94
[  143.809802]  ffff88003d36e180 00000000ffffffff ffff88003c285d98 ffffffff81556fbc
[  143.815638]  ffff88003c285d90 ffffffff81557255 ffffffff81558604 ffff88003d37fd30
[  143.821210] Call Trace:
[  143.823431]  [<ffffffff81556fbc>] ? schedule+0x2c/0x80
[  143.828700]  [<ffffffff81557255>] ? schedule_preempt_disabled+0x5/0x10
[  143.833661]  [<ffffffff81558604>] ? __mutex_lock_slowpath+0xb4/0x130
[  143.838552]  [<ffffffff81558696>] ? mutex_lock+0x16/0x25
[  143.842614]  [<ffffffffa022687c>] ? xfs_file_buffered_aio_write+0x5c/0x1e0 [xfs]
[  143.847945]  [<ffffffff810226ad>] ? __switch_to+0x20d/0x3f0
[  143.852188]  [<ffffffffa0226a86>] ? xfs_file_write_iter+0x86/0x140 [xfs]
[  143.857179]  [<ffffffff811838cb>] ? __vfs_write+0xcb/0x100
[  143.861441]  [<ffffffff81184478>] ? vfs_write+0x98/0x190
[  143.865629]  [<ffffffff81556a51>] ? __schedule+0x271/0x7b0
[  143.869902]  [<ffffffff8118583d>] ? SyS_write+0x4d/0xc0
[  143.874031]  [<ffffffff810034a7>] ? do_syscall_64+0x57/0xf0
[  143.878258]  [<ffffffff8155a3a1>] ? entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  165.512677] Mem-Info:
[  165.514925] active_anon:166683 inactive_anon:1640 isolated_anon:0
[  165.514925]  active_file:10870 inactive_file:49863 isolated_file:68
[  165.514925]  unevictable:0 dirty:49806 writeback:112 unstable:0
[  165.514925]  slab_reclaimable:3373 slab_unreclaimable:7156
[  165.514925]  mapped:10566 shmem:1703 pagetables:1606 bounce:0
[  165.514925]  free:1854 free_pcp:130 free_cma:0
[  165.541474] Node 0 DMA free:3932kB min:60kB low:72kB high:84kB active_anon:7596kB inactive_anon:176kB active_file:328kB inactive_file:976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:976kB writeback:0kB mapped:404kB shmem:176kB slab_reclaimable:128kB slab_unreclaimable:488kB kernel_stack:144kB pagetables:140kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8636 all_unreclaimable? yes
[  165.574685] lowmem_reserve[]: 0 968 968 968
[  165.578352] Node 0 DMA32 free:3484kB min:3812kB low:4804kB high:5796kB active_anon:659136kB inactive_anon:6384kB active_file:43152kB inactive_file:198476kB unevictable:0kB isolated(anon):0kB isolated(file):272kB present:1032064kB managed:996224kB mlocked:0kB dirty:198248kB writeback:448kB mapped:41860kB shmem:6636kB slab_reclaimable:13364kB slab_unreclaimable:28136kB kernel_stack:7792kB pagetables:6284kB unstable:0kB bounce:0kB free_pcp:520kB local_pcp:216kB free_cma:0kB writeback_tmp:0kB pages_scanned:201090336 all_unreclaimable? yes
[  165.612568] lowmem_reserve[]: 0 0 0 0
[  165.615805] Node 0 DMA: 23*4kB (UM) 30*8kB (UM) 21*16kB (U) 6*32kB (U) 4*64kB (U) 2*128kB (U) 0*256kB 3*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 3932kB
[  165.626447] Node 0 DMA32: 759*4kB (UE) 54*8kB (U) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3484kB
[  165.635697] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  165.642340] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  165.648792] 62515 total pagecache pages
[  165.652093] 0 pages in swap cache
[  165.655325] Swap cache stats: add 0, delete 0, find 0/0
[  165.659472] Free swap  = 0kB
[  165.662094] Total swap = 0kB
[  165.664813] 262013 pages RAM
[  165.667364] 0 pages HighMem/MovableOnly
[  165.670595] 8981 pages reserved
[  165.673400] 0 pages cma reserved
[  165.676333] 0 pages hwpoisoned
[  165.679103] Showing busy workqueues and worker pools:
[  165.683077] workqueue events: flags=0x0
[  165.686367]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  165.690779]     pending: vmpressure_work_fn
[  165.694084]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  165.698960]     pending: vmw_fb_dirty_flush [vmwgfx]
[  165.703112] workqueue events_freezable_power_: flags=0x84
[  165.707516]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  165.711938]     in-flight: 61:disk_events_workfn
[  165.715500] workqueue writeback: flags=0x4e
[  165.719068]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256
[  165.723890]     in-flight: 270:wb_workfn wb_workfn
[  165.728443] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 209 3311 23
[  165.734327] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=5 idle: 6 51 277 276
[  165.740618] MemAlloc-Info: stalling=2 dying=178 exiting=31 victim=1 oom_count=3071760/430759
----------

Complete log for ext4 is at http://I-love.SAKURA.ne.jp/tmp/serial-20160405.txt.xz
----------
[  186.620979] Out of memory: Kill process 4458 (file_io.24) score 997 or sacrifice child
[  186.627897] Killed process 4458 (file_io.24) total-vm:4336kB, anon-rss:116kB, file-rss:1024kB, shmem-rss:0kB
[  186.688345] oom_reaper: reaped process 4458 (file_io.24), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  187.089562] Killed process 3499 (writer.26) total-vm:4344kB, anon-rss:80kB, file-rss:64kB, shmem-rss:0kB
[  242.174775] MemAlloc-Info: stalling=9 dying=31 exiting=0 victim=1 oom_count=752788/16556
[  242.183365] MemAlloc: kswapd0(49) flags=0xa40840 switches=994137
[  242.188759] kswapd0         R  running task        0    49      2 0x00000000
[  242.195022]  ffff88003af2fd20 ffff88003af30000 ffff88003af2fde8 ffff88003f62dfc0
[  242.201296]  ffff88003af2fd80 00000000ffff1d70 ffff88003ffde000 ffffffff81587dec
[  242.207771]  ffff88003f62dfc0 ffffffff8158a48e ffffffff811249c7 0000000000000286
[  242.213864] Call Trace:
[  242.216691]  [<ffffffff81587dec>] ? schedule+0x2c/0x80
[  242.221078]  [<ffffffff811249c7>] ? shrink_zone+0xd7/0x2f0
[  242.225342]  [<ffffffff810c00a0>] ? cascade+0x80/0x80
[  242.229333]  [<ffffffff81125b89>] ? kswapd+0x709/0x7d0
[  242.233452]  [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80
[  242.237618]  [<ffffffff81125480>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[  242.242602]  [<ffffffff81083b18>] ? kthread+0xc8/0xe0
[  242.246718]  [<ffffffff8158b342>] ? ret_from_fork+0x22/0x40
[  242.250939]  [<ffffffff81083a50>] ? kthread_create_on_node+0x1a0/0x1a0
[  242.257505] MemAlloc: kworker/u128:1(51) flags=0x4a08860 switches=80360 seq=18 gfp=0x2400040(GFP_NOFS) order=0 delay=60000 uninterruptible
[  242.266407] kworker/u128:1  D 0000000000000017     0    51      2 0x00000000
[  242.272485] Workqueue: writeback wb_workfn (flush-8:0)
[  242.276909]  ffff880036814740 ffff88003681c000 ffff88003681b278 ffff88003f64dfc0
[  242.282635]  00000000a6a32935 00000000ffff1d5e ffff88003681b278 ffffffff81587dec
[  242.288840]  ffff88003f64dfc0 ffffffff8158a496 0000000000000000 0000000000000286
[  242.294612] Call Trace:
[  242.297193]  [<ffffffff81587dec>] ? schedule+0x2c/0x80
[  242.301418]  [<ffffffff8158a48e>] ? schedule_timeout+0x11e/0x1c0
[  242.306184]  [<ffffffff810c00a0>] ? cascade+0x80/0x80
[  242.310356]  [<ffffffff8112df97>] ? wait_iff_congested+0xd7/0x120
[  242.314891]  [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80
[  242.319372]  [<ffffffff81123c0f>] ? shrink_inactive_list+0x30f/0x490
[  242.324695]  [<ffffffff81124731>] ? shrink_zone_memcg+0x5d1/0x790
[  242.329526]  [<ffffffff81095f29>] ? check_preempt_wakeup+0x119/0x230
[  242.334118]  [<ffffffff81094d6f>] ? dequeue_entity+0x23f/0x8e0
[  242.339120]  [<ffffffff811249c7>] ? shrink_zone+0xd7/0x2f0
[  242.343704]  [<ffffffff81124f2a>] ? do_try_to_free_pages+0x15a/0x3e0
[  242.348572]  [<ffffffff81125245>] ? try_to_free_pages+0x95/0xc0
[  242.353213]  [<ffffffff81119a4f>] ? __alloc_pages_nodemask+0x63f/0xc40
[  242.358128]  [<ffffffff8115d2df>] ? alloc_pages_current+0x7f/0x100
[  242.362766]  [<ffffffff81110445>] ? pagecache_get_page+0x85/0x240
[  242.367679]  [<ffffffff81228fb7>] ? ext4_mb_load_buddy_gfp+0x357/0x440
[  242.372621]  [<ffffffff8122b599>] ? ext4_mb_regular_allocator+0x169/0x470
[  242.377834]  [<ffffffff81094d6f>] ? dequeue_entity+0x23f/0x8e0
[  242.382677]  [<ffffffff8122d059>] ? ext4_mb_new_blocks+0x369/0x440
[  242.387572]  [<ffffffff81222bc0>] ? ext4_ext_map_blocks+0x10c0/0x1770
[  242.392153]  [<ffffffff8111e373>] ? release_pages+0x243/0x350
[  242.396704]  [<ffffffff81110bb3>] ? find_get_pages_tag+0xd3/0x1b0
[  242.401379]  [<ffffffff81110099>] ? __lock_page+0x49/0xf0
[  242.405824]  [<ffffffff81201412>] ? ext4_map_blocks+0x122/0x510
[  242.410186]  [<ffffffff8120490c>] ? ext4_writepages+0x53c/0xb10
[  242.414687]  [<ffffffff811a968c>] ? __writeback_single_inode+0x2c/0x170
[  242.419531]  [<ffffffff811a9ca3>] ? writeback_sb_inodes+0x223/0x4e0
[  242.424284]  [<ffffffff811a9fe2>] ? __writeback_inodes_wb+0x82/0xb0
[  242.429196]  [<ffffffff811aa1fc>] ? wb_writeback+0x1ec/0x220
[  242.433267]  [<ffffffff811aa6ee>] ? wb_workfn+0xde/0x290
[  242.437275]  [<ffffffff8107e434>] ? process_one_work+0x134/0x310
[  242.441492]  [<ffffffff8107e72d>] ? worker_thread+0x11d/0x4a0
[  242.445781]  [<ffffffff8107e610>] ? process_one_work+0x310/0x310
[  242.450190]  [<ffffffff81083b18>] ? kthread+0xc8/0xe0
[  242.454366]  [<ffffffff8158b342>] ? ret_from_fork+0x22/0x40
[  242.458870]  [<ffffffff81083a50>] ? kthread_create_on_node+0x1a0/0x1a0
[  242.465699] MemAlloc: kworker/0:2(285) flags=0x4208860 switches=275666 seq=15 gfp=0x2400000(GFP_NOIO) order=0 delay=58093
[  242.474600] kworker/0:2     R  running task        0   285      2 0x00000000
[  242.479981] Workqueue: events_freezable_power_ disk_events_workfn
[  242.484669]  ffff8800396f8600 0000000000000286 ffff8800396ff768 ffff88003f60dfc0
[  242.490850]  ffff8800396ff700 ffff8800396ff700 0000000000000017 ffffffff81587dec
[  242.496493]  ffff88003f60dfc0 ffffffff8158a48e 0000000000000000 0000000000000286
[  242.502195] Call Trace:
[  242.504347]  [<ffffffff810c01dc>] ? try_to_del_timer_sync+0x4c/0x80
[  242.509164]  [<ffffffff81587dec>] ? schedule+0x2c/0x80
[  242.513106]  [<ffffffff8158a48e>] ? schedule_timeout+0x11e/0x1c0
[  242.517333]  [<ffffffff810c00a0>] ? cascade+0x80/0x80
[  242.521263]  [<ffffffff8112df6f>] ? wait_iff_congested+0xaf/0x120
[  242.525472]  [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80
[  242.529443]  [<ffffffff81123c0f>] ? shrink_inactive_list+0x30f/0x490
[  242.534392]  [<ffffffff81124731>] ? shrink_zone_memcg+0x5d1/0x790
[  242.539076]  [<ffffffff81094910>] ? update_curr+0x90/0xd0
[  242.543052]  [<ffffffff81174b0d>] ? mem_cgroup_iter+0x14d/0x2b0
[  242.547529]  [<ffffffff811249c7>] ? shrink_zone+0xd7/0x2f0
[  242.551904]  [<ffffffff81124f2a>] ? do_try_to_free_pages+0x15a/0x3e0
[  242.556629]  [<ffffffff81125245>] ? try_to_free_pages+0x95/0xc0
[  242.561000]  [<ffffffff81119c77>] ? __alloc_pages_nodemask+0x867/0xc40
[  242.566133]  [<ffffffff8115d2df>] ? alloc_pages_current+0x7f/0x100
[  242.570852]  [<ffffffff81265b3b>] ? bio_copy_kern+0xbb/0x170
[  242.575036]  [<ffffffff8126eb0a>] ? blk_rq_map_kern+0x6a/0x120
[  242.579227]  [<ffffffff81269272>] ? blk_get_request+0x72/0xd0
[  242.583721]  [<ffffffff813ba2e2>] ? scsi_execute+0x122/0x150
[  242.588072]  [<ffffffff813ba3e5>] ? scsi_execute_req_flags+0x85/0xf0
[  242.592773]  [<ffffffffa01cf719>] ? sr_check_events+0xb9/0x2b0 [sr_mod]
[  242.597639]  [<ffffffffa01c314f>] ? cdrom_check_events+0xf/0x30 [cdrom]
[  242.602455]  [<ffffffff81278cfa>] ? disk_check_events+0x5a/0x140
[  242.606821]  [<ffffffff8107e434>] ? process_one_work+0x134/0x310
[  242.611191]  [<ffffffff8107e72d>] ? worker_thread+0x11d/0x4a0
[  242.615560]  [<ffffffff81587881>] ? __schedule+0x271/0x7b0
[  242.619988]  [<ffffffff8107e610>] ? process_one_work+0x310/0x310
[  242.624618]  [<ffffffff81083b18>] ? kthread+0xc8/0xe0
[  242.628245]  [<ffffffff8158b342>] ? ret_from_fork+0x22/0x40
[  242.632185]  [<ffffffff81083a50>] ? kthread_create_on_node+0x1a0/0x1a0
(...snipped...)
[  245.572082] MemAlloc: file_io.24(4715) flags=0x400040 switches=8650 uninterruptible dying victim
[  245.578876] file_io.24      D 0000000000000000     0  4715      1 0x00100084
[  245.584122]  ffff88002fd9c000 ffff88002fda4000 ffff880036221870 00000000000035a2
[  245.589618]  0000000000000000 ffff880036221870 0000000000000000 ffffffff81587dec
[  245.595428]  ffff880036221800 ffffffff8123b821 0000000000000000 ffff88002fd9c000
[  245.601370] Call Trace:
[  245.603428]  [<ffffffff81587dec>] ? schedule+0x2c/0x80
[  245.607680]  [<ffffffff8123b821>] ? wait_transaction_locked+0x81/0xc0
[  245.613586]  [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80
[  245.618074]  [<ffffffff8123ba9a>] ? add_transaction_credits+0x21a/0x2a0
[  245.623497]  [<ffffffff81178abc>] ? mem_cgroup_commit_charge+0x7c/0xf0
[  245.628352]  [<ffffffff8123bceb>] ? start_this_handle+0x18b/0x400
[  245.632755]  [<ffffffff8110fb6e>] ? add_to_page_cache_lru+0x6e/0xd0
[  245.637274]  [<ffffffff8123c294>] ? jbd2__journal_start+0xf4/0x190
[  245.642298]  [<ffffffff81205ca4>] ? ext4_da_write_begin+0x114/0x360
[  245.647035]  [<ffffffff8111116e>] ? generic_perform_write+0xce/0x1d0
[  245.651651]  [<ffffffff8119c440>] ? file_update_time+0xc0/0x110
[  245.656166]  [<ffffffff81111f2d>] ? __generic_file_write_iter+0x16d/0x1c0
[  245.660835]  [<ffffffff811fbafa>] ? ext4_file_write_iter+0x12a/0x340
[  245.665292]  [<ffffffff810226ad>] ? __switch_to+0x20d/0x3f0
[  245.669604]  [<ffffffff81182ddb>] ? __vfs_write+0xcb/0x100
[  245.673904]  [<ffffffff81183968>] ? vfs_write+0x98/0x190
[  245.678174]  [<ffffffff81184d2d>] ? SyS_write+0x4d/0xc0
[  245.682376]  [<ffffffff810034a7>] ? do_syscall_64+0x57/0xf0
[  245.686845]  [<ffffffff8158b1e1>] ? entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  246.216363] Mem-Info:
[  246.218425] active_anon:183099 inactive_anon:2734 isolated_anon:0
[  246.218425]  active_file:2006 inactive_file:36363 isolated_file:0
[  246.218425]  unevictable:0 dirty:36369 writeback:0 unstable:0
[  246.218425]  slab_reclaimable:2055 slab_unreclaimable:9453
[  246.218425]  mapped:2266 shmem:3080 pagetables:1480 bounce:0
[  246.218425]  free:1814 free_pcp:197 free_cma:0
[  246.245998] Node 0 DMA free:3928kB min:60kB low:72kB high:84kB active_anon:7868kB inactive_anon:112kB active_file:188kB inactive_file:1504kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:1440kB writeback:0kB mapped:132kB shmem:120kB slab_reclaimable:184kB slab_unreclaimable:592kB kernel_stack:624kB pagetables:304kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:42274336 all_unreclaimable? yes
[  246.281121] lowmem_reserve[]: 0 968 968 968
[  246.284938] Node 0 DMA32 free:3328kB min:3812kB low:4804kB high:5796kB active_anon:724528kB inactive_anon:10824kB active_file:7836kB inactive_file:143948kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1032064kB managed:996008kB mlocked:0kB dirty:144036kB writeback:0kB mapped:8932kB shmem:12200kB slab_reclaimable:8036kB slab_unreclaimable:37220kB kernel_stack:23680kB pagetables:5616kB unstable:0kB bounce:0kB free_pcp:788kB local_pcp:116kB free_cma:0kB writeback_tmp:0kB pages_scanned:22926424 all_unreclaimable? yes
[  246.319945] lowmem_reserve[]: 0 0 0 0
[  246.323303] Node 0 DMA: 32*4kB (UME) 35*8kB (UME) 18*16kB (UE) 9*32kB (UE) 6*64kB (ME) 2*128kB (UE) 3*256kB (E) 3*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 3928kB
[  246.334695] Node 0 DMA32: 332*4kB (UE) 244*8kB (U) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3328kB
[  246.344693] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  246.351599] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  246.357749] 41456 total pagecache pages
[  246.360874] 0 pages in swap cache
[  246.363717] Swap cache stats: add 0, delete 0, find 0/0
[  246.368022] Free swap  = 0kB
[  246.370769] Total swap = 0kB
[  246.373444] 262013 pages RAM
[  246.376115] 0 pages HighMem/MovableOnly
[  246.379669] 9035 pages reserved
[  246.382654] 0 pages cma reserved
[  246.385675] 0 pages hwpoisoned
[  246.388597] Showing busy workqueues and worker pools:
[  246.392477] workqueue events: flags=0x0
[  246.395797]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  246.400741]     pending: vmw_fb_dirty_flush [vmwgfx]
[  246.405129] workqueue events_freezable_power_: flags=0x84
[  246.409390]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  246.413910]     in-flight: 285:disk_events_workfn
[  246.417932] workqueue writeback: flags=0x4e
[  246.421660]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256
[  246.426158]     in-flight: 51:wb_workfn wb_workfn
[  246.430208] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 42 3280 4
[  246.435871] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=4 idle: 260 6 259
[  246.441342] MemAlloc-Info: stalling=9 dying=31 exiting=0 victim=1 oom_count=783613/16904
----------

If I apply
----------
diff --git a/block/bio.c b/block/bio.c
index f124a0a..03250e86 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1504,6 +1504,8 @@ struct bio *bio_copy_kern(struct request_queue *q, void *data, unsigned int len,
 	void *p = data;
 	int nr_pages = 0;

+	gfp_mask |= __GFP_HIGH;
+
 	/*
 	 * Overflow, abort
 	 */
----------
then disk_events_workfn stall is gone. If I also apply
----------
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 9a2191b..448f61e 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -55,7 +55,7 @@ static kmem_zone_t *xfs_buf_zone;
 #endif

 #define xb_to_gfp(flags) \
-	((((flags) & XBF_READ_AHEAD) ? __GFP_NORETRY : GFP_NOFS) | __GFP_NOWARN)
+	((((flags) & XBF_READ_AHEAD) ? __GFP_NORETRY : (GFP_NOFS | __GFP_HIGH)) | __GFP_NOWARN)


 static inline int
----------
then both disk_events_workfn stall and wb_workfn stall are gone
and I can no longer reproduce OOM livelock using this reproducer.

Therefore, I think that the root cause of OOM livelock is that

  (A) We use the same watermark for GFP_KERNEL / GFP_NOFS / GFP_NOIO
      allocation requests.

  (B) We allow GFP_KERNEL allocation requests to consume memory to
      min: watermark.

  (C) GFP_KERNEL allocation requests might depend on GFP_NOFS
      allocation requests, and GFP_NOFS allocation requests
      might depend on GFP_NOIO allocation requests.

  (D) TIF_MEMDIE thread might wait forever for other thread's
      GFP_NOFS / GFP_NOIO allocation requests.

There is no gfp flag that prevents GFP_KERNEL from consuming memory to min:
watermark. Thus, it is inevitable that GFP_KERNEL allocations consume
memory to min: watermark and invokes the OOM killer. But if we change
memory allocations which might block writeback operations to utilize
memory reserves, it is likely that allocations from workqueue items
will no longer stall, even without depending on mmap_sem which is a
weakness of the OOM reaper.

Of course, there is no guarantee that allowing such GFP_NOFS / GFP_NOIO
allocations to utilize memory reserves always avoids OOM livelock. But
at least we don't need to give up GFP_NOFS / GFP_NOIO allocations
immediately without trying to utilize memory reserves.
Therefore, I object this comment

Michal Hocko wrote:
> +		/*
> +		 * XXX: GFP_NOFS allocations should rather fail than rely on
> +		 * other request to make a forward progress.
> +		 * We are in an unfortunate situation where out_of_memory cannot
> +		 * do much for this context but let's try it to at least get
> +		 * access to memory reserved if the current task is killed (see
> +		 * out_of_memory). Once filesystems are ready to handle allocation
> +		 * failures more gracefully we should just bail out here.
> +		 */
> +

that try to make !__GFP_FS allocations fail.

It is possible that such GFP_NOFS / GFP_NOIO allocations need to select
next OOM victim. If we add a guaranteed unlocking mechanism (the simplest
way is timeout), such GFP_NOFS / GFP_NOIO allocations will succeed, and
we can avoid loss of reliability of async write operations.

(By the way, can swap in/out work even if GFP_NOIO fails?)

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-04-05 11:12 ` Tetsuo Handa
@ 2016-04-06 10:28   ` Tetsuo Handa
  2016-04-06 12:41   ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Tetsuo Handa @ 2016-04-06 10:28 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, hannes, akpm, linux-kernel, mhocko

This ext4 livelock case shows a race window which commit 36324a990cf5
("oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space")
did not care about.

----------
[  186.620979] Out of memory: Kill process 4458 (file_io.24) score 997 or sacrifice child
[  186.627897] Killed process 4458 (file_io.24) total-vm:4336kB, anon-rss:116kB, file-rss:1024kB, shmem-rss:0kB
[  186.688345] oom_reaper: reaped process 4458 (file_io.24), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

[  245.572082] MemAlloc: file_io.24(4715) flags=0x400040 switches=8650 uninterruptible dying victim
[  245.578876] file_io.24      D 0000000000000000     0  4715      1 0x00100084
[  245.584122]  ffff88002fd9c000 ffff88002fda4000 ffff880036221870 00000000000035a2
[  245.589618]  0000000000000000 ffff880036221870 0000000000000000 ffffffff81587dec
[  245.595428]  ffff880036221800 ffffffff8123b821 0000000000000000 ffff88002fd9c000
[  245.601370] Call Trace:
[  245.603428]  [<ffffffff81587dec>] ? schedule+0x2c/0x80
[  245.607680]  [<ffffffff8123b821>] ? wait_transaction_locked+0x81/0xc0           /* linux-4.6-rc2/fs/jbd2/transaction.c:163 */
[  245.613586]  [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80                        /* linux-4.6-rc2/kernel/sched/wait.c:292   */
[  245.618074]  [<ffffffff8123ba9a>] ? add_transaction_credits+0x21a/0x2a0         /* linux-4.6-rc2/fs/jbd2/transaction.c:191 */
[  245.623497]  [<ffffffff81178abc>] ? mem_cgroup_commit_charge+0x7c/0xf0
[  245.628352]  [<ffffffff8123bceb>] ? start_this_handle+0x18b/0x400               /* linux-4.6-rc2/fs/jbd2/transaction.c:357 */
[  245.632755]  [<ffffffff8110fb6e>] ? add_to_page_cache_lru+0x6e/0xd0
[  245.637274]  [<ffffffff8123c294>] ? jbd2__journal_start+0xf4/0x190              /* linux-4.6-rc2/fs/jbd2/transaction.c:459 */
[  245.642298]  [<ffffffff81205ca4>] ? ext4_da_write_begin+0x114/0x360             /* linux-4.6-rc2/fs/ext4/inode.c:2883      */
[  245.647035]  [<ffffffff8111116e>] ? generic_perform_write+0xce/0x1d0            /* linux-4.6-rc2/mm/filemap.c:2639         */
[  245.651651]  [<ffffffff8119c440>] ? file_update_time+0xc0/0x110
[  245.656166]  [<ffffffff81111f2d>] ? __generic_file_write_iter+0x16d/0x1c0       /* linux-4.6-rc2/mm/filemap.c:2765         */
[  245.660835]  [<ffffffff811fbafa>] ? ext4_file_write_iter+0x12a/0x340            /* linux-4.6-rc2/fs/ext4/file.c:170        */
[  245.665292]  [<ffffffff810226ad>] ? __switch_to+0x20d/0x3f0
[  245.669604]  [<ffffffff81182ddb>] ? __vfs_write+0xcb/0x100
[  245.673904]  [<ffffffff81183968>] ? vfs_write+0x98/0x190
[  245.678174]  [<ffffffff81184d2d>] ? SyS_write+0x4d/0xc0
[  245.682376]  [<ffffffff810034a7>] ? do_syscall_64+0x57/0xf0
[  245.686845]  [<ffffffff8158b1e1>] ? entry_SYSCALL64_slow_path+0x25/0x25
----------

ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) {
  ret = __generic_file_write_iter(iocb, from) {
    written = generic_perform_write(file, from, iocb->ki_pos) {
      if (fatal_signal_pending(current)) {
        status = -EINTR;
        break;
      }
      status = a_ops->write_begin(file, mapping, pos, bytes, flags, &page, &fsdata) /* ext4_da_write_begin */ { /***** Event1 *****/
        handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, ext4_da_write_credits(inode, pos, len)) /* __ext4_journal_start */ {
          __ext4_journal_start_sb(inode->i_sb, line, type, blocks, rsv_blocks) {
            jbd2__journal_start(journal, blocks, rsv_blocks, GFP_NOFS, type, line) {
              err = start_this_handle(journal, handle, gfp_mask) {
                if (!journal->j_running_transaction) {
                  /*
                   * If __GFP_FS is not present, then we may be being called from
                   * inside the fs writeback layer, so we MUST NOT fail.
                   */
                  if ((gfp_mask & __GFP_FS) == 0)
                    gfp_mask |= __GFP_NOFAIL;
                  new_transaction = kmem_cache_zalloc(transaction_cache, gfp_mask); /***** Event2 *****/
                  if (!new_transaction)
                    return -ENOMEM;
                }
                /* We may have dropped j_state_lock - restart in that case */
                add_transaction_credits(journal, blocks, rsv_blocks) {
                  /*
                   * If the current transaction is locked down for commit, wait
                   * for the lock to be released.
                   */
                  if (t->t_state == T_LOCKED) { /***** Event3 *****/
                    wait_transaction_locked(journal); /***** Event4 *****/
                    return 1;
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Event1 ... The OOM killer sent SIGKILL to file_io.24(4715) because
           file_io.24(4715) was sharing memory with file_io.24(4458).

Event2 ... file_io.24(4715) silently got TIF_MEMDIE using a shortcut
           fatal_signal_pending(current) in out_of_memory() because
           kmem_cache_zalloc() is allowed to call out_of_memory() due to
           __GFP_NOFAIL.

Event3 ... The OOM reaper completed reaping memory used by file_io.24(4458)
           and marked file_io.24(4458) as no longer OOM-killable by now.
           But since the OOM reaper cleared TIF_MEMDIE from only
           file_io.24(4458), TIF_MEMDIE in file_io.24(4715) still remains.

Event4 ... file_io.24(4715) (which used GFP_NOFS | __GFP_NOFAIL) is waiting
           for kworker/u128:1(51) (which used GFP_NOFS) to complete wb_workfn.
           But both kworker/u128:1(51) (which used GFP_NOFS) and kworker/0:2(285)
           (which used GFP_NOIO) cannot make forward progress because the OOM
           reaper does not clear TIF_MEMDIE from file_io.24(4715), and the OOM
           killer does not select next OOM victim due to TIF_MEMDIE in
           file_io.24(4715).

If we remove these shortcuts and set TIF_MEMDIE to all OOM-killed threads
sharing the victim's memory at oom_kill_process() and clear TIF_MEMDIE from
all threads sharing the victim's memory at __oom_reap_task() (or do equivalent
thing using per a signal_struct flag or per a mm_struct flag or a timer), we
wouldn't have hit this race window. Thus, I say again, "I think that removing
these shortcuts is better." unless we add a guaranteed unlocking mechanism
like a timer.

Also, I again want to say that, making current thread's current allocation
request completed by giving TIF_MEMDIE does not guarantee that the current
thread will be able to arrive at do_exit() shortly. It is possible that
the current thread is blocked at unkillable wait if current allocation
succeeded.

Also, is it acceptable to make allocation requests by kworker/u128:1(51) and
kworker/0:2(285) fail because they are !__GFP_FS && !__GFP_NOFAIL when
file_io.24(4715) has managed to allocate memory for journal's transaction
using GFP_NOFS | __GFP_NOFAIL?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory
  2016-04-05 11:12 ` Tetsuo Handa
  2016-04-06 10:28   ` Tetsuo Handa
@ 2016-04-06 12:41   ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2016-04-06 12:41 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel

On Tue 05-04-16 20:12:51, Tetsuo Handa wrote:
[...]
> What I can observe under OOM livelock condition is a three-way dependency loop.
> 
>   (1) An OOM victim (which has TIF_MEMDIE) is unable to make forward progress
>       due to blocked at unkillable lock waiting for other thread's memory
>       allocation.
> 
>   (2) A filesystem writeback work item is unable to make forward progress
>       due to waiting for GFP_NOFS memory allocation to be satisfied because
>       storage I/O is stalling.
> 
>   (3) A disk I/O work item is unable to make forward progress due to
>       waiting for GFP_NOIO memory allocation to be satisfied because
>       an OOM victim does not release memory but the OOM reaper does not
>       unlock TIF_MEMDIE.

It is true that find_lock_task_mm might have returned NULL and so
we cannot reap anything. I guess we want to clear TIF_MEMDIE for such a
task because it wouldn't have been selected in the next oom victim
selection round, so we can argue this would be acceptable. After more
thinking about this we can clear it for tasks which block oom_reaper
because of mmap_sem contention because those would be sitting on the
memory and we can retry to select them later so we cannot end up in the
worse state we are now. I will prepare a patch for that.

[...]

>   (A) We use the same watermark for GFP_KERNEL / GFP_NOFS / GFP_NOIO
>       allocation requests.
> 
>   (B) We allow GFP_KERNEL allocation requests to consume memory to
>       min: watermark.
> 
>   (C) GFP_KERNEL allocation requests might depend on GFP_NOFS
>       allocation requests, and GFP_NOFS allocation requests
>       might depend on GFP_NOIO allocation requests.
> 
>   (D) TIF_MEMDIE thread might wait forever for other thread's
>       GFP_NOFS / GFP_NOIO allocation requests.
> 
> There is no gfp flag that prevents GFP_KERNEL from consuming memory to min:
> watermark. Thus, it is inevitable that GFP_KERNEL allocations consume
> memory to min: watermark and invokes the OOM killer. But if we change
> memory allocations which might block writeback operations to utilize
> memory reserves, it is likely that allocations from workqueue items
> will no longer stall, even without depending on mmap_sem which is a
> weakness of the OOM reaper.

Depending on memory reserves just shifts the issue to a later moment.
Heavy GFP_NOFS loads would deplete this reserve very easily and we are
back to square one.

> Of course, there is no guarantee that allowing such GFP_NOFS / GFP_NOIO
> allocations to utilize memory reserves always avoids OOM livelock. But
> at least we don't need to give up GFP_NOFS / GFP_NOIO allocations
> immediately without trying to utilize memory reserves.
> Therefore, I object this comment
> 
> Michal Hocko wrote:
> > +		/*
> > +		 * XXX: GFP_NOFS allocations should rather fail than rely on
> > +		 * other request to make a forward progress.
> > +		 * We are in an unfortunate situation where out_of_memory cannot
> > +		 * do much for this context but let's try it to at least get
> > +		 * access to memory reserved if the current task is killed (see
> > +		 * out_of_memory). Once filesystems are ready to handle allocation
> > +		 * failures more gracefully we should just bail out here.
> > +		 */
> > +
> 
> that try to make !__GFP_FS allocations fail.

I do not get what do you abject to. The comment is clear that we are not
yet there to make this happen. The primary purpose of the comment is to
make it clear where we should back off and fail if we _ever_ consider
this safe to do.

> It is possible that such GFP_NOFS / GFP_NOIO allocations need to select
> next OOM victim. If we add a guaranteed unlocking mechanism (the simplest
> way is timeout), such GFP_NOFS / GFP_NOIO allocations will succeed, and
> we can avoid loss of reliability of async write operations.

this still relies on somebody else for making a forward progress, which
is not good. I can imagine a highly theoretical situation where even
selecting other task doesn't lead to any relief because most of the
memory might be pinned for some reason.

> (By the way, can swap in/out work even if GFP_NOIO fails?)

The page would be redirtied and kept around if get_swap_bio failed the
GFP_NOIO allocation

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-04-06 12:41 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko
2016-03-29 13:45 ` Tetsuo Handa
2016-03-29 14:22   ` Michal Hocko
2016-03-29 15:29     ` Tetsuo Handa
2016-03-29 14:14 ` Michal Hocko
2016-03-29 22:13 ` David Rientjes
2016-03-30  9:47   ` Michal Hocko
2016-03-30 11:46     ` Tetsuo Handa
2016-03-30 12:11       ` Michal Hocko
2016-03-31 11:56         ` Tetsuo Handa
2016-03-31 15:11           ` Michal Hocko
2016-04-05 11:12 ` Tetsuo Handa
2016-04-06 10:28   ` Tetsuo Handa
2016-04-06 12:41   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).