linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
@ 2017-07-10  7:48 Michal Hocko
  2017-07-10 13:16 ` Vlastimil Babka
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Michal Hocko @ 2017-07-10  7:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, Rik van Riel, Johannes Weiner,
	Vlastimil Babka, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
[2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
[3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp

Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
I am resubmitting this patch previously sent here
http://lkml.kernel.org/r/20170307133057.26182-1-mhocko@kernel.org.

Johannes and Rik had some concerns that this could lead to premature
OOM kills. I agree with them that we need a better throttling
mechanism. Until now we didn't give the issue described above a high
priority because it usually required a really insane workload to
trigger. But it seems that the issue can be reproduced also without
having an insane number of competing threads [3].

Moreover, the issue also triggers very often while testing heavy memory
pressure and so prevents further development of hardening of that area
(http://lkml.kernel.org/r/201707061948.ICJ18763.tVFOQFOHMJFSLO@I-love.SAKURA.ne.jp).
Tetsuo hasn't seen any negative effect of this patch in his oom stress
tests so I think we should go with this simple patch for now and think
about something more robust long term.

That being said I suggest merging this (after spending the full release
cycle in linux-next) for the time being until we come up with a more
clever solution.

 mm/vmscan.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	bool stalled = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (stalled)
+			return 0;
+
+		/* wait a bit for the reclaimer. */
+		schedule_timeout_interruptible(HZ/10);
+		stalled = true;
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-10  7:48 [PATCH] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko
@ 2017-07-10 13:16 ` Vlastimil Babka
  2017-07-10 13:58 ` Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Vlastimil Babka @ 2017-07-10 13:16 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, Rik van Riel, Johannes Weiner,
	linux-mm, LKML, Michal Hocko

On 07/10/2017 09:48 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
> 
> [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
> [2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
> [3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Let's hope there won't be premature OOM's then.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-10  7:48 [PATCH] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko
  2017-07-10 13:16 ` Vlastimil Babka
@ 2017-07-10 13:58 ` Rik van Riel
  2017-07-10 16:58   ` Johannes Weiner
  2017-07-19 22:20 ` Andrew Morton
  2017-07-20  1:54 ` Hugh Dickins
  3 siblings, 1 reply; 41+ messages in thread
From: Rik van Riel @ 2017-07-10 13:58 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, Johannes Weiner, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:

> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].

My worries stand, but lets fix the real observed bug, and not worry
too much about the theoretical bug for now.

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-10 13:58 ` Rik van Riel
@ 2017-07-10 16:58   ` Johannes Weiner
  2017-07-10 17:09     ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Johannes Weiner @ 2017-07-10 16:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Vlastimil Babka, linux-mm, LKML, Michal Hocko

On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote:
> On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:
> 
> > Johannes and Rik had some concerns that this could lead to premature
> > OOM kills. I agree with them that we need a better throttling
> > mechanism. Until now we didn't give the issue described above a high
> > priority because it usually required a really insane workload to
> > trigger. But it seems that the issue can be reproduced also without
> > having an insane number of competing threads [3].
> 
> My worries stand, but lets fix the real observed bug, and not worry
> too much about the theoretical bug for now.
> 
> Acked-by: Rik van Riel <riel@redhat.com>

I agree with this.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-10 16:58   ` Johannes Weiner
@ 2017-07-10 17:09     ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2017-07-10 17:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Vlastimil Babka, linux-mm, LKML

On Mon 10-07-17 12:58:59, Johannes Weiner wrote:
> On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote:
> > On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:
> > 
> > > Johannes and Rik had some concerns that this could lead to premature
> > > OOM kills. I agree with them that we need a better throttling
> > > mechanism. Until now we didn't give the issue described above a high
> > > priority because it usually required a really insane workload to
> > > trigger. But it seems that the issue can be reproduced also without
> > > having an insane number of competing threads [3].
> > 
> > My worries stand, but lets fix the real observed bug, and not worry
> > too much about the theoretical bug for now.
> > 
> > Acked-by: Rik van Riel <riel@redhat.com>
> 
> I agree with this.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks to both of you. Just to make it clear. I really do want to
address the throttling problem longterm properly. I do not have any
great ideas to be honest.  I am busy with other things so it might be
quite some time before I come up with something.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-10  7:48 [PATCH] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko
  2017-07-10 13:16 ` Vlastimil Babka
  2017-07-10 13:58 ` Rik van Riel
@ 2017-07-19 22:20 ` Andrew Morton
  2017-07-20  6:56   ` Michal Hocko
  2017-07-20  1:54 ` Hugh Dickins
  3 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2017-07-19 22:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, Rik van Riel, Johannes Weiner,
	Vlastimil Babka, linux-mm, LKML, Michal Hocko

On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.

Need to figure out which kernels to patch.  Maybe just 4.13-rc after a
week or two?

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	int file = is_file_lru(lru);
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +	bool stalled = false;
>  
>  	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		if (stalled)
> +			return 0;
> +
> +		/* wait a bit for the reclaimer. */
> +		schedule_timeout_interruptible(HZ/10);

a) if this task has signal_pending(), this falls straight through
   and I suspect the code breaks?

b) replacing congestion_wait() with schedule_timeout_interruptible()
   means this task no longer contributes to load average here and it's
   a (slightly) user-visible change.

c) msleep_interruptible() is nicer

d) IOW, methinks we should be using msleep() here?

> +		stalled = true;
>  
>  		/* We are about to die and free our memory. Return now. */
>  		if (fatal_signal_pending(current))

(Gets distracted by the thought that we should do
s/msleep/msleep_uninterruptible/g) 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-10  7:48 [PATCH] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko
                   ` (2 preceding siblings ...)
  2017-07-19 22:20 ` Andrew Morton
@ 2017-07-20  1:54 ` Hugh Dickins
  2017-07-20 10:44   ` Tetsuo Handa
  2017-07-20 13:22   ` Michal Hocko
  3 siblings, 2 replies; 41+ messages in thread
From: Hugh Dickins @ 2017-07-20  1:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Rik van Riel,
	Johannes Weiner, Vlastimil Babka, linux-mm, LKML, Michal Hocko

On Mon, 10 Jul 2017, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
> 
> [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
> [2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
> [3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> I am resubmitting this patch previously sent here
> http://lkml.kernel.org/r/20170307133057.26182-1-mhocko@kernel.org.
> 
> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].
> 
> Moreover, the issue also triggers very often while testing heavy memory
> pressure and so prevents further development of hardening of that area
> (http://lkml.kernel.org/r/201707061948.ICJ18763.tVFOQFOHMJFSLO@I-love.SAKURA.ne.jp).
> Tetsuo hasn't seen any negative effect of this patch in his oom stress
> tests so I think we should go with this simple patch for now and think
> about something more robust long term.
> 
> That being said I suggest merging this (after spending the full release
> cycle in linux-next) for the time being until we come up with a more
> clever solution.
> 
>  mm/vmscan.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c15b2e4c47ca..4ae069060ae5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	int file = is_file_lru(lru);
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +	bool stalled = false;
>  
>  	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		if (stalled)
> +			return 0;
> +
> +		/* wait a bit for the reclaimer. */
> +		schedule_timeout_interruptible(HZ/10);
> +		stalled = true;
>  
>  		/* We are about to die and free our memory. Return now. */
>  		if (fatal_signal_pending(current))
> -- 

You probably won't welcome getting into alternatives at this late stage;
but after hacking around it one way or another because of its pointless
lockups, I lost patience with that too_many_isolated() loop a few months
back (on realizing the enormous number of pages that may be isolated via
migrate_pages(2)), and we've been running nicely since with something like:

	bool got_mutex = false;

	if (unlikely(too_many_isolated(pgdat, file, sc))) {
		if (mutex_lock_killable(&pgdat->too_many_isolated))
			return SWAP_CLUSTER_MAX;
		got_mutex = true;
	}
	...
	if (got_mutex)
		mutex_unlock(&pgdat->too_many_isolated);

Using a mutex to provide the intended throttling, without an infinite
loop or an arbitrary delay; and without having to worry (as we often did)
about whether those numbers in too_many_isolated() are really appropriate.
No premature OOMs complained of yet.

But that was on a different kernel, and there I did have to make sure
that PF_MEMALLOC always prevented us from nesting: I'm not certain of
that in the current kernel (but do remember Johannes changing the memcg
end to make it use PF_MEMALLOC too).  I offer the preview above, to see
if you're interested in that alternative: if you are, then I'll go ahead
and make it into an actual patch against v4.13-rc.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-19 22:20 ` Andrew Morton
@ 2017-07-20  6:56   ` Michal Hocko
  2017-07-21 23:01     ` Andrew Morton
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2017-07-20  6:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, Rik van Riel, Johannes Weiner,
	Vlastimil Babka, linux-mm, LKML

On Wed 19-07-17 15:20:14, Andrew Morton wrote:
> On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> > in too_many_isolated loop basically for ever because the last few pages
> > on the LRU lists are isolated by the kswapd which is stuck on fs locks
> > when doing the pageout or slab reclaim. This in turn means that there is
> > nobody to actually trigger the oom killer and the system is basically
> > unusable.
> > 
> > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> > direct reclaim when too many pages are isolated already") to prevent
> > from pre-mature oom killer invocations because back then no reclaim
> > progress could indeed trigger the OOM killer too early. But since the
> > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > the allocation/reclaim retry loop considers all the reclaimable pages
> > and throttles the allocation at that layer so we can loosen the direct
> > reclaim throttling.
> > 
> > Make shrink_inactive_list loop over too_many_isolated bounded and returns
> > immediately when the situation hasn't resolved after the first sleep.
> > Replace congestion_wait by a simple schedule_timeout_interruptible because
> > we are not really waiting on the IO congestion in this path.
> > 
> > Please note that this patch can theoretically cause the OOM killer to
> > trigger earlier while there are many pages isolated for the reclaim
> > which makes progress only very slowly. This would be obvious from the oom
> > report as the number of isolated pages are printed there. If we ever hit
> > this should_reclaim_retry should consider those numbers in the evaluation
> > in one way or another.
> 
> Need to figure out which kernels to patch.  Maybe just 4.13-rc after a
> week or two?

I do not think we need to rush it and the next merge window should be
just OK.

> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  	int file = is_file_lru(lru);
> >  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> >  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > +	bool stalled = false;
> >  
> >  	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +		if (stalled)
> > +			return 0;
> > +
> > +		/* wait a bit for the reclaimer. */
> > +		schedule_timeout_interruptible(HZ/10);
> 
> a) if this task has signal_pending(), this falls straight through
>    and I suspect the code breaks?

It will not break. It will return to the allocation path more quickly
but no over-reclaim will happen and it will/should get throttled there.
So nothing critical.

> b) replacing congestion_wait() with schedule_timeout_interruptible()
>    means this task no longer contributes to load average here and it's
>    a (slightly) user-visible change.

you are right. I am not sure it matters but it might be visible.
 
> c) msleep_interruptible() is nicer
> 
> d) IOW, methinks we should be using msleep() here?

OK, I do not have objections. Are you going to squash this in or want a
separate patch explaining all the above?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-20  1:54 ` Hugh Dickins
@ 2017-07-20 10:44   ` Tetsuo Handa
  2017-07-24  7:01     ` Hugh Dickins
  2017-07-20 13:22   ` Michal Hocko
  1 sibling, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2017-07-20 10:44 UTC (permalink / raw)
  To: hughd, mhocko
  Cc: akpm, mgorman, riel, hannes, vbabka, linux-mm, linux-kernel, mhocko

Hugh Dickins wrote:
> You probably won't welcome getting into alternatives at this late stage;
> but after hacking around it one way or another because of its pointless
> lockups, I lost patience with that too_many_isolated() loop a few months
> back (on realizing the enormous number of pages that may be isolated via
> migrate_pages(2)), and we've been running nicely since with something like:
> 
> 	bool got_mutex = false;
> 
> 	if (unlikely(too_many_isolated(pgdat, file, sc))) {
> 		if (mutex_lock_killable(&pgdat->too_many_isolated))
> 			return SWAP_CLUSTER_MAX;
> 		got_mutex = true;
> 	}
> 	...
> 	if (got_mutex)
> 		mutex_unlock(&pgdat->too_many_isolated);
> 
> Using a mutex to provide the intended throttling, without an infinite
> loop or an arbitrary delay; and without having to worry (as we often did)
> about whether those numbers in too_many_isolated() are really appropriate.
> No premature OOMs complained of yet.

Roughly speaking, there is a moment where shrink_inactive_list() acts
like below.

	bool got_mutex = false;

	if (!current_is_kswapd()) {
		if (mutex_lock_killable(&pgdat->too_many_isolated))
			return SWAP_CLUSTER_MAX;
		got_mutex = true;
	}

	// kswapd is blocked here waiting for !current_is_kswapd().

	if (got_mutex)
		mutex_unlock(&pgdat->too_many_isolated);

> 
> But that was on a different kernel, and there I did have to make sure
> that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> that in the current kernel (but do remember Johannes changing the memcg
> end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> if you're interested in that alternative: if you are, then I'll go ahead
> and make it into an actual patch against v4.13-rc.

I don't know what your actual patch looks like, but the problem is that
pgdat->too_many_isolated waits for kswapd while kswapd waits for
pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
once we hit it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-20  1:54 ` Hugh Dickins
  2017-07-20 10:44   ` Tetsuo Handa
@ 2017-07-20 13:22   ` Michal Hocko
  2017-07-24  7:03     ` Hugh Dickins
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2017-07-20 13:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Rik van Riel,
	Johannes Weiner, Vlastimil Babka, linux-mm, LKML

On Wed 19-07-17 18:54:40, Hugh Dickins wrote:
[...]
> You probably won't welcome getting into alternatives at this late stage;
> but after hacking around it one way or another because of its pointless
> lockups, I lost patience with that too_many_isolated() loop a few months
> back (on realizing the enormous number of pages that may be isolated via
> migrate_pages(2)), and we've been running nicely since with something like:
> 
> 	bool got_mutex = false;
> 
> 	if (unlikely(too_many_isolated(pgdat, file, sc))) {
> 		if (mutex_lock_killable(&pgdat->too_many_isolated))
> 			return SWAP_CLUSTER_MAX;
> 		got_mutex = true;
> 	}
> 	...
> 	if (got_mutex)
> 		mutex_unlock(&pgdat->too_many_isolated);
> 
> Using a mutex to provide the intended throttling, without an infinite
> loop or an arbitrary delay; and without having to worry (as we often did)
> about whether those numbers in too_many_isolated() are really appropriate.
> No premature OOMs complained of yet.
> 
> But that was on a different kernel, and there I did have to make sure
> that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> that in the current kernel (but do remember Johannes changing the memcg
> end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> if you're interested in that alternative: if you are, then I'll go ahead
> and make it into an actual patch against v4.13-rc.

I would rather get rid of any additional locking here and my ultimate
goal is to make throttling at the page allocator layer rather than
inside the reclaim.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-20  6:56   ` Michal Hocko
@ 2017-07-21 23:01     ` Andrew Morton
  2017-07-24  6:50       ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2017-07-21 23:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, Rik van Riel, Johannes Weiner,
	Vlastimil Babka, linux-mm, LKML

On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > >  	int file = is_file_lru(lru);
> > >  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > >  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > > +	bool stalled = false;
> > >  
> > >  	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > +		if (stalled)
> > > +			return 0;
> > > +
> > > +		/* wait a bit for the reclaimer. */
> > > +		schedule_timeout_interruptible(HZ/10);
> > 
> > a) if this task has signal_pending(), this falls straight through
> >    and I suspect the code breaks?
> 
> It will not break. It will return to the allocation path more quickly
> but no over-reclaim will happen and it will/should get throttled there.
> So nothing critical.
> 
> > b) replacing congestion_wait() with schedule_timeout_interruptible()
> >    means this task no longer contributes to load average here and it's
> >    a (slightly) user-visible change.
> 
> you are right. I am not sure it matters but it might be visible.
>  
> > c) msleep_interruptible() is nicer
> > 
> > d) IOW, methinks we should be using msleep() here?
> 
> OK, I do not have objections. Are you going to squash this in or want a
> separate patch explaining all the above?

I'd prefer to have a comment explaining why interruptible sleep is
being used, because that "what if signal_pending()" case is rather a
red flag.

Is it the case that fall-through-if-signal_pending() is the
*preferred* behaviour?  If so, the comment should explain this.  If it
isn't the preferred behaviour then using uninterruptible sleep sounds
better to me, if only because it saves us from having to test a rather
tricky and rare case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-21 23:01     ` Andrew Morton
@ 2017-07-24  6:50       ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2017-07-24  6:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, Rik van Riel, Johannes Weiner,
	Vlastimil Babka, linux-mm, LKML

On Fri 21-07-17 16:01:04, Andrew Morton wrote:
> On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > > >  	int file = is_file_lru(lru);
> > > >  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > >  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > > > +	bool stalled = false;
> > > >  
> > > >  	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +		if (stalled)
> > > > +			return 0;
> > > > +
> > > > +		/* wait a bit for the reclaimer. */
> > > > +		schedule_timeout_interruptible(HZ/10);
> > > 
> > > a) if this task has signal_pending(), this falls straight through
> > >    and I suspect the code breaks?
> > 
> > It will not break. It will return to the allocation path more quickly
> > but no over-reclaim will happen and it will/should get throttled there.
> > So nothing critical.
> > 
> > > b) replacing congestion_wait() with schedule_timeout_interruptible()
> > >    means this task no longer contributes to load average here and it's
> > >    a (slightly) user-visible change.
> > 
> > you are right. I am not sure it matters but it might be visible.
> >  
> > > c) msleep_interruptible() is nicer
> > > 
> > > d) IOW, methinks we should be using msleep() here?
> > 
> > OK, I do not have objections. Are you going to squash this in or want a
> > separate patch explaining all the above?
> 
> I'd prefer to have a comment explaining why interruptible sleep is
> being used, because that "what if signal_pending()" case is rather a
> red flag.

I didn't really consider interruptible vs. uninterruptible sleep so it
wasn't really a deliberate decision. Now, that you have brought up the
above points I am OK changing that the uninterruptible.

Here is a fix up. I am fine with this either folded in or as a separate
patch.
---

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-20 10:44   ` Tetsuo Handa
@ 2017-07-24  7:01     ` Hugh Dickins
  2017-07-24 11:12       ` Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Hugh Dickins @ 2017-07-24  7:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hughd, mhocko, akpm, mgorman, riel, hannes, vbabka, linux-mm,
	linux-kernel, mhocko

On Thu, 20 Jul 2017, Tetsuo Handa wrote:
> Hugh Dickins wrote:
> > You probably won't welcome getting into alternatives at this late stage;
> > but after hacking around it one way or another because of its pointless
> > lockups, I lost patience with that too_many_isolated() loop a few months
> > back (on realizing the enormous number of pages that may be isolated via
> > migrate_pages(2)), and we've been running nicely since with something like:
> > 
> > 	bool got_mutex = false;
> > 
> > 	if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > 		if (mutex_lock_killable(&pgdat->too_many_isolated))
> > 			return SWAP_CLUSTER_MAX;
> > 		got_mutex = true;
> > 	}
> > 	...
> > 	if (got_mutex)
> > 		mutex_unlock(&pgdat->too_many_isolated);
> > 
> > Using a mutex to provide the intended throttling, without an infinite
> > loop or an arbitrary delay; and without having to worry (as we often did)
> > about whether those numbers in too_many_isolated() are really appropriate.
> > No premature OOMs complained of yet.
> 
> Roughly speaking, there is a moment where shrink_inactive_list() acts
> like below.
> 
> 	bool got_mutex = false;
> 
> 	if (!current_is_kswapd()) {
> 		if (mutex_lock_killable(&pgdat->too_many_isolated))
> 			return SWAP_CLUSTER_MAX;
> 		got_mutex = true;
> 	}
> 
> 	// kswapd is blocked here waiting for !current_is_kswapd().

That would be a shame, for kswapd to wait for !current_is_kswapd()!

But seriously, I think I understand what you mean by that, you're
thinking that kswapd would be waiting on some other task to clear
the too_many_isolated() condition?

No, it does not work that way: kswapd (never seeing too_many_isolated()
because that always says false when current_is_kswapd()) never tries to
take the pgdat->too_many_isolated mutex itself: it does not wait there
at all, although other tasks may be waiting there at the time.

Perhaps my naming the mutex "too_many_isolated", same as the function,
is actually confusing, when I had intended it to be helpful.

> 
> 	if (got_mutex)
> 		mutex_unlock(&pgdat->too_many_isolated);
> 
> > 
> > But that was on a different kernel, and there I did have to make sure
> > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > that in the current kernel (but do remember Johannes changing the memcg
> > end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> > if you're interested in that alternative: if you are, then I'll go ahead
> > and make it into an actual patch against v4.13-rc.
> 
> I don't know what your actual patch looks like, but the problem is that
> pgdat->too_many_isolated waits for kswapd while kswapd waits for
> pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
> once we hit it.

Not so (and we'd hardly be finding it a useful patch if that were so).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-20 13:22   ` Michal Hocko
@ 2017-07-24  7:03     ` Hugh Dickins
  0 siblings, 0 replies; 41+ messages in thread
From: Hugh Dickins @ 2017-07-24  7:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Rik van Riel, Johannes Weiner, Vlastimil Babka, linux-mm, LKML

On Thu, 20 Jul 2017, Michal Hocko wrote:
> On Wed 19-07-17 18:54:40, Hugh Dickins wrote:
> [...]
> > You probably won't welcome getting into alternatives at this late stage;
> > but after hacking around it one way or another because of its pointless
> > lockups, I lost patience with that too_many_isolated() loop a few months
> > back (on realizing the enormous number of pages that may be isolated via
> > migrate_pages(2)), and we've been running nicely since with something like:
> > 
> > 	bool got_mutex = false;
> > 
> > 	if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > 		if (mutex_lock_killable(&pgdat->too_many_isolated))
> > 			return SWAP_CLUSTER_MAX;
> > 		got_mutex = true;
> > 	}
> > 	...
> > 	if (got_mutex)
> > 		mutex_unlock(&pgdat->too_many_isolated);
> > 
> > Using a mutex to provide the intended throttling, without an infinite
> > loop or an arbitrary delay; and without having to worry (as we often did)
> > about whether those numbers in too_many_isolated() are really appropriate.
> > No premature OOMs complained of yet.
> > 
> > But that was on a different kernel, and there I did have to make sure
> > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > that in the current kernel (but do remember Johannes changing the memcg
> > end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> > if you're interested in that alternative: if you are, then I'll go ahead
> > and make it into an actual patch against v4.13-rc.
> 
> I would rather get rid of any additional locking here and my ultimate
> goal is to make throttling at the page allocator layer rather than
> inside the reclaim.

Fair enough, I'm certainly in no hurry to send the patch,
but thought it worth mentioning.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-24  7:01     ` Hugh Dickins
@ 2017-07-24 11:12       ` Tetsuo Handa
  0 siblings, 0 replies; 41+ messages in thread
From: Tetsuo Handa @ 2017-07-24 11:12 UTC (permalink / raw)
  To: hughd
  Cc: mhocko, akpm, mgorman, riel, hannes, vbabka, linux-mm,
	linux-kernel, mhocko

Hugh Dickins wrote:
> On Thu, 20 Jul 2017, Tetsuo Handa wrote:
> > Hugh Dickins wrote:
> > > You probably won't welcome getting into alternatives at this late stage;
> > > but after hacking around it one way or another because of its pointless
> > > lockups, I lost patience with that too_many_isolated() loop a few months
> > > back (on realizing the enormous number of pages that may be isolated via
> > > migrate_pages(2)), and we've been running nicely since with something like:
> > > 
> > > 	bool got_mutex = false;
> > > 
> > > 	if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > 		if (mutex_lock_killable(&pgdat->too_many_isolated))
> > > 			return SWAP_CLUSTER_MAX;
> > > 		got_mutex = true;
> > > 	}
> > > 	...
> > > 	if (got_mutex)
> > > 		mutex_unlock(&pgdat->too_many_isolated);
> > > 
> > > Using a mutex to provide the intended throttling, without an infinite
> > > loop or an arbitrary delay; and without having to worry (as we often did)
> > > about whether those numbers in too_many_isolated() are really appropriate.
> > > No premature OOMs complained of yet.
> > 
> > Roughly speaking, there is a moment where shrink_inactive_list() acts
> > like below.
> > 
> > 	bool got_mutex = false;
> > 
> > 	if (!current_is_kswapd()) {
> > 		if (mutex_lock_killable(&pgdat->too_many_isolated))
> > 			return SWAP_CLUSTER_MAX;
> > 		got_mutex = true;
> > 	}
> > 
> > 	// kswapd is blocked here waiting for !current_is_kswapd().
> 
> That would be a shame, for kswapd to wait for !current_is_kswapd()!

Yes, but current code (not about your patch) does allow kswapd to wait
for memory allocations of !current_is_kswapd() thread to complete.

> 
> But seriously, I think I understand what you mean by that, you're
> thinking that kswapd would be waiting on some other task to clear
> the too_many_isolated() condition?

Yes.

> 
> No, it does not work that way: kswapd (never seeing too_many_isolated()
> because that always says false when current_is_kswapd()) never tries to
> take the pgdat->too_many_isolated mutex itself: it does not wait there
> at all, although other tasks may be waiting there at the time.

I know. I wrote behavior of your patch if my guess (your "..." part
corresponds to kswapd doing writepage) is correct.

> 
> Perhaps my naming the mutex "too_many_isolated", same as the function,
> is actually confusing, when I had intended it to be helpful.

Not confusing at all. It is helpful.
I just wanted to confirm what comes in your "..." part.

> 
> > 
> > 	if (got_mutex)
> > 		mutex_unlock(&pgdat->too_many_isolated);
> > 
> > > 
> > > But that was on a different kernel, and there I did have to make sure
> > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > > that in the current kernel (but do remember Johannes changing the memcg
> > > end to make it use PF_MEMALLOC too).  I offer the preview above, to see
> > > if you're interested in that alternative: if you are, then I'll go ahead
> > > and make it into an actual patch against v4.13-rc.
> > 
> > I don't know what your actual patch looks like, but the problem is that
> > pgdat->too_many_isolated waits for kswapd while kswapd waits for
> > pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
> > once we hit it.
> 
> Not so (and we'd hardly be finding it a useful patch if that were so).

Current code allows kswapd to wait for memory allocation of !current_is_kswapd()
threads, and thus !current_is_kswapd() threads wait for current_is_kswapd() threads
while current_is_kswapd() threads wait for !current_is_kswapd() threads; nobody can
make too_many_isolated() false if once we hit it. Hence, this patch is proposed.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-05  8:20                   ` Michal Hocko
@ 2017-07-06 10:48                     ` Tetsuo Handa
  0 siblings, 0 replies; 41+ messages in thread
From: Tetsuo Handa @ 2017-07-06 10:48 UTC (permalink / raw)
  To: mhocko; +Cc: hannes, riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > It is really hard to pursue this half solution when there is no clear
> > > indication it helps in your testing. So could you try to test with only
> > > this patch on top of the current linux-next tree (or Linus tree) and see
> > > if you can reproduce the problem?
> > 
> > With this patch on top of next-20170630, I no longer hit this problem.
> > (Of course, this is because this patch eliminates the infinite loop.)
> 
> I assume you haven't seen other negative side effects, like unexpected
> OOMs etc... Are you willing to give your Tested-by?

I didn't see other negative side effects.

Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

We need long time for testing this patch at linux-next.git (and I give up
this handy bug for finding other bugs under almost OOM situation).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-01 11:43                 ` Tetsuo Handa
  2017-07-05  8:19                   ` Michal Hocko
@ 2017-07-05  8:20                   ` Michal Hocko
  2017-07-06 10:48                     ` Tetsuo Handa
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2017-07-05  8:20 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hannes, riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > It is really hard to pursue this half solution when there is no clear
> > indication it helps in your testing. So could you try to test with only
> > this patch on top of the current linux-next tree (or Linus tree) and see
> > if you can reproduce the problem?
> 
> With this patch on top of next-20170630, I no longer hit this problem.
> (Of course, this is because this patch eliminates the infinite loop.)

I assume you haven't seen other negative side effects, like unexpected
OOMs etc... Are you willing to give your Tested-by?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-07-01 11:43                 ` Tetsuo Handa
@ 2017-07-05  8:19                   ` Michal Hocko
  2017-07-05  8:20                   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2017-07-05  8:19 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hannes, riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

[this is getting tangent again and I will not respond any further if
this turn into yet another flame]

On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I really do appreciate your testing because it uncovers corner cases
> > most people do not test for and we can actually make the code better in
> > the end.
> 
> That statement does not get to my heart at all. Collision between your
> approach and my approach is wasting both your time and my time.
> 
> I've reported this too_many_isolated() trap three years ago at
> http://lkml.kernel.org/r/201407022140.BFJ13092.QVOSJtFMFHLOFO@I-love.SAKURA.ne.jp .
> Do you know that we already wasted 3 years without any attention?

And how many real bugs have we seen in those three years? Well, zero
AFAIR, except for your corner case testing. So while I never dismissed
the problem I've been saing this is not that trivial to fix. As my
attempt to address this and the review feedback I've received shows.

> You are rejecting serialization under OOM without giving a chance to test
> side effects of serialization under OOM at linux-next.git. I call such attitude
> "speculation" which you never accept.

No I am rejecting abusing the lock for purpose it is not aimed for.

> Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization.
> In the first place, if the system is under global OOM (which is more
> serious situation than memcg OOM), delay caused by serialization will not
> matter. Rather, I consider that making sure that the system does not get
> locked up is more important. I'm reporting that serialization helps
> facilitating the OOM killer/reaper operations, avoiding lockups, and
> solving global OOM situation smoothly. But you are refusing my report without
> giving a chance to test what side effects will pop up at linux-next.git.

You are mixing oranges with apples here. We do synchronize memcg oom
killer the same way as the global one.

> Knowledge about OOM situation is hardly shared among Linux developers and users,
> and is far from object of concern. Like shown by cgroup-aware OOM killer proposal,
> what will happen if we restrict 0 <= oom_victims <= 1 is not shared among developers.
> 
> How many developers joined to my OOM watchdog proposal? Every time and ever it is
> confrontation between you and me. You, as effectively the only participant, are
> showing negative attitude is effectively Nacked-by: response without alternative
> proposal.

This is something all of us have to fight with. There are only so many
MM developers. You have to justify your changes in order to attract other
developers/users. You are basing your changes on speculations and what-ifs
for workloads that most developers consider borderline and
misconfigurations already.
 
> Not everybody can afford testing with absolutely latest upstream kernels.
> Not prepared to obtain information for analysis using distributor kernels makes
> it impossible to compare whether user's problems are already fixed in upstream
> kernels, makes it impossible to identify patches which needs to be backported to
> distributor kernels, and is bad for customers using distributor kernels. Of course,
> it is possible that distributors decide not to allow users to obtain information
> for analysis, but such decision cannot become a reason we can not prepare to obtain
> information for analysis at upstream kernels.

If you have to work with distribution kernels then talk to distribution
people. It is that simple. You are surely not using those systems just
because of a fancy logo...
 
[...]

> > this way of pushing your patch is really annoying. Please do realize
> > that repeating the same thing all around will not make a patch more
> > likely to merge. You have proposed something, nobody has nacked it
> > so it waits for people to actually find it important enough to justify
> > the additional code. So please stop this.
> 
> When will people find time to judge it? We already wasted three years, and
> knowledge about OOM situation is hardly shared among Linux developers and users,
> and will unlikely be object of concern. How many years (or decades) will we waste
> more? MM subsystem will change meanwhile and we will just ignore old kernels.
> 
> If you do want me to stop bringing watchdog here and there, please do show
> alternative approach which I can tolerate. If you cannot afford it, please allow
> me to involve people (e.g. you make calls for joining to my proposals because
> you are asking me to wait until people find time to judge it).
> Please do realize that just repeatedly saying "wait patiently" helps nothing.

You really have to realize that there will hardly be more interest in
your reports when they do not reflect real life situations. I have said
(several times) that those issues should be addressed eventually but
there are more pressing issues which do trigger in the real life and
they have precedence.

Should we add a lot of code for something that doesn't bother many
users? I do not think so. As explained earlier (several times) this code
will have a maintenance cost and also can lead to other problems (false
positives etc. just consider how easily it is to get a false positive
lockup splats - I am facing reports for those very often on our
distribution kernels on large boxes).

I said I appreciate your testing regardless because I really mean it. We
really want to have a more robust out of memory handling long term. And
as you have surely noticed quite some changes have been made in that
direction last few years. There are still many unaddressed ones, no
question about that. We do not have to jump into the first approach we
come up with for those, though. Cost/benefit evaluation has to be done
everytime for each proposal. I am really not sure what is so hard to
understand about this.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-06-30 16:19               ` Michal Hocko
@ 2017-07-01 11:43                 ` Tetsuo Handa
  2017-07-05  8:19                   ` Michal Hocko
  2017-07-05  8:20                   ` Michal Hocko
  0 siblings, 2 replies; 41+ messages in thread
From: Tetsuo Handa @ 2017-07-01 11:43 UTC (permalink / raw)
  To: mhocko; +Cc: hannes, riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

Michal Hocko wrote:
> I really do appreciate your testing because it uncovers corner cases
> most people do not test for and we can actually make the code better in
> the end.

That statement does not get to my heart at all. Collision between your
approach and my approach is wasting both your time and my time.

I've reported this too_many_isolated() trap three years ago at
http://lkml.kernel.org/r/201407022140.BFJ13092.QVOSJtFMFHLOFO@I-love.SAKURA.ne.jp .
Do you know that we already wasted 3 years without any attention?

You are rejecting serialization under OOM without giving a chance to test
side effects of serialization under OOM at linux-next.git. I call such attitude
"speculation" which you never accept.

Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization.
In the first place, if the system is under global OOM (which is more
serious situation than memcg OOM), delay caused by serialization will not
matter. Rather, I consider that making sure that the system does not get
locked up is more important. I'm reporting that serialization helps
facilitating the OOM killer/reaper operations, avoiding lockups, and
solving global OOM situation smoothly. But you are refusing my report without
giving a chance to test what side effects will pop up at linux-next.git.

Knowledge about OOM situation is hardly shared among Linux developers and users,
and is far from object of concern. Like shown by cgroup-aware OOM killer proposal,
what will happen if we restrict 0 <= oom_victims <= 1 is not shared among developers.

How many developers joined to my OOM watchdog proposal? Every time and ever it is
confrontation between you and me. You, as effectively the only participant, are
showing negative attitude is effectively Nacked-by: response without alternative
proposal.

Not everybody can afford testing with absolutely latest upstream kernels.
Not prepared to obtain information for analysis using distributor kernels makes
it impossible to compare whether user's problems are already fixed in upstream
kernels, makes it impossible to identify patches which needs to be backported to
distributor kernels, and is bad for customers using distributor kernels. Of course,
it is possible that distributors decide not to allow users to obtain information
for analysis, but such decision cannot become a reason we can not prepare to obtain
information for analysis at upstream kernels.

Suppose I take a step back and tolerate the burden of sitting in front of console
24 hours a day, every day of the year so that users can press SysRq when something
went wrong, how nice it will be if all in-flight allocation requests were printed
upon SysRq. show_workqueue_state() is called upon SysRq-t is to some degree useful.

In fact, my proposal was such approach before I serialize using a kernel thread
(e.g. http://lkml.kernel.org/r/201411231351.HJA17065.VHQSFOJFtLFOMO@I-love.SAKURA.ne.jp
which I proposed two years and a half ago). Though, while my proposal was left ignored,
I learned that showing only current thread is not sufficient and updated my watchdog
to show other threads (e.g. kswapd) using serialization.

A patch at http://lkml.kernel.org/r/201505232339.DAB00557.VFFLHMSOJFOOtQ@I-love.SAKURA.ne.jp
which I posted two years ago also includes a proposal for handling infinite
shrink_inactive_list() problem. After all, this shrink_inactive_list() problem was
ignored for three years without getting a chance to even test at linux-next.git.
Sigh...

I know my proposals might not be best. But you cannot afford showing alternative proposals
because you are putting higher priority to other problems. And other developers cannot afford
participating because they are not interested in or they do not share knowledge of this problem.

My proposals do not constrain future kernels. We can revert my proposals when my proposals
became no longer needed. My proposals is meaningful as interim approach, but you never accept
approaches which do not match your will (or desire). Even without giving people a chance to
test what side effects will crop up, how can your "I really do appreciate your testing"
statement get to my heart?

My watchdog allows detecting problems which are previously overlooked unless putting
unrealistic burden (e.g. stand by 24 hours a day, every day of the year). You ask people
to prove that it is a MM problem. But I am dissatisfied that you are letting proposals
which helps judging whether it is a MM problem alone.

> this way of pushing your patch is really annoying. Please do realize
> that repeating the same thing all around will not make a patch more
> likely to merge. You have proposed something, nobody has nacked it
> so it waits for people to actually find it important enough to justify
> the additional code. So please stop this.

When will people find time to judge it? We already wasted three years, and
knowledge about OOM situation is hardly shared among Linux developers and users,
and will unlikely be object of concern. How many years (or decades) will we waste
more? MM subsystem will change meanwhile and we will just ignore old kernels.

If you do want me to stop bringing watchdog here and there, please do show
alternative approach which I can tolerate. If you cannot afford it, please allow
me to involve people (e.g. you make calls for joining to my proposals because
you are asking me to wait until people find time to judge it).
Please do realize that just repeatedly saying "wait patiently" helps nothing.

> It is really hard to pursue this half solution when there is no clear
> indication it helps in your testing. So could you try to test with only
> this patch on top of the current linux-next tree (or Linus tree) and see
> if you can reproduce the problem?

With this patch on top of next-20170630, I no longer hit this problem.
(Of course, this is because this patch eliminates the infinite loop.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-06-30 15:59             ` Tetsuo Handa
@ 2017-06-30 16:19               ` Michal Hocko
  2017-07-01 11:43                 ` Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2017-06-30 16:19 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hannes, riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

On Sat 01-07-17 00:59:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
> > [...]
> > > Ping? Ping? When are we going to apply this patch or watchdog patch?
> > > This problem occurs with not so insane stress like shown below.
> > > I can't test almost OOM situation because test likely falls into either
> > > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.
> > 
> > So you are saying that the patch fixes this issue. Do I understand you
> > corretly? And you do not see any other negative side effectes with it
> > applied?
> 
> I hit this problem using http://lkml.kernel.org/r/20170626130346.26314-1-mhocko@kernel.org
> on next-20170628. We won't be able to test whether the patch fixes this issue without
> seeing any other negative side effects without sending this patch to linux-next.git.
> But at least we know that even this patch is sent to linux-next.git, we will still see
> bugs like http://lkml.kernel.org/r/201703031948.CHJ81278.VOHSFFFOOLJQMt@I-love.SAKURA.ne.jp .

It is really hard to pursue this half solution when there is no clear
indication it helps in your testing. So could you try to test with only
this patch on top of the current linux-next tree (or Linus tree) and see
if you can reproduce the problem?

It is possible that there are other potential problems but we at least
need to know whether it is worth going with the patch now.
 
[...]
> > Rik, Johannes what do you think? Should we go with the simpler approach
> > for now and think of a better plan longterm?
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
> 
> Watchdog does not introduce negative side effects, will avoid soft lockups like
> http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com ,
> will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(),
> will catch similar bugs which people are failing to reproduce.

this way of pushing your patch is really annoying. Please do realize
that repeating the same thing all around will not make a patch more
likely to merge. You have proposed something, nobody has nacked it
so it waits for people to actually find it important enough to justify
the additional code. So please stop this.

I really do appreciate your testing because it uncovers corner cases
most people do not test for and we can actually make the code better in
the end.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-06-30 13:32           ` Michal Hocko
@ 2017-06-30 15:59             ` Tetsuo Handa
  2017-06-30 16:19               ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2017-06-30 15:59 UTC (permalink / raw)
  To: mhocko; +Cc: hannes, riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
> [...]
> > Ping? Ping? When are we going to apply this patch or watchdog patch?
> > This problem occurs with not so insane stress like shown below.
> > I can't test almost OOM situation because test likely falls into either
> > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.
> 
> So you are saying that the patch fixes this issue. Do I understand you
> corretly? And you do not see any other negative side effectes with it
> applied?

I hit this problem using http://lkml.kernel.org/r/20170626130346.26314-1-mhocko@kernel.org
on next-20170628. We won't be able to test whether the patch fixes this issue without
seeing any other negative side effects without sending this patch to linux-next.git.
But at least we know that even this patch is sent to linux-next.git, we will still see
bugs like http://lkml.kernel.org/r/201703031948.CHJ81278.VOHSFFFOOLJQMt@I-love.SAKURA.ne.jp .

> 
> I am sorry I didn't have much time to think about feedback from Johannes
> yet. A more robust throttling method is surely due but also not trivial.
> So I am not sure how to proceed. It is true that your last test case
> with only 10 processes fighting resembles the reality much better than
> hundreds (AFAIR) that you were using previously.

Even if hundreds are running, most of them are simply blocked inside open()
at down_write() (like an example from serial-20170423-2.txt.xz shown below).
Actual number of processes fighting for memory is always less than 100.

 ? __schedule+0x1d2/0x5a0
 ? schedule+0x2d/0x80
 ? rwsem_down_write_failed+0x1f9/0x370
 ? walk_component+0x43/0x270
 ? call_rwsem_down_write_failed+0x13/0x20
 ? down_write+0x24/0x40
 ? path_openat+0x670/0x1210
 ? do_filp_open+0x8c/0x100
 ? getname_flags+0x47/0x1e0
 ? do_sys_open+0x121/0x200
 ? do_syscall_64+0x5c/0x140
 ? entry_SYSCALL64_slow_path+0x25/0x25

> 
> Rik, Johannes what do you think? Should we go with the simpler approach
> for now and think of a better plan longterm?

I don't hurry if we can check using watchdog whether this problem is occurring
in the real world. I have to test corner cases because watchdog is missing.

Watchdog does not introduce negative side effects, will avoid soft lockups like
http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com ,
will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(),
will catch similar bugs which people are failing to reproduce.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-06-30  0:14         ` Tetsuo Handa
@ 2017-06-30 13:32           ` Michal Hocko
  2017-06-30 15:59             ` Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2017-06-30 13:32 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hannes, riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
[...]
> Ping? Ping? When are we going to apply this patch or watchdog patch?
> This problem occurs with not so insane stress like shown below.
> I can't test almost OOM situation because test likely falls into either
> printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.

So you are saying that the patch fixes this issue. Do I understand you
corretly? And you do not see any other negative side effectes with it
applied?

I am sorry I didn't have much time to think about feedback from Johannes
yet. A more robust throttling method is surely due but also not trivial.
So I am not sure how to proceed. It is true that your last test case
with only 10 processes fighting resembles the reality much better than
hundreds (AFAIR) that you were using previously.

Rik, Johannes what do you think? Should we go with the simpler approach
for now and think of a better plan longterm?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-10 11:44       ` Tetsuo Handa
  2017-03-21 10:37         ` Tetsuo Handa
  2017-04-23 10:24         ` Tetsuo Handa
@ 2017-06-30  0:14         ` Tetsuo Handa
  2017-06-30 13:32           ` Michal Hocko
  2 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2017-06-30  0:14 UTC (permalink / raw)
  To: mhocko, hannes; +Cc: riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > > It only does this to some extent.  If reclaim made
> > > > no progress, for example due to immediately bailing
> > > > out because the number of already isolated pages is
> > > > too high (due to many parallel reclaimers), the code
> > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > > test without ever looking at the number of reclaimable
> > > > pages.
> > > 
> > > Hm, there is no early return there, actually. We bump the loop counter
> > > every time it happens, but then *do* look at the reclaimable pages.
> > > 
> > > > Could that create problems if we have many concurrent
> > > > reclaimers?
> > > 
> > > With increased concurrency, the likelihood of OOM will go up if we
> > > remove the unlimited wait for isolated pages, that much is true.
> > > 
> > > I'm not sure that's a bad thing, however, because we want the OOM
> > > killer to be predictable and timely. So a reasonable wait time in
> > > between 0 and forever before an allocating thread gives up under
> > > extreme concurrency makes sense to me.
> > > 
> > > > It may be OK, I just do not understand all the implications.
> > > > 
> > > > I like the general direction your patch takes the code in,
> > > > but I would like to understand it better...
> > > 
> > > I feel the same way. The throttling logic doesn't seem to be very well
> > > thought out at the moment, making it hard to reason about what happens
> > > in certain scenarios.
> > > 
> > > In that sense, this patch isn't really an overall improvement to the
> > > way things work. It patches a hole that seems to be exploitable only
> > > from an artificial OOM torture test, at the risk of regressing high
> > > concurrency workloads that may or may not be artificial.
> > > 
> > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > > behind this patch. Can we think about a general model to deal with
> > > allocation concurrency? 
> > 
> > I am definitely not against. There is no reason to rush the patch in.
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
> 
> > My main point behind this patch was to reduce unbound loops from inside
> > the reclaim path and push any throttling up the call chain to the
> > page allocator path because I believe that it is easier to reason
> > about them at that level. The direct reclaim should be as simple as
> > possible without too many side effects otherwise we end up in a highly
> > unpredictable behavior. This was a first step in that direction and my
> > testing so far didn't show any regressions.
> > 
> > > Unlimited parallel direct reclaim is kinda
> > > bonkers in the first place. How about checking for excessive isolation
> > > counts from the page allocator and putting allocations on a waitqueue?
> > 
> > I would be interested in details here.
> 
> That will help implementing __GFP_KILLABLE.
> https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15
> 
Ping? Ping? When are we going to apply this patch or watchdog patch?
This problem occurs with not so insane stress like shown below.
I can't test almost OOM situation because test likely falls into either
printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.

----------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
	static char buffer[4096] = { };
	char *buf = NULL;
	unsigned long size;
	int i;
	for (i = 0; i < 10; i++) {
		if (fork() == 0) {
			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
			write(fd, "1000", 4);
			close(fd);
			sleep(1);
			if (!i)
				pause();
			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer))
				fsync(fd);
			_exit(0);
		}
	}
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(2);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	return 0;
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170629-3.txt.xz .

[  190.924887] a.out           D13296  2191   2172 0x00000080
[  190.927121] Call Trace:
[  190.928304]  __schedule+0x23f/0x5d0
[  190.929843]  schedule+0x31/0x80
[  190.931261]  schedule_timeout+0x189/0x290
[  190.933068]  ? del_timer_sync+0x40/0x40
[  190.934722]  io_schedule_timeout+0x19/0x40
[  190.936467]  ? io_schedule_timeout+0x19/0x40
[  190.938272]  congestion_wait+0x7d/0xd0
[  190.939919]  ? wait_woken+0x80/0x80
[  190.941452]  shrink_inactive_list+0x3e3/0x4d0
[  190.943281]  shrink_node_memcg+0x360/0x780
[  190.945023]  ? check_preempt_curr+0x7d/0x90
[  190.946794]  ? try_to_wake_up+0x23b/0x3c0
[  190.948741]  shrink_node+0xdc/0x310
[  190.950285]  ? shrink_node+0xdc/0x310
[  190.951870]  do_try_to_free_pages+0xea/0x370
[  190.953661]  try_to_free_pages+0xc3/0x100
[  190.955644]  __alloc_pages_slowpath+0x441/0xd50
[  190.957714]  __alloc_pages_nodemask+0x20c/0x250
[  190.959598]  alloc_pages_vma+0x83/0x1e0
[  190.961244]  __handle_mm_fault+0xc2c/0x1030
[  190.963006]  handle_mm_fault+0xf4/0x220
[  190.964871]  __do_page_fault+0x25b/0x4a0
[  190.966611]  do_page_fault+0x30/0x80
[  190.968169]  page_fault+0x28/0x30

[  190.987135] a.out           D11896  2193   2191 0x00000086
[  190.989636] Call Trace:
[  190.990855]  __schedule+0x23f/0x5d0
[  190.992384]  schedule+0x31/0x80
[  190.993797]  schedule_timeout+0x1c1/0x290
[  190.995578]  ? init_object+0x64/0xa0
[  190.997133]  __down+0x85/0xd0
[  190.998476]  ? __down+0x85/0xd0
[  190.999879]  ? deactivate_slab.isra.83+0x160/0x4b0
[  191.001843]  down+0x3c/0x50
[  191.003116]  ? down+0x3c/0x50
[  191.004460]  xfs_buf_lock+0x21/0x50 [xfs]
[  191.006146]  _xfs_buf_find+0x3cd/0x640 [xfs]
[  191.007924]  xfs_buf_get_map+0x25/0x150 [xfs]
[  191.009736]  xfs_buf_read_map+0x25/0xc0 [xfs]
[  191.011891]  xfs_trans_read_buf_map+0xef/0x2f0 [xfs]
[  191.013990]  xfs_read_agf+0x86/0x110 [xfs]
[  191.015758]  xfs_alloc_read_agf+0x3e/0x140 [xfs]
[  191.017675]  xfs_alloc_fix_freelist+0x3e8/0x4e0 [xfs]
[  191.019725]  ? kmem_zone_alloc+0x8a/0x110 [xfs]
[  191.021613]  ? set_track+0x6b/0x140
[  191.023452]  ? init_object+0x64/0xa0
[  191.025049]  ? ___slab_alloc+0x1b6/0x590
[  191.026870]  ? ___slab_alloc+0x1b6/0x590
[  191.028581]  xfs_free_extent_fix_freelist+0x78/0xe0 [xfs]
[  191.030768]  xfs_free_extent+0x6a/0x1d0 [xfs]
[  191.032577]  xfs_trans_free_extent+0x2c/0xb0 [xfs]
[  191.034534]  xfs_extent_free_finish_item+0x21/0x40 [xfs]
[  191.036695]  xfs_defer_finish+0x143/0x2b0 [xfs]
[  191.038622]  xfs_itruncate_extents+0x1a5/0x3d0 [xfs]
[  191.040686]  xfs_free_eofblocks+0x1a8/0x200 [xfs]
[  191.042945]  xfs_release+0x13f/0x160 [xfs]
[  191.044811]  xfs_file_release+0x10/0x20 [xfs]
[  191.046674]  __fput+0xda/0x1e0
[  191.048077]  ____fput+0x9/0x10
[  191.049479]  task_work_run+0x7b/0xa0
[  191.051063]  do_exit+0x2c5/0xb30
[  191.052522]  do_group_exit+0x3e/0xb0
[  191.054103]  get_signal+0x1dd/0x4f0
[  191.055663]  ? __do_fault+0x19/0xf0
[  191.057790]  do_signal+0x32/0x650
[  191.059421]  ? handle_mm_fault+0xf4/0x220
[  191.061108]  ? __do_page_fault+0x25b/0x4a0
[  191.062818]  exit_to_usermode_loop+0x5a/0x90
[  191.064588]  prepare_exit_to_usermode+0x40/0x50
[  191.066468]  retint_user+0x8/0x10

[  191.085459] a.out           D11576  2194   2191 0x00000086
[  191.087652] Call Trace:
[  191.088883]  __schedule+0x23f/0x5d0
[  191.090437]  schedule+0x31/0x80
[  191.091830]  schedule_timeout+0x189/0x290
[  191.093541]  ? del_timer_sync+0x40/0x40
[  191.095166]  io_schedule_timeout+0x19/0x40
[  191.096881]  ? io_schedule_timeout+0x19/0x40
[  191.098657]  congestion_wait+0x7d/0xd0
[  191.100254]  ? wait_woken+0x80/0x80
[  191.101758]  shrink_inactive_list+0x3e3/0x4d0
[  191.103574]  shrink_node_memcg+0x360/0x780
[  191.105599]  ? check_preempt_curr+0x7d/0x90
[  191.107402]  ? try_to_wake_up+0x23b/0x3c0
[  191.109087]  shrink_node+0xdc/0x310
[  191.110590]  ? shrink_node+0xdc/0x310
[  191.112153]  do_try_to_free_pages+0xea/0x370
[  191.113948]  try_to_free_pages+0xc3/0x100
[  191.115639]  __alloc_pages_slowpath+0x441/0xd50
[  191.117508]  __alloc_pages_nodemask+0x20c/0x250
[  191.119374]  alloc_pages_current+0x65/0xd0
[  191.121179]  xfs_buf_allocate_memory+0x172/0x2d0 [xfs]
[  191.123262]  xfs_buf_get_map+0xbe/0x150 [xfs]
[  191.125077]  xfs_buf_read_map+0x25/0xc0 [xfs]
[  191.126909]  xfs_trans_read_buf_map+0xef/0x2f0 [xfs]
[  191.128924]  xfs_btree_read_buf_block.constprop.36+0x6d/0xc0 [xfs]
[  191.131358]  xfs_btree_lookup_get_block+0x85/0x180 [xfs]
[  191.133529]  xfs_btree_lookup+0x125/0x460 [xfs]
[  191.135562]  ? xfs_allocbt_init_cursor+0x43/0x130 [xfs]
[  191.137674]  xfs_free_ag_extent+0x9f/0x870 [xfs]
[  191.139579]  xfs_free_extent+0xb5/0x1d0 [xfs]
[  191.141419]  xfs_trans_free_extent+0x2c/0xb0 [xfs]
[  191.143387]  xfs_extent_free_finish_item+0x21/0x40 [xfs]
[  191.145538]  xfs_defer_finish+0x143/0x2b0 [xfs]
[  191.147446]  xfs_itruncate_extents+0x1a5/0x3d0 [xfs]
[  191.149485]  xfs_free_eofblocks+0x1a8/0x200 [xfs]
[  191.151630]  xfs_release+0x13f/0x160 [xfs]
[  191.153373]  xfs_file_release+0x10/0x20 [xfs]
[  191.155248]  __fput+0xda/0x1e0
[  191.156637]  ____fput+0x9/0x10
[  191.158011]  task_work_run+0x7b/0xa0
[  191.159563]  do_exit+0x2c5/0xb30
[  191.161013]  do_group_exit+0x3e/0xb0
[  191.162557]  get_signal+0x1dd/0x4f0
[  191.164071]  do_signal+0x32/0x650
[  191.165526]  ? handle_mm_fault+0xf4/0x220
[  191.167429]  ? __do_page_fault+0x283/0x4a0
[  191.169254]  exit_to_usermode_loop+0x5a/0x90
[  191.171070]  prepare_exit_to_usermode+0x40/0x50
[  191.172976]  retint_user+0x8/0x10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-04-24 13:06             ` Tetsuo Handa
@ 2017-04-25  6:33               ` Stanislaw Gruszka
  0 siblings, 0 replies; 41+ messages in thread
From: Stanislaw Gruszka @ 2017-04-25  6:33 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, riel, akpm, mgorman, vbabka, linux-mm,
	linux-kernel, rientjes

On Mon, Apr 24, 2017 at 10:06:32PM +0900, Tetsuo Handa wrote:
> Stanislaw Gruszka wrote:
> > On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> > > On 2017/03/10 20:44, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > >> I am definitely not against. There is no reason to rush the patch in.
> > > > 
> > > > I don't hurry if we can check using watchdog whether this problem is occurring
> > > > in the real world. I have to test corner cases because watchdog is missing.
> > > > 
> > > Ping?
> > > 
> > > This problem can occur even immediately after the first invocation of
> > > the OOM killer. I believe this problem can occur in the real world.
> > > When are we going to apply this patch or watchdog patch?
> > > 
> > > ----------------------------------------
> > > [    0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
> > > [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1
> > 
> > Are you debugging memory corruption problem?
> 
> No. Just a random testing trying to find how we can avoid flooding of
> warn_alloc_stall() warning messages while also avoiding ratelimiting.

This is not right way to stress mm subsystem, debug_guardpage_minorder= 
option is for _debug_ purpose. Use mem= instead if you want to limit
available memory.

> > FWIW, if you use debug_guardpage_minorder= you can expect any
> > allocation memory problems. This option is intended to debug
> > memory corruption bugs and it shrinks available memory in 
> > artificial way. Taking that, I don't think justifying any
> > patch, by problem happened when debug_guardpage_minorder= is 
> > used, is reasonable.
> >  
> > Stanislaw
> 
> This problem occurs without debug_guardpage_minorder= parameter and

So please justify your patches by that.

Thanks
Stanislaw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-04-24 12:39           ` Stanislaw Gruszka
@ 2017-04-24 13:06             ` Tetsuo Handa
  2017-04-25  6:33               ` Stanislaw Gruszka
  0 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2017-04-24 13:06 UTC (permalink / raw)
  To: sgruszka
  Cc: mhocko, hannes, riel, akpm, mgorman, vbabka, linux-mm,
	linux-kernel, rientjes

Stanislaw Gruszka wrote:
> On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> > On 2017/03/10 20:44, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > >> I am definitely not against. There is no reason to rush the patch in.
> > > 
> > > I don't hurry if we can check using watchdog whether this problem is occurring
> > > in the real world. I have to test corner cases because watchdog is missing.
> > > 
> > Ping?
> > 
> > This problem can occur even immediately after the first invocation of
> > the OOM killer. I believe this problem can occur in the real world.
> > When are we going to apply this patch or watchdog patch?
> > 
> > ----------------------------------------
> > [    0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
> > [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1
> 
> Are you debugging memory corruption problem?

No. Just a random testing trying to find how we can avoid flooding of
warn_alloc_stall() warning messages while also avoiding ratelimiting.

> 
> FWIW, if you use debug_guardpage_minorder= you can expect any
> allocation memory problems. This option is intended to debug
> memory corruption bugs and it shrinks available memory in 
> artificial way. Taking that, I don't think justifying any
> patch, by problem happened when debug_guardpage_minorder= is 
> used, is reasonable.
>  
> Stanislaw

This problem occurs without debug_guardpage_minorder= parameter and

----------
[    0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8
(...snipped...)
CentOS Linux 7 (Core)
Kernel 4.11.0-rc7-next-20170421+ on an x86_64

ccsecurity login: [   31.882531] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   32.550187] Ebtables v2.0 registered
[   32.730371] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   32.926518] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready
[   32.928310] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[   32.930960] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready
[   33.741378] Netfilter messages via NETLINK v0.30.
[   33.807350] ip_set: protocol 6
[   37.581002] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based  firewall rule not found. Use the iptables CT target to attach helpers instead.
[   38.072689] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready
[   38.074419] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[   38.077222] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready
[   92.753140] gmain invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
[   92.763445] gmain cpuset=/ mems_allowed=0
[   92.767634] CPU: 2 PID: 2733 Comm: gmain Not tainted 4.11.0-rc7-next-20170421+ #588
[   92.773624] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   92.781790] Call Trace:
[   92.782630]  ? dump_stack+0x5c/0x7d
[   92.783902]  ? dump_header+0x97/0x233
[   92.785427]  ? ktime_get+0x30/0x90
[   92.786390]  ? delayacct_end+0x35/0x60
[   92.787433]  ? do_try_to_free_pages+0x2ca/0x370
[   92.789157]  ? oom_kill_process+0x223/0x3e0
[   92.790502]  ? has_capability_noaudit+0x17/0x20
[   92.791761]  ? oom_badness+0xeb/0x160
[   92.792783]  ? out_of_memory+0x10b/0x490
[   92.793872]  ? __alloc_pages_slowpath+0x701/0x8e2
[   92.795603]  ? __alloc_pages_nodemask+0x1ed/0x210
[   92.796902]  ? alloc_pages_current+0x7a/0x100
[   92.798115]  ? filemap_fault+0x2e9/0x5e0
[   92.799204]  ? filemap_map_pages+0x185/0x3a0
[   92.800402]  ? xfs_filemap_fault+0x2f/0x50 [xfs]
[   92.801678]  ? __do_fault+0x15/0x70
[   92.802651]  ? __handle_mm_fault+0xb0f/0x11e0
[   92.805141]  ? handle_mm_fault+0xc5/0x220
[   92.807261]  ? __do_page_fault+0x21e/0x4b0
[   92.809203]  ? do_page_fault+0x2b/0x70
[   92.811018]  ? do_syscall_64+0x137/0x140
[   92.812554]  ? page_fault+0x28/0x30
[   92.813855] Mem-Info:
[   92.815009] active_anon:437483 inactive_anon:2097 isolated_anon:0
[   92.815009]  active_file:0 inactive_file:104 isolated_file:41
[   92.815009]  unevictable:0 dirty:10 writeback:0 unstable:0
[   92.815009]  slab_reclaimable:2439 slab_unreclaimable:11018
[   92.815009]  mapped:405 shmem:2162 pagetables:8704 bounce:0
[   92.815009]  free:13168 free_pcp:58 free_cma:0
[   92.825444] Node 0 active_anon:1749932kB inactive_anon:8388kB active_file:0kB inactive_file:592kB unevictable:0kB isolated(anon):0kB isolated(file):164kB mapped:1620kB dirty:40kB writeback:0kB shmem:8648kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1519616kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   92.832175] Node 0 DMA free:8148kB min:352kB low:440kB high:528kB active_anon:7696kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   92.840217] lowmem_reserve[]: 0 1952 1952 1952
[   92.841799] Node 0 DMA32 free:45028kB min:44700kB low:55872kB high:67044kB active_anon:1742236kB inactive_anon:8388kB active_file:0kB inactive_file:992kB unevictable:0kB writepending:40kB present:2080640kB managed:2018376kB mlocked:0kB slab_reclaimable:9756kB slab_unreclaimable:44040kB kernel_stack:22192kB pagetables:34788kB bounce:0kB free_pcp:672kB local_pcp:0kB free_cma:0kB
[   92.850458] lowmem_reserve[]: 0 0 0 0
[   92.851881] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (M) 2*32kB (UM) 2*64kB (UM) 2*128kB (UM) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 1*4096kB (M) = 8148kB
[   92.855530] Node 0 DMA32: 1023*4kB (UME) 591*8kB (UME) 220*16kB (UME) 223*32kB (UME) 156*64kB (UME) 38*128kB (UME) 12*256kB (UME) 10*512kB (UME) 2*1024kB (M) 0*2048kB 0*4096kB = 44564kB
[   92.860735] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   92.863216] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   92.865714] 2994 total pagecache pages
[   92.867201] 0 pages in swap cache
[   92.868575] Swap cache stats: add 0, delete 0, find 0/0
[   92.870309] Free swap  = 0kB
[   92.871579] Total swap = 0kB
[   92.873000] 524157 pages RAM
[   92.874351] 0 pages HighMem/MovableOnly
[   92.875809] 15587 pages reserved
[   92.877151] 0 pages cma reserved
[   92.878513] 0 pages hwpoisoned
[   92.879948] Out of memory: Kill process 2983 (a.out) score 998 or sacrifice child
[   92.882182] Killed process 2983 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   92.886190] oom_reaper: reaped process 2983 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   96.072996] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   96.076683] a.out cpuset=/ mems_allowed=0
[   96.078329] CPU: 3 PID: 2982 Comm: a.out Not tainted 4.11.0-rc7-next-20170421+ #588
[   96.080583] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   96.083254] Call Trace:
[   96.084404]  ? dump_stack+0x5c/0x7d
[   96.085855]  ? dump_header+0x97/0x233
[   96.087393]  ? oom_kill_process+0x223/0x3e0
[   96.089059]  ? has_capability_noaudit+0x17/0x20
[   96.090567]  ? oom_badness+0xeb/0x160
[   96.092133]  ? out_of_memory+0x10b/0x490
[   96.093920]  ? __alloc_pages_slowpath+0x701/0x8e2
[   96.095732]  ? __alloc_pages_nodemask+0x1ed/0x210
[   96.097544]  ? alloc_pages_vma+0x9f/0x220
[   96.099133]  ? __handle_mm_fault+0xc22/0x11e0
[   96.100668]  ? handle_mm_fault+0xc5/0x220
[   96.102387]  ? __do_page_fault+0x21e/0x4b0
[   96.103824]  ? do_page_fault+0x2b/0x70
[   96.105351]  ? page_fault+0x28/0x30
[   96.106759] Mem-Info:
[   96.107908] active_anon:438003 inactive_anon:2097 isolated_anon:0
[   96.107908]  active_file:91 inactive_file:265 isolated_file:6
[   96.107908]  unevictable:0 dirty:1 writeback:121 unstable:0
[   96.107908]  slab_reclaimable:2439 slab_unreclaimable:11273
[   96.107908]  mapped:382 shmem:2162 pagetables:8698 bounce:0
[   96.107908]  free:13166 free_pcp:0 free_cma:0
[   96.119325] Node 0 active_anon:1752012kB inactive_anon:8388kB active_file:364kB inactive_file:1060kB unevictable:0kB isolated(anon):0kB isolated(file):24kB mapped:1528kB dirty:4kB writeback:484kB shmem:8648kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1519616kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   96.125753] Node 0 DMA free:8148kB min:352kB low:440kB high:528kB active_anon:7696kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   96.133203] lowmem_reserve[]: 0 1952 1952 1952
[   96.135013] Node 0 DMA32 free:44516kB min:44700kB low:55872kB high:67044kB active_anon:1743720kB inactive_anon:8388kB active_file:336kB inactive_file:792kB unevictable:0kB writepending:488kB present:2080640kB managed:2018376kB mlocked:0kB slab_reclaimable:9756kB slab_unreclaimable:45060kB kernel_stack:22192kB pagetables:34764kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   96.143814] lowmem_reserve[]: 0 0 0 0
[   96.145371] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (M) 2*32kB (UM) 2*64kB (UM) 2*128kB (UM) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 1*4096kB (M) = 8148kB
[   96.148956] Node 0 DMA32: 1052*4kB (UME) 599*8kB (UME) 212*16kB (UME) 237*32kB (UME) 155*64kB (UME) 39*128kB (UME) 12*256kB (UME) 10*512kB (UME) 2*1024kB (M) 0*2048kB 0*4096kB = 45128kB
[   96.153861] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   96.156374] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   96.158817] 2598 total pagecache pages
[   96.160434] 0 pages in swap cache
[   96.161904] Swap cache stats: add 0, delete 0, find 0/0
[   96.163762] Free swap  = 0kB
[   96.165142] Total swap = 0kB
[   96.166507] 524157 pages RAM
[   96.167839] 0 pages HighMem/MovableOnly
[   96.169374] 15587 pages reserved
[   96.170834] 0 pages cma reserved
[   96.172247] 0 pages hwpoisoned
[   96.173569] Out of memory: Kill process 2984 (a.out) score 998 or sacrifice child
[   96.176242] Killed process 2984 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   96.182342] oom_reaper: reaped process 2984 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  242.498498] sysrq: SysRq : Show State
[  242.503329]   task                        PC stack   pid father
[  242.509822] systemd         D    0     1      0 0x00000000
[  242.515791] Call Trace:
[  242.519807]  ? __schedule+0x1d2/0x5a0
[  242.526263]  ? schedule+0x2d/0x80
[  242.530940]  ? schedule_timeout+0x16d/0x240
[  242.536135]  ? del_timer_sync+0x40/0x40
[  242.541458]  ? io_schedule_timeout+0x14/0x40
[  242.543661]  ? congestion_wait+0x79/0xd0
[  242.545748]  ? prepare_to_wait_event+0xf0/0xf0
[  242.548051]  ? shrink_inactive_list+0x388/0x3d0
[  242.550323]  ? shrink_node_memcg+0x33a/0x740
[  242.552505]  ? _cond_resched+0x10/0x20
[  242.554743]  ? _cond_resched+0x10/0x20
[  242.556952]  ? shrink_node+0xe0/0x320
[  242.558962]  ? do_try_to_free_pages+0xdc/0x370
[  242.561168]  ? try_to_free_pages+0xbe/0x100
[  242.563309]  ? __alloc_pages_slowpath+0x387/0x8e2
[  242.565581]  ? __wake_up_common+0x4c/0x80
[  242.567759]  ? __alloc_pages_nodemask+0x1ed/0x210
[  242.570064]  ? alloc_pages_current+0x7a/0x100
[  242.572092]  ? __do_page_cache_readahead+0xe9/0x250
[  242.573707]  ? radix_tree_lookup_slot+0x1e/0x50
[  242.575081]  ? find_get_entry+0x14/0x100
[  242.576414]  ? pagecache_get_page+0x21/0x200
[  242.577678]  ? filemap_fault+0x23a/0x5e0
[  242.578859]  ? filemap_map_pages+0x185/0x3a0
[  242.580093]  ? xfs_filemap_fault+0x2f/0x50 [xfs]
[  242.581398]  ? __do_fault+0x15/0x70
[  242.582468]  ? __handle_mm_fault+0xb0f/0x11e0
[  242.583665]  ? ep_ptable_queue_proc+0x90/0x90
[  242.584831]  ? handle_mm_fault+0xc5/0x220
[  242.585993]  ? __do_page_fault+0x21e/0x4b0
[  242.587257]  ? do_page_fault+0x2b/0x70
[  242.589145]  ? page_fault+0x28/0x30
(...snipped...)
[  243.105826] kswapd0         D    0    51      2 0x00000000
[  243.107344] Call Trace:
[  243.108113]  ? __schedule+0x1d2/0x5a0
[  243.109114]  ? schedule+0x2d/0x80
[  243.110052]  ? schedule_timeout+0x192/0x240
[  243.111190]  ? check_preempt_curr+0x7f/0x90
[  243.112260]  ? __down_common+0xc0/0x128
[  243.113329]  ? down+0x36/0x40
[  243.114296]  ? xfs_buf_lock+0x1d/0x40 [xfs]
[  243.115473]  ? _xfs_buf_find+0x2ad/0x580 [xfs]
[  243.116785]  ? xfs_buf_get_map+0x1d/0x140 [xfs]
[  243.118052]  ? xfs_buf_read_map+0x23/0xd0 [xfs]
[  243.119310]  ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[  243.120655]  ? _cond_resched+0x10/0x20
[  243.122831]  ? xfs_read_agf+0x8d/0x120 [xfs]
[  243.124181]  ? xfs_alloc_read_agf+0x39/0x130 [xfs]
[  243.125616]  ? xfs_alloc_fix_freelist+0x369/0x430 [xfs]
[  243.127093]  ? __radix_tree_lookup+0x80/0xf0
[  243.128235]  ? __radix_tree_lookup+0x80/0xf0
[  243.129357]  ? xfs_alloc_vextent+0x148/0x460 [xfs]
[  243.130596]  ? xfs_bmap_btalloc+0x45e/0x8a0 [xfs]
[  243.131804]  ? xfs_bmapi_write+0x768/0x1250 [xfs]
[  243.133032]  ? kmem_cache_alloc+0x11c/0x130
[  243.134160]  ? xfs_iomap_write_allocate+0x175/0x360 [xfs]
[  243.135503]  ? xfs_map_blocks+0x181/0x230 [xfs]
[  243.136802]  ? xfs_do_writepage+0x1db/0x630 [xfs]
[  243.138030]  ? xfs_vm_writepage+0x31/0x70 [xfs]
[  243.139396]  ? pageout.isra.47+0x188/0x280
[  243.140490]  ? shrink_page_list+0x79d/0xbb0
[  243.141619]  ? shrink_inactive_list+0x1c2/0x3d0
[  243.142831]  ? radix_tree_gang_lookup_tag+0xe3/0x160
[  243.144072]  ? shrink_node_memcg+0x33a/0x740
[  243.145188]  ? _cond_resched+0x10/0x20
[  243.146410]  ? _cond_resched+0x10/0x20
[  243.147746]  ? shrink_node+0xe0/0x320
[  243.148754]  ? kswapd+0x2b4/0x660
[  243.149691]  ? kthread+0xf2/0x130
[  243.150690]  ? mem_cgroup_shrink_node+0xb0/0xb0
[  243.151887]  ? kthread_park+0x60/0x60
[  243.152909]  ? ret_from_fork+0x26/0x40
(...snipped...)
[  273.216540] Showing busy workqueues and worker pools:
[  273.218084] workqueue events_freezable_power_: flags=0x84
[  273.219707]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  273.221259]     in-flight: 381:disk_events_workfn
[  273.222576] workqueue writeback: flags=0x4e
[  273.223721]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  273.225240]     in-flight: 344:wb_workfn wb_workfn
[  273.227485] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 63 17
[  273.229266] pool 256: cpus=0-127 flags=0x4 nice=0 hung=180s workers=34 idle: 343 342 341 340 339 338 337 336 335 334 333 332 331 329 330 328 327 326 325 324 323 322 321 320 319 318 317 248 280 53 345 5 348
[  340.690056] sysrq: SysRq : Resetting
----------

this problem also occurs with only 4 parallel writers.

----------
[    0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1
(...snipped...)
[  383.692506] Out of memory: Kill process 3391 (a.out) score 999 or sacrifice child
[  383.694476] Killed process 3391 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  383.699008] oom_reaper: reaped process 3391 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  445.711383] sysrq: SysRq : Show State
[  445.718193]   task                        PC stack   pid father
(...snipped...)
[  446.272860] kswapd0         D    0    51      2 0x00000000
[  446.274148] Call Trace:
[  446.274890]  ? __schedule+0x1d2/0x5a0
[  446.275847]  ? schedule+0x2d/0x80
[  446.276736]  ? rwsem_down_read_failed+0x108/0x180
[  446.278223]  ? call_rwsem_down_read_failed+0x14/0x30
[  446.280076]  ? down_read+0x17/0x30
[  446.281297]  ? xfs_map_blocks+0x8f/0x230 [xfs]
[  446.282685]  ? xfs_do_writepage+0x1db/0x630 [xfs]
[  446.283985]  ? xfs_vm_writepage+0x31/0x70 [xfs]
[  446.285124]  ? pageout.isra.47+0x188/0x280
[  446.286192]  ? shrink_page_list+0x79d/0xbb0
[  446.287296]  ? shrink_inactive_list+0x1c2/0x3d0
[  446.288442]  ? radix_tree_gang_lookup_tag+0xe3/0x160
[  446.289808]  ? shrink_node_memcg+0x33a/0x740
[  446.291027]  ? _cond_resched+0x10/0x20
[  446.292038]  ? _cond_resched+0x10/0x20
[  446.293089]  ? shrink_node+0xe0/0x320
[  446.294069]  ? kswapd+0x2b4/0x660
[  446.295036]  ? kthread+0xf2/0x130
[  446.296211]  ? mem_cgroup_shrink_node+0xb0/0xb0
[  446.297367]  ? kthread_park+0x60/0x60
[  446.298353]  ? ret_from_fork+0x26/0x40
(...snipped...)
[  448.285791] a.out           D    0  3387   2847 0x00000080
[  448.287194] Call Trace:
[  448.287975]  ? __schedule+0x1d2/0x5a0
[  448.288975]  ? schedule+0x2d/0x80
[  448.289910]  ? schedule_timeout+0x16d/0x240
[  448.291072]  ? del_timer_sync+0x40/0x40
[  448.292097]  ? io_schedule_timeout+0x14/0x40
[  448.293294]  ? congestion_wait+0x79/0xd0
[  448.294327]  ? prepare_to_wait_event+0xf0/0xf0
[  448.295476]  ? shrink_inactive_list+0x388/0x3d0
[  448.296650]  ? shrink_node_memcg+0x33a/0x740
[  448.298016]  ? _cond_resched+0x10/0x20
[  448.299027]  ? _cond_resched+0x10/0x20
[  448.300032]  ? shrink_node+0xe0/0x320
[  448.301068]  ? do_try_to_free_pages+0xdc/0x370
[  448.302247]  ? try_to_free_pages+0xbe/0x100
[  448.303325]  ? __alloc_pages_slowpath+0x387/0x8e2
[  448.304492]  ? __lock_page_or_retry+0x1b8/0x300
[  448.305628]  ? __alloc_pages_nodemask+0x1ed/0x210
[  448.306809]  ? alloc_pages_vma+0x9f/0x220
[  448.307874]  ? __handle_mm_fault+0xc22/0x11e0
[  448.308984]  ? handle_mm_fault+0xc5/0x220
[  448.310228]  ? __do_page_fault+0x21e/0x4b0
[  448.311500]  ? do_page_fault+0x2b/0x70
[  448.312609]  ? page_fault+0x28/0x30
[  448.313926] a.out           D    0  3388   3387 0x00000086
[  448.315461] Call Trace:
[  448.316339]  ? __schedule+0x1d2/0x5a0
[  448.317348]  ? schedule+0x2d/0x80
[  448.318291]  ? schedule_timeout+0x192/0x240
[  448.319372]  ? sched_clock_cpu+0xc/0xa0
[  448.320417]  ? __down_common+0xc0/0x128
[  448.321583]  ? down+0x36/0x40
[  448.322463]  ? xfs_buf_lock+0x1d/0x40 [xfs]
[  448.323572]  ? _xfs_buf_find+0x2ad/0x580 [xfs]
[  448.324698]  ? xfs_buf_get_map+0x1d/0x140 [xfs]
[  448.325885]  ? xfs_buf_read_map+0x23/0xd0 [xfs]
[  448.327045]  ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[  448.328303]  ? xfs_read_agf+0x8d/0x120 [xfs]
[  448.329384]  ? xfs_trans_read_buf_map+0x178/0x2f0 [xfs]
[  448.330906]  ? xfs_alloc_read_agf+0x39/0x130 [xfs]
[  448.332401]  ? xfs_alloc_fix_freelist+0x369/0x430 [xfs]
[  448.333738]  ? xfs_btree_rec_addr+0x9/0x10 [xfs]
[  448.335180]  ? _cond_resched+0x10/0x20
[  448.336628]  ? __kmalloc+0x114/0x180
[  448.337783]  ? xfs_buf_rele+0x57/0x3b0 [xfs]
[  448.339143]  ? __radix_tree_lookup+0x80/0xf0
[  448.340406]  ? xfs_free_extent_fix_freelist+0x67/0xc0 [xfs]
[  448.341889]  ? xfs_free_extent+0x6f/0x210 [xfs]
[  448.343210]  ? xfs_trans_free_extent+0x27/0x90 [xfs]
[  448.344565]  ? xfs_extent_free_finish_item+0x1c/0x30 [xfs]
[  448.346042]  ? xfs_defer_finish+0x125/0x280 [xfs]
[  448.348145]  ? xfs_itruncate_extents+0x1a2/0x3c0 [xfs]
[  448.349999]  ? xfs_free_eofblocks+0x1c5/0x230 [xfs]
[  448.351680]  ? xfs_release+0x135/0x160 [xfs]
[  448.353278]  ? __fput+0xc8/0x1c0
[  448.354355]  ? task_work_run+0x6e/0x90
[  448.355646]  ? do_exit+0x2b6/0xab0
[  448.356761]  ? do_group_exit+0x34/0xa0
[  448.357901]  ? get_signal+0x17c/0x4f0
[  448.359039]  ? __do_fault+0x15/0x70
[  448.360139]  ? do_signal+0x31/0x610
[  448.361238]  ? handle_mm_fault+0xc5/0x220
[  448.362487]  ? __do_page_fault+0x21e/0x4b0
[  448.363752]  ? exit_to_usermode_loop+0x35/0x70
[  448.365109]  ? prepare_exit_to_usermode+0x39/0x40
[  448.366475]  ? retint_user+0x8/0x13
[  448.367640] a.out           D    0  3389   3387 0x00000086
[  448.369260] Call Trace:
[  448.370151]  ? __schedule+0x1d2/0x5a0
[  448.371220]  ? schedule+0x2d/0x80
[  448.372181]  ? schedule_timeout+0x16d/0x240
[  448.373277]  ? del_timer_sync+0x40/0x40
[  448.374309]  ? io_schedule_timeout+0x14/0x40
[  448.375414]  ? congestion_wait+0x79/0xd0
[  448.376460]  ? prepare_to_wait_event+0xf0/0xf0
[  448.377590]  ? shrink_inactive_list+0x388/0x3d0
[  448.378788]  ? pick_next_task_fair+0x39c/0x480
[  448.380269]  ? shrink_node_memcg+0x33a/0x740
[  448.381981]  ? mem_cgroup_iter+0x127/0x2b0
[  448.383266]  ? shrink_node+0xe0/0x320
[  448.384342]  ? do_try_to_free_pages+0xdc/0x370
[  448.385569]  ? try_to_free_pages+0xbe/0x100
[  448.386680]  ? __alloc_pages_slowpath+0x387/0x8e2
[  448.387909]  ? __alloc_pages_nodemask+0x1ed/0x210
[  448.389163]  ? alloc_pages_current+0x7a/0x100
[  448.390369]  ? xfs_buf_allocate_memory+0x16a/0x2ad [xfs]
[  448.391731]  ? xfs_buf_get_map+0xeb/0x140 [xfs]
[  448.392931]  ? xfs_buf_read_map+0x23/0xd0 [xfs]
[  448.394114]  ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[  448.395421]  ? xfs_btree_read_buf_block.constprop.37+0x72/0xc0 [xfs]
[  448.397007]  ? xfs_btree_lookup_get_block+0x7f/0x160 [xfs]
[  448.398671]  ? xfs_btree_lookup+0xc9/0x3f0 [xfs]
[  448.399927]  ? xfs_bmap_del_extent+0x1a0/0xbb0 [xfs]
[  448.401357]  ? __xfs_bunmapi+0x3bb/0xb70 [xfs]
[  448.402679]  ? xfs_bunmapi+0x26/0x40 [xfs]
[  448.403907]  ? xfs_itruncate_extents+0x18a/0x3c0 [xfs]
[  448.405339]  ? xfs_free_eofblocks+0x1c5/0x230 [xfs]
[  448.406688]  ? xfs_release+0x135/0x160 [xfs]
[  448.407911]  ? __fput+0xc8/0x1c0
[  448.408939]  ? task_work_run+0x6e/0x90
[  448.410061]  ? do_exit+0x2b6/0xab0
[  448.411156]  ? do_group_exit+0x34/0xa0
[  448.412301]  ? get_signal+0x17c/0x4f0
[  448.413526]  ? __do_fault+0x15/0x70
[  448.415066]  ? do_signal+0x31/0x610
[  448.416174]  ? handle_mm_fault+0xc5/0x220
[  448.417490]  ? __do_page_fault+0x21e/0x4b0
[  448.418729]  ? exit_to_usermode_loop+0x35/0x70
[  448.419976]  ? prepare_exit_to_usermode+0x39/0x40
[  448.421336]  ? retint_user+0x8/0x13
[  448.422414] a.out           D    0  3391   3387 0x00000086
[  448.423873] Call Trace:
[  448.424755]  ? __schedule+0x1d2/0x5a0
[  448.425857]  ? schedule+0x2d/0x80
[  448.426918]  ? schedule_timeout+0x192/0x240
[  448.428143]  ? mempool_alloc+0x64/0x170
[  448.429318]  ? __down_common+0xc0/0x128
[  448.430401]  ? down+0x36/0x40
[  448.431561]  ? xfs_buf_lock+0x1d/0x40 [xfs]
[  448.432727]  ? _xfs_buf_find+0x2ad/0x580 [xfs]
[  448.433976]  ? xfs_buf_get_map+0x1d/0x140 [xfs]
[  448.435216]  ? xfs_buf_read_map+0x23/0xd0 [xfs]
[  448.436545]  ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[  448.437901]  ? xfs_read_agf+0x8d/0x120 [xfs]
[  448.439111]  ? xfs_trans_read_buf_map+0x178/0x2f0 [xfs]
[  448.440624]  ? xfs_alloc_read_agf+0x39/0x130 [xfs]
[  448.441958]  ? xfs_alloc_fix_freelist+0x369/0x430 [xfs]
[  448.443524]  ? xfs_btree_rec_addr+0x9/0x10 [xfs]
[  448.444800]  ? _cond_resched+0x10/0x20
[  448.445933]  ? __kmalloc+0x114/0x180
[  448.447319]  ? xfs_buf_rele+0x57/0x3b0 [xfs]
[  448.448657]  ? __radix_tree_lookup+0x80/0xf0
[  448.449934]  ? xfs_free_extent_fix_freelist+0x67/0xc0 [xfs]
[  448.451445]  ? xfs_free_extent+0x6f/0x210 [xfs]
[  448.452608]  ? xfs_trans_free_extent+0x27/0x90 [xfs]
[  448.453874]  ? xfs_extent_free_finish_item+0x1c/0x30 [xfs]
[  448.455203]  ? xfs_defer_finish+0x125/0x280 [xfs]
[  448.456410]  ? xfs_itruncate_extents+0x1a2/0x3c0 [xfs]
[  448.457682]  ? xfs_free_eofblocks+0x1c5/0x230 [xfs]
[  448.458937]  ? xfs_release+0x135/0x160 [xfs]
[  448.460060]  ? __fput+0xc8/0x1c0
[  448.461081]  ? task_work_run+0x6e/0x90
[  448.462103]  ? do_exit+0x2b6/0xab0
[  448.463064]  ? do_group_exit+0x34/0xa0
[  448.464347]  ? get_signal+0x17c/0x4f0
[  448.465402]  ? do_signal+0x31/0x610
[  448.466373]  ? xfs_file_write_iter+0x88/0x120 [xfs]
[  448.467614]  ? __vfs_write+0xe5/0x140
[  448.468613]  ? exit_to_usermode_loop+0x35/0x70
[  448.469747]  ? do_syscall_64+0x12a/0x140
[  448.470827]  ? entry_SYSCALL64_slow_path+0x25/0x25
[  448.472399] a.out           D    0  3392   3387 0x00000080
[  448.473757] Call Trace:
[  448.474567]  ? __schedule+0x1d2/0x5a0
[  448.475598]  ? schedule+0x2d/0x80
[  448.476566]  ? schedule_timeout+0x16d/0x240
[  448.477688]  ? del_timer_sync+0x40/0x40
[  448.478709]  ? io_schedule_timeout+0x14/0x40
[  448.480159]  ? congestion_wait+0x79/0xd0
[  448.481998]  ? prepare_to_wait_event+0xf0/0xf0
[  448.483679]  ? shrink_inactive_list+0x388/0x3d0
[  448.485113]  ? shrink_node_memcg+0x33a/0x740
[  448.486310]  ? xfs_reclaim_inodes_count+0x2d/0x40 [xfs]
[  448.487609]  ? mem_cgroup_iter+0x127/0x2b0
[  448.488719]  ? shrink_node+0xe0/0x320
[  448.489747]  ? do_try_to_free_pages+0xdc/0x370
[  448.490926]  ? try_to_free_pages+0xbe/0x100
[  448.492122]  ? __alloc_pages_slowpath+0x387/0x8e2
[  448.493347]  ? __alloc_pages_nodemask+0x1ed/0x210
[  448.494633]  ? alloc_pages_current+0x7a/0x100
[  448.495800]  ? xfs_buf_allocate_memory+0x16a/0x2ad [xfs]
[  448.497170]  ? xfs_buf_get_map+0xeb/0x140 [xfs]
[  448.498710]  ? xfs_buf_read_map+0x23/0xd0 [xfs]
[  448.499861]  ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[  448.501200]  ? xfs_btree_read_buf_block.constprop.37+0x72/0xc0 [xfs]
[  448.502698]  ? xfs_btree_lookup_get_block+0x7f/0x160 [xfs]
[  448.504021]  ? xfs_btree_lookup+0xc9/0x3f0 [xfs]
[  448.505242]  ? xfs_iext_remove_direct+0x64/0xd0 [xfs]
[  448.506495]  ? xfs_bmap_add_extent_delay_real+0x4f9/0x18e0 [xfs]
[  448.507930]  ? _cond_resched+0x10/0x20
[  448.508972]  ? kmem_cache_alloc+0x11c/0x130
[  448.510132]  ? kmem_zone_alloc+0x84/0xf0 [xfs]
[  448.511366]  ? xfs_bmapi_write+0x826/0x1250 [xfs]
[  448.512572]  ? kmem_cache_alloc+0x11c/0x130
[  448.514112]  ? xfs_iomap_write_allocate+0x175/0x360 [xfs]
[  448.515920]  ? xfs_map_blocks+0x181/0x230 [xfs]
[  448.517136]  ? xfs_do_writepage+0x1db/0x630 [xfs]
[  448.518381]  ? invalid_page_referenced_vma+0x80/0x80
[  448.519640]  ? write_cache_pages+0x205/0x400
[  448.520831]  ? xfs_vm_set_page_dirty+0x1c0/0x1c0 [xfs]
[  448.522203]  ? iomap_apply+0xe3/0x120
[  448.523271]  ? xfs_vm_writepages+0x5f/0xa0 [xfs]
[  448.524523]  ? __filemap_fdatawrite_range+0xc0/0xf0
[  448.525866]  ? filemap_write_and_wait_range+0x20/0x50
[  448.527157]  ? xfs_file_fsync+0x41/0x160 [xfs]
[  448.528319]  ? do_fsync+0x33/0x60
[  448.529273]  ? SyS_fsync+0x7/0x10
[  448.530267]  ? do_syscall_64+0x5c/0x140
[  448.531609]  ? entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  580.304479] Showing busy workqueues and worker pools:
[  580.306114] workqueue events_freezable_power_: flags=0x84
[  580.307522]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  580.309059]     in-flight: 99:disk_events_workfn
[  580.310365] workqueue writeback: flags=0x4e
[  580.312273]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  580.313966]     in-flight: 342:wb_workfn wb_workfn
[  580.316378] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=2s workers=3 idle: 24 3095
[  580.318281] pool 256: cpus=0-127 flags=0x4 nice=0 hung=198s workers=3 idle: 341 340
[  595.909943] sysrq: SysRq : Resetting
----------

This problem is very much dependent on timing, and warn_alloc_stall() cannot
catch this problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-04-23 10:24         ` Tetsuo Handa
@ 2017-04-24 12:39           ` Stanislaw Gruszka
  2017-04-24 13:06             ` Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Stanislaw Gruszka @ 2017-04-24 12:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, riel, akpm, mgorman, vbabka, linux-mm,
	linux-kernel, rientjes

On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> On 2017/03/10 20:44, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> >>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> >>>> It only does this to some extent.  If reclaim made
> >>>> no progress, for example due to immediately bailing
> >>>> out because the number of already isolated pages is
> >>>> too high (due to many parallel reclaimers), the code
> >>>> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> >>>> test without ever looking at the number of reclaimable
> >>>> pages.
> >>>
> >>> Hm, there is no early return there, actually. We bump the loop counter
> >>> every time it happens, but then *do* look at the reclaimable pages.
> >>>
> >>>> Could that create problems if we have many concurrent
> >>>> reclaimers?
> >>>
> >>> With increased concurrency, the likelihood of OOM will go up if we
> >>> remove the unlimited wait for isolated pages, that much is true.
> >>>
> >>> I'm not sure that's a bad thing, however, because we want the OOM
> >>> killer to be predictable and timely. So a reasonable wait time in
> >>> between 0 and forever before an allocating thread gives up under
> >>> extreme concurrency makes sense to me.
> >>>
> >>>> It may be OK, I just do not understand all the implications.
> >>>>
> >>>> I like the general direction your patch takes the code in,
> >>>> but I would like to understand it better...
> >>>
> >>> I feel the same way. The throttling logic doesn't seem to be very well
> >>> thought out at the moment, making it hard to reason about what happens
> >>> in certain scenarios.
> >>>
> >>> In that sense, this patch isn't really an overall improvement to the
> >>> way things work. It patches a hole that seems to be exploitable only
> >>> from an artificial OOM torture test, at the risk of regressing high
> >>> concurrency workloads that may or may not be artificial.
> >>>
> >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> >>> behind this patch. Can we think about a general model to deal with
> >>> allocation concurrency? 
> >>
> >> I am definitely not against. There is no reason to rush the patch in.
> > 
> > I don't hurry if we can check using watchdog whether this problem is occurring
> > in the real world. I have to test corner cases because watchdog is missing.
> > 
> Ping?
> 
> This problem can occur even immediately after the first invocation of
> the OOM killer. I believe this problem can occur in the real world.
> When are we going to apply this patch or watchdog patch?
> 
> ----------------------------------------
> [    0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
> [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1

Are you debugging memory corruption problem?

FWIW, if you use debug_guardpage_minorder= you can expect any
allocation memory problems. This option is intended to debug
memory corruption bugs and it shrinks available memory in 
artificial way. Taking that, I don't think justifying any
patch, by problem happened when debug_guardpage_minorder= is 
used, is reasonable.
 
Stanislaw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-10 11:44       ` Tetsuo Handa
  2017-03-21 10:37         ` Tetsuo Handa
@ 2017-04-23 10:24         ` Tetsuo Handa
  2017-04-24 12:39           ` Stanislaw Gruszka
  2017-06-30  0:14         ` Tetsuo Handa
  2 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2017-04-23 10:24 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: riel, akpm, mgorman, vbabka, linux-mm, linux-kernel, rientjes, sgruszka

On 2017/03/10 20:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
>>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
>>>> It only does this to some extent.  If reclaim made
>>>> no progress, for example due to immediately bailing
>>>> out because the number of already isolated pages is
>>>> too high (due to many parallel reclaimers), the code
>>>> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
>>>> test without ever looking at the number of reclaimable
>>>> pages.
>>>
>>> Hm, there is no early return there, actually. We bump the loop counter
>>> every time it happens, but then *do* look at the reclaimable pages.
>>>
>>>> Could that create problems if we have many concurrent
>>>> reclaimers?
>>>
>>> With increased concurrency, the likelihood of OOM will go up if we
>>> remove the unlimited wait for isolated pages, that much is true.
>>>
>>> I'm not sure that's a bad thing, however, because we want the OOM
>>> killer to be predictable and timely. So a reasonable wait time in
>>> between 0 and forever before an allocating thread gives up under
>>> extreme concurrency makes sense to me.
>>>
>>>> It may be OK, I just do not understand all the implications.
>>>>
>>>> I like the general direction your patch takes the code in,
>>>> but I would like to understand it better...
>>>
>>> I feel the same way. The throttling logic doesn't seem to be very well
>>> thought out at the moment, making it hard to reason about what happens
>>> in certain scenarios.
>>>
>>> In that sense, this patch isn't really an overall improvement to the
>>> way things work. It patches a hole that seems to be exploitable only
>>> from an artificial OOM torture test, at the risk of regressing high
>>> concurrency workloads that may or may not be artificial.
>>>
>>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
>>> behind this patch. Can we think about a general model to deal with
>>> allocation concurrency? 
>>
>> I am definitely not against. There is no reason to rush the patch in.
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
> 
Ping?

This problem can occur even immediately after the first invocation of
the OOM killer. I believe this problem can occur in the real world.
When are we going to apply this patch or watchdog patch?

----------------------------------------
[    0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1
(...snipped...)
CentOS Linux 7 (Core)
Kernel 4.11.0-rc7-next-20170421+ on an x86_64

ccsecurity login: [   32.406723] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   32.852917] Ebtables v2.0 registered
[   33.034402] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   33.467929] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[   33.475728] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready
[   33.478910] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready
[   33.950365] Netfilter messages via NETLINK v0.30.
[   33.983449] ip_set: protocol 6
[   37.335966] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[   37.337587] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready
[   37.339925] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready
[   39.940942] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based  firewall rule not found. Use the iptables CT target to attach helpers instead.
[   98.926202] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   98.932977] a.out cpuset=/ mems_allowed=0
[   98.934780] CPU: 1 PID: 2972 Comm: a.out Not tainted 4.11.0-rc7-next-20170421+ #588
[   98.937988] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   98.942193] Call Trace:
[   98.942942]  ? dump_stack+0x5c/0x7d
[   98.943907]  ? dump_header+0x97/0x233
[   98.945334]  ? ktime_get+0x30/0x90
[   98.946290]  ? delayacct_end+0x35/0x60
[   98.947319]  ? do_try_to_free_pages+0x2ca/0x370
[   98.948554]  ? oom_kill_process+0x223/0x3e0
[   98.949715]  ? has_capability_noaudit+0x17/0x20
[   98.950948]  ? oom_badness+0xeb/0x160
[   98.951962]  ? out_of_memory+0x10b/0x490
[   98.953030]  ? __alloc_pages_slowpath+0x701/0x8e2
[   98.954313]  ? __alloc_pages_nodemask+0x1ed/0x210
[   98.956242]  ? alloc_pages_vma+0x9f/0x220
[   98.957486]  ? __handle_mm_fault+0xc22/0x11e0
[   98.958673]  ? handle_mm_fault+0xc5/0x220
[   98.959766]  ? __do_page_fault+0x21e/0x4b0
[   98.960906]  ? do_page_fault+0x2b/0x70
[   98.961977]  ? page_fault+0x28/0x30
[   98.963861] Mem-Info:
[   98.965330] active_anon:372765 inactive_anon:2097 isolated_anon:0
[   98.965330]  active_file:182 inactive_file:214 isolated_file:32
[   98.965330]  unevictable:0 dirty:6 writeback:6 unstable:0
[   98.965330]  slab_reclaimable:2011 slab_unreclaimable:11291
[   98.965330]  mapped:623 shmem:2162 pagetables:8582 bounce:0
[   98.965330]  free:13278 free_pcp:117 free_cma:0
[   98.978473] Node 0 active_anon:1491060kB inactive_anon:8388kB active_file:728kB inactive_file:856kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2492kB dirty:24kB writeback:24kB shmem:8648kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1241088kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   98.987555] Node 0 DMA free:7176kB min:408kB low:508kB high:608kB active_anon:8672kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   98.998904] lowmem_reserve[]: 0 1696 1696 1696
[   99.001205] Node 0 DMA32 free:45936kB min:44644kB low:55804kB high:66964kB active_anon:1482048kB inactive_anon:8388kB active_file:232kB inactive_file:1000kB unevictable:0kB writepending:48kB present:2080640kB managed:1756232kB mlocked:0kB slab_reclaimable:8044kB slab_unreclaimable:45132kB kernel_stack:22128kB pagetables:34304kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB
[   99.009428] lowmem_reserve[]: 0 0 0 0
[   99.010816] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 2*32kB (UM) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (M) 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 7176kB
[   99.014262] Node 0 DMA32: 909*4kB (UE) 548*8kB (UME) 190*16kB (UME) 99*32kB (UME) 37*64kB (UME) 14*128kB (UME) 5*256kB (UME) 3*512kB (E) 2*1024kB (UM) 1*2048kB (M) 5*4096kB (M) = 45780kB
[   99.018848] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   99.021288] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   99.023758] 2752 total pagecache pages
[   99.025196] 0 pages in swap cache
[   99.026538] Swap cache stats: add 0, delete 0, find 0/0
[   99.028521] Free swap  = 0kB
[   99.029923] Total swap = 0kB
[   99.031212] 524157 pages RAM
[   99.032458] 0 pages HighMem/MovableOnly
[   99.033812] 81123 pages reserved
[   99.035255] 0 pages cma reserved
[   99.036729] 0 pages hwpoisoned
[   99.037898] Out of memory: Kill process 2973 (a.out) score 999 or sacrifice child
[   99.039902] Killed process 2973 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   99.043953] oom_reaper: reaped process 2973 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  173.285686] sysrq: SysRq : Show State
(...snipped...)
[  173.899630] kswapd0         D    0    51      2 0x00000000
[  173.900935] Call Trace:
[  173.901706]  ? __schedule+0x1d2/0x5a0
[  173.902906]  ? schedule+0x2d/0x80
[  173.904034]  ? schedule_timeout+0x192/0x240
[  173.905437]  ? __down_common+0xc0/0x128
[  173.906549]  ? down+0x36/0x40
[  173.907433]  ? xfs_buf_lock+0x1d/0x40 [xfs]
[  173.908574]  ? _xfs_buf_find+0x2ad/0x580 [xfs]
[  173.909734]  ? xfs_buf_get_map+0x1d/0x140 [xfs]
[  173.910886]  ? xfs_buf_read_map+0x23/0xd0 [xfs]
[  173.912045]  ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[  173.913381]  ? xfs_read_agf+0x8d/0x120 [xfs]
[  173.914725]  ? xfs_alloc_read_agf+0x39/0x130 [xfs]
[  173.916225]  ? xfs_alloc_fix_freelist+0x369/0x430 [xfs]
[  173.917491]  ? __radix_tree_lookup+0x80/0xf0
[  173.918593]  ? __radix_tree_lookup+0x80/0xf0
[  173.920091]  ? xfs_alloc_vextent+0x148/0x460 [xfs]
[  173.921549]  ? xfs_bmap_btalloc+0x45e/0x8a0 [xfs]
[  173.922728]  ? xfs_bmapi_write+0x768/0x1250 [xfs]
[  173.923904]  ? kmem_cache_alloc+0x11c/0x130
[  173.925030]  ? xfs_iomap_write_allocate+0x175/0x360 [xfs]
[  173.926592]  ? xfs_map_blocks+0x181/0x230 [xfs]
[  173.927854]  ? xfs_do_writepage+0x1db/0x630 [xfs]
[  173.929046]  ? xfs_setfilesize_trans_alloc.isra.26+0x35/0x80 [xfs]
[  173.930665]  ? xfs_vm_writepage+0x31/0x70 [xfs]
[  173.931915]  ? pageout.isra.47+0x188/0x280
[  173.933005]  ? shrink_page_list+0x79d/0xbb0
[  173.934138]  ? shrink_inactive_list+0x1c2/0x3d0
[  173.935609]  ? radix_tree_gang_lookup_tag+0xe3/0x160
[  173.937100]  ? shrink_node_memcg+0x33a/0x740
[  173.938335]  ? _cond_resched+0x10/0x20
[  173.939443]  ? _cond_resched+0x10/0x20
[  173.940470]  ? shrink_node+0xe0/0x320
[  173.941483]  ? kswapd+0x2b4/0x660
[  173.942424]  ? kthread+0xf2/0x130
[  173.943396]  ? mem_cgroup_shrink_node+0xb0/0xb0
[  173.944578]  ? kthread_park+0x60/0x60
[  173.945613]  ? ret_from_fork+0x26/0x40
(...snipped...)
[  195.183281] Showing busy workqueues and worker pools:
[  195.184626] workqueue events_freezable_power_: flags=0x84
[  195.186013]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  195.187596]     in-flight: 24:disk_events_workfn
[  195.188832] workqueue writeback: flags=0x4e
[  195.189919]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=1/256
[  195.191826]     in-flight: 370:wb_workfn
[  195.194105] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 129 63
[  195.195883] pool 256: cpus=0-127 flags=0x4 nice=0 hung=96s workers=31 idle: 371 369 368 367 366 365 364 363 362 361 360 359 358 357 356 355 354 353 352 351 350 349 348 347 346 249 253 5 53 372
[  243.365293] sysrq: SysRq : Resetting
----------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-10 11:44       ` Tetsuo Handa
@ 2017-03-21 10:37         ` Tetsuo Handa
  2017-04-23 10:24         ` Tetsuo Handa
  2017-06-30  0:14         ` Tetsuo Handa
  2 siblings, 0 replies; 41+ messages in thread
From: Tetsuo Handa @ 2017-03-21 10:37 UTC (permalink / raw)
  To: mhocko, hannes; +Cc: riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

On 2017/03/10 20:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
>>>> It may be OK, I just do not understand all the implications.
>>>>
>>>> I like the general direction your patch takes the code in,
>>>> but I would like to understand it better...
>>>
>>> I feel the same way. The throttling logic doesn't seem to be very well
>>> thought out at the moment, making it hard to reason about what happens
>>> in certain scenarios.
>>>
>>> In that sense, this patch isn't really an overall improvement to the
>>> way things work. It patches a hole that seems to be exploitable only
>>> from an artificial OOM torture test, at the risk of regressing high
>>> concurrency workloads that may or may not be artificial.
>>>
>>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
>>> behind this patch. Can we think about a general model to deal with
>>> allocation concurrency? 
>>
>> I am definitely not against. There is no reason to rush the patch in.
> 
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.

Today I tested linux-next-20170321 with not so insane stress, and I again
hit this problem. Thus, I think this problem might occur in the real world.

http://I-love.SAKURA.ne.jp/tmp/serial-20170321.txt.xz (Logs up to before swapoff are eliminated.)
----------
[ 2250.175109] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2257.535653] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2319.806880] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2320.722282] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2381.243393] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2389.777052] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2450.878287] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2459.386321] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2520.500633] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2529.042088] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-10 10:20     ` Michal Hocko
@ 2017-03-10 11:44       ` Tetsuo Handa
  2017-03-21 10:37         ` Tetsuo Handa
                           ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Tetsuo Handa @ 2017-03-10 11:44 UTC (permalink / raw)
  To: mhocko, hannes; +Cc: riel, akpm, mgorman, vbabka, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > It only does this to some extent.  If reclaim made
> > > no progress, for example due to immediately bailing
> > > out because the number of already isolated pages is
> > > too high (due to many parallel reclaimers), the code
> > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > test without ever looking at the number of reclaimable
> > > pages.
> > 
> > Hm, there is no early return there, actually. We bump the loop counter
> > every time it happens, but then *do* look at the reclaimable pages.
> > 
> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > 
> > With increased concurrency, the likelihood of OOM will go up if we
> > remove the unlimited wait for isolated pages, that much is true.
> > 
> > I'm not sure that's a bad thing, however, because we want the OOM
> > killer to be predictable and timely. So a reasonable wait time in
> > between 0 and forever before an allocating thread gives up under
> > extreme concurrency makes sense to me.
> > 
> > > It may be OK, I just do not understand all the implications.
> > > 
> > > I like the general direction your patch takes the code in,
> > > but I would like to understand it better...
> > 
> > I feel the same way. The throttling logic doesn't seem to be very well
> > thought out at the moment, making it hard to reason about what happens
> > in certain scenarios.
> > 
> > In that sense, this patch isn't really an overall improvement to the
> > way things work. It patches a hole that seems to be exploitable only
> > from an artificial OOM torture test, at the risk of regressing high
> > concurrency workloads that may or may not be artificial.
> > 
> > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > behind this patch. Can we think about a general model to deal with
> > allocation concurrency? 
> 
> I am definitely not against. There is no reason to rush the patch in.

I don't hurry if we can check using watchdog whether this problem is occurring
in the real world. I have to test corner cases because watchdog is missing.

> My main point behind this patch was to reduce unbound loops from inside
> the reclaim path and push any throttling up the call chain to the
> page allocator path because I believe that it is easier to reason
> about them at that level. The direct reclaim should be as simple as
> possible without too many side effects otherwise we end up in a highly
> unpredictable behavior. This was a first step in that direction and my
> testing so far didn't show any regressions.
> 
> > Unlimited parallel direct reclaim is kinda
> > bonkers in the first place. How about checking for excessive isolation
> > counts from the page allocator and putting allocations on a waitqueue?
> 
> I would be interested in details here.

That will help implementing __GFP_KILLABLE.
https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-09 22:18     ` Rik van Riel
@ 2017-03-10 10:27       ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2017-03-10 10:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Tetsuo Handa, linux-mm, LKML

On Thu 09-03-17 17:18:00, Rik van Riel wrote:
> On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote:
> > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > 
> > > It only does this to some extent.  If reclaim made
> > > no progress, for example due to immediately bailing
> > > out because the number of already isolated pages is
> > > too high (due to many parallel reclaimers), the code
> > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > test without ever looking at the number of reclaimable
> > > pages.
> > Hm, there is no early return there, actually. We bump the loop
> > counter
> > every time it happens, but then *do* look at the reclaimable pages.
> 
> Am I looking at an old tree?  I see this code
> before we look at the reclaimable pages.
> 
>         /*
>          * Make sure we converge to OOM if we cannot make any progress
>          * several times in the row.
>          */
>         if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
>                 /* Before OOM, exhaust highatomic_reserve */
>                 return unreserve_highatomic_pageblock(ac, true);
>         }

I believe that Johannes meant cases where we do not exhaust all the
reclaim retries and fail early because there are no reclaimable pages
during the watermark check.

> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > With increased concurrency, the likelihood of OOM will go up if we
> > remove the unlimited wait for isolated pages, that much is true.
> > 
> > I'm not sure that's a bad thing, however, because we want the OOM
> > killer to be predictable and timely. So a reasonable wait time in
> > between 0 and forever before an allocating thread gives up under
> > extreme concurrency makes sense to me.
> 
> That is a fair point, a faster OOM kill is preferable
> to a system that is livelocked.
> 
> > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > behind this patch. Can we think about a general model to deal with
> > allocation concurrency? Unlimited parallel direct reclaim is kinda
> > bonkers in the first place. How about checking for excessive
> > isolation
> > counts from the page allocator and putting allocations on a
> > waitqueue?
> 
> The (limited) number of reclaimers can still do a
> relatively fast OOM kill, if none of them manage
> to make progress.

well, we can estimate how much memory can those relatively few
reclaimers isolate and try to reclaim. Even if we have hundreds of them which
is more towards a large number to me then we are 100*SWAP_CLUSTER_MAX
which is not all that much. And we are effectivelly OOM if there is no
other reclaimable memory left. All we need is just to put some upper
bound. We already have throttle_direct_reclaim but it doesn't really
throttle the maximum number of reclaimers.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-09 18:05   ` Johannes Weiner
  2017-03-09 22:18     ` Rik van Riel
@ 2017-03-10 10:20     ` Michal Hocko
  2017-03-10 11:44       ` Tetsuo Handa
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2017-03-10 10:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Tetsuo Handa, linux-mm, LKML

On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > It only does this to some extent.  If reclaim made
> > no progress, for example due to immediately bailing
> > out because the number of already isolated pages is
> > too high (due to many parallel reclaimers), the code
> > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > test without ever looking at the number of reclaimable
> > pages.
> 
> Hm, there is no early return there, actually. We bump the loop counter
> every time it happens, but then *do* look at the reclaimable pages.
> 
> > Could that create problems if we have many concurrent
> > reclaimers?
> 
> With increased concurrency, the likelihood of OOM will go up if we
> remove the unlimited wait for isolated pages, that much is true.
> 
> I'm not sure that's a bad thing, however, because we want the OOM
> killer to be predictable and timely. So a reasonable wait time in
> between 0 and forever before an allocating thread gives up under
> extreme concurrency makes sense to me.
> 
> > It may be OK, I just do not understand all the implications.
> > 
> > I like the general direction your patch takes the code in,
> > but I would like to understand it better...
> 
> I feel the same way. The throttling logic doesn't seem to be very well
> thought out at the moment, making it hard to reason about what happens
> in certain scenarios.
> 
> In that sense, this patch isn't really an overall improvement to the
> way things work. It patches a hole that seems to be exploitable only
> from an artificial OOM torture test, at the risk of regressing high
> concurrency workloads that may or may not be artificial.
> 
> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> behind this patch. Can we think about a general model to deal with
> allocation concurrency? 

I am definitely not against. There is no reason to rush the patch in.
My main point behind this patch was to reduce unbound loops from inside
the reclaim path and push any throttling up the call chain to the
page allocator path because I believe that it is easier to reason
about them at that level. The direct reclaim should be as simple as
possible without too many side effects otherwise we end up in a highly
unpredictable behavior. This was a first step in that direction and my
testing so far didn't show any regressions.

> Unlimited parallel direct reclaim is kinda
> bonkers in the first place. How about checking for excessive isolation
> counts from the page allocator and putting allocations on a waitqueue?

I would be interested in details here.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-09 18:05   ` Johannes Weiner
@ 2017-03-09 22:18     ` Rik van Riel
  2017-03-10 10:27       ` Michal Hocko
  2017-03-10 10:20     ` Michal Hocko
  1 sibling, 1 reply; 41+ messages in thread
From: Rik van Riel @ 2017-03-09 22:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Tetsuo Handa, linux-mm, LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2423 bytes --]

On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > 
> > It only does this to some extent.  If reclaim made
> > no progress, for example due to immediately bailing
> > out because the number of already isolated pages is
> > too high (due to many parallel reclaimers), the code
> > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > test without ever looking at the number of reclaimable
> > pages.
> Hm, there is no early return there, actually. We bump the loop
> counter
> every time it happens, but then *do* look at the reclaimable pages.

Am I looking at an old tree?  I see this code
before we look at the reclaimable pages.

        /*
         * Make sure we converge to OOM if we cannot make any progress
         * several times in the row.
         */
        if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
                /* Before OOM, exhaust highatomic_reserve */
                return unreserve_highatomic_pageblock(ac, true);
        }

> > Could that create problems if we have many concurrent
> > reclaimers?
> With increased concurrency, the likelihood of OOM will go up if we
> remove the unlimited wait for isolated pages, that much is true.
> 
> I'm not sure that's a bad thing, however, because we want the OOM
> killer to be predictable and timely. So a reasonable wait time in
> between 0 and forever before an allocating thread gives up under
> extreme concurrency makes sense to me.

That is a fair point, a faster OOM kill is preferable
to a system that is livelocked.

> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> behind this patch. Can we think about a general model to deal with
> allocation concurrency? Unlimited parallel direct reclaim is kinda
> bonkers in the first place. How about checking for excessive
> isolation
> counts from the page allocator and putting allocations on a
> waitqueue?

The (limited) number of reclaimers can still do a
relatively fast OOM kill, if none of them manage
to make progress.

That should avoid the potential issue you and I
both pointed out, and, as a bonus, it might actually
be faster than letting all the tasks in the system
into the direct reclaim code simultaneously.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-07 19:52 ` Rik van Riel
  2017-03-08  9:21   ` Michal Hocko
@ 2017-03-09 18:05   ` Johannes Weiner
  2017-03-09 22:18     ` Rik van Riel
  2017-03-10 10:20     ` Michal Hocko
  1 sibling, 2 replies; 41+ messages in thread
From: Johannes Weiner @ 2017-03-09 18:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Tetsuo Handa, linux-mm, LKML, Michal Hocko

On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> It only does this to some extent.  If reclaim made
> no progress, for example due to immediately bailing
> out because the number of already isolated pages is
> too high (due to many parallel reclaimers), the code
> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> test without ever looking at the number of reclaimable
> pages.

Hm, there is no early return there, actually. We bump the loop counter
every time it happens, but then *do* look at the reclaimable pages.

> Could that create problems if we have many concurrent
> reclaimers?

With increased concurrency, the likelihood of OOM will go up if we
remove the unlimited wait for isolated pages, that much is true.

I'm not sure that's a bad thing, however, because we want the OOM
killer to be predictable and timely. So a reasonable wait time in
between 0 and forever before an allocating thread gives up under
extreme concurrency makes sense to me.

> It may be OK, I just do not understand all the implications.
> 
> I like the general direction your patch takes the code in,
> but I would like to understand it better...

I feel the same way. The throttling logic doesn't seem to be very well
thought out at the moment, making it hard to reason about what happens
in certain scenarios.

In that sense, this patch isn't really an overall improvement to the
way things work. It patches a hole that seems to be exploitable only
from an artificial OOM torture test, at the risk of regressing high
concurrency workloads that may or may not be artificial.

Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
behind this patch. Can we think about a general model to deal with
allocation concurrency? Unlimited parallel direct reclaim is kinda
bonkers in the first place. How about checking for excessive isolation
counts from the page allocator and putting allocations on a waitqueue?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-09 14:16         ` Rik van Riel
@ 2017-03-09 14:59           ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2017-03-09 14:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	Tetsuo Handa, linux-mm, LKML

On Thu 09-03-17 09:16:25, Rik van Riel wrote:
> On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> > On Wed 08-03-17 10:54:57, Rik van Riel wrote:
> 
> > > In fact, false OOM kills with that kind of workload is
> > > how we ended up getting the "too many isolated" logic
> > > in the first place.
> > Right, but the retry logic was considerably different than what we
> > have these days. should_reclaim_retry considers amount of reclaimable
> > memory. As I've said earlier if we see a report where the oom hits
> > prematurely with many NR_ISOLATED* we know how to fix that.
> 
> Would it be enough to simply reset no_progress_loops
> in this check inside should_reclaim_retry, if we know
> pageout IO is pending?
> 
>                         if (!did_some_progress) {
>                                 unsigned long write_pending;
> 
>                                 write_pending = zone_page_state_snapshot(zone,
>                                                         NR_ZONE_WRITE_PENDING);
> 
>                                 if (2 * write_pending > reclaimable) {
>                                         congestion_wait(BLK_RW_ASYNC, HZ/10);
>                                         return true;
>                                 }
>                         }

I am not really sure what problem we are trying to solve right now to be
honest. I would prefer to keep the logic simpler rather than over
engeneer something that is even not needed.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-07 13:30 Michal Hocko
  2017-03-07 19:52 ` Rik van Riel
@ 2017-03-09 14:31 ` Mel Gorman
  1 sibling, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2017-03-09 14:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Tetsuo Handa,
	Rik van Riel, linux-mm, LKML, Michal Hocko

On Tue, Mar 07, 2017 at 02:30:57PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
> 
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
> 
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
> 
> [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
> [2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-09  9:12       ` Michal Hocko
@ 2017-03-09 14:16         ` Rik van Riel
  2017-03-09 14:59           ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Rik van Riel @ 2017-03-09 14:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	Tetsuo Handa, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1509 bytes --]

On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> On Wed 08-03-17 10:54:57, Rik van Riel wrote:

> > In fact, false OOM kills with that kind of workload is
> > how we ended up getting the "too many isolated" logic
> > in the first place.
> Right, but the retry logic was considerably different than what we
> have these days. should_reclaim_retry considers amount of reclaimable
> memory. As I've said earlier if we see a report where the oom hits
> prematurely with many NR_ISOLATED* we know how to fix that.

Would it be enough to simply reset no_progress_loops
in this check inside should_reclaim_retry, if we know
pageout IO is pending?

                        if (!did_some_progress) {
                                unsigned long write_pending;

                                write_pending =
zone_page_state_snapshot(zone,
                                                        NR_ZONE_WRITE_P
ENDING);

                                if (2 * write_pending > reclaimable) {
                                        congestion_wait(BLK_RW_ASYNC,
HZ/10);
                                        return true;
                                }
                        }

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-08 15:54     ` Rik van Riel
@ 2017-03-09  9:12       ` Michal Hocko
  2017-03-09 14:16         ` Rik van Riel
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2017-03-09  9:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	Tetsuo Handa, linux-mm, LKML

On Wed 08-03-17 10:54:57, Rik van Riel wrote:
> On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote:
> 
> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > 
> > As the changelog mentions it might cause a premature oom killer
> > invocation theoretically. We could easily see that from the oom
> > report
> > by checking isolated counters. My testing didn't trigger that though
> > and I was hammering the page allocator path from many threads.
> > 
> > I suspect some artificial tests can trigger that, I am not so sure
> > about
> > reasonabel workloads. If we see this happening though then the fix
> > would
> > be to resurrect my previous attempt to track NR_ISOLATED* per zone
> > and
> > use them in the allocator retry logic.
> 
> I am not sure the workload in question is "artificial".
> A heavily forking (or multi-threaded) server running out
> of physical memory could easily get hundreds of tasks
> doing direct reclaim simultaneously.

Yes, some of my OOM tests (fork many short lived processes while there
is a strong memory pressure and a lot of IO going on) are doing this and
I haven't hit a premature OOM yet. It is hard to tune those tests for almost
OOM but not yet there, though. Usually you either find a steady state or
really run out of memory.

> In fact, false OOM kills with that kind of workload is
> how we ended up getting the "too many isolated" logic
> in the first place.

Right, but the retry logic was considerably different than what we
have these days. should_reclaim_retry considers amount of reclaimable
memory. As I've said earlier if we see a report where the oom hits
prematurely with many NR_ISOLATED* we know how to fix that.

> I am perfectly fine with moving the retry logic up like
> you did, but think it may make sense to check the number
> of reclaimable pages if we have too many isolated pages,
> instead of risking a too-early OOM kill.

Actually that was my initial attempt but for that we would need per-zone
NR_ISOLATED* counters but Mel was against and wanted to start with
simpler approach if it works reasonably well which it seems it does from
my experience so far (but the reallity can surprise as I've seen so many
times already).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-08  9:21   ` Michal Hocko
@ 2017-03-08 15:54     ` Rik van Riel
  2017-03-09  9:12       ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Rik van Riel @ 2017-03-08 15:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	Tetsuo Handa, linux-mm, LKML

On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote:

> > Could that create problems if we have many concurrent
> > reclaimers?
> 
> As the changelog mentions it might cause a premature oom killer
> invocation theoretically. We could easily see that from the oom
> report
> by checking isolated counters. My testing didn't trigger that though
> and I was hammering the page allocator path from many threads.
> 
> I suspect some artificial tests can trigger that, I am not so sure
> about
> reasonabel workloads. If we see this happening though then the fix
> would
> be to resurrect my previous attempt to track NR_ISOLATED* per zone
> and
> use them in the allocator retry logic.

I am not sure the workload in question is "artificial".
A heavily forking (or multi-threaded) server running out
of physical memory could easily get hundreds of tasks
doing direct reclaim simultaneously.

In fact, false OOM kills with that kind of workload is
how we ended up getting the "too many isolated" logic
in the first place.

I am perfectly fine with moving the retry logic up like
you did, but think it may make sense to check the number
of reclaimable pages if we have too many isolated pages,
instead of risking a too-early OOM kill.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-07 19:52 ` Rik van Riel
@ 2017-03-08  9:21   ` Michal Hocko
  2017-03-08 15:54     ` Rik van Riel
  2017-03-09 18:05   ` Johannes Weiner
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2017-03-08  9:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	Tetsuo Handa, linux-mm, LKML

On Tue 07-03-17 14:52:36, Rik van Riel wrote:
> On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Tetsuo Handa has reported [1][2] that direct reclaimers might get
> > stuck
> > in too_many_isolated loop basically for ever because the last few
> > pages
> > on the LRU lists are isolated by the kswapd which is stuck on fs
> > locks
> > when doing the pageout or slab reclaim. This in turn means that there
> > is
> > nobody to actually trigger the oom killer and the system is basically
> > unusable.
> > 
> > too_many_isolated has been introduced by 35cd78156c49 ("vmscan:
> > throttle
> > direct reclaim when too many pages are isolated already") to prevent
> > from pre-mature oom killer invocations because back then no reclaim
> > progress could indeed trigger the OOM killer too early. But since the
> > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > the allocation/reclaim retry loop considers all the reclaimable pages
> > and throttles the allocation at that layer so we can loosen the
> > direct
> > reclaim throttling.
> 
> It only does this to some extent.  If reclaim made
> no progress, for example due to immediately bailing
> out because the number of already isolated pages is
> too high (due to many parallel reclaimers), the code
> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> test without ever looking at the number of reclaimable
> pages.
> 
> Could that create problems if we have many concurrent
> reclaimers?

As the changelog mentions it might cause a premature oom killer
invocation theoretically. We could easily see that from the oom report
by checking isolated counters. My testing didn't trigger that though
and I was hammering the page allocator path from many threads.

I suspect some artificial tests can trigger that, I am not so sure about
reasonabel workloads. If we see this happening though then the fix would
be to resurrect my previous attempt to track NR_ISOLATED* per zone and
use them in the allocator retry logic.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
  2017-03-07 13:30 Michal Hocko
@ 2017-03-07 19:52 ` Rik van Riel
  2017-03-08  9:21   ` Michal Hocko
  2017-03-09 18:05   ` Johannes Weiner
  2017-03-09 14:31 ` Mel Gorman
  1 sibling, 2 replies; 41+ messages in thread
From: Rik van Riel @ 2017-03-07 19:52 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, Tetsuo Handa,
	linux-mm, LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1646 bytes --]

On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo Handa has reported [1][2] that direct reclaimers might get
> stuck
> in too_many_isolated loop basically for ever because the last few
> pages
> on the LRU lists are isolated by the kswapd which is stuck on fs
> locks
> when doing the pageout or slab reclaim. This in turn means that there
> is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan:
> throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the
> direct
> reclaim throttling.

It only does this to some extent.  If reclaim made
no progress, for example due to immediately bailing
out because the number of already isolated pages is
too high (due to many parallel reclaimers), the code
could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
test without ever looking at the number of reclaimable
pages.

Could that create problems if we have many concurrent
reclaimers?

It may be OK, I just do not understand all the implications.

I like the general direction your patch takes the code in,
but I would like to understand it better...

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
@ 2017-03-07 13:30 Michal Hocko
  2017-03-07 19:52 ` Rik van Riel
  2017-03-09 14:31 ` Mel Gorman
  0 siblings, 2 replies; 41+ messages in thread
From: Michal Hocko @ 2017-03-07 13:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, Tetsuo Handa,
	Rik van Riel, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
[2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
Tetsuo helped to test this patch [3] and couldn't reproduce the hang
inside the page allocator anymore. Thanks! He was able to see a
different lockup though. This time this is more related to how XFS is
doing the inode reclaim from the WQ context. This is being discussed [4]
and I believe it is unrelated to this change.

I believe this change is still an improvement because it reduces chances
of an unbound loop inside the reclaim path so we have a) more reliable
detection of the lockup from the allocator path and b) more deterministic
retry loop logic.

Thoughts/complains/suggestions?

[3] http://lkml.kernel.org/r/201702261530.JDD56292.OFOLFHQtVMJSOF@I-love.SAKURA.ne.jp
[4] http://lkml.kernel.org/r/20170303133950.GD31582@dhcp22.suse.cz

 mm/vmscan.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	bool stalled = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (stalled)
+			return 0;
+
+		/* wait a bit for the reclaimer. */
+		schedule_timeout_interruptible(HZ/10);
+		stalled = true;
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-07-24 11:12 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-10  7:48 [PATCH] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko
2017-07-10 13:16 ` Vlastimil Babka
2017-07-10 13:58 ` Rik van Riel
2017-07-10 16:58   ` Johannes Weiner
2017-07-10 17:09     ` Michal Hocko
2017-07-19 22:20 ` Andrew Morton
2017-07-20  6:56   ` Michal Hocko
2017-07-21 23:01     ` Andrew Morton
2017-07-24  6:50       ` Michal Hocko
2017-07-20  1:54 ` Hugh Dickins
2017-07-20 10:44   ` Tetsuo Handa
2017-07-24  7:01     ` Hugh Dickins
2017-07-24 11:12       ` Tetsuo Handa
2017-07-20 13:22   ` Michal Hocko
2017-07-24  7:03     ` Hugh Dickins
  -- strict thread matches above, loose matches on Subject: below --
2017-03-07 13:30 Michal Hocko
2017-03-07 19:52 ` Rik van Riel
2017-03-08  9:21   ` Michal Hocko
2017-03-08 15:54     ` Rik van Riel
2017-03-09  9:12       ` Michal Hocko
2017-03-09 14:16         ` Rik van Riel
2017-03-09 14:59           ` Michal Hocko
2017-03-09 18:05   ` Johannes Weiner
2017-03-09 22:18     ` Rik van Riel
2017-03-10 10:27       ` Michal Hocko
2017-03-10 10:20     ` Michal Hocko
2017-03-10 11:44       ` Tetsuo Handa
2017-03-21 10:37         ` Tetsuo Handa
2017-04-23 10:24         ` Tetsuo Handa
2017-04-24 12:39           ` Stanislaw Gruszka
2017-04-24 13:06             ` Tetsuo Handa
2017-04-25  6:33               ` Stanislaw Gruszka
2017-06-30  0:14         ` Tetsuo Handa
2017-06-30 13:32           ` Michal Hocko
2017-06-30 15:59             ` Tetsuo Handa
2017-06-30 16:19               ` Michal Hocko
2017-07-01 11:43                 ` Tetsuo Handa
2017-07-05  8:19                   ` Michal Hocko
2017-07-05  8:20                   ` Michal Hocko
2017-07-06 10:48                     ` Tetsuo Handa
2017-03-09 14:31 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).