linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
@ 2017-03-07 13:30 Michal Hocko
  2017-03-07 19:52 ` Rik van Riel
  2017-03-09 14:31 ` Mel Gorman
  0 siblings, 2 replies; 41+ messages in thread
From: Michal Hocko @ 2017-03-07 13:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, Tetsuo Handa,
	Rik van Riel, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
[2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
Tetsuo helped to test this patch [3] and couldn't reproduce the hang
inside the page allocator anymore. Thanks! He was able to see a
different lockup though. This time this is more related to how XFS is
doing the inode reclaim from the WQ context. This is being discussed [4]
and I believe it is unrelated to this change.

I believe this change is still an improvement because it reduces chances
of an unbound loop inside the reclaim path so we have a) more reliable
detection of the lockup from the allocator path and b) more deterministic
retry loop logic.

Thoughts/complains/suggestions?

[3] http://lkml.kernel.org/r/201702261530.JDD56292.OFOLFHQtVMJSOF@I-love.SAKURA.ne.jp
[4] http://lkml.kernel.org/r/20170303133950.GD31582@dhcp22.suse.cz

 mm/vmscan.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	bool stalled = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (stalled)
+			return 0;
+
+		/* wait a bit for the reclaimer. */
+		schedule_timeout_interruptible(HZ/10);
+		stalled = true;
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread
* [PATCH] mm, vmscan: do not loop on too_many_isolated for ever
@ 2017-07-10  7:48 Michal Hocko
  2017-07-10 13:16 ` Vlastimil Babka
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Michal Hocko @ 2017-07-10  7:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, Rik van Riel, Johannes Weiner,
	Vlastimil Babka, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
[2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
[3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp

Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
I am resubmitting this patch previously sent here
http://lkml.kernel.org/r/20170307133057.26182-1-mhocko@kernel.org.

Johannes and Rik had some concerns that this could lead to premature
OOM kills. I agree with them that we need a better throttling
mechanism. Until now we didn't give the issue described above a high
priority because it usually required a really insane workload to
trigger. But it seems that the issue can be reproduced also without
having an insane number of competing threads [3].

Moreover, the issue also triggers very often while testing heavy memory
pressure and so prevents further development of hardening of that area
(http://lkml.kernel.org/r/201707061948.ICJ18763.tVFOQFOHMJFSLO@I-love.SAKURA.ne.jp).
Tetsuo hasn't seen any negative effect of this patch in his oom stress
tests so I think we should go with this simple patch for now and think
about something more robust long term.

That being said I suggest merging this (after spending the full release
cycle in linux-next) for the time being until we come up with a more
clever solution.

 mm/vmscan.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	bool stalled = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (stalled)
+			return 0;
+
+		/* wait a bit for the reclaimer. */
+		schedule_timeout_interruptible(HZ/10);
+		stalled = true;
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-07-24 11:12 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-07 13:30 [PATCH] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko
2017-03-07 19:52 ` Rik van Riel
2017-03-08  9:21   ` Michal Hocko
2017-03-08 15:54     ` Rik van Riel
2017-03-09  9:12       ` Michal Hocko
2017-03-09 14:16         ` Rik van Riel
2017-03-09 14:59           ` Michal Hocko
2017-03-09 18:05   ` Johannes Weiner
2017-03-09 22:18     ` Rik van Riel
2017-03-10 10:27       ` Michal Hocko
2017-03-10 10:20     ` Michal Hocko
2017-03-10 11:44       ` Tetsuo Handa
2017-03-21 10:37         ` Tetsuo Handa
2017-04-23 10:24         ` Tetsuo Handa
2017-04-24 12:39           ` Stanislaw Gruszka
2017-04-24 13:06             ` Tetsuo Handa
2017-04-25  6:33               ` Stanislaw Gruszka
2017-06-30  0:14         ` Tetsuo Handa
2017-06-30 13:32           ` Michal Hocko
2017-06-30 15:59             ` Tetsuo Handa
2017-06-30 16:19               ` Michal Hocko
2017-07-01 11:43                 ` Tetsuo Handa
2017-07-05  8:19                   ` Michal Hocko
2017-07-05  8:20                   ` Michal Hocko
2017-07-06 10:48                     ` Tetsuo Handa
2017-03-09 14:31 ` Mel Gorman
2017-07-10  7:48 Michal Hocko
2017-07-10 13:16 ` Vlastimil Babka
2017-07-10 13:58 ` Rik van Riel
2017-07-10 16:58   ` Johannes Weiner
2017-07-10 17:09     ` Michal Hocko
2017-07-19 22:20 ` Andrew Morton
2017-07-20  6:56   ` Michal Hocko
2017-07-21 23:01     ` Andrew Morton
2017-07-24  6:50       ` Michal Hocko
2017-07-20  1:54 ` Hugh Dickins
2017-07-20 10:44   ` Tetsuo Handa
2017-07-24  7:01     ` Hugh Dickins
2017-07-24 11:12       ` Tetsuo Handa
2017-07-20 13:22   ` Michal Hocko
2017-07-24  7:03     ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).