Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

From: Michal Hocko <mhocko@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: linux-mm@kvack.org, Johannes Weiner <hannes@cmpxchg.org>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Date: Thu, 19 Jan 2017 12:23:36 +0100	[thread overview]
Message-ID: <20170119112336.GN30786@dhcp22.suse.cz> (raw)
In-Reply-To: <20170119100755.rs6erdiz5u5by2pu@suse.de>

On Thu 19-01-17 10:07:55, Mel Gorman wrote:
[...]
> mm, vmscan: Wait on a waitqueue when too many pages are isolated
> 
> When too many pages are isolated, direct reclaim waits on congestion to clear
> for up to a tenth of a second. There is no reason to believe that too many
> pages are isolated due to dirty pages, reclaim efficiency or congestion.
> It may simply be because an extremely large number of processes have entered
> direct reclaim at the same time. However, it is possible for the situation
> to persist forever and never reach OOM.
> 
> This patch queues processes a waitqueue when too many pages are isolated.
> When parallel reclaimers finish shrink_page_list, they wake the waiters
> to recheck whether too many pages are isolated.
> 
> The wait on the queue has a timeout as not all sites that isolate pages
> will do the wakeup. Depending on every isolation of LRU pages to be perfect
> forever is potentially fragile. The specific wakeups occur for page reclaim
> and compaction. If too many pages are isolated due to memory failure,
> hotplug or directly calling migration from a syscall then the waiting
> processes may wait the full timeout.
> 
> Note that the timeout allows the use of waitqueue_active() on the basis
> that a race will cause the full timeout to be reached due to a missed
> wakeup. This is relatively harmless and still a massive improvement over
> unconditionally calling congestion_wait.
> 
> Direct reclaimers that cannot isolate pages within the timeout will consider
> return to the caller. This is somewhat clunky as it won't return immediately
> and make go through the other priorities and slab shrinking. Eventually,
> it'll go through a few iterations of should_reclaim_retry and reach the
> MAX_RECLAIM_RETRIES limit and consider going OOM.

I cannot really say I would like this. It's just much more complex than
necessary. I definitely agree that congestion_wait while waiting for
too_many_isolated is a crude hack. This patch doesn't really resolve
my biggest worry, though, that we go OOM with too many pages isolated
as your patch doesn't alter zone_reclaimable_pages to reflect those
numbers.

Anyway, I think both of us are probably overcomplicating things a bit.
Your waitqueue approach is definitely better semantically than the
congestion_wait because we are waiting for a different event than the
API is intended for. On the other hand a mere
schedule_timeout_interruptible might work equally well in the real life.
On the other side I might really over emphasise the role of NR_ISOLATED*
counts. It might really turn out that we can safely ignore them and it
won't be the end of the world. So what do you think about the following
as a starting point. If we ever see oom reports with high number of
NR_ISOLATED* which are part of the oom report then we know we have to do
something about that. Those changes would at least be driven by a real
usecase rather than theoretical scenarios.

So what do you think about the following? Tetsuo, would you be willing
to run this patch through your torture testing please?
---
>From 47cba23b5b50260b533d7ad57a4c9e6a800d9b20 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 19 Jan 2017 12:11:56 +0100
Subject: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Tetsuo Handa has reported [1] that direct reclaimers might get stuck in
too_many_isolated loop basically for ever because the last few pages on
the LRU lists are isolated by the kswapd which is stuck on fs locks when
doing the pageout. This in turn means that there is nobody to actually
trigger the oom killer and the system is basically unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a60066d4521b..d07380ba1f9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1718,9 +1718,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	bool stalled = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (stalled)
+			return 0;
+
+		/* wait a bit for the reclaimer. */
+		schedule_timeout_interruptible(HZ/10);
+		stalled = true;
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
-- 
2.11.0


-- 
Michal Hocko
SUSE Labs