Re: workingset transition detection corner case

From: Johannes Weiner <hannes@cmpxchg.org>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm <linux-mm@kvack.org>, Michal Hocko <mhocko@kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>
Subject: Re: workingset transition detection corner case
Date: Wed, 18 Dec 2019 17:14:52 -0500	[thread overview]
Message-ID: <20191218221452.GA232409@cmpxchg.org> (raw)
In-Reply-To: <b7f5e356-1f0a-98be-4a32-09a766c3949b@suse.cz>

Hi Vlastimil,

My apologies for the delay.

On Fri, Dec 13, 2019 at 04:38:38PM +0100, Vlastimil Babka wrote:
> Hi Johannes,
> 
> we have been debugging an issue reported against our 4.12-based kernel,
> where a DB-based workload would start thrashing badly at some point,
> making the system unusable. This didn't happen when replacing the kernel
> with older 4.4-based one (and keeping everything else the same).
> 
> Unfortunately we don't have the reproducer in-house and the conditions
> might be also configuration specific (rootfs is on NFS), but we provided
> vmstat monitoring instructions and later tracing and from the data we
> got we found that the workload at some point fills almost the whole
> memory with anonymous pages (namely shmem), pushing almost the whole
> page cache out, and filling part of the swap. The 4.4-based kernel then
> recovers quickly without excessive anon swapping, which suggests the
> shmem pages stop being frequently accessed. However the 4.12-based
> kernel is unable to recover and grow the page cache back (both active
> and inactive) and keeps thrashing on it.
> 
> We have considered the large upstream changes between 4.4 and 4.12 which
> include memcg awareness (but there's a single memcg and disabling memcg
> makes no difference) and node-based reclaim (there's no
> disproportionally sized zone). Then we suspected 4.12 commit
> 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset
> transition") and how it affects inactive_list_is_low() when called from
> shrink_list() - the theory was that we decide to shrink file active list
> too much (by setting inactive_ratio=0) due to refault detection, which
> in turn means we shrink file pages too much. This was confirmed by
> removing the inactive_ratio=0 part, after which the 4.12-based kernel
> stopped thrashing with the workload.
> 
> Then we investigated what leads to the main condition of the logic -
> "lruvec->refaults != refaults", by adding some more tracing to
> inactive_list_is_low() and snapshot_refaults(). We suspected bad
> interactions due to multiple direct reclaimers, but what I mostly see is
> the following pattern of kswapd activity:
> 
> - kswapd finishes balancing, makes a snapshot of lruvec->refaults
> - after a while (can be up to few seconds) kswapd is woken up again and
> the number of refaults meanwhile is changed by some relatively small
> number (tens or hundreds) since the snapshot, so the condition
> "lruvec->refaults != refaults" becomes true.
> - inactive_list_is_low() keeps being called as part of kswapd operation,
> always the condition is true as the snapshot didn't change. During that
> time, the refaults counter is either unchanged or changes only by a few
> refaults. Thus, the whole kswapd activity on the file lru is focused on
> the active lru.
> 
> Since the intention of commit 2a2e48854d70 is to detect workingset
> transitions, it seems to me it's not working well in this case, as
> there's no such transition - the workload just cannot keep its page
> cache working set in memory, because it's excessively reclaimed instead
> of anonymous memory. The '!=' condition is perhaps too coarse and static
> and doesn't reflect how many refaults there were or if refaults keep
> happening during kswapd operation - a single refault between two kswapd
> runs can affect the whole second run. I wonder if there shouldn't be at
> least some kind of decay - when the condition triggers, update the
> snapshot to a value between the old snapshot and current value, so if
> refaults do not keep occuring, after some number of calls the condition
> will stop being true? What do you think?

Thanks for the detailed report.

I think the problem here is that we entangle two separate things: on
one hand whether to protect active cache from refaulting cache; on the
other hand whether to protect anonymous from cache. We should be able
to open up the page cache to transition without automatically reducing
pressure on anonymous.

If we did that, we wouldn't have to worry about how many refaults
exactly are actually occurring - it should always be safe to open up
the active set for re-testing.

[ We *could* be more graceful and instead of dissolving the active
  protection entirely simply restrict its size to a balance we have
  targeted historically, e.g. 50:50. But in the interest of keeping
  magic numbers out of the code, I would not lead with that. ]

> I should also mention that we don't have the relatively recent commit
> 2c012a4ad1a2 ("mm: vmscan: scan anonymous pages on file refaults") in
> the 4.12-based kernel. It could in theory make the problem also go away,
> as the "excessively true" condition would now also be considered when
> inactive_list_is_low() is called from get_scan_count() (in v5.4; I know
> there were big reorganizations in last merge window), and perhaps change
> some SCAN_FILE outcomes to SCAN_FRACT. But I think it would be better to
> do something with the root cause first.

That patch should address the issue you are seeing in the interim.

My longer-term goal is still to implement pressure-based balancing
between the LRU types. A lot of prep work on the cgroup side was
necessary to make that patch set really work for cgrouped reclaim -
the new cgroup stat infrastructure, the recursive inactive:active
balancing etc. I'm hoping to dust off those patches early next year.

Those patches separate anon/cache balancing from active/inactive
balancing, which I think will universally make better decisions.

Does that sound reasonable?