Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

From: Joonsoo Kim <js1304@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linux Memory Management List <linux-mm@kvack.org>,
	Rik van Riel <riel@surriel.com>,
	 Minchan Kim <minchan.kim@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	 LKML <linux-kernel@vger.kernel.org>,
	kernel-team@fb.com
Subject: Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
Date: Wed, 3 Jun 2020 14:40:40 +0900	[thread overview]
Message-ID: <CAAmzW4P0HZw-K1jHogpCevDCfsSfm2SRSXYwUqznoDBNWa_NUQ@mail.gmail.com> (raw)
In-Reply-To: <20200602164726.GA225032@cmpxchg.org>

2020년 6월 3일 (수) 오전 1:48, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Tue, Jun 02, 2020 at 11:34:17AM +0900, Joonsoo Kim wrote:
> > 2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > > > But, I still think that modified refault activation equation isn't
> > > > safe. The next
> > > > problem I found is related to the scan ratio limit patch ("limit the range of
> > > > LRU type balancing") on this series. See the below example.
> > > >
> > > > anon: Hot (X M)
> > > > file: Hot (200 M) / dummy (200 M)
> > > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> > > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> > > >
> > > > Without this patch, A and F(H) are kept on the memory and look like
> > > > it's correct.
> > > >
> > > > With this patch and below fix, refault equation for Pn would be:
> > > >
> > > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> > > > ratio (from anon non-resident)
> > > > anon + active file = X + 200
> > > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200
> > >
> > > That doesn't look quite right to me. The anon part of the refault
> > > distance is driven by X, so the left-hand of this formula contains X
> > > as well.
> > >
> > > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000
> >
> > As I said before, there is no X on left-hand of this formula. To
> > access all Pn and
> > re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed.
> > With your patch "limit the range of LRU type balancing", scan ratio
> > between file/anon
> > list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M *
> > 2.0, that is,
> > 2400 M and not bounded by X. That means that file list cannot be
> > stable with some X.
>
> Oh, no X on the left because you're talking about the number of pages
> scanned until the first refaults, which is fixed - so why are we still
> interpreting the refault distance against a variable anon size X?

Looks like I was confused again. Your formula is correct and mine is
wrong. My mistake is I thought that your patch "limit the range of LRU
type balancing"
which makes scan *ratio* 2:1 leads to actual scan *count* ratio
between anon/file to 2:1.
But, now I realized that 2:1 is just scan ratio and actual scan
*count* ratio could be far
larger with certain list size. It would be X * scan ratio in above example so my
explanation is wrong and you are right.

Sorry for making a trouble.

> Well, that's misleading. We compare against anon because part of the
> cache is already encoded in the refault distance. What we're really
> checking is access distance against total amount of available RAM.
>
> Consider this. We want to activate pages where
>
>         access_distance <= RAM
>
> and our measure of access distance is:
>
>         access_distance = refault_distance + inactive_file
>
> So the comparison becomes:
>
>         refault_distance + inactive_file < RAM
>
> which we simplify to:
>
>         refault_distance < active_file + anon
>
> There is a certain threshold for X simply because there is a certain
> threshold for RAM beyond which we need to start activating. X cannot
> be arbitrary, it must be X + cache filling up memory - after all we
> have page reclaim evicting pages.
>
> Again, this isn't new. In the current code, we activate when:
>
>         refault_distance < active_file
>
> which is
>
>         access_distance <= RAM - anon
>
> You can see, whether things are stable or not always depends on the
> existing workingset size. It's just a proxy for how much total RAM we
> have potentially available to the refaulting page.
>
> > If my lastly found example is a correct example (your confirm is required),
> > it is also related to the correctness issue since cold pages causes
> > eviction of the hot pages repeatedly.
>
> I think your example is correct, but it benefits from the VM
> arbitrarily making an assumption that has a 50/50 shot of being true.
>
> You and I know which pages are hot and which are cold because you
> designed the example.
>
> All the VM sees is this:
>
> - We have an established workingset that has previously shown an
>   access distance <= RAM and therefor was activated.
>
> - We now have another set that also appears to have an access distance
>   <= RAM. The only way to know for sure, however, is sample the
>   established workingset and compare the relative access frequencies.
>
> Currently, we just assume the incoming pages are colder. Clearly
> that's beneficial when it's true. Clearly that can be totally wrong.
>
> We must allow a fair comparison between these two sets.
>
> For cache, that's already the case - that's why I brought up the
> cache-only example: if refault distances are 50M and you have 60M of
> active cache, we activate all refaults and force an even competition
> between the established workingset and the new pages.
>
> Whether we can protect active file when anon needs to shrink first and
> can't (the activate/iocost split) that's a different question. But I'm
> no longer so sure after looking into it further.
>
> First, we would need two different refault distances: either we
> consider anon age and need to compare to active_file + anon, or we
> don't and compare to active_file only. We cannot mix willy nilly,
> because the metrics wouldn't be comparable. We don't have the space to
> store two different eviction timestamps, nor could we afford to cut
> the precision in half.
>
> Second, the additional page flag required to implement it.
>
> Third, it's somewhat moot because we still have the same behavior when
> active_file would need to shrink and can't. There can't be a stable
> state as long as refault distances <= active_file.
>
> > In this case, they (without patch, with patch) all have some correctness
> > issue so we need to judge which one is better in terms of overall impact.
> > I don't have strong opinion about it so it's up to you to decide the way to go.
>
> If my patch was simply changing the default assumption on which pages
> are hot and which are cold, I would agree with you - the pros would be
> equal to the cons, one way wouldn't be more correct than the other.
>
> But that isn't what my patch is doing. What it does is get rid of the
> assumption, to actually sample and compare the access frequencies when
> there isn't enough data to make an informed decision.
>
> That's a net improvement.

Okay. Now I think that this patch has a net improvement.

Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Thanks.