Re: [patch] mm, memcg: provide a stat to describe reclaimable memory

From: David Rientjes <rientjes@google.com>
To: Chris Down <chris@chrisdown.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Yang Shi <shy828301@gmail.com>,  Michal Hocko <mhocko@kernel.org>,
	Shakeel Butt <shakeelb@google.com>,
	 Yang Shi <yang.shi@linux.alibaba.com>,
	Roman Gushchin <guro@fb.com>,  Greg Thelen <gthelen@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Vladimir Davydov <vdavydov.dev@gmail.com>,
	cgroups@vger.kernel.org,  linux-mm@kvack.org
Subject: Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
Date: Fri, 17 Jul 2020 12:37:57 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.23.453.2007171226310.3398972@chino.kir.corp.google.com> (raw)
In-Reply-To: <20200717121750.GA367633@chrisdown.name>

On Fri, 17 Jul 2020, Chris Down wrote:

> > With the proposed anon_reclaimable, do you have any reliability concerns?
> > This would be the amount of lazy freeable memory and memory that can be
> > uncharged if compound pages from the deferred split queue are split under
> > memory pressure.  It seems to be a very precise value (as slab_reclaimable
> > already in memory.stat is), so I'm not sure why there is a reliability
> > concern.  Maybe you can elaborate?
> 
> Ability to reclaim a page is largely about context at the time of reclaim. For
> example, if you are running at the edge of swap, at a metric that truly
> describes "reclaimable memory" will contain vastly different numbers from one
> second to the next as cluster and page availability increases and decreases.
> We may also have to do things like look for youngness at reclaim time, so I'm
> not convinced metrics like this makes sense in the general case.
...
> Again, I'm curious why this can't be solved by artificial workingset
> pressurisation and monitoring. Generally, the most reliable reclaim metrics
> come from operating reclaim itself.
> 

Perhaps this is best discussed in the context I gave in the earlier 
thread: imagine a thp-backed heap of 64MB and then a malloc implementation 
doing MADV_DONTNEED over all but one page in every one of these 
pageblocks.

On a 4.3 kernel, for example, memory.current for the heap segment is now 
(64MB / 2MB) * 4KB = 128KB because we have synchronous splitting and 
uncharging of the underlying hugepage.  On a 4.15 kernel, for example, 
memory.current is still 64MB because the underlying hugepages are still 
charged to the memcg due to deferred split queues.

For any application that monitors this, pressurization is not going to 
help: the memory will be reclaimed under memcg pressure but we aren't 
facing that pressure yet.  Userspace could identify this as a memory leak 
unless we describe what anon memory is actually reclaimable in this 
context (including on systems without swap).  For any entity that uses 
this information to infer if new work can be scheduled in this memcg (the 
reason MemAvailable exists in /proc/meminfo at the system level), this is 
now dramatically skewed.

At worse, on a swapless system, this memory is seen from userspace as 
unreclaimable because it's charged anon.

Do you have other suggestions for how userspace can understand what anon 
is reclaimable in this context before encountering memory pressure?  If 
so, it may be a great alternative to this: I haven't been able to think of 
such a way other than an anon_reclaimable stat.