Re: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs

From: Roman Gushchin <guro@fb.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: <akpm@linux-foundation.org>, <bigeasy@linutronix.de>,
	<cl@linux.com>, <hannes@cmpxchg.org>, <iamjoonsoo.kim@lge.com>,
	<jannh@google.com>, <linux-kernel@vger.kernel.org>,
	<linux-mm@kvack.org>, <mhocko@kernel.org>, <minchan@kernel.org>,
	<penberg@kernel.org>, <rientjes@google.com>,
	<shakeelb@google.com>, <surenb@google.com>, <tglx@linutronix.de>
Subject: Re: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs
Date: Thu, 21 Jan 2021 16:48:47 -0800	[thread overview]
Message-ID: <20210122004847.GA25567@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <20210121172154.27580-2-vbabka@suse.cz>

On Thu, Jan 21, 2021 at 06:21:54PM +0100, Vlastimil Babka wrote:
> For performance reasons, SLUB doesn't keep all slabs on shared lists and
> doesn't always free slabs immediately after all objects are freed. Namely:
> 
> - for each cache and cpu, there might be a "CPU slab" page, partially or fully
>   free
> - with SLUB_CPU_PARTIAL enabled (default y), there might be a number of "percpu
>   partial slabs" for each cache and cpu, also partially or fully free
> - for each cache and numa node, there are caches on per-node partial list, up
>   to 10 of those may be empty
> 
> As Jann reports [1], the number of percpu partial slabs should be limited by
> number of free objects (up to 30), but due to imprecise accounting, this can
> deterioriate so that there are up to 30 free slabs. He notes:
> 
> > Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I
> > see something like 1.5MiB of pages with zero inuse objects stuck in
> > percpu lists.
> 
> My observations match Jann's, and we've seen e.g. cases with 10 free slabs per
> cpu. We can also confirm Jann's theory that on kernels pre-kmemcg rewrite (in
> v5.9), this issue is amplified as there are separate sets of kmem caches with
> cpu caches, per-cpu partial and per-node partial lists for each memcg and cache
> that deals with kmemcg-accounted objects.
> 
> The cached free slabs can therefore become a memory waste, making memory
> pressure higher, causing more reclaim of actually used LRU pages, and even
> cause OOM (global, or memcg on older kernels).
> 
> SLUB provides __kmem_cache_shrink() that can flush all the abovementioned
> slabs, but is currently called only in rare situations, or from a sysfs
> handler. The standard way to cooperate with reclaim is to provide a shrinker,
> and so this patch adds such shrinker to call __kmem_cache_shrink()
> systematically.
> 
> The shrinker design is however atypical. The usual design assumes that a
> shrinker can easily count how many objects can be reclaimed, and then reclaim
> given number of objects. For SLUB, determining the number of the various cached
> slabs would be a lot of work, and controlling how many to shrink precisely
> would be impractical. Instead, the shrinker is based on reclaim priority, and
> on lowest priority shrinks a single kmem cache, while on highest it shrinks all
> of them. To do that effectively, there's a new list caches_to_shrink where
> caches are taken from its head and then moved to tail. Existing slab_caches
> list is unaffected so that e.g. /proc/slabinfo order is not disrupted.
> 
> This approach should not cause excessive shrinking and IPI storms:
> 
> - If there are multiple reclaimers in parallel, only one can proceed, thanks to
>   mutex_trylock(&slab_mutex). After unlocking, caches that were just shrinked
>   are at the tail of the list.
> - in flush_all(), we actually check if there's anything to flush by a CPU
>   (has_cpu_slab()) before sending an IPI
> - CPU slab deactivation became more efficient with "mm, slub: splice cpu and
>   page freelists in deactivate_slab()
> 
> The result is that SLUB's per-cpu and per-node caches are trimmed of free
> pages, and partially used pages have higher chance of being either reused of
> freed. The trimming effort is controlled by reclaim activity and thus memory
> pressure. Before an OOM, a reclaim attempt at highest priority ensures
> shrinking all caches. Also being a proper slab shrinker, the shrinking is
> now also called as part of the drop_caches sysctl operation.

Hi Vlastimil!

This makes a lot of sense, however it looks a bit as an overkill to me (on 5.9+).
Isn't limiting a number of pages (instead of number of objects) sufficient on 5.9+?

If not, maybe we can limit the shrinking to the pre-OOM condition?
Do we really need to trip it constantly?

Thanks!