From: Roman Gushchin <guro@fb.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: <akpm@linux-foundation.org>, <bigeasy@linutronix.de>,
<cl@linux.com>, <hannes@cmpxchg.org>, <iamjoonsoo.kim@lge.com>,
<jannh@google.com>, <linux-kernel@vger.kernel.org>,
<linux-mm@kvack.org>, <mhocko@kernel.org>, <minchan@kernel.org>,
<penberg@kernel.org>, <rientjes@google.com>,
<shakeelb@google.com>, <surenb@google.com>, <tglx@linutronix.de>
Subject: Re: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs
Date: Thu, 21 Jan 2021 16:48:47 -0800 [thread overview]
Message-ID: <20210122004847.GA25567@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <20210121172154.27580-2-vbabka@suse.cz>
On Thu, Jan 21, 2021 at 06:21:54PM +0100, Vlastimil Babka wrote:
> For performance reasons, SLUB doesn't keep all slabs on shared lists and
> doesn't always free slabs immediately after all objects are freed. Namely:
>
> - for each cache and cpu, there might be a "CPU slab" page, partially or fully
> free
> - with SLUB_CPU_PARTIAL enabled (default y), there might be a number of "percpu
> partial slabs" for each cache and cpu, also partially or fully free
> - for each cache and numa node, there are caches on per-node partial list, up
> to 10 of those may be empty
>
> As Jann reports [1], the number of percpu partial slabs should be limited by
> number of free objects (up to 30), but due to imprecise accounting, this can
> deterioriate so that there are up to 30 free slabs. He notes:
>
> > Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I
> > see something like 1.5MiB of pages with zero inuse objects stuck in
> > percpu lists.
>
> My observations match Jann's, and we've seen e.g. cases with 10 free slabs per
> cpu. We can also confirm Jann's theory that on kernels pre-kmemcg rewrite (in
> v5.9), this issue is amplified as there are separate sets of kmem caches with
> cpu caches, per-cpu partial and per-node partial lists for each memcg and cache
> that deals with kmemcg-accounted objects.
>
> The cached free slabs can therefore become a memory waste, making memory
> pressure higher, causing more reclaim of actually used LRU pages, and even
> cause OOM (global, or memcg on older kernels).
>
> SLUB provides __kmem_cache_shrink() that can flush all the abovementioned
> slabs, but is currently called only in rare situations, or from a sysfs
> handler. The standard way to cooperate with reclaim is to provide a shrinker,
> and so this patch adds such shrinker to call __kmem_cache_shrink()
> systematically.
>
> The shrinker design is however atypical. The usual design assumes that a
> shrinker can easily count how many objects can be reclaimed, and then reclaim
> given number of objects. For SLUB, determining the number of the various cached
> slabs would be a lot of work, and controlling how many to shrink precisely
> would be impractical. Instead, the shrinker is based on reclaim priority, and
> on lowest priority shrinks a single kmem cache, while on highest it shrinks all
> of them. To do that effectively, there's a new list caches_to_shrink where
> caches are taken from its head and then moved to tail. Existing slab_caches
> list is unaffected so that e.g. /proc/slabinfo order is not disrupted.
>
> This approach should not cause excessive shrinking and IPI storms:
>
> - If there are multiple reclaimers in parallel, only one can proceed, thanks to
> mutex_trylock(&slab_mutex). After unlocking, caches that were just shrinked
> are at the tail of the list.
> - in flush_all(), we actually check if there's anything to flush by a CPU
> (has_cpu_slab()) before sending an IPI
> - CPU slab deactivation became more efficient with "mm, slub: splice cpu and
> page freelists in deactivate_slab()
>
> The result is that SLUB's per-cpu and per-node caches are trimmed of free
> pages, and partially used pages have higher chance of being either reused of
> freed. The trimming effort is controlled by reclaim activity and thus memory
> pressure. Before an OOM, a reclaim attempt at highest priority ensures
> shrinking all caches. Also being a proper slab shrinker, the shrinking is
> now also called as part of the drop_caches sysctl operation.
Hi Vlastimil!
This makes a lot of sense, however it looks a bit as an overkill to me (on 5.9+).
Isn't limiting a number of pages (instead of number of objects) sufficient on 5.9+?
If not, maybe we can limit the shrinking to the pre-OOM condition?
Do we really need to trip it constantly?
Thanks!
next prev parent reply other threads:[~2021-01-22 0:49 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-11 23:12 SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies? Jann Horn
2021-01-12 0:27 ` Roman Gushchin
2021-01-12 16:35 ` Christoph Lameter
2021-01-14 9:27 ` Vlastimil Babka
2021-01-18 11:03 ` Michal Hocko
2021-01-18 15:46 ` Christoph Lameter
2021-01-18 16:07 ` Michal Hocko
2021-01-13 19:14 ` Vlastimil Babka
2021-01-13 22:37 ` Jann Horn
2021-01-14 9:04 ` Christoph Lameter
2021-01-21 17:21 ` Vlastimil Babka
2021-01-21 17:21 ` [RFC 1/2] mm, vmscan: add priority field to struct shrink_control Vlastimil Babka
2021-01-21 17:21 ` [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs Vlastimil Babka
2021-01-22 0:48 ` Roman Gushchin [this message]
2021-01-26 12:06 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210122004847.GA25567@carbon.dhcp.thefacebook.com \
--to=guro@fb.com \
--cc=akpm@linux-foundation.org \
--cc=bigeasy@linutronix.de \
--cc=cl@linux.com \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=minchan@kernel.org \
--cc=penberg@kernel.org \
--cc=rientjes@google.com \
--cc=shakeelb@google.com \
--cc=surenb@google.com \
--cc=tglx@linutronix.de \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).