From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E970CC433E9 for ; Thu, 21 Jan 2021 17:22:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6A8B023A57 for ; Thu, 21 Jan 2021 17:22:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6A8B023A57 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0E5266B0007; Thu, 21 Jan 2021 12:22:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 04EC16B000C; Thu, 21 Jan 2021 12:22:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D423D6B0007; Thu, 21 Jan 2021 12:22:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0008.hostedemail.com [216.40.44.8]) by kanga.kvack.org (Postfix) with ESMTP id A94936B0008 for ; Thu, 21 Jan 2021 12:22:07 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 619A18249980 for ; Thu, 21 Jan 2021 17:22:07 +0000 (UTC) X-FDA: 77730450294.06.kiss88_4c094e927564 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 29F601005FE9B for ; Thu, 21 Jan 2021 17:22:07 +0000 (UTC) X-HE-Tag: kiss88_4c094e927564 X-Filterd-Recvd-Size: 9012 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Thu, 21 Jan 2021 17:22:06 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 32597B8F8; Thu, 21 Jan 2021 17:22:05 +0000 (UTC) From: Vlastimil Babka To: vbabka@suse.cz Cc: akpm@linux-foundation.org, bigeasy@linutronix.de, cl@linux.com, guro@fb.com, hannes@cmpxchg.org, iamjoonsoo.kim@lge.com, jannh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org, minchan@kernel.org, penberg@kernel.org, rientjes@google.com, shakeelb@google.com, surenb@google.com, tglx@linutronix.de Subject: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs Date: Thu, 21 Jan 2021 18:21:54 +0100 Message-Id: <20210121172154.27580-2-vbabka@suse.cz> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20210121172154.27580-1-vbabka@suse.cz> References: <20210121172154.27580-1-vbabka@suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: For performance reasons, SLUB doesn't keep all slabs on shared lists and doesn't always free slabs immediately after all objects are freed. Namely= : - for each cache and cpu, there might be a "CPU slab" page, partially or = fully free - with SLUB_CPU_PARTIAL enabled (default y), there might be a number of "= percpu partial slabs" for each cache and cpu, also partially or fully free - for each cache and numa node, there are caches on per-node partial list= , up to 10 of those may be empty As Jann reports [1], the number of percpu partial slabs should be limited= by number of free objects (up to 30), but due to imprecise accounting, this = can deterioriate so that there are up to 30 free slabs. He notes: > Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I > see something like 1.5MiB of pages with zero inuse objects stuck in > percpu lists. My observations match Jann's, and we've seen e.g. cases with 10 free slab= s per cpu. We can also confirm Jann's theory that on kernels pre-kmemcg rewrite= (in v5.9), this issue is amplified as there are separate sets of kmem caches = with cpu caches, per-cpu partial and per-node partial lists for each memcg and= cache that deals with kmemcg-accounted objects. The cached free slabs can therefore become a memory waste, making memory pressure higher, causing more reclaim of actually used LRU pages, and eve= n cause OOM (global, or memcg on older kernels). SLUB provides __kmem_cache_shrink() that can flush all the abovementioned slabs, but is currently called only in rare situations, or from a sysfs handler. The standard way to cooperate with reclaim is to provide a shrin= ker, and so this patch adds such shrinker to call __kmem_cache_shrink() systematically. The shrinker design is however atypical. The usual design assumes that a shrinker can easily count how many objects can be reclaimed, and then rec= laim given number of objects. For SLUB, determining the number of the various = cached slabs would be a lot of work, and controlling how many to shrink precisel= y would be impractical. Instead, the shrinker is based on reclaim priority,= and on lowest priority shrinks a single kmem cache, while on highest it shrin= ks all of them. To do that effectively, there's a new list caches_to_shrink wher= e caches are taken from its head and then moved to tail. Existing slab_cach= es list is unaffected so that e.g. /proc/slabinfo order is not disrupted. This approach should not cause excessive shrinking and IPI storms: - If there are multiple reclaimers in parallel, only one can proceed, tha= nks to mutex_trylock(&slab_mutex). After unlocking, caches that were just shri= nked are at the tail of the list. - in flush_all(), we actually check if there's anything to flush by a CPU (has_cpu_slab()) before sending an IPI - CPU slab deactivation became more efficient with "mm, slub: splice cpu = and page freelists in deactivate_slab() The result is that SLUB's per-cpu and per-node caches are trimmed of free pages, and partially used pages have higher chance of being either reused= of freed. The trimming effort is controlled by reclaim activity and thus mem= ory pressure. Before an OOM, a reclaim attempt at highest priority ensures shrinking all caches. Also being a proper slab shrinker, the shrinking is now also called as part of the drop_caches sysctl operation. [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=3D= -umR7BLsEgjEYzA@mail.gmail.com/ Reported-by: Jann Horn Signed-off-by: Vlastimil Babka --- include/linux/slub_def.h | 1 + mm/slub.c | 76 +++++++++++++++++++++++++++++++++++++++- 2 files changed, 76 insertions(+), 1 deletion(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index dcde82a4434c..6c4eeb30764d 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -107,6 +107,7 @@ struct kmem_cache { unsigned int red_left_pad; /* Left redzone padding size */ const char *name; /* Name (only for display!) */ struct list_head list; /* List of slab caches */ + struct list_head shrink_list; /* List ordered for shrinking */ #ifdef CONFIG_SYSFS struct kobject kobj; /* For sysfs */ #endif diff --git a/mm/slub.c b/mm/slub.c index c3141aa962be..bba05bd9287a 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -123,6 +123,8 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled); #endif #endif =20 +static LIST_HEAD(caches_to_shrink); + static inline bool kmem_cache_debug(struct kmem_cache *s) { return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS); @@ -3933,6 +3935,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s) int node; struct kmem_cache_node *n; =20 + list_del(&s->shrink_list); + flush_all(s); /* Attempt to free all objects */ for_each_kmem_cache_node(s, node, n) { @@ -3985,6 +3989,69 @@ void kmem_obj_info(struct kmem_obj_info *kpp, void= *object, struct page *page) } #endif =20 +static unsigned long count_shrinkable_caches(struct shrinker *shrink, + struct shrink_control *sc) +{ + /* + * Determining how much there is to shrink would be so complex, it's + * better to just pretend there always is and scale the actual effort + * based on sc->priority. + */ + return shrink->batch; +} + +static unsigned long shrink_caches(struct shrinker *shrink, + struct shrink_control *sc) +{ + struct kmem_cache *s; + int nr_to_shrink; + int ret =3D sc->nr_to_scan / 2; + + nr_to_shrink =3D DEF_PRIORITY - sc->priority; + if (nr_to_shrink < 0) + nr_to_shrink =3D 0; + + nr_to_shrink =3D 1 << nr_to_shrink; + if (sc->priority =3D=3D 0) { + nr_to_shrink =3D INT_MAX; + ret =3D 0; + } + + if (!mutex_trylock(&slab_mutex)) + return SHRINK_STOP; + + list_for_each_entry(s, &caches_to_shrink, shrink_list) { + __kmem_cache_shrink(s); + if (--nr_to_shrink =3D=3D 0) { + list_bulk_move_tail(&caches_to_shrink, + caches_to_shrink.next, + &s->shrink_list); + break; + } + } + + mutex_unlock(&slab_mutex); + + /* + * As long as we are not at the highest priority, pretend we freed + * something as we might have not have processed all caches. This + * should signal that it's worth retrying. Once we are at the highest + * priority and shrink the whole list, pretend we didn't free anything, + * because there's no point in trying again. + * + * Note the value is currently ultimately ignored in "normal" reclaim, + * but drop_slab_node() which handles drop_caches sysctl works like thi= s. + */ + return ret; +} + +static struct shrinker slub_cache_shrinker =3D { + .count_objects =3D count_shrinkable_caches, + .scan_objects =3D shrink_caches, + .batch =3D 128, + .seeks =3D 0, +}; + /******************************************************************** * Kmalloc subsystem *******************************************************************/ @@ -4424,6 +4491,8 @@ static struct kmem_cache * __init bootstrap(struct = kmem_cache *static_cache) #endif } list_add(&s->list, &slab_caches); + list_del(&static_cache->shrink_list); + list_add(&s->shrink_list, &caches_to_shrink); return s; } =20 @@ -4480,6 +4549,8 @@ void __init kmem_cache_init(void) =20 void __init kmem_cache_init_late(void) { + if (!register_shrinker(&slub_cache_shrinker)) + pr_err("SLUB: failed to register shrinker\n"); } =20 struct kmem_cache * @@ -4518,11 +4589,14 @@ int __kmem_cache_create(struct kmem_cache *s, sla= b_flags_t flags) =20 /* Mutex is not taken during early boot */ if (slab_state <=3D UP) - return 0; + goto out; =20 err =3D sysfs_slab_add(s); if (err) __kmem_cache_release(s); +out: + if (!err) + list_add(&s->shrink_list, &caches_to_shrink); =20 return err; } --=20 2.30.0