From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=kv+e=GY=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E970CC433E9
	for <linux-mm@archiver.kernel.org>; Thu, 21 Jan 2021 17:22:09 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 6A8B023A57
	for <linux-mm@archiver.kernel.org>; Thu, 21 Jan 2021 17:22:09 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6A8B023A57
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 0E5266B0007; Thu, 21 Jan 2021 12:22:08 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 04EC16B000C; Thu, 21 Jan 2021 12:22:07 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D423D6B0007; Thu, 21 Jan 2021 12:22:07 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0008.hostedemail.com [216.40.44.8])
	by kanga.kvack.org (Postfix) with ESMTP id A94936B0008
	for <linux-mm@kvack.org>; Thu, 21 Jan 2021 12:22:07 -0500 (EST)
Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 619A18249980
	for <linux-mm@kvack.org>; Thu, 21 Jan 2021 17:22:07 +0000 (UTC)
X-FDA: 77730450294.06.kiss88_4c094e927564
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin06.hostedemail.com (Postfix) with ESMTP id 29F601005FE9B
	for <linux-mm@kvack.org>; Thu, 21 Jan 2021 17:22:07 +0000 (UTC)
X-HE-Tag: kiss88_4c094e927564
X-Filterd-Recvd-Size: 9012
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
	by imf12.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu, 21 Jan 2021 17:22:06 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.221.27])
	by mx2.suse.de (Postfix) with ESMTP id 32597B8F8;
	Thu, 21 Jan 2021 17:22:05 +0000 (UTC)
From: Vlastimil Babka <vbabka@suse.cz>
To: vbabka@suse.cz
Cc: akpm@linux-foundation.org,
	bigeasy@linutronix.de,
	cl@linux.com,
	guro@fb.com,
	hannes@cmpxchg.org,
	iamjoonsoo.kim@lge.com,
	jannh@google.com,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	mhocko@kernel.org,
	minchan@kernel.org,
	penberg@kernel.org,
	rientjes@google.com,
	shakeelb@google.com,
	surenb@google.com,
	tglx@linutronix.de
Subject: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs
Date: Thu, 21 Jan 2021 18:21:54 +0100
Message-Id: <20210121172154.27580-2-vbabka@suse.cz>
X-Mailer: git-send-email 2.30.0
In-Reply-To: <20210121172154.27580-1-vbabka@suse.cz>
References: <aa02cf86-3a83-2e55-3bb6-3ec1c0f71b11@suse.cz>
 <20210121172154.27580-1-vbabka@suse.cz>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

For performance reasons, SLUB doesn't keep all slabs on shared lists and
doesn't always free slabs immediately after all objects are freed. Namely=
:

- for each cache and cpu, there might be a "CPU slab" page, partially or =
fully
  free
- with SLUB_CPU_PARTIAL enabled (default y), there might be a number of "=
percpu
  partial slabs" for each cache and cpu, also partially or fully free
- for each cache and numa node, there are caches on per-node partial list=
, up
  to 10 of those may be empty

As Jann reports [1], the number of percpu partial slabs should be limited=
 by
number of free objects (up to 30), but due to imprecise accounting, this =
can
deterioriate so that there are up to 30 free slabs. He notes:

> Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I
> see something like 1.5MiB of pages with zero inuse objects stuck in
> percpu lists.

My observations match Jann's, and we've seen e.g. cases with 10 free slab=
s per
cpu. We can also confirm Jann's theory that on kernels pre-kmemcg rewrite=
 (in
v5.9), this issue is amplified as there are separate sets of kmem caches =
with
cpu caches, per-cpu partial and per-node partial lists for each memcg and=
 cache
that deals with kmemcg-accounted objects.

The cached free slabs can therefore become a memory waste, making memory
pressure higher, causing more reclaim of actually used LRU pages, and eve=
n
cause OOM (global, or memcg on older kernels).

SLUB provides __kmem_cache_shrink() that can flush all the abovementioned
slabs, but is currently called only in rare situations, or from a sysfs
handler. The standard way to cooperate with reclaim is to provide a shrin=
ker,
and so this patch adds such shrinker to call __kmem_cache_shrink()
systematically.

The shrinker design is however atypical. The usual design assumes that a
shrinker can easily count how many objects can be reclaimed, and then rec=
laim
given number of objects. For SLUB, determining the number of the various =
cached
slabs would be a lot of work, and controlling how many to shrink precisel=
y
would be impractical. Instead, the shrinker is based on reclaim priority,=
 and
on lowest priority shrinks a single kmem cache, while on highest it shrin=
ks all
of them. To do that effectively, there's a new list caches_to_shrink wher=
e
caches are taken from its head and then moved to tail. Existing slab_cach=
es
list is unaffected so that e.g. /proc/slabinfo order is not disrupted.

This approach should not cause excessive shrinking and IPI storms:

- If there are multiple reclaimers in parallel, only one can proceed, tha=
nks to
  mutex_trylock(&slab_mutex). After unlocking, caches that were just shri=
nked
  are at the tail of the list.
- in flush_all(), we actually check if there's anything to flush by a CPU
  (has_cpu_slab()) before sending an IPI
- CPU slab deactivation became more efficient with "mm, slub: splice cpu =
and
  page freelists in deactivate_slab()

The result is that SLUB's per-cpu and per-node caches are trimmed of free
pages, and partially used pages have higher chance of being either reused=
 of
freed. The trimming effort is controlled by reclaim activity and thus mem=
ory
pressure. Before an OOM, a reclaim attempt at highest priority ensures
shrinking all caches. Also being a proper slab shrinker, the shrinking is
now also called as part of the drop_caches sysctl operation.

[1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=3D=
-umR7BLsEgjEYzA@mail.gmail.com/

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slub_def.h |  1 +
 mm/slub.c                | 76 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index dcde82a4434c..6c4eeb30764d 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -107,6 +107,7 @@ struct kmem_cache {
 	unsigned int red_left_pad;	/* Left redzone padding size */
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
+	struct list_head shrink_list;	/* List ordered for shrinking */
 #ifdef CONFIG_SYSFS
 	struct kobject kobj;	/* For sysfs */
 #endif
diff --git a/mm/slub.c b/mm/slub.c
index c3141aa962be..bba05bd9287a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -123,6 +123,8 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled);
 #endif
 #endif
=20
+static LIST_HEAD(caches_to_shrink);
+
 static inline bool kmem_cache_debug(struct kmem_cache *s)
 {
 	return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS);
@@ -3933,6 +3935,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	int node;
 	struct kmem_cache_node *n;
=20
+	list_del(&s->shrink_list);
+
 	flush_all(s);
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
@@ -3985,6 +3989,69 @@ void kmem_obj_info(struct kmem_obj_info *kpp, void=
 *object, struct page *page)
 }
 #endif
=20
+static unsigned long count_shrinkable_caches(struct shrinker *shrink,
+					     struct shrink_control *sc)
+{
+	/*
+	 * Determining how much there is to shrink would be so complex, it's
+	 * better to just pretend there always is and scale the actual effort
+	 * based on sc->priority.
+	 */
+	return shrink->batch;
+}
+
+static unsigned long shrink_caches(struct shrinker *shrink,
+				   struct shrink_control *sc)
+{
+	struct kmem_cache *s;
+	int nr_to_shrink;
+	int ret =3D sc->nr_to_scan / 2;
+
+	nr_to_shrink =3D DEF_PRIORITY - sc->priority;
+	if (nr_to_shrink < 0)
+		nr_to_shrink =3D 0;
+
+	nr_to_shrink =3D 1 << nr_to_shrink;
+	if (sc->priority =3D=3D 0) {
+		nr_to_shrink =3D INT_MAX;
+		ret =3D 0;
+	}
+
+	if (!mutex_trylock(&slab_mutex))
+		return SHRINK_STOP;
+
+	list_for_each_entry(s, &caches_to_shrink, shrink_list) {
+		__kmem_cache_shrink(s);
+		if (--nr_to_shrink =3D=3D 0) {
+			list_bulk_move_tail(&caches_to_shrink,
+					    caches_to_shrink.next,
+					    &s->shrink_list);
+			break;
+		}
+	}
+
+	mutex_unlock(&slab_mutex);
+
+	/*
+	 * As long as we are not at the highest priority, pretend we freed
+	 * something as we might have not have processed all caches. This
+	 * should signal that it's worth retrying. Once we are at the highest
+	 * priority and shrink the whole list, pretend we didn't free anything,
+	 * because there's no point in trying again.
+	 *
+	 * Note the value is currently ultimately ignored in "normal" reclaim,
+	 * but drop_slab_node() which handles drop_caches sysctl works like thi=
s.
+	 */
+	return ret;
+}
+
+static struct shrinker slub_cache_shrinker =3D {
+	.count_objects =3D count_shrinkable_caches,
+	.scan_objects =3D shrink_caches,
+	.batch =3D 128,
+	.seeks =3D 0,
+};
+
 /********************************************************************
  *		Kmalloc subsystem
  *******************************************************************/
@@ -4424,6 +4491,8 @@ static struct kmem_cache * __init bootstrap(struct =
kmem_cache *static_cache)
 #endif
 	}
 	list_add(&s->list, &slab_caches);
+	list_del(&static_cache->shrink_list);
+	list_add(&s->shrink_list, &caches_to_shrink);
 	return s;
 }
=20
@@ -4480,6 +4549,8 @@ void __init kmem_cache_init(void)
=20
 void __init kmem_cache_init_late(void)
 {
+	if (!register_shrinker(&slub_cache_shrinker))
+		pr_err("SLUB: failed to register shrinker\n");
 }
=20
 struct kmem_cache *
@@ -4518,11 +4589,14 @@ int __kmem_cache_create(struct kmem_cache *s, sla=
b_flags_t flags)
=20
 	/* Mutex is not taken during early boot */
 	if (slab_state <=3D UP)
-		return 0;
+		goto out;
=20
 	err =3D sysfs_slab_add(s);
 	if (err)
 		__kmem_cache_release(s);
+out:
+	if (!err)
+		list_add(&s->shrink_list, &caches_to_shrink);
=20
 	return err;
 }
--=20
2.30.0