All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes
@ 2023-11-29  9:53 Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 1/9] mm/slub: fix bulk alloc and free stats Vlastimil Babka
                   ` (9 more replies)
  0 siblings, 10 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev, Vlastimil Babka

Also in git [1]. Changes since v2 [2]:

- empty cache refill/full cache flush using internal bulk operations
- bulk alloc/free operations also use the cache
- memcg, KASAN etc hooks processed when the cache is used for the
  operation - now fully transparent
- NUMA node-specific allocations now explicitly bypass the cache

[1] https://git.kernel.org/vbabka/l/slub-percpu-caches-v3r2
[2] https://lore.kernel.org/all/20230810163627.6206-9-vbabka@suse.cz/

----

At LSF/MM I've mentioned that I see several use cases for introducing
opt-in percpu arrays for caching alloc/free objects in SLUB. This is my
first exploration of this idea, speficially for the use case of maple
tree nodes. The assumptions are:

- percpu arrays will be faster thank bulk alloc/free which needs
  relatively long freelists to work well. Especially in the freeing case
  we need the nodes to come from the same slab (or small set of those)

- preallocation for the worst case of needed nodes for a tree operation
  that can't reclaim due to locks is wasteful. We could instead expect
  that most of the time percpu arrays would satisfy the constained
  allocations, and in the rare cases it does not we can dip into
  GFP_ATOMIC reserves temporarily. So instead of preallocation just
  prefill the arrays.

- NUMA locality of the nodes is not a concern as the nodes of a
  process's VMA tree end up all over the place anyway.

Patches 1-4 are preparatory, but should also work as standalone fixes
and cleanups, so I would like to add them for 6.8 after review, and
probably rebasing on top of the current series in slab/for-next, mainly
SLAB removal, as it should be easier to follow than the necessary
conflict resolutions.

Patch 5 adds the per-cpu array caches support. Locking is stolen from
Mel's recent page allocator's pcplists implementation so it can avoid
disabling IRQs and just disable preemption, but the trylocks can fail in
rare situations - in most cases the locks are uncontended so the locking
should be cheap.

Then maple tree is modified in patches 6-9 to benefit from this. From
that, only Liam's patches make sense and the rest are my crude hacks.
Liam is already working on a better solution for the maple tree side.
I'm including this only so the bots have something for testing that uses
the new code. The stats below thus likely don't reflect the full
benefits that can be achieved from cache prefill vs preallocation.

I've briefly tested this with virtme VM boot and checking the stats from
CONFIG_SLUB_STATS in sysfs.

Patch 5:

slub per-cpu array caches implemented including new counters but maple
tree doesn't use them yet

/sys/kernel/slab/maple_node # grep . alloc_cpu_cache alloc_*path free_cpu_cache free_*path cpu_cache* | cut -d' ' -f1
alloc_cpu_cache:0
alloc_fastpath:20213
alloc_slowpath:1741
free_cpu_cache:0
free_fastpath:10754
free_slowpath:9232
cpu_cache_flush:0
cpu_cache_refill:0

Patch 7:

maple node cache creates percpu array with 32 entries,
not changed anything else

majority alloc/free operations are satisfied by the array, number of
flushed/refilled objects is 1/3 of the cached operations so the hit
ratio is 2/3. Note the flush/refill operations also increase the
fastpath/slowpath counters, thus the majority of those indeed come from
the flushes and refills.

alloc_cpu_cache:11880
alloc_fastpath:4131
alloc_slowpath:587
free_cpu_cache:13075
free_fastpath:437
free_slowpath:2216
cpu_cache_flush:4336
cpu_cache_refill:3216

Patch 9:

This tries to replace maple tree's preallocation with the cache prefill.
Thus should reduce all of the counters as many of the preallocations for
the worst-case scenarios are not needed in the end. But according to
Liam it's not the full solution, which probably explains why the
reduction is only modest.

alloc_cpu_cache:11540
alloc_fastpath:3756
alloc_slowpath:512
free_cpu_cache:12775
free_fastpath:388
free_slowpath:1944
cpu_cache_flush:3904
cpu_cache_refill:2742

---
Liam R. Howlett (2):
      tools: Add SLUB percpu array functions for testing
      maple_tree: Remove MA_STATE_PREALLOC

Vlastimil Babka (7):
      mm/slub: fix bulk alloc and free stats
      mm/slub: introduce __kmem_cache_free_bulk() without free hooks
      mm/slub: handle bulk and single object freeing separately
      mm/slub: free KFENCE objects in slab_free_hook()
      mm/slub: add opt-in percpu array cache of objects
      maple_tree: use slub percpu array
      maple_tree: replace preallocation with slub percpu array prefill

 include/linux/slab.h                    |   4 +
 include/linux/slub_def.h                |  12 +
 lib/maple_tree.c                        |  46 ++-
 mm/Kconfig                              |   1 +
 mm/slub.c                               | 561 +++++++++++++++++++++++++++++---
 tools/include/linux/slab.h              |   4 +
 tools/testing/radix-tree/linux.c        |  14 +
 tools/testing/radix-tree/linux/kernel.h |   1 +
 8 files changed, 578 insertions(+), 65 deletions(-)
---
base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
change-id: 20231128-slub-percpu-caches-9441892011d7

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC v3 1/9] mm/slub: fix bulk alloc and free stats
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
@ 2023-11-29  9:53 ` Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 2/9] mm/slub: introduce __kmem_cache_free_bulk() without free hooks Vlastimil Babka
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev, Vlastimil Babka

The SLUB sysfs stats enabled CONFIG_SLUB_STATS have two deficiencies
identified wrt bulk alloc/free operations:

- Bulk allocations from cpu freelist are not counted. Add the
  ALLOC_FASTPATH counter there.

- Bulk fastpath freeing will count a list of multiple objects with a
  single FREE_FASTPATH inc. Add a stat_add() variant to count them all.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 63d281dfacdb..f0cd55bb4e11 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -341,6 +341,14 @@ static inline void stat(const struct kmem_cache *s, enum stat_item si)
 #endif
 }
 
+static inline void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
+{
+#ifdef CONFIG_SLUB_STATS
+	raw_cpu_add(s->cpu_slab->stat[si], v);
+#endif
+}
+
+
 /*
  * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
  * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
@@ -3784,7 +3792,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 
 		local_unlock(&s->cpu_slab->lock);
 	}
-	stat(s, FREE_FASTPATH);
+	stat_add(s, FREE_FASTPATH, cnt);
 }
 #else /* CONFIG_SLUB_TINY */
 static void do_slab_free(struct kmem_cache *s,
@@ -3986,6 +3994,7 @@ static inline int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 		c->freelist = get_freepointer(s, object);
 		p[i] = object;
 		maybe_wipe_obj_freeptr(s, p[i]);
+		stat(s, ALLOC_FASTPATH);
 	}
 	c->tid = next_tid(c->tid);
 	local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v3 2/9] mm/slub: introduce __kmem_cache_free_bulk() without free hooks
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 1/9] mm/slub: fix bulk alloc and free stats Vlastimil Babka
@ 2023-11-29  9:53 ` Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 3/9] mm/slub: handle bulk and single object freeing separately Vlastimil Babka
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev, Vlastimil Babka

Currently, when __kmem_cache_alloc_bulk() fails, it frees back the
objects that were allocated before the failure, using
kmem_cache_free_bulk(). Because kmem_cache_free_bulk() calls the free
hooks (kasan etc.) and those expect objects processed by the post alloc
hooks, slab_post_alloc_hook() is called before kmem_cache_free_bulk().

This is wasteful, although not a big concern in practice for the very
rare error path. But in order to efficiently handle percpu array batch
refill and free in the following patch, we will also need a variant of
kmem_cache_free_bulk() that avoids the free hooks. So introduce it first
and use it in the error path too.

As a consequence, __kmem_cache_alloc_bulk() no longer needs the objcg
parameter, remove it.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index f0cd55bb4e11..16748aeada8f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3919,6 +3919,27 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
 	return same;
 }
 
+/*
+ * Internal bulk free of objects that were not initialised by the post alloc
+ * hooks and thus should not be processed by the free hooks
+ */
+static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	if (!size)
+		return;
+
+	do {
+		struct detached_freelist df;
+
+		size = build_detached_freelist(s, size, p, &df);
+		if (!df.slab)
+			continue;
+
+		do_slab_free(df.s, df.slab, df.freelist, df.tail, df.cnt,
+			     _RET_IP_);
+	} while (likely(size));
+}
+
 /* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
@@ -3940,7 +3961,7 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
 
 #ifndef CONFIG_SLUB_TINY
 static inline int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
-			size_t size, void **p, struct obj_cgroup *objcg)
+					  size_t size, void **p)
 {
 	struct kmem_cache_cpu *c;
 	unsigned long irqflags;
@@ -4004,14 +4025,13 @@ static inline int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 
 error:
 	slub_put_cpu_ptr(s->cpu_slab);
-	slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
-	kmem_cache_free_bulk(s, i, p);
+	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 
 }
 #else /* CONFIG_SLUB_TINY */
 static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
-			size_t size, void **p, struct obj_cgroup *objcg)
+				   size_t size, void **p)
 {
 	int i;
 
@@ -4034,8 +4054,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 	return i;
 
 error:
-	slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
-	kmem_cache_free_bulk(s, i, p);
+	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
 #endif /* CONFIG_SLUB_TINY */
@@ -4055,7 +4074,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	if (unlikely(!s))
 		return 0;
 
-	i = __kmem_cache_alloc_bulk(s, flags, size, p, objcg);
+	i = __kmem_cache_alloc_bulk(s, flags, size, p);
 
 	/*
 	 * memcg and kmem_cache debug support and memory initialization.

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v3 3/9] mm/slub: handle bulk and single object freeing separately
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 1/9] mm/slub: fix bulk alloc and free stats Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 2/9] mm/slub: introduce __kmem_cache_free_bulk() without free hooks Vlastimil Babka
@ 2023-11-29  9:53 ` Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 4/9] mm/slub: free KFENCE objects in slab_free_hook() Vlastimil Babka
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev, Vlastimil Babka

Until now we have a single function slab_free() handling both single
object freeing and bulk freeing with neccessary hooks, the latter case
requiring slab_free_freelist_hook(). It should be however better to
distinguish the two scenarios for the following reasons:

- code simpler to follow for the single object case

- better code generation - although inlining should eliminate the
  slab_free_freelist_hook() in case no debugging options are enabled, it
  seems it's not perfect. When e.g. KASAN is enabled, we're imposing
  additional unnecessary overhead for single object freeing.

- preparation to add percpu array caches in later patches

Therefore, simplify slab_free() for the single object case by dropping
unnecessary parameters and calling only slab_free_hook() instead of
slab_free_freelist_hook(). Rename the bulk variant to slab_free_bulk()
and adjust callers accordingly.

While at it, flip (and document) slab_free_hook() return value so that
it returns true when the freeing can proceed, which matches the logic of
slab_free_freelist_hook() and is not confusingly the opposite.

Additionally we can simplify a bit by changing the tail parameter of
do_slab_free() when freeing a single object - instead of NULL we can set
equal to head.

bloat-o-meter shows small code reduction with a .config that has KASAN
etc disabled:

add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-118 (-118)
Function                                     old     new   delta
kmem_cache_alloc_bulk                       1203    1196      -7
kmem_cache_free                              861     835     -26
__kmem_cache_free                            741     704     -37
kmem_cache_free_bulk                         911     863     -48

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 57 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 34 insertions(+), 23 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 16748aeada8f..7d23f10d42e6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1770,9 +1770,12 @@ static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
 /*
  * Hooks for other subsystems that check memory allocations. In a typical
  * production configuration these hooks all should produce no code at all.
+ *
+ * Returns true if freeing of the object can proceed, false if its reuse
+ * was delayed by KASAN quarantine.
  */
-static __always_inline bool slab_free_hook(struct kmem_cache *s,
-						void *x, bool init)
+static __always_inline
+bool slab_free_hook(struct kmem_cache *s, void *x, bool init)
 {
 	kmemleak_free_recursive(x, s->flags);
 	kmsan_slab_free(s, x);
@@ -1805,7 +1808,7 @@ static __always_inline bool slab_free_hook(struct kmem_cache *s,
 		       s->size - s->inuse - rsize);
 	}
 	/* KASAN might put x into memory quarantine, delaying its reuse. */
-	return kasan_slab_free(s, x, init);
+	return !kasan_slab_free(s, x, init);
 }
 
 static inline bool slab_free_freelist_hook(struct kmem_cache *s,
@@ -1815,7 +1818,7 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
 
 	void *object;
 	void *next = *head;
-	void *old_tail = *tail ? *tail : *head;
+	void *old_tail = *tail;
 
 	if (is_kfence_address(next)) {
 		slab_free_hook(s, next, false);
@@ -1831,7 +1834,7 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
 		next = get_freepointer(s, object);
 
 		/* If object's reuse doesn't have to be delayed */
-		if (!slab_free_hook(s, object, slab_want_init_on_free(s))) {
+		if (slab_free_hook(s, object, slab_want_init_on_free(s))) {
 			/* Move object to the new freelist */
 			set_freepointer(s, object, *head);
 			*head = object;
@@ -1846,9 +1849,6 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
 		}
 	} while (object != old_tail);
 
-	if (*head == *tail)
-		*tail = NULL;
-
 	return *head != NULL;
 }
 
@@ -3743,7 +3743,6 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 				struct slab *slab, void *head, void *tail,
 				int cnt, unsigned long addr)
 {
-	void *tail_obj = tail ? : head;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
 	void **freelist;
@@ -3762,14 +3761,14 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	barrier();
 
 	if (unlikely(slab != c->slab)) {
-		__slab_free(s, slab, head, tail_obj, cnt, addr);
+		__slab_free(s, slab, head, tail, cnt, addr);
 		return;
 	}
 
 	if (USE_LOCKLESS_FAST_PATH()) {
 		freelist = READ_ONCE(c->freelist);
 
-		set_freepointer(s, tail_obj, freelist);
+		set_freepointer(s, tail, freelist);
 
 		if (unlikely(!__update_cpu_freelist_fast(s, freelist, head, tid))) {
 			note_cmpxchg_failure("slab_free", s, tid);
@@ -3786,7 +3785,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 		tid = c->tid;
 		freelist = c->freelist;
 
-		set_freepointer(s, tail_obj, freelist);
+		set_freepointer(s, tail, freelist);
 		c->freelist = head;
 		c->tid = next_tid(tid);
 
@@ -3799,15 +3798,27 @@ static void do_slab_free(struct kmem_cache *s,
 				struct slab *slab, void *head, void *tail,
 				int cnt, unsigned long addr)
 {
-	void *tail_obj = tail ? : head;
-
-	__slab_free(s, slab, head, tail_obj, cnt, addr);
+	__slab_free(s, slab, head, tail, cnt, addr);
 }
 #endif /* CONFIG_SLUB_TINY */
 
-static __fastpath_inline void slab_free(struct kmem_cache *s, struct slab *slab,
-				      void *head, void *tail, void **p, int cnt,
-				      unsigned long addr)
+static __fastpath_inline
+void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
+	       unsigned long addr)
+{
+	bool init;
+
+	memcg_slab_free_hook(s, slab, &object, 1);
+
+	init = !is_kfence_address(object) && slab_want_init_on_free(s);
+
+	if (likely(slab_free_hook(s, object, init)))
+		do_slab_free(s, slab, object, object, 1, addr);
+}
+
+static __fastpath_inline
+void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head,
+		    void *tail, void **p, int cnt, unsigned long addr)
 {
 	memcg_slab_free_hook(s, slab, p, cnt);
 	/*
@@ -3821,13 +3832,13 @@ static __fastpath_inline void slab_free(struct kmem_cache *s, struct slab *slab,
 #ifdef CONFIG_KASAN_GENERIC
 void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr)
 {
-	do_slab_free(cache, virt_to_slab(x), x, NULL, 1, addr);
+	do_slab_free(cache, virt_to_slab(x), x, x, 1, addr);
 }
 #endif
 
 void __kmem_cache_free(struct kmem_cache *s, void *x, unsigned long caller)
 {
-	slab_free(s, virt_to_slab(x), x, NULL, &x, 1, caller);
+	slab_free(s, virt_to_slab(x), x, caller);
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -3836,7 +3847,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 	if (!s)
 		return;
 	trace_kmem_cache_free(_RET_IP_, x, s);
-	slab_free(s, virt_to_slab(x), x, NULL, &x, 1, _RET_IP_);
+	slab_free(s, virt_to_slab(x), x, _RET_IP_);
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
@@ -3953,8 +3964,8 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 		if (!df.slab)
 			continue;
 
-		slab_free(df.s, df.slab, df.freelist, df.tail, &p[size], df.cnt,
-			  _RET_IP_);
+		slab_free_bulk(df.s, df.slab, df.freelist, df.tail, &p[size],
+				df.cnt, _RET_IP_);
 	} while (likely(size));
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v3 4/9] mm/slub: free KFENCE objects in slab_free_hook()
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
                   ` (2 preceding siblings ...)
  2023-11-29  9:53 ` [PATCH RFC v3 3/9] mm/slub: handle bulk and single object freeing separately Vlastimil Babka
@ 2023-11-29  9:53 ` Vlastimil Babka
  2023-11-29 12:00   ` Marco Elver
  2023-11-29  9:53 ` [PATCH RFC v3 5/9] mm/slub: add opt-in percpu array cache of objects Vlastimil Babka
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev, Vlastimil Babka

When freeing an object that was allocated from KFENCE, we do that in the
slowpath __slab_free(), relying on the fact that KFENCE "slab" cannot be
the cpu slab, so the fastpath has to fallback to the slowpath.

This optimization doesn't help much though, because is_kfence_address()
is checked earlier anyway during the free hook processing or detached
freelist building. Thus we can simplify the code by making the
slab_free_hook() free the KFENCE object immediately, similarly to KASAN
quarantine.

In slab_free_hook() we can place kfence_free() above init processing, as
callers have been making sure to set init to false for KFENCE objects.
This simplifies slab_free(). This places it also above kasan_slab_free()
which is ok as that skips KFENCE objects anyway.

While at it also determine the init value in slab_free_freelist_hook()
outside of the loop.

This change will also make introducing per cpu array caches easier.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 7d23f10d42e6..59912a376c6d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1772,7 +1772,7 @@ static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
  * production configuration these hooks all should produce no code at all.
  *
  * Returns true if freeing of the object can proceed, false if its reuse
- * was delayed by KASAN quarantine.
+ * was delayed by KASAN quarantine, or it was returned to KFENCE.
  */
 static __always_inline
 bool slab_free_hook(struct kmem_cache *s, void *x, bool init)
@@ -1790,6 +1790,9 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init)
 		__kcsan_check_access(x, s->object_size,
 				     KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT);
 
+	if (kfence_free(kasan_reset_tag(x)))
+		return false;
+
 	/*
 	 * As memory initialization might be integrated into KASAN,
 	 * kasan_slab_free and initialization memset's must be
@@ -1819,22 +1822,25 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
 	void *object;
 	void *next = *head;
 	void *old_tail = *tail;
+	bool init;
 
 	if (is_kfence_address(next)) {
 		slab_free_hook(s, next, false);
-		return true;
+		return false;
 	}
 
 	/* Head and tail of the reconstructed freelist */
 	*head = NULL;
 	*tail = NULL;
 
+	init = slab_want_init_on_free(s);
+
 	do {
 		object = next;
 		next = get_freepointer(s, object);
 
 		/* If object's reuse doesn't have to be delayed */
-		if (slab_free_hook(s, object, slab_want_init_on_free(s))) {
+		if (slab_free_hook(s, object, init)) {
 			/* Move object to the new freelist */
 			set_freepointer(s, object, *head);
 			*head = object;
@@ -3619,9 +3625,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 
 	stat(s, FREE_SLOWPATH);
 
-	if (kfence_free(head))
-		return;
-
 	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
 		free_to_partial_list(s, slab, head, tail, cnt, addr);
 		return;
@@ -3806,13 +3809,9 @@ static __fastpath_inline
 void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	       unsigned long addr)
 {
-	bool init;
-
 	memcg_slab_free_hook(s, slab, &object, 1);
 
-	init = !is_kfence_address(object) && slab_want_init_on_free(s);
-
-	if (likely(slab_free_hook(s, object, init)))
+	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s))))
 		do_slab_free(s, slab, object, object, 1, addr);
 }
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v3 5/9] mm/slub: add opt-in percpu array cache of objects
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
                   ` (3 preceding siblings ...)
  2023-11-29  9:53 ` [PATCH RFC v3 4/9] mm/slub: free KFENCE objects in slab_free_hook() Vlastimil Babka
@ 2023-11-29  9:53 ` Vlastimil Babka
  2023-11-29 10:35   ` Marco Elver
  2023-12-15 18:28   ` Suren Baghdasaryan
  2023-11-29  9:53 ` [PATCH RFC v3 6/9] tools: Add SLUB percpu array functions for testing Vlastimil Babka
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev, Vlastimil Babka

kmem_cache_setup_percpu_array() will allocate a per-cpu array for
caching alloc/free objects of given size for the cache. The cache
has to be created with SLAB_NO_MERGE flag.

When empty, half of the array is filled by an internal bulk alloc
operation. When full, half of the array is flushed by an internal bulk
free operation.

The array does not distinguish NUMA locality of the cached objects. If
an allocation is requested with kmem_cache_alloc_node() with numa node
not equal to NUMA_NO_NODE, the array is bypassed.

The bulk operations exposed to slab users also try to utilize the array
when possible, but leave the array empty or full and use the bulk
alloc/free only to finish the operation itself. If kmemcg is enabled and
active, bulk freeing skips the array completely as it would be less
efficient to use it.

The locking scheme is copied from the page allocator's pcplists, based
on embedded spin locks. Interrupts are not disabled, only preemption
(cpu migration on RT). Trylock is attempted to avoid deadlock due to an
interrupt; trylock failure means the array is bypassed.

Sysfs stat counters alloc_cpu_cache and free_cpu_cache count objects
allocated or freed using the percpu array; counters cpu_cache_refill and
cpu_cache_flush count objects refilled or flushed form the array.

kmem_cache_prefill_percpu_array() can be called to ensure the array on
the current cpu to at least the given number of objects. However this is
only opportunistic as there's no cpu pinning between the prefill and
usage, and trylocks may fail when the usage is in an irq handler.
Therefore allocations cannot rely on the array for success even after
the prefill. But misses should be rare enough that e.g. GFP_ATOMIC
allocations should be acceptable after the refill.

When slub_debug is enabled for a cache with percpu array, the objects in
the array are considered as allocated from the slub_debug perspective,
and the alloc/free debugging hooks occur when moving the objects between
the array and slab pages. This means that e.g. an use-after-free that
occurs for an object cached in the array is undetected. Collected
alloc/free stacktraces might also be less useful. This limitation could
be changed in the future.

On the other hand, KASAN, kmemcg and other hooks are executed on actual
allocations and frees by kmem_cache users even if those use the array,
so their debugging or accounting accuracy should be unaffected.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h     |   4 +
 include/linux/slub_def.h |  12 ++
 mm/Kconfig               |   1 +
 mm/slub.c                | 457 ++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 468 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d6d6ffeeb9a2..fe0c0981be59 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -197,6 +197,8 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void kmem_cache_destroy(struct kmem_cache *s);
 int kmem_cache_shrink(struct kmem_cache *s);
 
+int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count);
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
@@ -512,6 +514,8 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
 int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
 
+int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count, gfp_t gfp);
+
 static __always_inline void kfree_bulk(size_t size, void **p)
 {
 	kmem_cache_free_bulk(NULL, size, p);
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index deb90cf4bffb..2083aa849766 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -13,8 +13,10 @@
 #include <linux/local_lock.h>
 
 enum stat_item {
+	ALLOC_PCA,		/* Allocation from percpu array cache */
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
+	FREE_PCA,		/* Free to percpu array cache */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -39,6 +41,8 @@ enum stat_item {
 	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
 	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
 	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
+	PCA_REFILL,		/* Refilling empty percpu array cache */
+	PCA_FLUSH,		/* Flushing full percpu array cache */
 	NR_SLUB_STAT_ITEMS
 };
 
@@ -66,6 +70,13 @@ struct kmem_cache_cpu {
 };
 #endif /* CONFIG_SLUB_TINY */
 
+struct slub_percpu_array {
+	spinlock_t lock;
+	unsigned int count;
+	unsigned int used;
+	void * objects[];
+};
+
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 #define slub_percpu_partial(c)		((c)->partial)
 
@@ -99,6 +110,7 @@ struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
 #endif
+	struct slub_percpu_array __percpu *cpu_array;
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
 	unsigned long min_partial;
diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..aa53c51bb4a6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -237,6 +237,7 @@ choice
 config SLAB_DEPRECATED
 	bool "SLAB (DEPRECATED)"
 	depends on !PREEMPT_RT
+	depends on BROKEN
 	help
 	  Deprecated and scheduled for removal in a few cycles. Replaced by
 	  SLUB.
diff --git a/mm/slub.c b/mm/slub.c
index 59912a376c6d..f08bd71c244f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -188,6 +188,79 @@ do {					\
 #define USE_LOCKLESS_FAST_PATH()	(false)
 #endif
 
+/* copy/pasted  from mm/page_alloc.c */
+
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+/*
+ * On SMP, spin_trylock is sufficient protection.
+ * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ */
+#define pcp_trylock_prepare(flags)	do { } while (0)
+#define pcp_trylock_finish(flag)	do { } while (0)
+#else
+
+/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
+#define pcp_trylock_prepare(flags)	local_irq_save(flags)
+#define pcp_trylock_finish(flags)	local_irq_restore(flags)
+#endif
+
+/*
+ * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
+ * a migration causing the wrong PCP to be locked and remote memory being
+ * potentially allocated, pin the task to the CPU for the lookup+lock.
+ * preempt_disable is used on !RT because it is faster than migrate_disable.
+ * migrate_disable is used on RT because otherwise RT spinlock usage is
+ * interfered with and a high priority task cannot preempt the allocator.
+ */
+#ifndef CONFIG_PREEMPT_RT
+#define pcpu_task_pin()		preempt_disable()
+#define pcpu_task_unpin()	preempt_enable()
+#else
+#define pcpu_task_pin()		migrate_disable()
+#define pcpu_task_unpin()	migrate_enable()
+#endif
+
+/*
+ * Generic helper to lookup and a per-cpu variable with an embedded spinlock.
+ * Return value should be used with equivalent unlock helper.
+ */
+#define pcpu_spin_lock(type, member, ptr)				\
+({									\
+	type *_ret;							\
+	pcpu_task_pin();						\
+	_ret = this_cpu_ptr(ptr);					\
+	spin_lock(&_ret->member);					\
+	_ret;								\
+})
+
+#define pcpu_spin_trylock(type, member, ptr)				\
+({									\
+	type *_ret;							\
+	pcpu_task_pin();						\
+	_ret = this_cpu_ptr(ptr);					\
+	if (!spin_trylock(&_ret->member)) {				\
+		pcpu_task_unpin();					\
+		_ret = NULL;						\
+	}								\
+	_ret;								\
+})
+
+#define pcpu_spin_unlock(member, ptr)					\
+({									\
+	spin_unlock(&ptr->member);					\
+	pcpu_task_unpin();						\
+})
+
+/* struct slub_percpu_array specific helpers. */
+#define pca_spin_lock(ptr)						\
+	pcpu_spin_lock(struct slub_percpu_array, lock, ptr)
+
+#define pca_spin_trylock(ptr)						\
+	pcpu_spin_trylock(struct slub_percpu_array, lock, ptr)
+
+#define pca_spin_unlock(ptr)						\
+	pcpu_spin_unlock(lock, ptr)
+
 #ifndef CONFIG_SLUB_TINY
 #define __fastpath_inline __always_inline
 #else
@@ -3454,6 +3527,78 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
 			0, sizeof(void *));
 }
 
+static bool refill_pca(struct kmem_cache *s, unsigned int count, gfp_t gfp);
+
+static __fastpath_inline
+void *alloc_from_pca(struct kmem_cache *s, gfp_t gfp)
+{
+	unsigned long __maybe_unused UP_flags;
+	struct slub_percpu_array *pca;
+	void *object;
+
+retry:
+	pcp_trylock_prepare(UP_flags);
+	pca = pca_spin_trylock(s->cpu_array);
+
+	if (unlikely(!pca)) {
+		pcp_trylock_finish(UP_flags);
+		return NULL;
+	}
+
+	if (unlikely(pca->used == 0)) {
+		unsigned int batch = pca->count / 2;
+
+		pca_spin_unlock(pca);
+		pcp_trylock_finish(UP_flags);
+
+		if (!gfpflags_allow_blocking(gfp) || in_irq())
+			return NULL;
+
+		if (refill_pca(s, batch, gfp))
+			goto retry;
+
+		return NULL;
+	}
+
+	object = pca->objects[--pca->used];
+
+	pca_spin_unlock(pca);
+	pcp_trylock_finish(UP_flags);
+
+	stat(s, ALLOC_PCA);
+
+	return object;
+}
+
+static __fastpath_inline
+int alloc_from_pca_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	unsigned long __maybe_unused UP_flags;
+	struct slub_percpu_array *pca;
+
+	pcp_trylock_prepare(UP_flags);
+	pca = pca_spin_trylock(s->cpu_array);
+
+	if (unlikely(!pca)) {
+		size = 0;
+		goto failed;
+	}
+
+	if (pca->used < size)
+		size = pca->used;
+
+	for (int i = size; i > 0;) {
+		p[--i] = pca->objects[--pca->used];
+	}
+
+	pca_spin_unlock(pca);
+	stat_add(s, ALLOC_PCA, size);
+
+failed:
+	pcp_trylock_finish(UP_flags);
+	return size;
+}
+
 /*
  * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
  * have the fastpath folded into their functions. So no function call
@@ -3479,7 +3624,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
+	if (s->cpu_array && (node == NUMA_NO_NODE))
+		object = alloc_from_pca(s, gfpflags);
+
+	if (!object)
+		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);
 	init = slab_want_init_on_alloc(gfpflags, s);
@@ -3726,6 +3875,81 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+static bool flush_pca(struct kmem_cache *s, unsigned int count);
+
+static __fastpath_inline
+bool free_to_pca(struct kmem_cache *s, void *object)
+{
+	unsigned long __maybe_unused UP_flags;
+	struct slub_percpu_array *pca;
+
+retry:
+	pcp_trylock_prepare(UP_flags);
+	pca = pca_spin_trylock(s->cpu_array);
+
+	if (!pca) {
+		pcp_trylock_finish(UP_flags);
+		return false;
+	}
+
+	if (pca->used == pca->count) {
+		unsigned int batch = pca->count / 2;
+
+		pca_spin_unlock(pca);
+		pcp_trylock_finish(UP_flags);
+
+		if (in_irq())
+			return false;
+
+		if (!flush_pca(s, batch))
+			return false;
+
+		goto retry;
+	}
+
+	pca->objects[pca->used++] = object;
+
+	pca_spin_unlock(pca);
+	pcp_trylock_finish(UP_flags);
+
+	stat(s, FREE_PCA);
+
+	return true;
+}
+
+static __fastpath_inline
+size_t free_to_pca_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	unsigned long __maybe_unused UP_flags;
+	struct slub_percpu_array *pca;
+	bool init;
+
+	pcp_trylock_prepare(UP_flags);
+	pca = pca_spin_trylock(s->cpu_array);
+
+	if (unlikely(!pca)) {
+		size = 0;
+		goto failed;
+	}
+
+	if (pca->count - pca->used < size)
+		size = pca->count - pca->used;
+
+	init = slab_want_init_on_free(s);
+
+	for (size_t i = 0; i < size; i++) {
+		if (likely(slab_free_hook(s, p[i], init)))
+			pca->objects[pca->used++] = p[i];
+	}
+
+	pca_spin_unlock(pca);
+	stat_add(s, FREE_PCA, size);
+
+failed:
+	pcp_trylock_finish(UP_flags);
+	return size;
+}
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -3811,7 +4035,12 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 {
 	memcg_slab_free_hook(s, slab, &object, 1);
 
-	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s))))
+	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s))))
+		return;
+
+	if (s->cpu_array)
+		free_to_pca(s, object);
+	else
 		do_slab_free(s, slab, object, object, 1, addr);
 }
 
@@ -3956,6 +4185,26 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 	if (!size)
 		return;
 
+	/*
+	 * In case the objects might need memcg_slab_free_hook(), skip the array
+	 * because the hook is not effective with single objects and benefits
+	 * from groups of objects from a single slab that the detached freelist
+	 * builds. But once we build the detached freelist, it's wasteful to
+	 * throw it away and put the objects into the array.
+	 *
+	 * XXX: This test could be cache-specific if it was not possible to use
+	 * __GFP_ACCOUNT with caches that are not SLAB_ACCOUNT
+	 */
+	if (s && s->cpu_array && !memcg_kmem_online()) {
+		size_t pca_freed = free_to_pca_bulk(s, size, p);
+
+		if (pca_freed == size)
+			return;
+
+		p += pca_freed;
+		size -= pca_freed;
+	}
+
 	do {
 		struct detached_freelist df;
 
@@ -4073,7 +4322,8 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)
 {
-	int i;
+	int from_pca = 0;
+	int allocated = 0;
 	struct obj_cgroup *objcg = NULL;
 
 	if (!size)
@@ -4084,19 +4334,147 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	if (unlikely(!s))
 		return 0;
 
-	i = __kmem_cache_alloc_bulk(s, flags, size, p);
+	if (s->cpu_array)
+		from_pca = alloc_from_pca_bulk(s, size, p);
+
+	if (from_pca < size) {
+		allocated = __kmem_cache_alloc_bulk(s, flags, size-from_pca,
+						    p+from_pca);
+		if (allocated == 0 && from_pca > 0) {
+			__kmem_cache_free_bulk(s, from_pca, p);
+		}
+	}
+
+	allocated += from_pca;
 
 	/*
 	 * memcg and kmem_cache debug support and memory initialization.
 	 * Done outside of the IRQ disabled fastpath loop.
 	 */
-	if (i != 0)
+	if (allocated != 0)
 		slab_post_alloc_hook(s, objcg, flags, size, p,
 			slab_want_init_on_alloc(flags, s), s->object_size);
-	return i;
+	return allocated;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
 
+static bool refill_pca(struct kmem_cache *s, unsigned int count, gfp_t gfp)
+{
+	void *objects[32];
+	unsigned int batch, allocated;
+	unsigned long __maybe_unused UP_flags;
+	struct slub_percpu_array *pca;
+
+bulk_alloc:
+	batch = min(count, 32U);
+
+	allocated = __kmem_cache_alloc_bulk(s, gfp, batch, &objects[0]);
+	if (!allocated)
+		return false;
+
+	pcp_trylock_prepare(UP_flags);
+	pca = pca_spin_trylock(s->cpu_array);
+	if (!pca) {
+		pcp_trylock_finish(UP_flags);
+		return false;
+	}
+
+	batch = min(allocated, pca->count - pca->used);
+
+	for (unsigned int i = 0; i < batch; i++) {
+		pca->objects[pca->used++] = objects[i];
+	}
+
+	pca_spin_unlock(pca);
+	pcp_trylock_finish(UP_flags);
+
+	stat_add(s, PCA_REFILL, batch);
+
+	/*
+	 * We could have migrated to a different cpu or somebody else freed to the
+	 * pca while we were bulk allocating, and now we have too many objects
+	 */
+	if (batch < allocated) {
+		__kmem_cache_free_bulk(s, allocated - batch, &objects[batch]);
+	} else {
+		count -= batch;
+		if (count > 0)
+			goto bulk_alloc;
+	}
+
+	return true;
+}
+
+static bool flush_pca(struct kmem_cache *s, unsigned int count)
+{
+	void *objects[32];
+	unsigned int batch, remaining;
+	unsigned long __maybe_unused UP_flags;
+	struct slub_percpu_array *pca;
+
+next_batch:
+	batch = min(count, 32);
+
+	pcp_trylock_prepare(UP_flags);
+	pca = pca_spin_trylock(s->cpu_array);
+	if (!pca) {
+		pcp_trylock_finish(UP_flags);
+		return false;
+	}
+
+	batch = min(batch, pca->used);
+
+	for (unsigned int i = 0; i < batch; i++) {
+		objects[i] = pca->objects[--pca->used];
+	}
+
+	remaining = pca->used;
+
+	pca_spin_unlock(pca);
+	pcp_trylock_finish(UP_flags);
+
+	__kmem_cache_free_bulk(s, batch, &objects[0]);
+
+	stat_add(s, PCA_FLUSH, batch);
+
+	if (batch < count && remaining > 0) {
+		count -= batch;
+		goto next_batch;
+	}
+
+	return true;
+}
+
+/* Do not call from irq handler nor with irqs disabled */
+int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count,
+				    gfp_t gfp)
+{
+	struct slub_percpu_array *pca;
+	unsigned int used;
+
+	lockdep_assert_no_hardirq();
+
+	if (!s->cpu_array)
+		return -EINVAL;
+
+	/* racy but we don't care */
+	pca = raw_cpu_ptr(s->cpu_array);
+
+	used = READ_ONCE(pca->used);
+
+	if (used >= count)
+		return 0;
+
+	if (pca->count < count)
+		return -EINVAL;
+
+	count -= used;
+
+	if (!refill_pca(s, count, gfp))
+		return -ENOMEM;
+
+	return 0;
+}
 
 /*
  * Object placement in a slab is made very easy because we always start at
@@ -5167,6 +5545,65 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
 	return 0;
 }
 
+/**
+ * kmem_cache_setup_percpu_array - Create a per-cpu array cache for the cache
+ * @s: The cache to add per-cpu array. Must be created with SLAB_NO_MERGE flag.
+ * @count: Size of the per-cpu array.
+ *
+ * After this call, allocations from the cache go through a percpu array. When
+ * it becomes empty, half is refilled with a bulk allocation. When it becomes
+ * full, half is flushed with a bulk free operation.
+ *
+ * Using the array cache is not guaranteed, i.e. it can be bypassed if its lock
+ * cannot be obtained. The array cache also does not distinguish NUMA nodes, so
+ * allocations via kmem_cache_alloc_node() with a node specified other than
+ * NUMA_NO_NODE will bypass the cache.
+ *
+ * Bulk allocation and free operations also try to use the array.
+ *
+ * kmem_cache_prefill_percpu_array() can be used to pre-fill the array cache
+ * before e.g. entering a restricted context. It is however not guaranteed that
+ * the caller will be able to subsequently consume the prefilled cache. Such
+ * failures should be however sufficiently rare so after the prefill,
+ * allocations using GFP_ATOMIC | __GFP_NOFAIL are acceptable for objects up to
+ * the prefilled amount.
+ *
+ * Limitations: when slub_debug is enabled for the cache, all relevant actions
+ * (i.e. poisoning, obtaining stacktraces) and checks happen when objects move
+ * between the array cache and slab pages, which may result in e.g. not
+ * detecting a use-after-free while the object is in the array cache, and the
+ * stacktraces may be less useful.
+ *
+ * Return: 0 if OK, -EINVAL on caches without SLAB_NO_MERGE or with the array
+ * already created, -ENOMEM when the per-cpu array creation fails.
+ */
+int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count)
+{
+	int cpu;
+
+	if (WARN_ON_ONCE(!(s->flags & SLAB_NO_MERGE)))
+		return -EINVAL;
+
+	if (s->cpu_array)
+		return -EINVAL;
+
+	s->cpu_array = __alloc_percpu(struct_size(s->cpu_array, objects, count),
+					sizeof(void *));
+
+	if (!s->cpu_array)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_array *pca = per_cpu_ptr(s->cpu_array, cpu);
+
+		spin_lock_init(&pca->lock);
+		pca->count = count;
+		pca->used = 0;
+	}
+
+	return 0;
+}
+
 #ifdef SLAB_SUPPORTS_SYSFS
 static int count_inuse(struct slab *slab)
 {
@@ -5944,8 +6381,10 @@ static ssize_t text##_store(struct kmem_cache *s,		\
 }								\
 SLAB_ATTR(text);						\
 
+STAT_ATTR(ALLOC_PCA, alloc_cpu_cache);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
+STAT_ATTR(FREE_PCA, free_cpu_cache);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -5970,6 +6409,8 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
 STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
 STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
 STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
+STAT_ATTR(PCA_REFILL, cpu_cache_refill);
+STAT_ATTR(PCA_FLUSH, cpu_cache_flush);
 #endif	/* CONFIG_SLUB_STATS */
 
 #ifdef CONFIG_KFENCE
@@ -6031,8 +6472,10 @@ static struct attribute *slab_attrs[] = {
 	&remote_node_defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
+	&alloc_cpu_cache_attr.attr,
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
+	&free_cpu_cache_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,
@@ -6057,6 +6500,8 @@ static struct attribute *slab_attrs[] = {
 	&cpu_partial_free_attr.attr,
 	&cpu_partial_node_attr.attr,
 	&cpu_partial_drain_attr.attr,
+	&cpu_cache_refill_attr.attr,
+	&cpu_cache_flush_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v3 6/9] tools: Add SLUB percpu array functions for testing
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
                   ` (4 preceding siblings ...)
  2023-11-29  9:53 ` [PATCH RFC v3 5/9] mm/slub: add opt-in percpu array cache of objects Vlastimil Babka
@ 2023-11-29  9:53 ` Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 7/9] maple_tree: use slub percpu array Vlastimil Babka
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev, Vlastimil Babka

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

Support new percpu array functions to the test code so they can be used
in the maple tree testing.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/include/linux/slab.h              |  4 ++++
 tools/testing/radix-tree/linux.c        | 14 ++++++++++++++
 tools/testing/radix-tree/linux/kernel.h |  1 +
 3 files changed, 19 insertions(+)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index 311759ea25e9..1043f9c5ef4e 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -7,6 +7,7 @@
 
 #define SLAB_PANIC 2
 #define SLAB_RECLAIM_ACCOUNT    0x00020000UL            /* Objects are reclaimable */
+#define SLAB_NO_MERGE		0x01000000UL		/* Prevent merging with compatible kmem caches */
 
 #define kzalloc_node(size, flags, node) kmalloc(size, flags)
 
@@ -45,4 +46,7 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 			  void **list);
 
+int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count);
+int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count,
+		gfp_t gfp);
 #endif		/* _TOOLS_SLAB_H */
diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
index 61fe2601cb3a..3c9372afe9bc 100644
--- a/tools/testing/radix-tree/linux.c
+++ b/tools/testing/radix-tree/linux.c
@@ -187,6 +187,20 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 	return size;
 }
 
+int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count)
+{
+	return 0;
+}
+
+int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count,
+		gfp_t gfp)
+{
+	if (count > s->non_kernel)
+		return s->non_kernel;
+
+	return count;
+}
+
 struct kmem_cache *
 kmem_cache_create(const char *name, unsigned int size, unsigned int align,
 		unsigned int flags, void (*ctor)(void *))
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index c5c9d05f29da..fc75018974de 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -15,6 +15,7 @@
 
 #define printk printf
 #define pr_err printk
+#define pr_warn printk
 #define pr_info printk
 #define pr_debug printk
 #define pr_cont printk

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v3 7/9] maple_tree: use slub percpu array
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
                   ` (5 preceding siblings ...)
  2023-11-29  9:53 ` [PATCH RFC v3 6/9] tools: Add SLUB percpu array functions for testing Vlastimil Babka
@ 2023-11-29  9:53 ` Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 8/9] maple_tree: Remove MA_STATE_PREALLOC Vlastimil Babka
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev

Just make sure the maple_node_cache has a percpu array of size 32.

Will break with CONFIG_SLAB.
---
 lib/maple_tree.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index bb24d84a4922..d9e7088fd9a7 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -6213,9 +6213,16 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 
 void __init maple_tree_init(void)
 {
+	int ret;
+
 	maple_node_cache = kmem_cache_create("maple_node",
 			sizeof(struct maple_node), sizeof(struct maple_node),
-			SLAB_PANIC, NULL);
+			SLAB_PANIC | SLAB_NO_MERGE, NULL);
+
+	ret = kmem_cache_setup_percpu_array(maple_node_cache, 32);
+
+	if (ret)
+		pr_warn("error %d creating percpu_array for maple_node_cache\n", ret);
 }
 
 /**

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v3 8/9] maple_tree: Remove MA_STATE_PREALLOC
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
                   ` (6 preceding siblings ...)
  2023-11-29  9:53 ` [PATCH RFC v3 7/9] maple_tree: use slub percpu array Vlastimil Babka
@ 2023-11-29  9:53 ` Vlastimil Babka
  2023-11-29  9:53 ` [PATCH RFC v3 9/9] maple_tree: replace preallocation with slub percpu array prefill Vlastimil Babka
  2023-11-29 20:16 ` [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Christoph Lameter (Ampere)
  9 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev, Vlastimil Babka

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

MA_SATE_PREALLOC was added to catch any writes that try to allocate when
the maple state is being used in preallocation mode.  This can safely be
removed in favour of the percpu array of nodes.

Note that mas_expected_entries() still expects no allocations during
operation and so MA_STATE_BULK can be used in place of preallocations
for this case, which is primarily used for forking.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index d9e7088fd9a7..f5c0bca2c5d7 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -68,11 +68,9 @@
  * Maple state flags
  * * MA_STATE_BULK		- Bulk insert mode
  * * MA_STATE_REBALANCE		- Indicate a rebalance during bulk insert
- * * MA_STATE_PREALLOC		- Preallocated nodes, WARN_ON allocation
  */
 #define MA_STATE_BULK		1
 #define MA_STATE_REBALANCE	2
-#define MA_STATE_PREALLOC	4
 
 #define ma_parent_ptr(x) ((struct maple_pnode *)(x))
 #define mas_tree_parent(x) ((unsigned long)(x->tree) | MA_ROOT_PARENT)
@@ -1255,11 +1253,8 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
 		return;
 
 	mas_set_alloc_req(mas, 0);
-	if (mas->mas_flags & MA_STATE_PREALLOC) {
-		if (allocated)
-			return;
-		WARN_ON(!allocated);
-	}
+	if (mas->mas_flags & MA_STATE_BULK)
+		return;
 
 	if (!allocated || mas->alloc->node_count == MAPLE_ALLOC_SLOTS) {
 		node = (struct maple_alloc *)mt_alloc_one(gfp);
@@ -5518,7 +5513,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
 	/* node store, slot store needs one node */
 ask_now:
 	mas_node_count_gfp(mas, request, gfp);
-	mas->mas_flags |= MA_STATE_PREALLOC;
 	if (likely(!mas_is_err(mas)))
 		return 0;
 
@@ -5561,7 +5555,7 @@ void mas_destroy(struct ma_state *mas)
 
 		mas->mas_flags &= ~MA_STATE_REBALANCE;
 	}
-	mas->mas_flags &= ~(MA_STATE_BULK|MA_STATE_PREALLOC);
+	mas->mas_flags &= ~MA_STATE_BULK;
 
 	total = mas_allocated(mas);
 	while (total) {
@@ -5610,9 +5604,6 @@ int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries)
 	 * of nodes during the operation.
 	 */
 
-	/* Optimize splitting for bulk insert in-order */
-	mas->mas_flags |= MA_STATE_BULK;
-
 	/*
 	 * Avoid overflow, assume a gap between each entry and a trailing null.
 	 * If this is wrong, it just means allocation can happen during
@@ -5629,8 +5620,9 @@ int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries)
 	/* Add working room for split (2 nodes) + new parents */
 	mas_node_count_gfp(mas, nr_nodes + 3, GFP_KERNEL);
 
-	/* Detect if allocations run out */
-	mas->mas_flags |= MA_STATE_PREALLOC;
+	/* Optimize splitting for bulk insert in-order */
+	mas->mas_flags |= MA_STATE_BULK;
+
 
 	if (!mas_is_err(mas))
 		return 0;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v3 9/9] maple_tree: replace preallocation with slub percpu array prefill
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
                   ` (7 preceding siblings ...)
  2023-11-29  9:53 ` [PATCH RFC v3 8/9] maple_tree: Remove MA_STATE_PREALLOC Vlastimil Babka
@ 2023-11-29  9:53 ` Vlastimil Babka
  2023-11-29 20:16 ` [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Christoph Lameter (Ampere)
  9 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-29  9:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett
  Cc: Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev

With the percpu array we can try not doing the preallocations in maple
tree, and instead make sure the percpu array is prefilled, and using
GFP_ATOMIC in places that relied on the preallocation (in case we miss
or fail trylock on the array), i.e. mas_store_prealloc(). For now simply
add __GFP_NOFAIL there as well.
---
 lib/maple_tree.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index f5c0bca2c5d7..d84a0c0fe83b 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5452,7 +5452,12 @@ void mas_store_prealloc(struct ma_state *mas, void *entry)
 
 	mas_wr_store_setup(&wr_mas);
 	trace_ma_write(__func__, mas, 0, entry);
+
+retry:
 	mas_wr_store_entry(&wr_mas);
+	if (unlikely(mas_nomem(mas, GFP_ATOMIC | __GFP_NOFAIL)))
+		goto retry;
+
 	MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas));
 	mas_destroy(mas);
 }
@@ -5471,8 +5476,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
 	MA_WR_STATE(wr_mas, mas, entry);
 	unsigned char node_size;
 	int request = 1;
-	int ret;
-
 
 	if (unlikely(!mas->index && mas->last == ULONG_MAX))
 		goto ask_now;
@@ -5512,16 +5515,8 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
 
 	/* node store, slot store needs one node */
 ask_now:
-	mas_node_count_gfp(mas, request, gfp);
-	if (likely(!mas_is_err(mas)))
-		return 0;
+	return kmem_cache_prefill_percpu_array(maple_node_cache, request, gfp);
 
-	mas_set_alloc_req(mas, 0);
-	ret = xa_err(mas->node);
-	mas_reset(mas);
-	mas_destroy(mas);
-	mas_reset(mas);
-	return ret;
 }
 EXPORT_SYMBOL_GPL(mas_preallocate);
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v3 5/9] mm/slub: add opt-in percpu array cache of objects
  2023-11-29  9:53 ` [PATCH RFC v3 5/9] mm/slub: add opt-in percpu array cache of objects Vlastimil Babka
@ 2023-11-29 10:35   ` Marco Elver
  2023-12-15 18:28   ` Suren Baghdasaryan
  1 sibling, 0 replies; 18+ messages in thread
From: Marco Elver @ 2023-11-29 10:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett, Andrew Morton, Roman Gushchin,
	Hyeonggon Yoo, Alexander Potapenko, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev

On Wed, 29 Nov 2023 at 10:53, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> kmem_cache_setup_percpu_array() will allocate a per-cpu array for
> caching alloc/free objects of given size for the cache. The cache
> has to be created with SLAB_NO_MERGE flag.
>
> When empty, half of the array is filled by an internal bulk alloc
> operation. When full, half of the array is flushed by an internal bulk
> free operation.
>
> The array does not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() with numa node
> not equal to NUMA_NO_NODE, the array is bypassed.
>
> The bulk operations exposed to slab users also try to utilize the array
> when possible, but leave the array empty or full and use the bulk
> alloc/free only to finish the operation itself. If kmemcg is enabled and
> active, bulk freeing skips the array completely as it would be less
> efficient to use it.
>
> The locking scheme is copied from the page allocator's pcplists, based
> on embedded spin locks. Interrupts are not disabled, only preemption
> (cpu migration on RT). Trylock is attempted to avoid deadlock due to an
> interrupt; trylock failure means the array is bypassed.
>
> Sysfs stat counters alloc_cpu_cache and free_cpu_cache count objects
> allocated or freed using the percpu array; counters cpu_cache_refill and
> cpu_cache_flush count objects refilled or flushed form the array.
>
> kmem_cache_prefill_percpu_array() can be called to ensure the array on
> the current cpu to at least the given number of objects. However this is
> only opportunistic as there's no cpu pinning between the prefill and
> usage, and trylocks may fail when the usage is in an irq handler.
> Therefore allocations cannot rely on the array for success even after
> the prefill. But misses should be rare enough that e.g. GFP_ATOMIC
> allocations should be acceptable after the refill.
>
> When slub_debug is enabled for a cache with percpu array, the objects in
> the array are considered as allocated from the slub_debug perspective,
> and the alloc/free debugging hooks occur when moving the objects between
> the array and slab pages. This means that e.g. an use-after-free that
> occurs for an object cached in the array is undetected. Collected
> alloc/free stacktraces might also be less useful. This limitation could
> be changed in the future.
>
> On the other hand, KASAN, kmemcg and other hooks are executed on actual
> allocations and frees by kmem_cache users even if those use the array,
> so their debugging or accounting accuracy should be unaffected.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h     |   4 +
>  include/linux/slub_def.h |  12 ++
>  mm/Kconfig               |   1 +
>  mm/slub.c                | 457 ++++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 468 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d6d6ffeeb9a2..fe0c0981be59 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -197,6 +197,8 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
>  void kmem_cache_destroy(struct kmem_cache *s);
>  int kmem_cache_shrink(struct kmem_cache *s);
>
> +int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count);
> +
>  /*
>   * Please use this macro to create slab caches. Simply specify the
>   * name of the structure and maybe some flags that are listed above.
> @@ -512,6 +514,8 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
>  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
>  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
>
> +int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count, gfp_t gfp);
> +
>  static __always_inline void kfree_bulk(size_t size, void **p)
>  {
>         kmem_cache_free_bulk(NULL, size, p);
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index deb90cf4bffb..2083aa849766 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -13,8 +13,10 @@
>  #include <linux/local_lock.h>
>
>  enum stat_item {
> +       ALLOC_PCA,              /* Allocation from percpu array cache */
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
>         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
> +       FREE_PCA,               /* Free to percpu array cache */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>         FREE_FROZEN,            /* Freeing to frozen slab */
> @@ -39,6 +41,8 @@ enum stat_item {
>         CPU_PARTIAL_FREE,       /* Refill cpu partial on free */
>         CPU_PARTIAL_NODE,       /* Refill cpu partial from node partial */
>         CPU_PARTIAL_DRAIN,      /* Drain cpu partial to node partial */
> +       PCA_REFILL,             /* Refilling empty percpu array cache */
> +       PCA_FLUSH,              /* Flushing full percpu array cache */
>         NR_SLUB_STAT_ITEMS
>  };
>
> @@ -66,6 +70,13 @@ struct kmem_cache_cpu {
>  };
>  #endif /* CONFIG_SLUB_TINY */
>
> +struct slub_percpu_array {
> +       spinlock_t lock;
> +       unsigned int count;
> +       unsigned int used;
> +       void * objects[];

checkpatch complains: "foo * bar" should be "foo *bar"

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v3 4/9] mm/slub: free KFENCE objects in slab_free_hook()
  2023-11-29  9:53 ` [PATCH RFC v3 4/9] mm/slub: free KFENCE objects in slab_free_hook() Vlastimil Babka
@ 2023-11-29 12:00   ` Marco Elver
  0 siblings, 0 replies; 18+ messages in thread
From: Marco Elver @ 2023-11-29 12:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett, Andrew Morton, Roman Gushchin,
	Hyeonggon Yoo, Alexander Potapenko, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev

On Wed, 29 Nov 2023 at 10:53, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> When freeing an object that was allocated from KFENCE, we do that in the
> slowpath __slab_free(), relying on the fact that KFENCE "slab" cannot be
> the cpu slab, so the fastpath has to fallback to the slowpath.
>
> This optimization doesn't help much though, because is_kfence_address()
> is checked earlier anyway during the free hook processing or detached
> freelist building. Thus we can simplify the code by making the
> slab_free_hook() free the KFENCE object immediately, similarly to KASAN
> quarantine.
>
> In slab_free_hook() we can place kfence_free() above init processing, as
> callers have been making sure to set init to false for KFENCE objects.
> This simplifies slab_free(). This places it also above kasan_slab_free()
> which is ok as that skips KFENCE objects anyway.
>
> While at it also determine the init value in slab_free_freelist_hook()
> outside of the loop.
>
> This change will also make introducing per cpu array caches easier.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Tested-by: Marco Elver <elver@google.com>

> ---
>  mm/slub.c | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 7d23f10d42e6..59912a376c6d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1772,7 +1772,7 @@ static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
>   * production configuration these hooks all should produce no code at all.
>   *
>   * Returns true if freeing of the object can proceed, false if its reuse
> - * was delayed by KASAN quarantine.
> + * was delayed by KASAN quarantine, or it was returned to KFENCE.
>   */
>  static __always_inline
>  bool slab_free_hook(struct kmem_cache *s, void *x, bool init)
> @@ -1790,6 +1790,9 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init)
>                 __kcsan_check_access(x, s->object_size,
>                                      KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT);
>
> +       if (kfence_free(kasan_reset_tag(x)))
> +               return false;
> +
>         /*
>          * As memory initialization might be integrated into KASAN,
>          * kasan_slab_free and initialization memset's must be
> @@ -1819,22 +1822,25 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
>         void *object;
>         void *next = *head;
>         void *old_tail = *tail;
> +       bool init;
>
>         if (is_kfence_address(next)) {
>                 slab_free_hook(s, next, false);
> -               return true;
> +               return false;
>         }
>
>         /* Head and tail of the reconstructed freelist */
>         *head = NULL;
>         *tail = NULL;
>
> +       init = slab_want_init_on_free(s);
> +
>         do {
>                 object = next;
>                 next = get_freepointer(s, object);
>
>                 /* If object's reuse doesn't have to be delayed */
> -               if (slab_free_hook(s, object, slab_want_init_on_free(s))) {
> +               if (slab_free_hook(s, object, init)) {
>                         /* Move object to the new freelist */
>                         set_freepointer(s, object, *head);
>                         *head = object;
> @@ -3619,9 +3625,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>
>         stat(s, FREE_SLOWPATH);
>
> -       if (kfence_free(head))
> -               return;
> -
>         if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
>                 free_to_partial_list(s, slab, head, tail, cnt, addr);
>                 return;
> @@ -3806,13 +3809,9 @@ static __fastpath_inline
>  void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>                unsigned long addr)
>  {
> -       bool init;
> -
>         memcg_slab_free_hook(s, slab, &object, 1);
>
> -       init = !is_kfence_address(object) && slab_want_init_on_free(s);
> -
> -       if (likely(slab_free_hook(s, object, init)))
> +       if (likely(slab_free_hook(s, object, slab_want_init_on_free(s))))
>                 do_slab_free(s, slab, object, object, 1, addr);
>  }
>
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes
  2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
                   ` (8 preceding siblings ...)
  2023-11-29  9:53 ` [PATCH RFC v3 9/9] maple_tree: replace preallocation with slub percpu array prefill Vlastimil Babka
@ 2023-11-29 20:16 ` Christoph Lameter (Ampere)
  2023-11-29 21:20   ` Matthew Wilcox
  2023-11-30  9:14   ` Vlastimil Babka
  9 siblings, 2 replies; 18+ messages in thread
From: Christoph Lameter (Ampere) @ 2023-11-29 20:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pekka Enberg, David Rientjes, Joonsoo Kim, Matthew Wilcox,
	Liam R. Howlett, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev

On Wed, 29 Nov 2023, Vlastimil Babka wrote:

> At LSF/MM I've mentioned that I see several use cases for introducing
> opt-in percpu arrays for caching alloc/free objects in SLUB. This is my
> first exploration of this idea, speficially for the use case of maple
> tree nodes. The assumptions are:

Hohumm... So we are not really removing SLAB but merging SLAB features 
into SLUB. In addition to per cpu slabs, we now have per cpu queues.

> - percpu arrays will be faster thank bulk alloc/free which needs
>  relatively long freelists to work well. Especially in the freeing case
>  we need the nodes to come from the same slab (or small set of those)

Percpu arrays require the code to handle individual objects. Handling 
freelists in partial SLABS means that numerous objects can be handled at 
once by handling the pointer to the list of objects.

In order to make the SLUB in page freelists work better you need to have 
larger freelist and that comes with larger page sizes. I.e. boot with
slub_min_order=5 or so to increase performance.

Also this means increasing TLB pressure. The in page freelists of SLUB 
cause objects from the same page be served. The SLAB queueing approach
results in objects being mixed from any address and thus neighboring 
objects may require more TLB entries.

> - preallocation for the worst case of needed nodes for a tree operation
>  that can't reclaim due to locks is wasteful. We could instead expect
>  that most of the time percpu arrays would satisfy the constained
>  allocations, and in the rare cases it does not we can dip into
>  GFP_ATOMIC reserves temporarily. So instead of preallocation just
>  prefill the arrays.

The partial percpu slabs could already do the same.

> - NUMA locality of the nodes is not a concern as the nodes of a
>  process's VMA tree end up all over the place anyway.

NUMA locality is already controlled by the user through the node 
specification for percpu slabs. All objects coming from the same in page 
freelist of SLUB have the same NUMA locality which simplifies things.

If you would consider NUMA locality for the percpu array then you'd be
back to my beloved alien caches. We were not able to avoid that when we 
tuned SLAB for maximum performance.

> Patch 5 adds the per-cpu array caches support. Locking is stolen from
> Mel's recent page allocator's pcplists implementation so it can avoid
> disabling IRQs and just disable preemption, but the trylocks can fail in
> rare situations - in most cases the locks are uncontended so the locking
> should be cheap.

Ok the locking is new but the design follows basic SLAB queue handling.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes
  2023-11-29 20:16 ` [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Christoph Lameter (Ampere)
@ 2023-11-29 21:20   ` Matthew Wilcox
  2023-12-14 20:14     ` Christoph Lameter (Ampere)
  2023-11-30  9:14   ` Vlastimil Babka
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2023-11-29 21:20 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Vlastimil Babka, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Liam R. Howlett, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev

On Wed, Nov 29, 2023 at 12:16:17PM -0800, Christoph Lameter (Ampere) wrote:
> Percpu arrays require the code to handle individual objects. Handling
> freelists in partial SLABS means that numerous objects can be handled at
> once by handling the pointer to the list of objects.

That works great until you hit degenerate cases like having one or two free
objects per slab.  Users have hit these cases and complained about them.
Arrays are much cheaper than lists, around 10x in my testing.

> In order to make the SLUB in page freelists work better you need to have
> larger freelist and that comes with larger page sizes. I.e. boot with
> slub_min_order=5 or so to increase performance.

That comes with its own problems, of course.

> Also this means increasing TLB pressure. The in page freelists of SLUB cause
> objects from the same page be served. The SLAB queueing approach
> results in objects being mixed from any address and thus neighboring objects
> may require more TLB entries.

Is that still a concern for modern CPUs?  We're using 1GB TLB entries
these days, and there are usually thousands of TLB entries.  This feels
like more of a concern for a 90s era CPU.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes
  2023-11-29 20:16 ` [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Christoph Lameter (Ampere)
  2023-11-29 21:20   ` Matthew Wilcox
@ 2023-11-30  9:14   ` Vlastimil Babka
  1 sibling, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2023-11-30  9:14 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Pekka Enberg, David Rientjes, Joonsoo Kim, Matthew Wilcox,
	Liam R. Howlett, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev

On 11/29/23 21:16, Christoph Lameter (Ampere) wrote:
> On Wed, 29 Nov 2023, Vlastimil Babka wrote:
> 
>> At LSF/MM I've mentioned that I see several use cases for introducing
>> opt-in percpu arrays for caching alloc/free objects in SLUB. This is my
>> first exploration of this idea, speficially for the use case of maple
>> tree nodes. The assumptions are:
> 
> Hohumm... So we are not really removing SLAB but merging SLAB features 
> into SLUB.

Hey, you've tried a similar thing back in 2010 too :)
https://lore.kernel.org/all/20100521211541.003062117@quilx.com/

In addition to per cpu slabs, we now have per cpu queues.

But importantly, it's very consciously opt-in. Whether the caches using
percpu arrays can also skip per cpu (partial) slabs, remains to be seen.

>> - percpu arrays will be faster thank bulk alloc/free which needs
>>  relatively long freelists to work well. Especially in the freeing case
>>  we need the nodes to come from the same slab (or small set of those)
> 
> Percpu arrays require the code to handle individual objects. Handling 
> freelists in partial SLABS means that numerous objects can be handled at 
> once by handling the pointer to the list of objects.
> 
> In order to make the SLUB in page freelists work better you need to have 
> larger freelist and that comes with larger page sizes. I.e. boot with
> slub_min_order=5 or so to increase performance.

In the freeing case, you might still end up with objects mixed from
different slab pages, so the detached freelist building will be inefficient.

> Also this means increasing TLB pressure. The in page freelists of SLUB 
> cause objects from the same page be served. The SLAB queueing approach
> results in objects being mixed from any address and thus neighboring 
> objects may require more TLB entries.

As Willy noted, we have 1GB entries in directmap. Also we found out that
even if there are actions that cause it to fragment, it's not worth trying
to minimize the fragmentations - https://lwn.net/Articles/931406/

>> - preallocation for the worst case of needed nodes for a tree operation
>>  that can't reclaim due to locks is wasteful. We could instead expect
>>  that most of the time percpu arrays would satisfy the constained
>>  allocations, and in the rare cases it does not we can dip into
>>  GFP_ATOMIC reserves temporarily. So instead of preallocation just
>>  prefill the arrays.
> 
> The partial percpu slabs could already do the same.

Possibly for the prefill, but efficient freeing will always be an issue.

>> - NUMA locality of the nodes is not a concern as the nodes of a
>>  process's VMA tree end up all over the place anyway.
> 
> NUMA locality is already controlled by the user through the node 
> specification for percpu slabs. All objects coming from the same in page 
> freelist of SLUB have the same NUMA locality which simplifies things.
> 
> If you would consider NUMA locality for the percpu array then you'd be
> back to my beloved alien caches. We were not able to avoid that when we 
> tuned SLAB for maximum performance.

True, it's easier not to support NUMA locality.

>> Patch 5 adds the per-cpu array caches support. Locking is stolen from
>> Mel's recent page allocator's pcplists implementation so it can avoid
>> disabling IRQs and just disable preemption, but the trylocks can fail in
>> rare situations - in most cases the locks are uncontended so the locking
>> should be cheap.
> 
> Ok the locking is new but the design follows basic SLAB queue handling.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes
  2023-11-29 21:20   ` Matthew Wilcox
@ 2023-12-14 20:14     ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter (Ampere) @ 2023-12-14 20:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Liam R. Howlett, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, linux-mm,
	linux-kernel, maple-tree, kasan-dev

On Wed, 29 Nov 2023, Matthew Wilcox wrote:

>> In order to make the SLUB in page freelists work better you need to have
>> larger freelist and that comes with larger page sizes. I.e. boot with
>> slub_min_order=5 or so to increase performance.
>
> That comes with its own problems, of course.

Well I thought you were solving those with the folios?

>> Also this means increasing TLB pressure. The in page freelists of SLUB cause
>> objects from the same page be served. The SLAB queueing approach
>> results in objects being mixed from any address and thus neighboring objects
>> may require more TLB entries.
>
> Is that still a concern for modern CPUs?  We're using 1GB TLB entries
> these days, and there are usually thousands of TLB entries.  This feels
> like more of a concern for a 90s era CPU.

ARM kernel memory is mapped by 4K entries by default since rodata=full is 
the default. Security concerns screw it up.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v3 5/9] mm/slub: add opt-in percpu array cache of objects
  2023-11-29  9:53 ` [PATCH RFC v3 5/9] mm/slub: add opt-in percpu array cache of objects Vlastimil Babka
  2023-11-29 10:35   ` Marco Elver
@ 2023-12-15 18:28   ` Suren Baghdasaryan
  2023-12-15 21:17     ` Suren Baghdasaryan
  1 sibling, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-12-15 18:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett, Andrew Morton, Roman Gushchin,
	Hyeonggon Yoo, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	linux-mm, linux-kernel, maple-tree, kasan-dev

On Wed, Nov 29, 2023 at 1:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> kmem_cache_setup_percpu_array() will allocate a per-cpu array for
> caching alloc/free objects of given size for the cache. The cache
> has to be created with SLAB_NO_MERGE flag.
>
> When empty, half of the array is filled by an internal bulk alloc
> operation. When full, half of the array is flushed by an internal bulk
> free operation.
>
> The array does not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() with numa node
> not equal to NUMA_NO_NODE, the array is bypassed.
>
> The bulk operations exposed to slab users also try to utilize the array
> when possible, but leave the array empty or full and use the bulk
> alloc/free only to finish the operation itself. If kmemcg is enabled and
> active, bulk freeing skips the array completely as it would be less
> efficient to use it.
>
> The locking scheme is copied from the page allocator's pcplists, based
> on embedded spin locks. Interrupts are not disabled, only preemption
> (cpu migration on RT). Trylock is attempted to avoid deadlock due to an
> interrupt; trylock failure means the array is bypassed.
>
> Sysfs stat counters alloc_cpu_cache and free_cpu_cache count objects
> allocated or freed using the percpu array; counters cpu_cache_refill and
> cpu_cache_flush count objects refilled or flushed form the array.
>
> kmem_cache_prefill_percpu_array() can be called to ensure the array on
> the current cpu to at least the given number of objects. However this is
> only opportunistic as there's no cpu pinning between the prefill and
> usage, and trylocks may fail when the usage is in an irq handler.
> Therefore allocations cannot rely on the array for success even after
> the prefill. But misses should be rare enough that e.g. GFP_ATOMIC
> allocations should be acceptable after the refill.
>
> When slub_debug is enabled for a cache with percpu array, the objects in
> the array are considered as allocated from the slub_debug perspective,
> and the alloc/free debugging hooks occur when moving the objects between
> the array and slab pages. This means that e.g. an use-after-free that
> occurs for an object cached in the array is undetected. Collected
> alloc/free stacktraces might also be less useful. This limitation could
> be changed in the future.
>
> On the other hand, KASAN, kmemcg and other hooks are executed on actual
> allocations and frees by kmem_cache users even if those use the array,
> so their debugging or accounting accuracy should be unaffected.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h     |   4 +
>  include/linux/slub_def.h |  12 ++
>  mm/Kconfig               |   1 +
>  mm/slub.c                | 457 ++++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 468 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d6d6ffeeb9a2..fe0c0981be59 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -197,6 +197,8 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
>  void kmem_cache_destroy(struct kmem_cache *s);
>  int kmem_cache_shrink(struct kmem_cache *s);
>
> +int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count);
> +
>  /*
>   * Please use this macro to create slab caches. Simply specify the
>   * name of the structure and maybe some flags that are listed above.
> @@ -512,6 +514,8 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
>  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
>  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
>
> +int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count, gfp_t gfp);
> +
>  static __always_inline void kfree_bulk(size_t size, void **p)
>  {
>         kmem_cache_free_bulk(NULL, size, p);
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index deb90cf4bffb..2083aa849766 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -13,8 +13,10 @@
>  #include <linux/local_lock.h>
>
>  enum stat_item {
> +       ALLOC_PCA,              /* Allocation from percpu array cache */
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
>         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
> +       FREE_PCA,               /* Free to percpu array cache */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>         FREE_FROZEN,            /* Freeing to frozen slab */
> @@ -39,6 +41,8 @@ enum stat_item {
>         CPU_PARTIAL_FREE,       /* Refill cpu partial on free */
>         CPU_PARTIAL_NODE,       /* Refill cpu partial from node partial */
>         CPU_PARTIAL_DRAIN,      /* Drain cpu partial to node partial */
> +       PCA_REFILL,             /* Refilling empty percpu array cache */
> +       PCA_FLUSH,              /* Flushing full percpu array cache */
>         NR_SLUB_STAT_ITEMS
>  };
>
> @@ -66,6 +70,13 @@ struct kmem_cache_cpu {
>  };
>  #endif /* CONFIG_SLUB_TINY */
>
> +struct slub_percpu_array {
> +       spinlock_t lock;
> +       unsigned int count;
> +       unsigned int used;
> +       void * objects[];
> +};
> +
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  #define slub_percpu_partial(c)         ((c)->partial)
>
> @@ -99,6 +110,7 @@ struct kmem_cache {
>  #ifndef CONFIG_SLUB_TINY
>         struct kmem_cache_cpu __percpu *cpu_slab;
>  #endif
> +       struct slub_percpu_array __percpu *cpu_array;
>         /* Used for retrieving partial slabs, etc. */
>         slab_flags_t flags;
>         unsigned long min_partial;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 89971a894b60..aa53c51bb4a6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -237,6 +237,7 @@ choice
>  config SLAB_DEPRECATED
>         bool "SLAB (DEPRECATED)"
>         depends on !PREEMPT_RT
> +       depends on BROKEN
>         help
>           Deprecated and scheduled for removal in a few cycles. Replaced by
>           SLUB.
> diff --git a/mm/slub.c b/mm/slub.c
> index 59912a376c6d..f08bd71c244f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -188,6 +188,79 @@ do {                                       \
>  #define USE_LOCKLESS_FAST_PATH()       (false)
>  #endif
>
> +/* copy/pasted  from mm/page_alloc.c */
> +
> +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
> +/*
> + * On SMP, spin_trylock is sufficient protection.
> + * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
> + */
> +#define pcp_trylock_prepare(flags)     do { } while (0)
> +#define pcp_trylock_finish(flag)       do { } while (0)
> +#else
> +
> +/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
> +#define pcp_trylock_prepare(flags)     local_irq_save(flags)
> +#define pcp_trylock_finish(flags)      local_irq_restore(flags)
> +#endif
> +
> +/*
> + * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
> + * a migration causing the wrong PCP to be locked and remote memory being
> + * potentially allocated, pin the task to the CPU for the lookup+lock.
> + * preempt_disable is used on !RT because it is faster than migrate_disable.
> + * migrate_disable is used on RT because otherwise RT spinlock usage is
> + * interfered with and a high priority task cannot preempt the allocator.
> + */
> +#ifndef CONFIG_PREEMPT_RT
> +#define pcpu_task_pin()                preempt_disable()
> +#define pcpu_task_unpin()      preempt_enable()
> +#else
> +#define pcpu_task_pin()                migrate_disable()
> +#define pcpu_task_unpin()      migrate_enable()
> +#endif
> +
> +/*
> + * Generic helper to lookup and a per-cpu variable with an embedded spinlock.
> + * Return value should be used with equivalent unlock helper.
> + */
> +#define pcpu_spin_lock(type, member, ptr)                              \
> +({                                                                     \
> +       type *_ret;                                                     \
> +       pcpu_task_pin();                                                \
> +       _ret = this_cpu_ptr(ptr);                                       \
> +       spin_lock(&_ret->member);                                       \
> +       _ret;                                                           \
> +})
> +
> +#define pcpu_spin_trylock(type, member, ptr)                           \
> +({                                                                     \
> +       type *_ret;                                                     \
> +       pcpu_task_pin();                                                \
> +       _ret = this_cpu_ptr(ptr);                                       \
> +       if (!spin_trylock(&_ret->member)) {                             \
> +               pcpu_task_unpin();                                      \
> +               _ret = NULL;                                            \
> +       }                                                               \
> +       _ret;                                                           \
> +})
> +
> +#define pcpu_spin_unlock(member, ptr)                                  \
> +({                                                                     \
> +       spin_unlock(&ptr->member);                                      \
> +       pcpu_task_unpin();                                              \
> +})
> +
> +/* struct slub_percpu_array specific helpers. */
> +#define pca_spin_lock(ptr)                                             \
> +       pcpu_spin_lock(struct slub_percpu_array, lock, ptr)
> +
> +#define pca_spin_trylock(ptr)                                          \
> +       pcpu_spin_trylock(struct slub_percpu_array, lock, ptr)
> +
> +#define pca_spin_unlock(ptr)                                           \
> +       pcpu_spin_unlock(lock, ptr)
> +
>  #ifndef CONFIG_SLUB_TINY
>  #define __fastpath_inline __always_inline
>  #else
> @@ -3454,6 +3527,78 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
>                         0, sizeof(void *));
>  }
>
> +static bool refill_pca(struct kmem_cache *s, unsigned int count, gfp_t gfp);
> +
> +static __fastpath_inline
> +void *alloc_from_pca(struct kmem_cache *s, gfp_t gfp)
> +{
> +       unsigned long __maybe_unused UP_flags;
> +       struct slub_percpu_array *pca;
> +       void *object;
> +
> +retry:
> +       pcp_trylock_prepare(UP_flags);
> +       pca = pca_spin_trylock(s->cpu_array);
> +
> +       if (unlikely(!pca)) {
> +               pcp_trylock_finish(UP_flags);
> +               return NULL;
> +       }
> +
> +       if (unlikely(pca->used == 0)) {
> +               unsigned int batch = pca->count / 2;
> +
> +               pca_spin_unlock(pca);
> +               pcp_trylock_finish(UP_flags);
> +
> +               if (!gfpflags_allow_blocking(gfp) || in_irq())
> +                       return NULL;
> +
> +               if (refill_pca(s, batch, gfp))
> +                       goto retry;
> +
> +               return NULL;
> +       }
> +
> +       object = pca->objects[--pca->used];
> +
> +       pca_spin_unlock(pca);
> +       pcp_trylock_finish(UP_flags);
> +
> +       stat(s, ALLOC_PCA);
> +
> +       return object;
> +}
> +
> +static __fastpath_inline
> +int alloc_from_pca_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> +       unsigned long __maybe_unused UP_flags;
> +       struct slub_percpu_array *pca;
> +
> +       pcp_trylock_prepare(UP_flags);
> +       pca = pca_spin_trylock(s->cpu_array);
> +
> +       if (unlikely(!pca)) {
> +               size = 0;
> +               goto failed;
> +       }
> +
> +       if (pca->used < size)
> +               size = pca->used;
> +
> +       for (int i = size; i > 0;) {
> +               p[--i] = pca->objects[--pca->used];
> +       }
> +
> +       pca_spin_unlock(pca);
> +       stat_add(s, ALLOC_PCA, size);
> +
> +failed:
> +       pcp_trylock_finish(UP_flags);
> +       return size;
> +}
> +
>  /*
>   * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
>   * have the fastpath folded into their functions. So no function call
> @@ -3479,7 +3624,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>         if (unlikely(object))
>                 goto out;
>
> -       object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> +       if (s->cpu_array && (node == NUMA_NO_NODE))
> +               object = alloc_from_pca(s, gfpflags);
> +
> +       if (!object)
> +               object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
>
>         maybe_wipe_obj_freeptr(s, object);
>         init = slab_want_init_on_alloc(gfpflags, s);
> @@ -3726,6 +3875,81 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>         discard_slab(s, slab);
>  }
>
> +static bool flush_pca(struct kmem_cache *s, unsigned int count);
> +
> +static __fastpath_inline
> +bool free_to_pca(struct kmem_cache *s, void *object)
> +{
> +       unsigned long __maybe_unused UP_flags;
> +       struct slub_percpu_array *pca;
> +
> +retry:
> +       pcp_trylock_prepare(UP_flags);
> +       pca = pca_spin_trylock(s->cpu_array);
> +
> +       if (!pca) {
> +               pcp_trylock_finish(UP_flags);
> +               return false;
> +       }
> +
> +       if (pca->used == pca->count) {
> +               unsigned int batch = pca->count / 2;
> +
> +               pca_spin_unlock(pca);
> +               pcp_trylock_finish(UP_flags);
> +
> +               if (in_irq())
> +                       return false;
> +
> +               if (!flush_pca(s, batch))
> +                       return false;
> +
> +               goto retry;
> +       }
> +
> +       pca->objects[pca->used++] = object;
> +
> +       pca_spin_unlock(pca);
> +       pcp_trylock_finish(UP_flags);
> +
> +       stat(s, FREE_PCA);
> +
> +       return true;
> +}
> +
> +static __fastpath_inline
> +size_t free_to_pca_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> +       unsigned long __maybe_unused UP_flags;
> +       struct slub_percpu_array *pca;
> +       bool init;
> +
> +       pcp_trylock_prepare(UP_flags);
> +       pca = pca_spin_trylock(s->cpu_array);
> +
> +       if (unlikely(!pca)) {
> +               size = 0;
> +               goto failed;
> +       }
> +
> +       if (pca->count - pca->used < size)
> +               size = pca->count - pca->used;
> +
> +       init = slab_want_init_on_free(s);
> +
> +       for (size_t i = 0; i < size; i++) {
> +               if (likely(slab_free_hook(s, p[i], init)))
> +                       pca->objects[pca->used++] = p[i];
> +       }
> +
> +       pca_spin_unlock(pca);
> +       stat_add(s, FREE_PCA, size);
> +
> +failed:
> +       pcp_trylock_finish(UP_flags);
> +       return size;
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -3811,7 +4035,12 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>  {
>         memcg_slab_free_hook(s, slab, &object, 1);
>
> -       if (likely(slab_free_hook(s, object, slab_want_init_on_free(s))))
> +       if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s))))
> +               return;
> +
> +       if (s->cpu_array)
> +               free_to_pca(s, object);

free_to_pca() can return false and leave the object alive. I think you
need to handle the failure case here to avoid leaks.

> +       else
>                 do_slab_free(s, slab, object, object, 1, addr);
>  }
>
> @@ -3956,6 +4185,26 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>         if (!size)
>                 return;
>
> +       /*
> +        * In case the objects might need memcg_slab_free_hook(), skip the array
> +        * because the hook is not effective with single objects and benefits
> +        * from groups of objects from a single slab that the detached freelist
> +        * builds. But once we build the detached freelist, it's wasteful to
> +        * throw it away and put the objects into the array.
> +        *
> +        * XXX: This test could be cache-specific if it was not possible to use
> +        * __GFP_ACCOUNT with caches that are not SLAB_ACCOUNT
> +        */
> +       if (s && s->cpu_array && !memcg_kmem_online()) {
> +               size_t pca_freed = free_to_pca_bulk(s, size, p);
> +
> +               if (pca_freed == size)
> +                       return;
> +
> +               p += pca_freed;
> +               size -= pca_freed;
> +       }
> +
>         do {
>                 struct detached_freelist df;
>
> @@ -4073,7 +4322,8 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>                           void **p)
>  {
> -       int i;
> +       int from_pca = 0;
> +       int allocated = 0;
>         struct obj_cgroup *objcg = NULL;
>
>         if (!size)
> @@ -4084,19 +4334,147 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>         if (unlikely(!s))
>                 return 0;
>
> -       i = __kmem_cache_alloc_bulk(s, flags, size, p);
> +       if (s->cpu_array)
> +               from_pca = alloc_from_pca_bulk(s, size, p);
> +
> +       if (from_pca < size) {
> +               allocated = __kmem_cache_alloc_bulk(s, flags, size-from_pca,
> +                                                   p+from_pca);
> +               if (allocated == 0 && from_pca > 0) {
> +                       __kmem_cache_free_bulk(s, from_pca, p);
> +               }
> +       }
> +
> +       allocated += from_pca;
>
>         /*
>          * memcg and kmem_cache debug support and memory initialization.
>          * Done outside of the IRQ disabled fastpath loop.
>          */
> -       if (i != 0)
> +       if (allocated != 0)
>                 slab_post_alloc_hook(s, objcg, flags, size, p,
>                         slab_want_init_on_alloc(flags, s), s->object_size);
> -       return i;
> +       return allocated;
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_bulk);
>
> +static bool refill_pca(struct kmem_cache *s, unsigned int count, gfp_t gfp)
> +{
> +       void *objects[32];
> +       unsigned int batch, allocated;
> +       unsigned long __maybe_unused UP_flags;
> +       struct slub_percpu_array *pca;
> +
> +bulk_alloc:
> +       batch = min(count, 32U);

Do you cap each batch at 32 to avoid overshooting too much (same in
flush_pca())? If so, it would be good to have a comment here. Also,
maybe this hardcoded 32 should be a function of pca->count instead? If
we set up a pca array with pca->count larger than 64 then the refill
count of pca->count/2 will always end up higher than 32, so at the end
we will have to loop back (goto bulk_alloc) to allocate more objects.

> +
> +       allocated = __kmem_cache_alloc_bulk(s, gfp, batch, &objects[0]);
> +       if (!allocated)
> +               return false;
> +
> +       pcp_trylock_prepare(UP_flags);
> +       pca = pca_spin_trylock(s->cpu_array);
> +       if (!pca) {
> +               pcp_trylock_finish(UP_flags);
> +               return false;
> +       }
> +
> +       batch = min(allocated, pca->count - pca->used);
> +
> +       for (unsigned int i = 0; i < batch; i++) {
> +               pca->objects[pca->used++] = objects[i];
> +       }
> +
> +       pca_spin_unlock(pca);
> +       pcp_trylock_finish(UP_flags);
> +
> +       stat_add(s, PCA_REFILL, batch);
> +
> +       /*
> +        * We could have migrated to a different cpu or somebody else freed to the
> +        * pca while we were bulk allocating, and now we have too many objects
> +        */
> +       if (batch < allocated) {
> +               __kmem_cache_free_bulk(s, allocated - batch, &objects[batch]);
> +       } else {
> +               count -= batch;
> +               if (count > 0)
> +                       goto bulk_alloc;
> +       }
> +
> +       return true;
> +}
> +
> +static bool flush_pca(struct kmem_cache *s, unsigned int count)
> +{
> +       void *objects[32];
> +       unsigned int batch, remaining;
> +       unsigned long __maybe_unused UP_flags;
> +       struct slub_percpu_array *pca;
> +
> +next_batch:
> +       batch = min(count, 32);
> +
> +       pcp_trylock_prepare(UP_flags);
> +       pca = pca_spin_trylock(s->cpu_array);
> +       if (!pca) {
> +               pcp_trylock_finish(UP_flags);
> +               return false;
> +       }
> +
> +       batch = min(batch, pca->used);
> +
> +       for (unsigned int i = 0; i < batch; i++) {
> +               objects[i] = pca->objects[--pca->used];
> +       }
> +
> +       remaining = pca->used;
> +
> +       pca_spin_unlock(pca);
> +       pcp_trylock_finish(UP_flags);
> +
> +       __kmem_cache_free_bulk(s, batch, &objects[0]);
> +
> +       stat_add(s, PCA_FLUSH, batch);
> +
> +       if (batch < count && remaining > 0) {
> +               count -= batch;
> +               goto next_batch;
> +       }
> +
> +       return true;
> +}
> +
> +/* Do not call from irq handler nor with irqs disabled */
> +int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count,
> +                                   gfp_t gfp)
> +{
> +       struct slub_percpu_array *pca;
> +       unsigned int used;
> +
> +       lockdep_assert_no_hardirq();
> +
> +       if (!s->cpu_array)
> +               return -EINVAL;
> +
> +       /* racy but we don't care */
> +       pca = raw_cpu_ptr(s->cpu_array);
> +
> +       used = READ_ONCE(pca->used);
> +
> +       if (used >= count)
> +               return 0;
> +
> +       if (pca->count < count)
> +               return -EINVAL;
> +
> +       count -= used;
> +
> +       if (!refill_pca(s, count, gfp))
> +               return -ENOMEM;
> +
> +       return 0;
> +}
>
>  /*
>   * Object placement in a slab is made very easy because we always start at
> @@ -5167,6 +5545,65 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
>         return 0;
>  }
>
> +/**
> + * kmem_cache_setup_percpu_array - Create a per-cpu array cache for the cache
> + * @s: The cache to add per-cpu array. Must be created with SLAB_NO_MERGE flag.
> + * @count: Size of the per-cpu array.
> + *
> + * After this call, allocations from the cache go through a percpu array. When
> + * it becomes empty, half is refilled with a bulk allocation. When it becomes
> + * full, half is flushed with a bulk free operation.
> + *
> + * Using the array cache is not guaranteed, i.e. it can be bypassed if its lock
> + * cannot be obtained. The array cache also does not distinguish NUMA nodes, so
> + * allocations via kmem_cache_alloc_node() with a node specified other than
> + * NUMA_NO_NODE will bypass the cache.
> + *
> + * Bulk allocation and free operations also try to use the array.
> + *
> + * kmem_cache_prefill_percpu_array() can be used to pre-fill the array cache
> + * before e.g. entering a restricted context. It is however not guaranteed that
> + * the caller will be able to subsequently consume the prefilled cache. Such
> + * failures should be however sufficiently rare so after the prefill,
> + * allocations using GFP_ATOMIC | __GFP_NOFAIL are acceptable for objects up to
> + * the prefilled amount.
> + *
> + * Limitations: when slub_debug is enabled for the cache, all relevant actions
> + * (i.e. poisoning, obtaining stacktraces) and checks happen when objects move
> + * between the array cache and slab pages, which may result in e.g. not
> + * detecting a use-after-free while the object is in the array cache, and the
> + * stacktraces may be less useful.
> + *
> + * Return: 0 if OK, -EINVAL on caches without SLAB_NO_MERGE or with the array
> + * already created, -ENOMEM when the per-cpu array creation fails.
> + */
> +int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count)
> +{
> +       int cpu;
> +
> +       if (WARN_ON_ONCE(!(s->flags & SLAB_NO_MERGE)))
> +               return -EINVAL;
> +
> +       if (s->cpu_array)
> +               return -EINVAL;
> +
> +       s->cpu_array = __alloc_percpu(struct_size(s->cpu_array, objects, count),
> +                                       sizeof(void *));

Maybe I missed it, but where do you free s->cpu_array? I see
__kmem_cache_release() freeing s->cpu_slab but s->cpu_array seems to
be left alive...

> +
> +       if (!s->cpu_array)
> +               return -ENOMEM;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct slub_percpu_array *pca = per_cpu_ptr(s->cpu_array, cpu);
> +
> +               spin_lock_init(&pca->lock);
> +               pca->count = count;
> +               pca->used = 0;
> +       }
> +
> +       return 0;
> +}
> +
>  #ifdef SLAB_SUPPORTS_SYSFS
>  static int count_inuse(struct slab *slab)
>  {
> @@ -5944,8 +6381,10 @@ static ssize_t text##_store(struct kmem_cache *s,                \
>  }                                                              \
>  SLAB_ATTR(text);                                               \
>
> +STAT_ATTR(ALLOC_PCA, alloc_cpu_cache);
>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
> +STAT_ATTR(FREE_PCA, free_cpu_cache);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>  STAT_ATTR(FREE_FROZEN, free_frozen);
> @@ -5970,6 +6409,8 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
>  STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
>  STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
>  STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
> +STAT_ATTR(PCA_REFILL, cpu_cache_refill);
> +STAT_ATTR(PCA_FLUSH, cpu_cache_flush);
>  #endif /* CONFIG_SLUB_STATS */
>
>  #ifdef CONFIG_KFENCE
> @@ -6031,8 +6472,10 @@ static struct attribute *slab_attrs[] = {
>         &remote_node_defrag_ratio_attr.attr,
>  #endif
>  #ifdef CONFIG_SLUB_STATS
> +       &alloc_cpu_cache_attr.attr,
>         &alloc_fastpath_attr.attr,
>         &alloc_slowpath_attr.attr,
> +       &free_cpu_cache_attr.attr,
>         &free_fastpath_attr.attr,
>         &free_slowpath_attr.attr,
>         &free_frozen_attr.attr,
> @@ -6057,6 +6500,8 @@ static struct attribute *slab_attrs[] = {
>         &cpu_partial_free_attr.attr,
>         &cpu_partial_node_attr.attr,
>         &cpu_partial_drain_attr.attr,
> +       &cpu_cache_refill_attr.attr,
> +       &cpu_cache_flush_attr.attr,
>  #endif
>  #ifdef CONFIG_FAILSLAB
>         &failslab_attr.attr,
>
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v3 5/9] mm/slub: add opt-in percpu array cache of objects
  2023-12-15 18:28   ` Suren Baghdasaryan
@ 2023-12-15 21:17     ` Suren Baghdasaryan
  0 siblings, 0 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-12-15 21:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Matthew Wilcox, Liam R. Howlett, Andrew Morton, Roman Gushchin,
	Hyeonggon Yoo, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	linux-mm, linux-kernel, maple-tree, kasan-dev

On Fri, Dec 15, 2023 at 10:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Nov 29, 2023 at 1:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > kmem_cache_setup_percpu_array() will allocate a per-cpu array for
> > caching alloc/free objects of given size for the cache. The cache
> > has to be created with SLAB_NO_MERGE flag.
> >
> > When empty, half of the array is filled by an internal bulk alloc
> > operation. When full, half of the array is flushed by an internal bulk
> > free operation.
> >
> > The array does not distinguish NUMA locality of the cached objects. If
> > an allocation is requested with kmem_cache_alloc_node() with numa node
> > not equal to NUMA_NO_NODE, the array is bypassed.
> >
> > The bulk operations exposed to slab users also try to utilize the array
> > when possible, but leave the array empty or full and use the bulk
> > alloc/free only to finish the operation itself. If kmemcg is enabled and
> > active, bulk freeing skips the array completely as it would be less
> > efficient to use it.
> >
> > The locking scheme is copied from the page allocator's pcplists, based
> > on embedded spin locks. Interrupts are not disabled, only preemption
> > (cpu migration on RT). Trylock is attempted to avoid deadlock due to an
> > interrupt; trylock failure means the array is bypassed.
> >
> > Sysfs stat counters alloc_cpu_cache and free_cpu_cache count objects
> > allocated or freed using the percpu array; counters cpu_cache_refill and
> > cpu_cache_flush count objects refilled or flushed form the array.
> >
> > kmem_cache_prefill_percpu_array() can be called to ensure the array on
> > the current cpu to at least the given number of objects. However this is
> > only opportunistic as there's no cpu pinning between the prefill and
> > usage, and trylocks may fail when the usage is in an irq handler.
> > Therefore allocations cannot rely on the array for success even after
> > the prefill. But misses should be rare enough that e.g. GFP_ATOMIC
> > allocations should be acceptable after the refill.
> >
> > When slub_debug is enabled for a cache with percpu array, the objects in
> > the array are considered as allocated from the slub_debug perspective,
> > and the alloc/free debugging hooks occur when moving the objects between
> > the array and slab pages. This means that e.g. an use-after-free that
> > occurs for an object cached in the array is undetected. Collected
> > alloc/free stacktraces might also be less useful. This limitation could
> > be changed in the future.
> >
> > On the other hand, KASAN, kmemcg and other hooks are executed on actual
> > allocations and frees by kmem_cache users even if those use the array,
> > so their debugging or accounting accuracy should be unaffected.
> >
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  include/linux/slab.h     |   4 +
> >  include/linux/slub_def.h |  12 ++
> >  mm/Kconfig               |   1 +
> >  mm/slub.c                | 457 ++++++++++++++++++++++++++++++++++++++++++++++-
> >  4 files changed, 468 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index d6d6ffeeb9a2..fe0c0981be59 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -197,6 +197,8 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
> >  void kmem_cache_destroy(struct kmem_cache *s);
> >  int kmem_cache_shrink(struct kmem_cache *s);
> >
> > +int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count);
> > +
> >  /*
> >   * Please use this macro to create slab caches. Simply specify the
> >   * name of the structure and maybe some flags that are listed above.
> > @@ -512,6 +514,8 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
> >  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
> >  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
> >
> > +int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count, gfp_t gfp);
> > +
> >  static __always_inline void kfree_bulk(size_t size, void **p)
> >  {
> >         kmem_cache_free_bulk(NULL, size, p);
> > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> > index deb90cf4bffb..2083aa849766 100644
> > --- a/include/linux/slub_def.h
> > +++ b/include/linux/slub_def.h
> > @@ -13,8 +13,10 @@
> >  #include <linux/local_lock.h>
> >
> >  enum stat_item {
> > +       ALLOC_PCA,              /* Allocation from percpu array cache */
> >         ALLOC_FASTPATH,         /* Allocation from cpu slab */
> >         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
> > +       FREE_PCA,               /* Free to percpu array cache */
> >         FREE_FASTPATH,          /* Free to cpu slab */
> >         FREE_SLOWPATH,          /* Freeing not to cpu slab */
> >         FREE_FROZEN,            /* Freeing to frozen slab */
> > @@ -39,6 +41,8 @@ enum stat_item {
> >         CPU_PARTIAL_FREE,       /* Refill cpu partial on free */
> >         CPU_PARTIAL_NODE,       /* Refill cpu partial from node partial */
> >         CPU_PARTIAL_DRAIN,      /* Drain cpu partial to node partial */
> > +       PCA_REFILL,             /* Refilling empty percpu array cache */
> > +       PCA_FLUSH,              /* Flushing full percpu array cache */
> >         NR_SLUB_STAT_ITEMS
> >  };
> >
> > @@ -66,6 +70,13 @@ struct kmem_cache_cpu {
> >  };
> >  #endif /* CONFIG_SLUB_TINY */
> >
> > +struct slub_percpu_array {
> > +       spinlock_t lock;
> > +       unsigned int count;
> > +       unsigned int used;
> > +       void * objects[];
> > +};
> > +
> >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> >  #define slub_percpu_partial(c)         ((c)->partial)
> >
> > @@ -99,6 +110,7 @@ struct kmem_cache {
> >  #ifndef CONFIG_SLUB_TINY
> >         struct kmem_cache_cpu __percpu *cpu_slab;
> >  #endif
> > +       struct slub_percpu_array __percpu *cpu_array;
> >         /* Used for retrieving partial slabs, etc. */
> >         slab_flags_t flags;
> >         unsigned long min_partial;
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 89971a894b60..aa53c51bb4a6 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -237,6 +237,7 @@ choice
> >  config SLAB_DEPRECATED
> >         bool "SLAB (DEPRECATED)"
> >         depends on !PREEMPT_RT
> > +       depends on BROKEN
> >         help
> >           Deprecated and scheduled for removal in a few cycles. Replaced by
> >           SLUB.
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 59912a376c6d..f08bd71c244f 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -188,6 +188,79 @@ do {                                       \
> >  #define USE_LOCKLESS_FAST_PATH()       (false)
> >  #endif
> >
> > +/* copy/pasted  from mm/page_alloc.c */
> > +
> > +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
> > +/*
> > + * On SMP, spin_trylock is sufficient protection.
> > + * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
> > + */
> > +#define pcp_trylock_prepare(flags)     do { } while (0)
> > +#define pcp_trylock_finish(flag)       do { } while (0)
> > +#else
> > +
> > +/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
> > +#define pcp_trylock_prepare(flags)     local_irq_save(flags)
> > +#define pcp_trylock_finish(flags)      local_irq_restore(flags)
> > +#endif
> > +
> > +/*
> > + * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
> > + * a migration causing the wrong PCP to be locked and remote memory being
> > + * potentially allocated, pin the task to the CPU for the lookup+lock.
> > + * preempt_disable is used on !RT because it is faster than migrate_disable.
> > + * migrate_disable is used on RT because otherwise RT spinlock usage is
> > + * interfered with and a high priority task cannot preempt the allocator.
> > + */
> > +#ifndef CONFIG_PREEMPT_RT
> > +#define pcpu_task_pin()                preempt_disable()
> > +#define pcpu_task_unpin()      preempt_enable()
> > +#else
> > +#define pcpu_task_pin()                migrate_disable()
> > +#define pcpu_task_unpin()      migrate_enable()
> > +#endif
> > +
> > +/*
> > + * Generic helper to lookup and a per-cpu variable with an embedded spinlock.
> > + * Return value should be used with equivalent unlock helper.
> > + */
> > +#define pcpu_spin_lock(type, member, ptr)                              \
> > +({                                                                     \
> > +       type *_ret;                                                     \
> > +       pcpu_task_pin();                                                \
> > +       _ret = this_cpu_ptr(ptr);                                       \
> > +       spin_lock(&_ret->member);                                       \
> > +       _ret;                                                           \
> > +})
> > +
> > +#define pcpu_spin_trylock(type, member, ptr)                           \
> > +({                                                                     \
> > +       type *_ret;                                                     \
> > +       pcpu_task_pin();                                                \
> > +       _ret = this_cpu_ptr(ptr);                                       \
> > +       if (!spin_trylock(&_ret->member)) {                             \
> > +               pcpu_task_unpin();                                      \
> > +               _ret = NULL;                                            \
> > +       }                                                               \
> > +       _ret;                                                           \
> > +})
> > +
> > +#define pcpu_spin_unlock(member, ptr)                                  \
> > +({                                                                     \
> > +       spin_unlock(&ptr->member);                                      \
> > +       pcpu_task_unpin();                                              \
> > +})
> > +
> > +/* struct slub_percpu_array specific helpers. */
> > +#define pca_spin_lock(ptr)                                             \
> > +       pcpu_spin_lock(struct slub_percpu_array, lock, ptr)
> > +
> > +#define pca_spin_trylock(ptr)                                          \
> > +       pcpu_spin_trylock(struct slub_percpu_array, lock, ptr)
> > +
> > +#define pca_spin_unlock(ptr)                                           \
> > +       pcpu_spin_unlock(lock, ptr)
> > +
> >  #ifndef CONFIG_SLUB_TINY
> >  #define __fastpath_inline __always_inline
> >  #else
> > @@ -3454,6 +3527,78 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
> >                         0, sizeof(void *));
> >  }
> >
> > +static bool refill_pca(struct kmem_cache *s, unsigned int count, gfp_t gfp);
> > +
> > +static __fastpath_inline
> > +void *alloc_from_pca(struct kmem_cache *s, gfp_t gfp)
> > +{
> > +       unsigned long __maybe_unused UP_flags;
> > +       struct slub_percpu_array *pca;
> > +       void *object;
> > +
> > +retry:
> > +       pcp_trylock_prepare(UP_flags);
> > +       pca = pca_spin_trylock(s->cpu_array);
> > +
> > +       if (unlikely(!pca)) {
> > +               pcp_trylock_finish(UP_flags);
> > +               return NULL;
> > +       }
> > +
> > +       if (unlikely(pca->used == 0)) {
> > +               unsigned int batch = pca->count / 2;
> > +
> > +               pca_spin_unlock(pca);
> > +               pcp_trylock_finish(UP_flags);
> > +
> > +               if (!gfpflags_allow_blocking(gfp) || in_irq())
> > +                       return NULL;
> > +
> > +               if (refill_pca(s, batch, gfp))
> > +                       goto retry;
> > +
> > +               return NULL;
> > +       }
> > +
> > +       object = pca->objects[--pca->used];
> > +
> > +       pca_spin_unlock(pca);
> > +       pcp_trylock_finish(UP_flags);
> > +
> > +       stat(s, ALLOC_PCA);
> > +
> > +       return object;
> > +}
> > +
> > +static __fastpath_inline
> > +int alloc_from_pca_bulk(struct kmem_cache *s, size_t size, void **p)
> > +{
> > +       unsigned long __maybe_unused UP_flags;
> > +       struct slub_percpu_array *pca;
> > +
> > +       pcp_trylock_prepare(UP_flags);
> > +       pca = pca_spin_trylock(s->cpu_array);
> > +
> > +       if (unlikely(!pca)) {
> > +               size = 0;
> > +               goto failed;
> > +       }
> > +
> > +       if (pca->used < size)
> > +               size = pca->used;
> > +
> > +       for (int i = size; i > 0;) {
> > +               p[--i] = pca->objects[--pca->used];
> > +       }
> > +
> > +       pca_spin_unlock(pca);
> > +       stat_add(s, ALLOC_PCA, size);
> > +
> > +failed:
> > +       pcp_trylock_finish(UP_flags);
> > +       return size;
> > +}
> > +
> >  /*
> >   * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
> >   * have the fastpath folded into their functions. So no function call
> > @@ -3479,7 +3624,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
> >         if (unlikely(object))
> >                 goto out;
> >
> > -       object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> > +       if (s->cpu_array && (node == NUMA_NO_NODE))
> > +               object = alloc_from_pca(s, gfpflags);
> > +
> > +       if (!object)
> > +               object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> >
> >         maybe_wipe_obj_freeptr(s, object);
> >         init = slab_want_init_on_alloc(gfpflags, s);
> > @@ -3726,6 +3875,81 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> >         discard_slab(s, slab);
> >  }
> >
> > +static bool flush_pca(struct kmem_cache *s, unsigned int count);
> > +
> > +static __fastpath_inline
> > +bool free_to_pca(struct kmem_cache *s, void *object)
> > +{
> > +       unsigned long __maybe_unused UP_flags;
> > +       struct slub_percpu_array *pca;
> > +
> > +retry:
> > +       pcp_trylock_prepare(UP_flags);
> > +       pca = pca_spin_trylock(s->cpu_array);
> > +
> > +       if (!pca) {
> > +               pcp_trylock_finish(UP_flags);
> > +               return false;
> > +       }
> > +
> > +       if (pca->used == pca->count) {
> > +               unsigned int batch = pca->count / 2;
> > +
> > +               pca_spin_unlock(pca);
> > +               pcp_trylock_finish(UP_flags);
> > +
> > +               if (in_irq())
> > +                       return false;
> > +
> > +               if (!flush_pca(s, batch))
> > +                       return false;
> > +
> > +               goto retry;
> > +       }
> > +
> > +       pca->objects[pca->used++] = object;
> > +
> > +       pca_spin_unlock(pca);
> > +       pcp_trylock_finish(UP_flags);
> > +
> > +       stat(s, FREE_PCA);
> > +
> > +       return true;
> > +}
> > +
> > +static __fastpath_inline
> > +size_t free_to_pca_bulk(struct kmem_cache *s, size_t size, void **p)
> > +{
> > +       unsigned long __maybe_unused UP_flags;
> > +       struct slub_percpu_array *pca;
> > +       bool init;
> > +
> > +       pcp_trylock_prepare(UP_flags);
> > +       pca = pca_spin_trylock(s->cpu_array);
> > +
> > +       if (unlikely(!pca)) {
> > +               size = 0;
> > +               goto failed;
> > +       }
> > +
> > +       if (pca->count - pca->used < size)
> > +               size = pca->count - pca->used;
> > +
> > +       init = slab_want_init_on_free(s);
> > +
> > +       for (size_t i = 0; i < size; i++) {
> > +               if (likely(slab_free_hook(s, p[i], init)))
> > +                       pca->objects[pca->used++] = p[i];
> > +       }
> > +
> > +       pca_spin_unlock(pca);
> > +       stat_add(s, FREE_PCA, size);
> > +
> > +failed:
> > +       pcp_trylock_finish(UP_flags);
> > +       return size;
> > +}
> > +
> >  #ifndef CONFIG_SLUB_TINY
> >  /*
> >   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> > @@ -3811,7 +4035,12 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> >  {
> >         memcg_slab_free_hook(s, slab, &object, 1);
> >
> > -       if (likely(slab_free_hook(s, object, slab_want_init_on_free(s))))
> > +       if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s))))
> > +               return;
> > +
> > +       if (s->cpu_array)
> > +               free_to_pca(s, object);
>
> free_to_pca() can return false and leave the object alive. I think you
> need to handle the failure case here to avoid leaks.
>
> > +       else
> >                 do_slab_free(s, slab, object, object, 1, addr);
> >  }
> >
> > @@ -3956,6 +4185,26 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >         if (!size)
> >                 return;
> >
> > +       /*
> > +        * In case the objects might need memcg_slab_free_hook(), skip the array
> > +        * because the hook is not effective with single objects and benefits
> > +        * from groups of objects from a single slab that the detached freelist
> > +        * builds. But once we build the detached freelist, it's wasteful to
> > +        * throw it away and put the objects into the array.
> > +        *
> > +        * XXX: This test could be cache-specific if it was not possible to use
> > +        * __GFP_ACCOUNT with caches that are not SLAB_ACCOUNT
> > +        */
> > +       if (s && s->cpu_array && !memcg_kmem_online()) {
> > +               size_t pca_freed = free_to_pca_bulk(s, size, p);
> > +
> > +               if (pca_freed == size)
> > +                       return;
> > +
> > +               p += pca_freed;
> > +               size -= pca_freed;
> > +       }
> > +
> >         do {
> >                 struct detached_freelist df;
> >
> > @@ -4073,7 +4322,8 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> >  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> >                           void **p)
> >  {
> > -       int i;
> > +       int from_pca = 0;
> > +       int allocated = 0;
> >         struct obj_cgroup *objcg = NULL;
> >
> >         if (!size)
> > @@ -4084,19 +4334,147 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> >         if (unlikely(!s))
> >                 return 0;
> >
> > -       i = __kmem_cache_alloc_bulk(s, flags, size, p);
> > +       if (s->cpu_array)
> > +               from_pca = alloc_from_pca_bulk(s, size, p);
> > +
> > +       if (from_pca < size) {
> > +               allocated = __kmem_cache_alloc_bulk(s, flags, size-from_pca,
> > +                                                   p+from_pca);
> > +               if (allocated == 0 && from_pca > 0) {
> > +                       __kmem_cache_free_bulk(s, from_pca, p);
> > +               }
> > +       }
> > +
> > +       allocated += from_pca;
> >
> >         /*
> >          * memcg and kmem_cache debug support and memory initialization.
> >          * Done outside of the IRQ disabled fastpath loop.
> >          */
> > -       if (i != 0)
> > +       if (allocated != 0)
> >                 slab_post_alloc_hook(s, objcg, flags, size, p,
> >                         slab_want_init_on_alloc(flags, s), s->object_size);
> > -       return i;
> > +       return allocated;
> >  }
> >  EXPORT_SYMBOL(kmem_cache_alloc_bulk);
> >
> > +static bool refill_pca(struct kmem_cache *s, unsigned int count, gfp_t gfp)
> > +{
> > +       void *objects[32];
> > +       unsigned int batch, allocated;
> > +       unsigned long __maybe_unused UP_flags;
> > +       struct slub_percpu_array *pca;
> > +
> > +bulk_alloc:
> > +       batch = min(count, 32U);
>
> Do you cap each batch at 32 to avoid overshooting too much (same in
> flush_pca())? If so, it would be good to have a comment here. Also,
> maybe this hardcoded 32 should be a function of pca->count instead? If
> we set up a pca array with pca->count larger than 64 then the refill
> count of pca->count/2 will always end up higher than 32, so at the end
> we will have to loop back (goto bulk_alloc) to allocate more objects.

Ah, I just noticed that you are using objects[32] and that's forcing
this limitation. Please ignore my previous comment.

>
> > +
> > +       allocated = __kmem_cache_alloc_bulk(s, gfp, batch, &objects[0]);
> > +       if (!allocated)
> > +               return false;
> > +
> > +       pcp_trylock_prepare(UP_flags);
> > +       pca = pca_spin_trylock(s->cpu_array);
> > +       if (!pca) {
> > +               pcp_trylock_finish(UP_flags);
> > +               return false;
> > +       }
> > +
> > +       batch = min(allocated, pca->count - pca->used);
> > +
> > +       for (unsigned int i = 0; i < batch; i++) {
> > +               pca->objects[pca->used++] = objects[i];
> > +       }
> > +
> > +       pca_spin_unlock(pca);
> > +       pcp_trylock_finish(UP_flags);
> > +
> > +       stat_add(s, PCA_REFILL, batch);
> > +
> > +       /*
> > +        * We could have migrated to a different cpu or somebody else freed to the
> > +        * pca while we were bulk allocating, and now we have too many objects
> > +        */
> > +       if (batch < allocated) {
> > +               __kmem_cache_free_bulk(s, allocated - batch, &objects[batch]);
> > +       } else {
> > +               count -= batch;
> > +               if (count > 0)
> > +                       goto bulk_alloc;
> > +       }
> > +
> > +       return true;
> > +}
> > +
> > +static bool flush_pca(struct kmem_cache *s, unsigned int count)
> > +{
> > +       void *objects[32];
> > +       unsigned int batch, remaining;
> > +       unsigned long __maybe_unused UP_flags;
> > +       struct slub_percpu_array *pca;
> > +
> > +next_batch:
> > +       batch = min(count, 32);
> > +
> > +       pcp_trylock_prepare(UP_flags);
> > +       pca = pca_spin_trylock(s->cpu_array);
> > +       if (!pca) {
> > +               pcp_trylock_finish(UP_flags);
> > +               return false;
> > +       }
> > +
> > +       batch = min(batch, pca->used);
> > +
> > +       for (unsigned int i = 0; i < batch; i++) {
> > +               objects[i] = pca->objects[--pca->used];
> > +       }
> > +
> > +       remaining = pca->used;
> > +
> > +       pca_spin_unlock(pca);
> > +       pcp_trylock_finish(UP_flags);
> > +
> > +       __kmem_cache_free_bulk(s, batch, &objects[0]);
> > +
> > +       stat_add(s, PCA_FLUSH, batch);
> > +
> > +       if (batch < count && remaining > 0) {
> > +               count -= batch;
> > +               goto next_batch;
> > +       }
> > +
> > +       return true;
> > +}
> > +
> > +/* Do not call from irq handler nor with irqs disabled */
> > +int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count,
> > +                                   gfp_t gfp)
> > +{
> > +       struct slub_percpu_array *pca;
> > +       unsigned int used;
> > +
> > +       lockdep_assert_no_hardirq();
> > +
> > +       if (!s->cpu_array)
> > +               return -EINVAL;
> > +
> > +       /* racy but we don't care */
> > +       pca = raw_cpu_ptr(s->cpu_array);
> > +
> > +       used = READ_ONCE(pca->used);
> > +
> > +       if (used >= count)
> > +               return 0;
> > +
> > +       if (pca->count < count)
> > +               return -EINVAL;
> > +
> > +       count -= used;
> > +
> > +       if (!refill_pca(s, count, gfp))
> > +               return -ENOMEM;
> > +
> > +       return 0;
> > +}
> >
> >  /*
> >   * Object placement in a slab is made very easy because we always start at
> > @@ -5167,6 +5545,65 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
> >         return 0;
> >  }
> >
> > +/**
> > + * kmem_cache_setup_percpu_array - Create a per-cpu array cache for the cache
> > + * @s: The cache to add per-cpu array. Must be created with SLAB_NO_MERGE flag.
> > + * @count: Size of the per-cpu array.
> > + *
> > + * After this call, allocations from the cache go through a percpu array. When
> > + * it becomes empty, half is refilled with a bulk allocation. When it becomes
> > + * full, half is flushed with a bulk free operation.
> > + *
> > + * Using the array cache is not guaranteed, i.e. it can be bypassed if its lock
> > + * cannot be obtained. The array cache also does not distinguish NUMA nodes, so
> > + * allocations via kmem_cache_alloc_node() with a node specified other than
> > + * NUMA_NO_NODE will bypass the cache.
> > + *
> > + * Bulk allocation and free operations also try to use the array.
> > + *
> > + * kmem_cache_prefill_percpu_array() can be used to pre-fill the array cache
> > + * before e.g. entering a restricted context. It is however not guaranteed that
> > + * the caller will be able to subsequently consume the prefilled cache. Such
> > + * failures should be however sufficiently rare so after the prefill,
> > + * allocations using GFP_ATOMIC | __GFP_NOFAIL are acceptable for objects up to
> > + * the prefilled amount.
> > + *
> > + * Limitations: when slub_debug is enabled for the cache, all relevant actions
> > + * (i.e. poisoning, obtaining stacktraces) and checks happen when objects move
> > + * between the array cache and slab pages, which may result in e.g. not
> > + * detecting a use-after-free while the object is in the array cache, and the
> > + * stacktraces may be less useful.
> > + *
> > + * Return: 0 if OK, -EINVAL on caches without SLAB_NO_MERGE or with the array
> > + * already created, -ENOMEM when the per-cpu array creation fails.
> > + */
> > +int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count)
> > +{
> > +       int cpu;
> > +
> > +       if (WARN_ON_ONCE(!(s->flags & SLAB_NO_MERGE)))
> > +               return -EINVAL;
> > +
> > +       if (s->cpu_array)
> > +               return -EINVAL;
> > +
> > +       s->cpu_array = __alloc_percpu(struct_size(s->cpu_array, objects, count),
> > +                                       sizeof(void *));
>
> Maybe I missed it, but where do you free s->cpu_array? I see
> __kmem_cache_release() freeing s->cpu_slab but s->cpu_array seems to
> be left alive...
>
> > +
> > +       if (!s->cpu_array)
> > +               return -ENOMEM;
> > +
> > +       for_each_possible_cpu(cpu) {
> > +               struct slub_percpu_array *pca = per_cpu_ptr(s->cpu_array, cpu);
> > +
> > +               spin_lock_init(&pca->lock);
> > +               pca->count = count;
> > +               pca->used = 0;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> >  #ifdef SLAB_SUPPORTS_SYSFS
> >  static int count_inuse(struct slab *slab)
> >  {
> > @@ -5944,8 +6381,10 @@ static ssize_t text##_store(struct kmem_cache *s,                \
> >  }                                                              \
> >  SLAB_ATTR(text);                                               \
> >
> > +STAT_ATTR(ALLOC_PCA, alloc_cpu_cache);
> >  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
> >  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
> > +STAT_ATTR(FREE_PCA, free_cpu_cache);
> >  STAT_ATTR(FREE_FASTPATH, free_fastpath);
> >  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
> >  STAT_ATTR(FREE_FROZEN, free_frozen);
> > @@ -5970,6 +6409,8 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
> >  STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
> >  STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
> >  STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
> > +STAT_ATTR(PCA_REFILL, cpu_cache_refill);
> > +STAT_ATTR(PCA_FLUSH, cpu_cache_flush);
> >  #endif /* CONFIG_SLUB_STATS */
> >
> >  #ifdef CONFIG_KFENCE
> > @@ -6031,8 +6472,10 @@ static struct attribute *slab_attrs[] = {
> >         &remote_node_defrag_ratio_attr.attr,
> >  #endif
> >  #ifdef CONFIG_SLUB_STATS
> > +       &alloc_cpu_cache_attr.attr,
> >         &alloc_fastpath_attr.attr,
> >         &alloc_slowpath_attr.attr,
> > +       &free_cpu_cache_attr.attr,
> >         &free_fastpath_attr.attr,
> >         &free_slowpath_attr.attr,
> >         &free_frozen_attr.attr,
> > @@ -6057,6 +6500,8 @@ static struct attribute *slab_attrs[] = {
> >         &cpu_partial_free_attr.attr,
> >         &cpu_partial_node_attr.attr,
> >         &cpu_partial_drain_attr.attr,
> > +       &cpu_cache_refill_attr.attr,
> > +       &cpu_cache_flush_attr.attr,
> >  #endif
> >  #ifdef CONFIG_FAILSLAB
> >         &failslab_attr.attr,
> >
> > --
> > 2.43.0
> >
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-12-15 21:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-29  9:53 [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Vlastimil Babka
2023-11-29  9:53 ` [PATCH RFC v3 1/9] mm/slub: fix bulk alloc and free stats Vlastimil Babka
2023-11-29  9:53 ` [PATCH RFC v3 2/9] mm/slub: introduce __kmem_cache_free_bulk() without free hooks Vlastimil Babka
2023-11-29  9:53 ` [PATCH RFC v3 3/9] mm/slub: handle bulk and single object freeing separately Vlastimil Babka
2023-11-29  9:53 ` [PATCH RFC v3 4/9] mm/slub: free KFENCE objects in slab_free_hook() Vlastimil Babka
2023-11-29 12:00   ` Marco Elver
2023-11-29  9:53 ` [PATCH RFC v3 5/9] mm/slub: add opt-in percpu array cache of objects Vlastimil Babka
2023-11-29 10:35   ` Marco Elver
2023-12-15 18:28   ` Suren Baghdasaryan
2023-12-15 21:17     ` Suren Baghdasaryan
2023-11-29  9:53 ` [PATCH RFC v3 6/9] tools: Add SLUB percpu array functions for testing Vlastimil Babka
2023-11-29  9:53 ` [PATCH RFC v3 7/9] maple_tree: use slub percpu array Vlastimil Babka
2023-11-29  9:53 ` [PATCH RFC v3 8/9] maple_tree: Remove MA_STATE_PREALLOC Vlastimil Babka
2023-11-29  9:53 ` [PATCH RFC v3 9/9] maple_tree: replace preallocation with slub percpu array prefill Vlastimil Babka
2023-11-29 20:16 ` [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes Christoph Lameter (Ampere)
2023-11-29 21:20   ` Matthew Wilcox
2023-12-14 20:14     ` Christoph Lameter (Ampere)
2023-11-30  9:14   ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.