All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v1 0/5] SLUB percpu array caches and maple tree nodes
@ 2023-08-08  9:53 Vlastimil Babka
  2023-08-08  9:53 ` [RFC v1 1/5] mm, slub: fix bulk alloc and free stats Vlastimil Babka
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Vlastimil Babka @ 2023-08-08  9:53 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Hyeonggon Yoo, Roman Gushchin, linux-mm, linux-kernel, patches,
	Vlastimil Babka

Also available in git, based on v6.5-rc5:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-caches-v1

At LSF/MM I've mentioned that I see several use cases for introducing
opt-in percpu arrays for caching alloc/free objects in SLUB. This is my
first exploration of this idea, speficially for the use case of maple
tree nodes. We have brainstormed this use case on IRC last week with
Liam and Matthew and this how I understood the requirements:

- percpu arrays will be faster thank bulk alloc/free which needs
  relatively long freelists to work well. Especially in the freeing case
  we need the nodes to come from the same slab (or small set of those)

- preallocation for the worst case of needed nodes for a tree operation
  that can't reclaim due to locks is wasteful. We could instead expect
  that most of the time percpu arrays would satisfy the constained
  allocations, and in the rare cases it does not we can dip into
  GFP_ATOMIC reserves temporarily. Instead of preallocation just prefill
  the arrays.

- NUMA locality is not a concern as the nodes of a process's VMA tree
  end up all over the place anyway.

So this RFC patchset adds such percpu array in Patch 2. Locking is
stolen from Mel's recent page allocator's pcplists implementation so it
can avoid disabling IRQs and just disable preemption, but the trylocks
can fail in rare situations.

Then maple tree is modified in patches 3-5 to benefit from this. This is
done in a very crude way as I'm not so familiar with the code.

I've briefly tested this with virtme VM boot and checking the stats from
CONFIG_SLUB_STATS in sysfs.

Patch 2:

slub changes implemented including new counters alloc_cpu_cache
and free_cpu_cache but maple tree doesn't use them yet

(none):/sys/kernel/slab/maple_node # grep . alloc_cpu_cache alloc_*path free_cpu_cache free_*path | cut -d' ' -f1
alloc_cpu_cache:0
alloc_fastpath:56604
alloc_slowpath:7279
free_cpu_cache:0
free_fastpath:35087
free_slowpath:22403

Patch 3:

maple node cache creates percpu array with 32 entries,
not changed anything else

-> some allocs/free satisfied by the array

alloc_cpu_cache:11950
alloc_fastpath:39955
alloc_slowpath:7989
free_cpu_cache:12076
free_fastpath:22878
free_slowpath:18677

Patch 4:

maple tree nodes bulk alloc/free converted to loop of normal alloc to use
percpu array more, because bulk alloc bypasses it

-> majority alloc/free now satisfied by percpu array

alloc_cpu_cache:54178
alloc_fastpath:4959
alloc_slowpath:727
free_cpu_cache:54244
free_fastpath:354
free_slowpath:5159

Patch 5:

mas_preallocate() just prefills the percpu array, actually preallocates only a single node
mas_store_prealloc() gains a retry loop with mas_nomem(mas, GFP_ATOMIC | __GFP_NOFAIL)

-> major drop of actual alloc/free

alloc_cpu_cache:17031
alloc_fastpath:5324
alloc_slowpath:631
free_cpu_cache:17099
free_fastpath:277
free_slowpath:5503

Would be interesting to see how it affects the workloads that saw
regressions from the maple tree introduction, as the slab operations
were suspected to be a major factor.

Vlastimil Babka (5):
  mm, slub: fix bulk alloc and free stats
  mm, slub: add opt-in slub_percpu_array
  maple_tree: use slub percpu array
  maple_tree: avoid bulk alloc/free to use percpu array more
  maple_tree: replace preallocation with slub percpu array prefill

 include/linux/slab.h     |   4 +
 include/linux/slub_def.h |  10 ++
 lib/maple_tree.c         |  30 +++++-
 mm/slub.c                | 221 ++++++++++++++++++++++++++++++++++++++-
 4 files changed, 258 insertions(+), 7 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC v1 1/5] mm, slub: fix bulk alloc and free stats
  2023-08-08  9:53 [RFC v1 0/5] SLUB percpu array caches and maple tree nodes Vlastimil Babka
@ 2023-08-08  9:53 ` Vlastimil Babka
  2023-08-18 11:47   ` Hyeonggon Yoo
  2023-08-08  9:53 ` [RFC v1 2/5] mm, slub: add opt-in slub_percpu_array Vlastimil Babka
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2023-08-08  9:53 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Hyeonggon Yoo, Roman Gushchin, linux-mm, linux-kernel, patches,
	Vlastimil Babka

The SLUB sysfs stats enabled CONFIG_SLUB_STATS have two deficiencies
identified wrt bulk alloc/free operations:

- Bulk allocations from cpu freelist are not counted. Add the
  ALLOC_FASTPATH counter there.

- Bulk fastpath freeing will count a list of multiple objects with a
  single FREE_FASTPATH inc. Add a stat_add() variant to count them all.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index e3b5d5c0eb3a..a9437d48840c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -341,6 +341,14 @@ static inline void stat(const struct kmem_cache *s, enum stat_item si)
 #endif
 }
 
+static inline void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
+{
+#ifdef CONFIG_SLUB_STATS
+	raw_cpu_add(s->cpu_slab->stat[si], v);
+#endif
+}
+
+
 /*
  * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
  * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
@@ -3776,7 +3784,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 
 		local_unlock(&s->cpu_slab->lock);
 	}
-	stat(s, FREE_FASTPATH);
+	stat_add(s, FREE_FASTPATH, cnt);
 }
 #else /* CONFIG_SLUB_TINY */
 static void do_slab_free(struct kmem_cache *s,
@@ -3978,6 +3986,7 @@ static inline int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 		c->freelist = get_freepointer(s, object);
 		p[i] = object;
 		maybe_wipe_obj_freeptr(s, p[i]);
+		stat(s, ALLOC_FASTPATH);
 	}
 	c->tid = next_tid(c->tid);
 	local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v1 2/5] mm, slub: add opt-in slub_percpu_array
  2023-08-08  9:53 [RFC v1 0/5] SLUB percpu array caches and maple tree nodes Vlastimil Babka
  2023-08-08  9:53 ` [RFC v1 1/5] mm, slub: fix bulk alloc and free stats Vlastimil Babka
@ 2023-08-08  9:53 ` Vlastimil Babka
  2023-08-08 12:06   ` Pedro Falcato
  2023-08-08  9:53 ` [RFC v1 3/5] maple_tree: use slub percpu array Vlastimil Babka
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2023-08-08  9:53 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Hyeonggon Yoo, Roman Gushchin, linux-mm, linux-kernel, patches,
	Vlastimil Babka

kmem_cache_setup_percpu_array() will allocate a per-cpu array for
caching alloc/free objects of given size for the cache. The cache
has to be created with SLAB_NO_MERGE flag.

The array is filled by freeing. When empty for alloc or full for
freeing, it's simply bypassed by the operation, there's currently no
batch freeing/allocations.

The locking is copied from the page allocator's pcplists, based on
embedded spin locks. Interrupts are not disabled, only preemption (cpu
migration on RT). Trylock is attempted to avoid deadlock due to
an intnerrupt, trylock failure means the array is bypassed.

Sysfs stat counters alloc_cpu_cache and free_cpu_cache count operations
that used the percpu array.

Bulk allocation bypasses the array, bulk freeing does not.

kmem_cache_prefill_percpu_array() can be called to ensure the array on
the current cpu to at least the given number of objects. However this is
only opportunistic as there's no cpu pinning and the trylocks may always
fail. Therefore allocations cannot rely on the array for success even
after the prefill. But misses should be rare enough that e.g. GFP_ATOMIC
allocations should be acceptable after the refill.
The operation is currently not optimized.

More TODO/FIXMEs:

- NUMA awareness - preferred node currently ignored, __GFP_THISNODE not
  honored
- slub_debug - will not work for allocations from the array. Normally in
  SLUB implementation the slub_debug kills all fast paths, but that
  could lead to depleting the reserves if we ignore the prefill and use
  GFP_ATOMIC. Needs more thought.
---
 include/linux/slab.h     |   4 +
 include/linux/slub_def.h |  10 ++
 mm/slub.c                | 210 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 223 insertions(+), 1 deletion(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 848c7c82ad5a..f6c91cbc1544 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -196,6 +196,8 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void kmem_cache_destroy(struct kmem_cache *s);
 int kmem_cache_shrink(struct kmem_cache *s);
 
+int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count);
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
@@ -494,6 +496,8 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
 int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
 
+int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count, gfp_t gfp);
+
 static __always_inline void kfree_bulk(size_t size, void **p)
 {
 	kmem_cache_free_bulk(NULL, size, p);
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index deb90cf4bffb..c85434668419 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -13,8 +13,10 @@
 #include <linux/local_lock.h>
 
 enum stat_item {
+	ALLOC_PERCPU_CACHE,	/* Allocation from percpu array cache */
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
+	FREE_PERCPU_CACHE,	/* Free to percpu array cache */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -66,6 +68,13 @@ struct kmem_cache_cpu {
 };
 #endif /* CONFIG_SLUB_TINY */
 
+struct slub_percpu_array {
+	spinlock_t lock;
+	unsigned int count;
+	unsigned int used;
+	void * objects[];
+};
+
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 #define slub_percpu_partial(c)		((c)->partial)
 
@@ -99,6 +108,7 @@ struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
 #endif
+	struct slub_percpu_array __percpu *cpu_array;
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
 	unsigned long min_partial;
diff --git a/mm/slub.c b/mm/slub.c
index a9437d48840c..7fc9f7c124eb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -188,6 +188,79 @@ do {					\
 #define USE_LOCKLESS_FAST_PATH()	(false)
 #endif
 
+/* copy/pasted  from mm/page_alloc.c */
+
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+/*
+ * On SMP, spin_trylock is sufficient protection.
+ * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ */
+#define pcp_trylock_prepare(flags)	do { } while (0)
+#define pcp_trylock_finish(flag)	do { } while (0)
+#else
+
+/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
+#define pcp_trylock_prepare(flags)	local_irq_save(flags)
+#define pcp_trylock_finish(flags)	local_irq_restore(flags)
+#endif
+
+/*
+ * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
+ * a migration causing the wrong PCP to be locked and remote memory being
+ * potentially allocated, pin the task to the CPU for the lookup+lock.
+ * preempt_disable is used on !RT because it is faster than migrate_disable.
+ * migrate_disable is used on RT because otherwise RT spinlock usage is
+ * interfered with and a high priority task cannot preempt the allocator.
+ */
+#ifndef CONFIG_PREEMPT_RT
+#define pcpu_task_pin()		preempt_disable()
+#define pcpu_task_unpin()	preempt_enable()
+#else
+#define pcpu_task_pin()		migrate_disable()
+#define pcpu_task_unpin()	migrate_enable()
+#endif
+
+/*
+ * Generic helper to lookup and a per-cpu variable with an embedded spinlock.
+ * Return value should be used with equivalent unlock helper.
+ */
+#define pcpu_spin_lock(type, member, ptr)				\
+({									\
+	type *_ret;							\
+	pcpu_task_pin();						\
+	_ret = this_cpu_ptr(ptr);					\
+	spin_lock(&_ret->member);					\
+	_ret;								\
+})
+
+#define pcpu_spin_trylock(type, member, ptr)				\
+({									\
+	type *_ret;							\
+	pcpu_task_pin();						\
+	_ret = this_cpu_ptr(ptr);					\
+	if (!spin_trylock(&_ret->member)) {				\
+		pcpu_task_unpin();					\
+		_ret = NULL;						\
+	}								\
+	_ret;								\
+})
+
+#define pcpu_spin_unlock(member, ptr)					\
+({									\
+	spin_unlock(&ptr->member);					\
+	pcpu_task_unpin();						\
+})
+
+/* struct slub_percpu_array specific helpers. */
+#define pca_spin_lock(ptr)						\
+	pcpu_spin_lock(struct slub_percpu_array, lock, ptr)
+
+#define pca_spin_trylock(ptr)						\
+	pcpu_spin_trylock(struct slub_percpu_array, lock, ptr)
+
+#define pca_spin_unlock(ptr)						\
+	pcpu_spin_unlock(lock, ptr)
+
 #ifndef CONFIG_SLUB_TINY
 #define __fastpath_inline __always_inline
 #else
@@ -3326,6 +3399,32 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	return p;
 }
 
+static inline void *alloc_from_pca(struct kmem_cache *s)
+{
+	unsigned long __maybe_unused UP_flags;
+	struct slub_percpu_array *pca;
+	void *object = NULL;
+
+	pcp_trylock_prepare(UP_flags);
+	pca = pca_spin_trylock(s->cpu_array);
+
+	if (unlikely(!pca))
+		goto failed;
+
+	if (likely(pca->used > 0)) {
+		object = pca->objects[--pca->used];
+		pca_spin_unlock(pca);
+		pcp_trylock_finish(UP_flags);
+		stat(s, ALLOC_PERCPU_CACHE);
+		return object;
+	}
+	pca_spin_unlock(pca);
+
+failed:
+	pcp_trylock_finish(UP_flags);
+	return NULL;
+}
+
 static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
 {
@@ -3465,7 +3564,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
+	if (s->cpu_array)
+		object = alloc_from_pca(s);
+
+	if (!object)
+		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);
 	init = slab_want_init_on_alloc(gfpflags, s);
@@ -3715,6 +3818,34 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+static inline bool free_to_pca(struct kmem_cache *s, void *object)
+{
+	unsigned long __maybe_unused UP_flags;
+	struct slub_percpu_array *pca;
+	bool ret = false;
+
+	pcp_trylock_prepare(UP_flags);
+	pca = pca_spin_trylock(s->cpu_array);
+
+	if (!pca) {
+		pcp_trylock_finish(UP_flags);
+		return false;
+	}
+
+	if (pca->used < pca->count) {
+		pca->objects[pca->used++] = object;
+		ret = true;
+	}
+
+	pca_spin_unlock(pca);
+	pcp_trylock_finish(UP_flags);
+
+	if (ret)
+		stat(s, FREE_PERCPU_CACHE);
+
+	return ret;
+}
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -3740,6 +3871,11 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	unsigned long tid;
 	void **freelist;
 
+	if (s->cpu_array && cnt == 1) {
+		if (free_to_pca(s, head))
+			return;
+	}
+
 redo:
 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -3793,6 +3929,11 @@ static void do_slab_free(struct kmem_cache *s,
 {
 	void *tail_obj = tail ? : head;
 
+	if (s->cpu_array && cnt == 1) {
+		if (free_to_pca(s, head))
+			return;
+	}
+
 	__slab_free(s, slab, head, tail_obj, cnt, addr);
 }
 #endif /* CONFIG_SLUB_TINY */
@@ -4060,6 +4201,45 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
 
+int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count,
+		gfp_t gfp)
+{
+	struct slub_percpu_array *pca;
+	void *objects[32];
+	unsigned int used;
+	unsigned int allocated;
+
+	if (!s->cpu_array)
+		return -EINVAL;
+
+	/* racy but we don't care */
+	pca = raw_cpu_ptr(s->cpu_array);
+
+	used = READ_ONCE(pca->used);
+
+	if (used >= count)
+		return 0;
+
+	if (pca->count < count)
+		return -EINVAL;
+
+	count -= used;
+
+	/* TODO fix later */
+	if (count > 32)
+		count = 32;
+
+	for (int i = 0; i < count; i++)
+		objects[i] = NULL;
+	allocated = kmem_cache_alloc_bulk(s, gfp, count, &objects[0]);
+
+	for (int i = 0; i < count; i++) {
+		if (objects[i]) {
+			kmem_cache_free(s, objects[i]);
+		}
+	}
+	return allocated;
+}
 
 /*
  * Object placement in a slab is made very easy because we always start at
@@ -5131,6 +5311,30 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
 	return 0;
 }
 
+int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count)
+{
+	int cpu;
+
+	if (WARN_ON_ONCE(!(s->flags & SLAB_NO_MERGE)))
+		return -EINVAL;
+
+	s->cpu_array = __alloc_percpu(struct_size(s->cpu_array, objects, count),
+					sizeof(void *));
+
+	if (!s->cpu_array)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_array *pca = per_cpu_ptr(s->cpu_array, cpu);
+
+		spin_lock_init(&pca->lock);
+		pca->count = count;
+		pca->used = 0;
+	}
+
+	return 0;
+}
+
 #ifdef SLAB_SUPPORTS_SYSFS
 static int count_inuse(struct slab *slab)
 {
@@ -5908,8 +6112,10 @@ static ssize_t text##_store(struct kmem_cache *s,		\
 }								\
 SLAB_ATTR(text);						\
 
+STAT_ATTR(ALLOC_PERCPU_CACHE, alloc_cpu_cache);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
+STAT_ATTR(FREE_PERCPU_CACHE, free_cpu_cache);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -5995,8 +6201,10 @@ static struct attribute *slab_attrs[] = {
 	&remote_node_defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
+	&alloc_cpu_cache_attr.attr,
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
+	&free_cpu_cache_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v1 3/5] maple_tree: use slub percpu array
  2023-08-08  9:53 [RFC v1 0/5] SLUB percpu array caches and maple tree nodes Vlastimil Babka
  2023-08-08  9:53 ` [RFC v1 1/5] mm, slub: fix bulk alloc and free stats Vlastimil Babka
  2023-08-08  9:53 ` [RFC v1 2/5] mm, slub: add opt-in slub_percpu_array Vlastimil Babka
@ 2023-08-08  9:53 ` Vlastimil Babka
  2023-08-08  9:53 ` [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more Vlastimil Babka
  2023-08-08  9:53 ` [RFC v1 5/5] maple_tree: replace preallocation with slub percpu array prefill Vlastimil Babka
  4 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2023-08-08  9:53 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Hyeonggon Yoo, Roman Gushchin, linux-mm, linux-kernel, patches,
	Vlastimil Babka

Just make sure the maple_node_cache has a percpu array of size 32.

Will break with CONFIG_SLAB.
---
 lib/maple_tree.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 4dd73cf936a6..1196d0a17f03 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -6180,9 +6180,16 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 
 void __init maple_tree_init(void)
 {
+	int ret;
+
 	maple_node_cache = kmem_cache_create("maple_node",
 			sizeof(struct maple_node), sizeof(struct maple_node),
-			SLAB_PANIC, NULL);
+			SLAB_PANIC | SLAB_NO_MERGE, NULL);
+
+	ret = kmem_cache_setup_percpu_array(maple_node_cache, 32);
+
+	if (ret)
+		pr_warn("error %d creating percpu_array for maple_node_cache\n", ret);
 }
 
 /**
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more
  2023-08-08  9:53 [RFC v1 0/5] SLUB percpu array caches and maple tree nodes Vlastimil Babka
                   ` (2 preceding siblings ...)
  2023-08-08  9:53 ` [RFC v1 3/5] maple_tree: use slub percpu array Vlastimil Babka
@ 2023-08-08  9:53 ` Vlastimil Babka
  2023-08-08 11:17   ` Peng Zhang
  2023-08-08  9:53 ` [RFC v1 5/5] maple_tree: replace preallocation with slub percpu array prefill Vlastimil Babka
  4 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2023-08-08  9:53 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Hyeonggon Yoo, Roman Gushchin, linux-mm, linux-kernel, patches,
	Vlastimil Babka

Using bulk alloc/free on a cache with percpu array should not be
necessary and the bulk alloc actually bypasses the array (the prefill
functionality currently relies on this).

The simplest change is just to convert the respective maple tree
wrappers to do a loop of normal alloc/free.
---
 lib/maple_tree.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 1196d0a17f03..7a8e7c467d7c 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -161,12 +161,19 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)
 
 static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
 {
-	return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
+	int allocated = 0;
+	for (size_t i = 0; i < size; i++) {
+		nodes[i] = kmem_cache_alloc(maple_node_cache, gfp);
+		if (nodes[i])
+			allocated++;
+	}
+	return allocated;
 }
 
 static inline void mt_free_bulk(size_t size, void __rcu **nodes)
 {
-	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
+	for (size_t i = 0; i < size; i++)
+		kmem_cache_free(maple_node_cache, nodes[i]);
 }
 
 static void mt_free_rcu(struct rcu_head *head)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v1 5/5] maple_tree: replace preallocation with slub percpu array prefill
  2023-08-08  9:53 [RFC v1 0/5] SLUB percpu array caches and maple tree nodes Vlastimil Babka
                   ` (3 preceding siblings ...)
  2023-08-08  9:53 ` [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more Vlastimil Babka
@ 2023-08-08  9:53 ` Vlastimil Babka
  2023-08-08 14:37   ` Liam R. Howlett
  2023-08-08 19:03   ` Liam R. Howlett
  4 siblings, 2 replies; 14+ messages in thread
From: Vlastimil Babka @ 2023-08-08  9:53 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Hyeonggon Yoo, Roman Gushchin, linux-mm, linux-kernel, patches,
	Vlastimil Babka

With the percpu array we can try not doing the preallocations in maple
tree, and instead make sure the percpu array is prefilled, and using
GFP_ATOMIC in places that relied on the preallocation (in case we miss
or fail trylock on the array), i.e. mas_store_prealloc(). For now simply
add __GFP_NOFAIL there as well.

First I tried to change mas_node_count_gfp() to not preallocate anything
anywhere, but that lead to warns and panics, even though the other
caller mas_node_count() uses GFP_NOWAIT | __GFP_NOWARN so it has no
guarantees... So I changed just mas_preallocate(). I let it still to
truly preallocate a single node, but maybe it's not necessary?
---
 lib/maple_tree.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 7a8e7c467d7c..5a209d88c318 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5534,7 +5534,12 @@ void mas_store_prealloc(struct ma_state *mas, void *entry)
 
 	mas_wr_store_setup(&wr_mas);
 	trace_ma_write(__func__, mas, 0, entry);
+
+retry:
 	mas_wr_store_entry(&wr_mas);
+	if (unlikely(mas_nomem(mas, GFP_ATOMIC | __GFP_NOFAIL)))
+		goto retry;
+
 	MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas));
 	mas_destroy(mas);
 }
@@ -5550,9 +5555,10 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
 int mas_preallocate(struct ma_state *mas, gfp_t gfp)
 {
 	int ret;
+	int count = 1 + mas_mt_height(mas) * 3;
 
-	mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
-	mas->mas_flags |= MA_STATE_PREALLOC;
+	mas_node_count_gfp(mas, 1, gfp);
+	kmem_cache_prefill_percpu_array(maple_node_cache, count, gfp);
 	if (likely(!mas_is_err(mas)))
 		return 0;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more
  2023-08-08  9:53 ` [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more Vlastimil Babka
@ 2023-08-08 11:17   ` Peng Zhang
  2023-08-08 14:29     ` Liam R. Howlett
  0 siblings, 1 reply; 14+ messages in thread
From: Peng Zhang @ 2023-08-08 11:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hyeonggon Yoo, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Christoph Lameter, Liam R. Howlett, Roman Gushchin, linux-mm,
	linux-kernel, patches, Matthew Wilcox



在 2023/8/8 17:53, Vlastimil Babka 写道:
> Using bulk alloc/free on a cache with percpu array should not be
> necessary and the bulk alloc actually bypasses the array (the prefill
> functionality currently relies on this).
> 
> The simplest change is just to convert the respective maple tree
> wrappers to do a loop of normal alloc/free.
> ---
>   lib/maple_tree.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 1196d0a17f03..7a8e7c467d7c 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -161,12 +161,19 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)
>   
>   static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
>   {
> -	return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
> +	int allocated = 0;
> +	for (size_t i = 0; i < size; i++) {
> +		nodes[i] = kmem_cache_alloc(maple_node_cache, gfp);
> +		if (nodes[i])
If the i-th allocation fails, node[i] will be NULL. This is wrong. We'd
better guarantee that mt_alloc_bulk() allocates completely successfully,
or returns 0. The following cases are not allowed:
nodes: [addr1][addr2][NULL][addr3].
> +			allocated++;
> +	}
> +	return allocated;
>   }
>   
>   static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>   {
> -	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
> +	for (size_t i = 0; i < size; i++)
> +		kmem_cache_free(maple_node_cache, nodes[i]);
>   }
>   
>   static void mt_free_rcu(struct rcu_head *head)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 2/5] mm, slub: add opt-in slub_percpu_array
  2023-08-08  9:53 ` [RFC v1 2/5] mm, slub: add opt-in slub_percpu_array Vlastimil Babka
@ 2023-08-08 12:06   ` Pedro Falcato
  0 siblings, 0 replies; 14+ messages in thread
From: Pedro Falcato @ 2023-08-08 12:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Matthew Wilcox, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Hyeonggon Yoo,
	Roman Gushchin, linux-mm, linux-kernel, patches

On Tue, Aug 8, 2023 at 10:54 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> kmem_cache_setup_percpu_array() will allocate a per-cpu array for
> caching alloc/free objects of given size for the cache. The cache
> has to be created with SLAB_NO_MERGE flag.
>
> The array is filled by freeing. When empty for alloc or full for
> freeing, it's simply bypassed by the operation, there's currently no
> batch freeing/allocations.
>
> The locking is copied from the page allocator's pcplists, based on
> embedded spin locks. Interrupts are not disabled, only preemption (cpu
> migration on RT). Trylock is attempted to avoid deadlock due to
> an intnerrupt, trylock failure means the array is bypassed.
>
> Sysfs stat counters alloc_cpu_cache and free_cpu_cache count operations
> that used the percpu array.
>
> Bulk allocation bypasses the array, bulk freeing does not.
>
> kmem_cache_prefill_percpu_array() can be called to ensure the array on
> the current cpu to at least the given number of objects. However this is
> only opportunistic as there's no cpu pinning and the trylocks may always
> fail. Therefore allocations cannot rely on the array for success even
> after the prefill. But misses should be rare enough that e.g. GFP_ATOMIC
> allocations should be acceptable after the refill.
> The operation is currently not optimized.

As I asked on IRC, I'm curious about three questions:

1) How does this affect SLUB's anti-queueing ideas?

2) Since this is so similar to SLAB's caching, is it realistic to make
this opt-out instead?

3) What performance difference do you expect/see from benchmarks?

> More TODO/FIXMEs:
>
> - NUMA awareness - preferred node currently ignored, __GFP_THISNODE not
>   honored
> - slub_debug - will not work for allocations from the array. Normally in
>   SLUB implementation the slub_debug kills all fast paths, but that
>   could lead to depleting the reserves if we ignore the prefill and use
>   GFP_ATOMIC. Needs more thought.
> ---
>  include/linux/slab.h     |   4 +
>  include/linux/slub_def.h |  10 ++
>  mm/slub.c                | 210 ++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 223 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 848c7c82ad5a..f6c91cbc1544 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -196,6 +196,8 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
>  void kmem_cache_destroy(struct kmem_cache *s);
>  int kmem_cache_shrink(struct kmem_cache *s);
>
> +int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count);
> +
>  /*
>   * Please use this macro to create slab caches. Simply specify the
>   * name of the structure and maybe some flags that are listed above.
> @@ -494,6 +496,8 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
>  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
>  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
>
> +int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count, gfp_t gfp);
> +
>  static __always_inline void kfree_bulk(size_t size, void **p)
>  {
>         kmem_cache_free_bulk(NULL, size, p);
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index deb90cf4bffb..c85434668419 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -13,8 +13,10 @@
>  #include <linux/local_lock.h>
>
>  enum stat_item {
> +       ALLOC_PERCPU_CACHE,     /* Allocation from percpu array cache */
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
>         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
> +       FREE_PERCPU_CACHE,      /* Free to percpu array cache */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>         FREE_FROZEN,            /* Freeing to frozen slab */
> @@ -66,6 +68,13 @@ struct kmem_cache_cpu {
>  };
>  #endif /* CONFIG_SLUB_TINY */
>
> +struct slub_percpu_array {
> +       spinlock_t lock;

Since this is a percpu array, you probably want to avoid a lock here.
An idea would be to have some sort of bool accessing;
and then doing:

preempt_disable();
WRITE_ONCE(accessing, 1);

/* doing pcpu array stuff */
WRITE_ONCE(accessing, 0);
preempt_enable();

which would avoid the atomic in a fast path while still giving you
safety on IRQ paths. Although reclamation gets harder as you stop
being able to reclaim these pcpu arrays from other CPUs.

-- 
Pedro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more
  2023-08-08 11:17   ` Peng Zhang
@ 2023-08-08 14:29     ` Liam R. Howlett
  2023-08-08 15:08       ` Vlastimil Babka
  0 siblings, 1 reply; 14+ messages in thread
From: Liam R. Howlett @ 2023-08-08 14:29 UTC (permalink / raw)
  To: Peng Zhang
  Cc: Vlastimil Babka, Hyeonggon Yoo, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Christoph Lameter, Roman Gushchin, linux-mm,
	linux-kernel, patches, Matthew Wilcox

* Peng Zhang <zhangpeng.00@bytedance.com> [230808 07:17]:
> 
> 
> 在 2023/8/8 17:53, Vlastimil Babka 写道:
> > Using bulk alloc/free on a cache with percpu array should not be
> > necessary and the bulk alloc actually bypasses the array (the prefill
> > functionality currently relies on this).
> > 
> > The simplest change is just to convert the respective maple tree
> > wrappers to do a loop of normal alloc/free.
> > ---
> >   lib/maple_tree.c | 11 +++++++++--
> >   1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > index 1196d0a17f03..7a8e7c467d7c 100644
> > --- a/lib/maple_tree.c
> > +++ b/lib/maple_tree.c
> > @@ -161,12 +161,19 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)
> >   static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
> >   {
> > -	return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
> > +	int allocated = 0;
> > +	for (size_t i = 0; i < size; i++) {
> > +		nodes[i] = kmem_cache_alloc(maple_node_cache, gfp);
> > +		if (nodes[i])
> If the i-th allocation fails, node[i] will be NULL. This is wrong. We'd
> better guarantee that mt_alloc_bulk() allocates completely successfully,
> or returns 0. The following cases are not allowed:
> nodes: [addr1][addr2][NULL][addr3].

Thanks for pointing this out Peng.

We can handle a lower number than requested being returned, but we
cannot handle the sparse data.

The kmem_cache_alloc_bulk() can return a failure today - leaving the
array to be cleaned by the caller, so if this is changed to a full
success or full fail, then we will also have to change the caller to
handle whatever state is returned if it differs from
kmem_cache_alloc_bulk().

It might be best to return the size already allocated when a failure is
encountered. This will make the caller, mas_alloc_nodes(), request more
nodes.  Only in the case of zero allocations would this be seen as an
OOM event.

Vlastimil, Is the first kmem_cache_alloc() call failing a possibility?
If so, what should be the corrective action?

> > +			allocated++;
> > +	}
> > +	return allocated;
> >   }
> >   static inline void mt_free_bulk(size_t size, void __rcu **nodes)
> >   {
> > -	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
> > +	for (size_t i = 0; i < size; i++)
> > +		kmem_cache_free(maple_node_cache, nodes[i]);
> >   }
> >   static void mt_free_rcu(struct rcu_head *head)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 5/5] maple_tree: replace preallocation with slub percpu array prefill
  2023-08-08  9:53 ` [RFC v1 5/5] maple_tree: replace preallocation with slub percpu array prefill Vlastimil Babka
@ 2023-08-08 14:37   ` Liam R. Howlett
  2023-08-08 19:01     ` Liam R. Howlett
  2023-08-08 19:03   ` Liam R. Howlett
  1 sibling, 1 reply; 14+ messages in thread
From: Liam R. Howlett @ 2023-08-08 14:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, Christoph Lameter, David Rientjes, Pekka Enberg,
	Joonsoo Kim, Hyeonggon Yoo, Roman Gushchin, linux-mm,
	linux-kernel, patches

* Vlastimil Babka <vbabka@suse.cz> [230808 05:53]:
> With the percpu array we can try not doing the preallocations in maple
> tree, and instead make sure the percpu array is prefilled, and using
> GFP_ATOMIC in places that relied on the preallocation (in case we miss
> or fail trylock on the array), i.e. mas_store_prealloc(). For now simply
> add __GFP_NOFAIL there as well.
> 
> First I tried to change mas_node_count_gfp() to not preallocate anything
> anywhere, but that lead to warns and panics, even though the other
> caller mas_node_count() uses GFP_NOWAIT | __GFP_NOWARN so it has no
> guarantees... So I changed just mas_preallocate(). I let it still to
> truly preallocate a single node, but maybe it's not necessary?

Ah, yes.  I added a check to make sure we didn't allocate more nodes
when using preallocations.  This check is what you are hitting when you
don't allocate anything.  This is tracked in mas_flags by
setting/clearing MA_STATE_PREALLOC.  Good news, that check works!

> ---
>  lib/maple_tree.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 7a8e7c467d7c..5a209d88c318 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5534,7 +5534,12 @@ void mas_store_prealloc(struct ma_state *mas, void *entry)
>  
>  	mas_wr_store_setup(&wr_mas);
>  	trace_ma_write(__func__, mas, 0, entry);
> +
> +retry:
>  	mas_wr_store_entry(&wr_mas);
> +	if (unlikely(mas_nomem(mas, GFP_ATOMIC | __GFP_NOFAIL)))
> +		goto retry;
> +
>  	MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas));
>  	mas_destroy(mas);
>  }
> @@ -5550,9 +5555,10 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>  int mas_preallocate(struct ma_state *mas, gfp_t gfp)
>  {
>  	int ret;
> +	int count = 1 + mas_mt_height(mas) * 3;
>  
> -	mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
> -	mas->mas_flags |= MA_STATE_PREALLOC;
> +	mas_node_count_gfp(mas, 1, gfp);
> +	kmem_cache_prefill_percpu_array(maple_node_cache, count, gfp);
>  	if (likely(!mas_is_err(mas)))
>  		return 0;
>  
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more
  2023-08-08 14:29     ` Liam R. Howlett
@ 2023-08-08 15:08       ` Vlastimil Babka
  0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2023-08-08 15:08 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, Hyeonggon Yoo, Joonsoo Kim,
	Pekka Enberg, David Rientjes, Christoph Lameter, Roman Gushchin,
	linux-mm, linux-kernel, patches, Matthew Wilcox

On 8/8/23 16:29, Liam R. Howlett wrote:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230808 07:17]:
>> 
>> 
>> 在 2023/8/8 17:53, Vlastimil Babka 写道:
>> > Using bulk alloc/free on a cache with percpu array should not be
>> > necessary and the bulk alloc actually bypasses the array (the prefill
>> > functionality currently relies on this).
>> > 
>> > The simplest change is just to convert the respective maple tree
>> > wrappers to do a loop of normal alloc/free.
>> > ---
>> >   lib/maple_tree.c | 11 +++++++++--
>> >   1 file changed, 9 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> > index 1196d0a17f03..7a8e7c467d7c 100644
>> > --- a/lib/maple_tree.c
>> > +++ b/lib/maple_tree.c
>> > @@ -161,12 +161,19 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)
>> >   static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
>> >   {
>> > -	return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
>> > +	int allocated = 0;
>> > +	for (size_t i = 0; i < size; i++) {
>> > +		nodes[i] = kmem_cache_alloc(maple_node_cache, gfp);
>> > +		if (nodes[i])
>> If the i-th allocation fails, node[i] will be NULL. This is wrong. We'd
>> better guarantee that mt_alloc_bulk() allocates completely successfully,
>> or returns 0. The following cases are not allowed:
>> nodes: [addr1][addr2][NULL][addr3].

Thanks, indeed. I guess it should just break; in case of failure and return
how many allocations succeeded so far.

But note this is a really a quick RFC proof of concept hack. I'd expect if
the whole idea is deemed as good, the maple tree node handling could be
redesigned (simplified?) around it and maybe there's no mt_alloc_bulk()
anymore as a result?

> Thanks for pointing this out Peng.
> 
> We can handle a lower number than requested being returned, but we
> cannot handle the sparse data.
> 
> The kmem_cache_alloc_bulk() can return a failure today - leaving the
> array to be cleaned by the caller, so if this is changed to a full
> success or full fail, then we will also have to change the caller to
> handle whatever state is returned if it differs from
> kmem_cache_alloc_bulk().
> 
> It might be best to return the size already allocated when a failure is
> encountered. This will make the caller, mas_alloc_nodes(), request more
> nodes.  Only in the case of zero allocations would this be seen as an
> OOM event.
> 
> Vlastimil, Is the first kmem_cache_alloc() call failing a possibility?

Sure, if there's no memory, it can fail. In practice if gfp is one that
allows reclaim, it will ultimately be the "too small to fail" allocation on
the page allocator level. But there are exceptions, like having received a
fatal signal, IIRC :)

> If so, what should be the corrective action?

Depends on your context, if you can pass on -ENOMEM to the caller, or need
to succeed.

>> > +			allocated++;
>> > +	}
>> > +	return allocated;
>> >   }
>> >   static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>> >   {
>> > -	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
>> > +	for (size_t i = 0; i < size; i++)
>> > +		kmem_cache_free(maple_node_cache, nodes[i]);
>> >   }
>> >   static void mt_free_rcu(struct rcu_head *head)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 5/5] maple_tree: replace preallocation with slub percpu array prefill
  2023-08-08 14:37   ` Liam R. Howlett
@ 2023-08-08 19:01     ` Liam R. Howlett
  0 siblings, 0 replies; 14+ messages in thread
From: Liam R. Howlett @ 2023-08-08 19:01 UTC (permalink / raw)
  To: Vlastimil Babka, Matthew Wilcox, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Hyeonggon Yoo,
	Roman Gushchin, linux-mm, linux-kernel, patches

[-- Attachment #1: Type: text/plain, Size: 2350 bytes --]

* Liam R. Howlett <Liam.Howlett@Oracle.com> [230808 10:37]:
> * Vlastimil Babka <vbabka@suse.cz> [230808 05:53]:
> > With the percpu array we can try not doing the preallocations in maple
> > tree, and instead make sure the percpu array is prefilled, and using
> > GFP_ATOMIC in places that relied on the preallocation (in case we miss
> > or fail trylock on the array), i.e. mas_store_prealloc(). For now simply
> > add __GFP_NOFAIL there as well.
> > 
> > First I tried to change mas_node_count_gfp() to not preallocate anything
> > anywhere, but that lead to warns and panics, even though the other
> > caller mas_node_count() uses GFP_NOWAIT | __GFP_NOWARN so it has no
> > guarantees... So I changed just mas_preallocate(). I let it still to
> > truly preallocate a single node, but maybe it's not necessary?
> 
> Ah, yes.  I added a check to make sure we didn't allocate more nodes
> when using preallocations.  This check is what you are hitting when you
> don't allocate anything.  This is tracked in mas_flags by
> setting/clearing MA_STATE_PREALLOC.  Good news, that check works!

Adding the attached patch to your series prior to the below allows for
the removal of the extra preallocation.

> 
> > ---
> >  lib/maple_tree.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > index 7a8e7c467d7c..5a209d88c318 100644
> > --- a/lib/maple_tree.c
> > +++ b/lib/maple_tree.c
> > @@ -5534,7 +5534,12 @@ void mas_store_prealloc(struct ma_state *mas, void *entry)
> >  
> >  	mas_wr_store_setup(&wr_mas);
> >  	trace_ma_write(__func__, mas, 0, entry);
> > +
> > +retry:
> >  	mas_wr_store_entry(&wr_mas);
> > +	if (unlikely(mas_nomem(mas, GFP_ATOMIC | __GFP_NOFAIL)))
> > +		goto retry;
> > +
> >  	MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas));
> >  	mas_destroy(mas);
> >  }
> > @@ -5550,9 +5555,10 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
> >  int mas_preallocate(struct ma_state *mas, gfp_t gfp)
> >  {
> >  	int ret;
> > +	int count = 1 + mas_mt_height(mas) * 3;
> >  
> > -	mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
> > -	mas->mas_flags |= MA_STATE_PREALLOC;
> > +	mas_node_count_gfp(mas, 1, gfp);
> > +	kmem_cache_prefill_percpu_array(maple_node_cache, count, gfp);
> >  	if (likely(!mas_is_err(mas)))
> >  		return 0;
> >  
> > -- 
> > 2.41.0
> > 

[-- Attachment #2: 0001-maple_tree-Remove-MA_STATE_PREALLOC.patch --]
[-- Type: text/x-diff, Size: 2823 bytes --]

From 5f1f230ed424ec37d65a3537f03c7e6e961cc713 Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Date: Tue, 8 Aug 2023 14:54:27 -0400
Subject: [PATCH] maple_tree: Remove MA_STATE_PREALLOC

MA_SATE_PREALLOC was added to catch any writes that try to allocate when
the maple state is being used in preallocation mode.  This can safely be
removed in favour of the percpu array of nodes.

Note that mas_expected_entries() still expects no allocations during
operation and so MA_STATE_BULK can be used in place of preallocations
for this case, which is primarily used for forking.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 lib/maple_tree.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 5a209d88c318..78796342caf0 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -68,11 +68,9 @@
  * Maple state flags
  * * MA_STATE_BULK		- Bulk insert mode
  * * MA_STATE_REBALANCE		- Indicate a rebalance during bulk insert
- * * MA_STATE_PREALLOC		- Preallocated nodes, WARN_ON allocation
  */
 #define MA_STATE_BULK		1
 #define MA_STATE_REBALANCE	2
-#define MA_STATE_PREALLOC	4
 
 #define ma_parent_ptr(x) ((struct maple_pnode *)(x))
 #define ma_mnode_ptr(x) ((struct maple_node *)(x))
@@ -1279,11 +1277,8 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
 		return;
 
 	mas_set_alloc_req(mas, 0);
-	if (mas->mas_flags & MA_STATE_PREALLOC) {
-		if (allocated)
-			return;
-		WARN_ON(!allocated);
-	}
+	if (mas->mas_flags & MA_STATE_BULK)
+		return;
 
 	if (!allocated || mas->alloc->node_count == MAPLE_ALLOC_SLOTS) {
 		node = (struct maple_alloc *)mt_alloc_one(gfp);
@@ -5601,7 +5596,7 @@ void mas_destroy(struct ma_state *mas)
 
 		mas->mas_flags &= ~MA_STATE_REBALANCE;
 	}
-	mas->mas_flags &= ~(MA_STATE_BULK|MA_STATE_PREALLOC);
+	mas->mas_flags &= ~MA_STATE_BULK;
 
 	total = mas_allocated(mas);
 	while (total) {
@@ -5650,9 +5645,6 @@ int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries)
 	 * of nodes during the operation.
 	 */
 
-	/* Optimize splitting for bulk insert in-order */
-	mas->mas_flags |= MA_STATE_BULK;
-
 	/*
 	 * Avoid overflow, assume a gap between each entry and a trailing null.
 	 * If this is wrong, it just means allocation can happen during
@@ -5669,8 +5661,9 @@ int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries)
 	/* Add working room for split (2 nodes) + new parents */
 	mas_node_count(mas, nr_nodes + 3);
 
-	/* Detect if allocations run out */
-	mas->mas_flags |= MA_STATE_PREALLOC;
+	/* Optimize splitting for bulk insert in-order */
+	mas->mas_flags |= MA_STATE_BULK;
+
 
 	if (!mas_is_err(mas))
 		return 0;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC v1 5/5] maple_tree: replace preallocation with slub percpu array prefill
  2023-08-08  9:53 ` [RFC v1 5/5] maple_tree: replace preallocation with slub percpu array prefill Vlastimil Babka
  2023-08-08 14:37   ` Liam R. Howlett
@ 2023-08-08 19:03   ` Liam R. Howlett
  1 sibling, 0 replies; 14+ messages in thread
From: Liam R. Howlett @ 2023-08-08 19:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, Christoph Lameter, David Rientjes, Pekka Enberg,
	Joonsoo Kim, Hyeonggon Yoo, Roman Gushchin, linux-mm,
	linux-kernel, patches

[-- Attachment #1: Type: text/plain, Size: 1935 bytes --]

* Vlastimil Babka <vbabka@suse.cz> [230808 05:53]:
> With the percpu array we can try not doing the preallocations in maple
> tree, and instead make sure the percpu array is prefilled, and using
> GFP_ATOMIC in places that relied on the preallocation (in case we miss
> or fail trylock on the array), i.e. mas_store_prealloc(). For now simply
> add __GFP_NOFAIL there as well.
> 
> First I tried to change mas_node_count_gfp() to not preallocate anything
> anywhere, but that lead to warns and panics, even though the other
> caller mas_node_count() uses GFP_NOWAIT | __GFP_NOWARN so it has no
> guarantees... So I changed just mas_preallocate(). I let it still to
> truly preallocate a single node, but maybe it's not necessary?

Here's a patch to add the percpu array interface to the testing code.

Note that the maple tree preallocation testing isn't updated.

> ---
>  lib/maple_tree.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 7a8e7c467d7c..5a209d88c318 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5534,7 +5534,12 @@ void mas_store_prealloc(struct ma_state *mas, void *entry)
>  
>  	mas_wr_store_setup(&wr_mas);
>  	trace_ma_write(__func__, mas, 0, entry);
> +
> +retry:
>  	mas_wr_store_entry(&wr_mas);
> +	if (unlikely(mas_nomem(mas, GFP_ATOMIC | __GFP_NOFAIL)))
> +		goto retry;
> +
>  	MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas));
>  	mas_destroy(mas);
>  }
> @@ -5550,9 +5555,10 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>  int mas_preallocate(struct ma_state *mas, gfp_t gfp)
>  {
>  	int ret;
> +	int count = 1 + mas_mt_height(mas) * 3;
>  
> -	mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
> -	mas->mas_flags |= MA_STATE_PREALLOC;
> +	mas_node_count_gfp(mas, 1, gfp);
> +	kmem_cache_prefill_percpu_array(maple_node_cache, count, gfp);
>  	if (likely(!mas_is_err(mas)))
>  		return 0;
>  
> -- 
> 2.41.0
> 

[-- Attachment #2: 0001-tools-Add-SLUB-percpu-array-functions-for-testing.patch --]
[-- Type: text/x-diff, Size: 2645 bytes --]

From 3d4ac32997cab4fc8e957deb4a58a5a7afa485bc Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Date: Tue, 8 Aug 2023 14:58:13 -0400
Subject: [PATCH] tools: Add SLUB percpu array functions for testing

Support new percpu array functions to the test code so they can be used
in the maple tree testing.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 tools/include/linux/slab.h              |  4 ++++
 tools/testing/radix-tree/linux.c        | 14 ++++++++++++++
 tools/testing/radix-tree/linux/kernel.h |  1 +
 3 files changed, 19 insertions(+)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index 311759ea25e9..1043f9c5ef4e 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -7,6 +7,7 @@
 
 #define SLAB_PANIC 2
 #define SLAB_RECLAIM_ACCOUNT    0x00020000UL            /* Objects are reclaimable */
+#define SLAB_NO_MERGE		0x01000000UL		/* Prevent merging with compatible kmem caches */
 
 #define kzalloc_node(size, flags, node) kmalloc(size, flags)
 
@@ -45,4 +46,7 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 			  void **list);
 
+int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count);
+int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count,
+		gfp_t gfp);
 #endif		/* _TOOLS_SLAB_H */
diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
index d587a558997f..cbe7937fdd5e 100644
--- a/tools/testing/radix-tree/linux.c
+++ b/tools/testing/radix-tree/linux.c
@@ -187,6 +187,20 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 	return size;
 }
 
+int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count)
+{
+	return 0;
+}
+
+int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count,
+		gfp_t gfp)
+{
+	if (count > s->non_kernel)
+		return s->non_kernel;
+
+	return count;
+}
+
 struct kmem_cache *
 kmem_cache_create(const char *name, unsigned int size, unsigned int align,
 		unsigned int flags, void (*ctor)(void *))
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index c5c9d05f29da..fc75018974de 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -15,6 +15,7 @@
 
 #define printk printf
 #define pr_err printk
+#define pr_warn printk
 #define pr_info printk
 #define pr_debug printk
 #define pr_cont printk
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC v1 1/5] mm, slub: fix bulk alloc and free stats
  2023-08-08  9:53 ` [RFC v1 1/5] mm, slub: fix bulk alloc and free stats Vlastimil Babka
@ 2023-08-18 11:47   ` Hyeonggon Yoo
  0 siblings, 0 replies; 14+ messages in thread
From: Hyeonggon Yoo @ 2023-08-18 11:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Matthew Wilcox, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	linux-mm, linux-kernel, patches

On Tue, Aug 8, 2023 at 6:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> The SLUB sysfs stats enabled CONFIG_SLUB_STATS have two deficiencies
> identified wrt bulk alloc/free operations:
>
> - Bulk allocations from cpu freelist are not counted. Add the
>   ALLOC_FASTPATH counter there.
>
> - Bulk fastpath freeing will count a list of multiple objects with a
>   single FREE_FASTPATH inc. Add a stat_add() variant to count them all.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index e3b5d5c0eb3a..a9437d48840c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -341,6 +341,14 @@ static inline void stat(const struct kmem_cache *s, enum stat_item si)
>  #endif
>  }
>
> +static inline void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
> +{
> +#ifdef CONFIG_SLUB_STATS
> +       raw_cpu_add(s->cpu_slab->stat[si], v);
> +#endif
> +}
> +
> +
>  /*
>   * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
>   * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
> @@ -3776,7 +3784,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>
>                 local_unlock(&s->cpu_slab->lock);
>         }
> -       stat(s, FREE_FASTPATH);
> +       stat_add(s, FREE_FASTPATH, cnt);

Should bulk free slowpath also be counted in the same way?

>  }
>  #else /* CONFIG_SLUB_TINY */
>  static void do_slab_free(struct kmem_cache *s,
> @@ -3978,6 +3986,7 @@ static inline int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>                 c->freelist = get_freepointer(s, object);
>                 p[i] = object;
>                 maybe_wipe_obj_freeptr(s, p[i]);
> +               stat(s, ALLOC_FASTPATH);
>         }
>         c->tid = next_tid(c->tid);
>         local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-08-18 11:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-08  9:53 [RFC v1 0/5] SLUB percpu array caches and maple tree nodes Vlastimil Babka
2023-08-08  9:53 ` [RFC v1 1/5] mm, slub: fix bulk alloc and free stats Vlastimil Babka
2023-08-18 11:47   ` Hyeonggon Yoo
2023-08-08  9:53 ` [RFC v1 2/5] mm, slub: add opt-in slub_percpu_array Vlastimil Babka
2023-08-08 12:06   ` Pedro Falcato
2023-08-08  9:53 ` [RFC v1 3/5] maple_tree: use slub percpu array Vlastimil Babka
2023-08-08  9:53 ` [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more Vlastimil Babka
2023-08-08 11:17   ` Peng Zhang
2023-08-08 14:29     ` Liam R. Howlett
2023-08-08 15:08       ` Vlastimil Babka
2023-08-08  9:53 ` [RFC v1 5/5] maple_tree: replace preallocation with slub percpu array prefill Vlastimil Babka
2023-08-08 14:37   ` Liam R. Howlett
2023-08-08 19:01     ` Liam R. Howlett
2023-08-08 19:03   ` Liam R. Howlett

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.