linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement
@ 2023-09-05  1:52 wuqiang.matt
  2023-09-05  1:52 ` [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC wuqiang.matt
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: wuqiang.matt @ 2023-09-05  1:52 UTC (permalink / raw)
  To: linux-trace-kernel, mhiramat, davem, anil.s.keshavamurthy,
	naveen.n.rao, rostedt, peterz, akpm, sander, ebiggers,
	dan.j.williams, jpoimboe
  Cc: linux-kernel, lkp, mattwu, wuqiang.matt

This patch series introduces a scalable and lockless ring-array based
object pool and replaces the original freelist (a LIFO queue based on
singly linked list) to improve scalability of kretprobed routines.

v9:
  1) objpool: raw_local_irq_save/restore added to prevent interruption

     To avoid possible ABA issues, we must ensure objpool_try_add_slot
     and objpool_try_add_slot are uninterruptible. If these operations
     are blocked or interrupted in the middle, other cores could overrun
     the same slot's ages[] of uint32, then after resuming back, the
     interrupted pop() or push() could see same value of ages[], which
     is a typical ABA problem though the possibility is small.

     The pair of pop()/push() costs about 8.53 cpu cycles, measured
     by IACA (Intel Architecture Code Analyzer). That is, on a 4Ghz
     core dedicated for pop() & push(), theoretically it would only
     need 8.53 seconds to overflow a 32bit value. Testings upon Intel
     i7-10700 (2.90GHz) cost 71.88 seconds to overrun a 32bit integer.

  2) codes improvements: thanks to Masami for the thorough inspection

v8:
  1) objpool: refcount added for objpool lifecycle management

wuqiang.matt (5):
  lib: objpool added: ring-array based lockless MPMC
  lib: objpool test module added
  kprobes: kretprobe scalability improvement with objpool
  kprobes: freelist.h removed
  MAINTAINERS: objpool added

 MAINTAINERS              |   7 +
 include/linux/freelist.h | 129 --------
 include/linux/kprobes.h  |  11 +-
 include/linux/objpool.h  | 174 ++++++++++
 include/linux/rethook.h  |  16 +-
 kernel/kprobes.c         |  93 +++---
 kernel/trace/fprobe.c    |  32 +-
 kernel/trace/rethook.c   |  90 +++--
 lib/Kconfig.debug        |  11 +
 lib/Makefile             |   4 +-
 lib/objpool.c            | 338 +++++++++++++++++++
 lib/test_objpool.c       | 689 +++++++++++++++++++++++++++++++++++++++
 12 files changed, 1320 insertions(+), 274 deletions(-)
 delete mode 100644 include/linux/freelist.h
 create mode 100644 include/linux/objpool.h
 create mode 100644 lib/objpool.c
 create mode 100644 lib/test_objpool.c

-- 
2.40.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-09-05  1:52 [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement wuqiang.matt
@ 2023-09-05  1:52 ` wuqiang.matt
  2023-09-23  9:48   ` Masami Hiramatsu
                     ` (2 more replies)
  2023-09-05  1:52 ` [PATCH v9 2/5] lib: objpool test module added wuqiang.matt
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 25+ messages in thread
From: wuqiang.matt @ 2023-09-05  1:52 UTC (permalink / raw)
  To: linux-trace-kernel, mhiramat, davem, anil.s.keshavamurthy,
	naveen.n.rao, rostedt, peterz, akpm, sander, ebiggers,
	dan.j.williams, jpoimboe
  Cc: linux-kernel, lkp, mattwu, wuqiang.matt

The object pool is a scalable implementaion of high performance queue
for object allocation and reclamation, such as kretprobe instances.

With leveraging percpu ring-array to mitigate the hot spot of memory
contention, it could deliver near-linear scalability for high parallel
scenarios. The objpool is best suited for following cases:
1) Memory allocation or reclamation are prohibited or too expensive
2) Consumers are of different priorities, such as irqs and threads

Limitations:
1) Maximum objects (capacity) is determined during pool initializing
   and can't be modified (extended) after objpool creation
2) The memory of objects won't be freed until objpool is finalized
3) Object allocation (pop) may fail after trying all cpu slots

Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
---
 include/linux/objpool.h | 174 +++++++++++++++++++++
 lib/Makefile            |   2 +-
 lib/objpool.c           | 338 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 513 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/objpool.h
 create mode 100644 lib/objpool.c

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
new file mode 100644
index 000000000000..33c832216b98
--- /dev/null
+++ b/include/linux/objpool.h
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_OBJPOOL_H
+#define _LINUX_OBJPOOL_H
+
+#include <linux/types.h>
+#include <linux/refcount.h>
+
+/*
+ * objpool: ring-array based lockless MPMC queue
+ *
+ * Copyright: wuqiang.matt@bytedance.com
+ *
+ * The object pool is a scalable implementaion of high performance queue
+ * for objects allocation and reclamation, such as kretprobe instances.
+ *
+ * With leveraging per-cpu ring-array to mitigate the hot spots of memory
+ * contention, it could deliver near-linear scalability for high parallel
+ * scenarios. The ring-array is compactly managed in a single cache-line
+ * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
+ * The body of pre-allocated objects is stored in continuous cache-lines
+ * just after the ring-array.
+ *
+ * The object pool is interrupt safe. Both allocation and reclamation
+ * (object pop and push operations) can be preemptible or interruptable.
+ *
+ * It's best suited for following cases:
+ * 1) Memory allocation or reclamation are prohibited or too expensive
+ * 2) Consumers are of different priorities, such as irqs and threads
+ *
+ * Limitations:
+ * 1) Maximum objects (capacity) is determined during pool initializing
+ * 2) The memory of objects won't be freed until the pool is finalized
+ * 3) Object allocation (pop) may fail after trying all cpu slots
+ */
+
+/**
+ * struct objpool_slot - percpu ring array of objpool
+ * @head: head of the local ring array (to retrieve at)
+ * @tail: tail of the local ring array (to append at)
+ * @bits: log2 of capacity (for bitwise operations)
+ * @mask: capacity - 1
+ *
+ * Represents a cpu-local array-based ring buffer, its size is specialized
+ * during initialization of object pool. The percpu objpool slot is to be
+ * allocated from local memory for NUMA system, and to be kept compact in
+ * continuous memory: ages[] is stored just after the body of objpool_slot,
+ * and then entries[]. ages[] describes revision of each item, solely used
+ * to avoid ABA; entries[] contains pointers of the actual objects
+ *
+ * Layout of objpool_slot in memory:
+ *
+ * 64bit:
+ *        4      8      12     16        32                 64
+ * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
+ *
+ * 32bit:
+ *        4      8      12     16        32        48       64
+ * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
+ *
+ */
+struct objpool_slot {
+	uint32_t                head;
+	uint32_t                tail;
+	uint32_t                bits;
+	uint32_t                mask;
+} __packed;
+
+struct objpool_head;
+
+/*
+ * caller-specified callback for object initial setup, it's only called
+ * once for each object (just after the memory allocation of the object)
+ */
+typedef int (*objpool_init_obj_cb)(void *obj, void *context);
+
+/* caller-specified cleanup callback for objpool destruction */
+typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
+
+/**
+ * struct objpool_head - object pooling metadata
+ * @obj_size:   object & element size
+ * @nr_objs:    total objs (to be pre-allocated)
+ * @nr_cpus:    nr_cpu_ids
+ * @capacity:   max objects per cpuslot
+ * @gfp:        gfp flags for kmalloc & vmalloc
+ * @ref:        refcount for objpool
+ * @flags:      flags for objpool management
+ * @cpu_slots:  array of percpu slots
+ * @slot_sizes:	size in bytes of slots
+ * @release:    resource cleanup callback
+ * @context:    caller-provided context
+ */
+struct objpool_head {
+	int                     obj_size;
+	int                     nr_objs;
+	int                     nr_cpus;
+	int                     capacity;
+	gfp_t                   gfp;
+	refcount_t              ref;
+	unsigned long           flags;
+	struct objpool_slot   **cpu_slots;
+	int                    *slot_sizes;
+	objpool_fini_cb         release;
+	void                   *context;
+};
+
+#define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
+#define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
+
+/**
+ * objpool_init() - initialize objpool and pre-allocated objects
+ * @head:    the object pool to be initialized, declared by caller
+ * @nr_objs: total objects to be pre-allocated by this object pool
+ * @object_size: size of an object (should be > 0)
+ * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
+ * @context: user context for object initialization callback
+ * @objinit: object initialization callback for extra setup
+ * @release: cleanup callback for extra cleanup task
+ *
+ * return value: 0 for success, otherwise error code
+ *
+ * All pre-allocated objects are to be zeroed after memory allocation.
+ * Caller could do extra initialization in objinit callback. objinit()
+ * will be called just after slot allocation and will be only once for
+ * each object. Since then the objpool won't touch any content of the
+ * objects. It's caller's duty to perform reinitialization after each
+ * pop (object allocation) or do clearance before each push (object
+ * reclamation).
+ */
+int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
+		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+		 objpool_fini_cb release);
+
+/**
+ * objpool_pop() - allocate an object from objpool
+ * @head: object pool
+ *
+ * return value: object ptr or NULL if failed
+ */
+void *objpool_pop(struct objpool_head *head);
+
+/**
+ * objpool_push() - reclaim the object and return back to objpool
+ * @obj:  object ptr to be pushed to objpool
+ * @head: object pool
+ *
+ * return: 0 or error code (it fails only when user tries to push
+ * the same object multiple times or wrong "objects" into objpool)
+ */
+int objpool_push(void *obj, struct objpool_head *head);
+
+/**
+ * objpool_drop() - discard the object and deref objpool
+ * @obj:  object ptr to be discarded
+ * @head: object pool
+ *
+ * return: 0 if objpool was released or error code
+ */
+int objpool_drop(void *obj, struct objpool_head *head);
+
+/**
+ * objpool_free() - release objpool forcely (all objects to be freed)
+ * @head: object pool to be released
+ */
+void objpool_free(struct objpool_head *head);
+
+/**
+ * objpool_fini() - deref object pool (also releasing unused objects)
+ * @head: object pool to be dereferenced
+ */
+void objpool_fini(struct objpool_head *head);
+
+#endif /* _LINUX_OBJPOOL_H */
diff --git a/lib/Makefile b/lib/Makefile
index 1ffae65bb7ee..7a84c922d9ff 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
 	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
 	 nmi_backtrace.o win_minmax.o memcat_p.o \
-	 buildid.o
+	 buildid.o objpool.o
 
 lib-$(CONFIG_PRINTK) += dump_stack.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/objpool.c b/lib/objpool.c
new file mode 100644
index 000000000000..22e752371820
--- /dev/null
+++ b/lib/objpool.c
@@ -0,0 +1,338 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/objpool.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/prefetch.h>
+#include <linux/irqflags.h>
+#include <linux/cpumask.h>
+#include <linux/log2.h>
+
+/*
+ * objpool: ring-array based lockless MPMC/FIFO queues
+ *
+ * Copyright: wuqiang.matt@bytedance.com
+ */
+
+#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
+#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
+			(sizeof(uint32_t) << (s)->bits)))
+#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
+			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
+#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
+
+/* compute the suitable num of objects to be managed per slot */
+static int objpool_nobjs(int size)
+{
+	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
+			(sizeof(uint32_t) + sizeof(void *)));
+}
+
+/* allocate and initialize percpu slots */
+static int
+objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
+			void *context, objpool_init_obj_cb objinit)
+{
+	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
+
+	/* aligned object size by sizeof(void *) */
+	objsz = ALIGN(head->obj_size, sizeof(void *));
+	/* shall we allocate objects along with percpu-slot */
+	if (objsz)
+		head->flags |= OBJPOOL_HAVE_OBJECTS;
+
+	/* vmalloc is used in default to allocate percpu-slots */
+	if (!(head->gfp & GFP_ATOMIC))
+		head->flags |= OBJPOOL_FROM_VMALLOC;
+
+	for (i = 0; i < head->nr_cpus; i++) {
+		struct objpool_slot *os;
+
+		/* skip the cpus which could never be present */
+		if (!cpu_possible(i))
+			continue;
+
+		/* compute how many objects to be managed by this slot */
+		n = nobjs / num_possible_cpus();
+		if (cpu < (nobjs % num_possible_cpus()))
+			n++;
+		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
+		       sizeof(uint32_t) * nents + objsz * n;
+
+		/*
+		 * here we allocate percpu-slot & objects together in a single
+		 * allocation, taking advantage of warm caches and TLB hits as
+		 * vmalloc always aligns the request size to pages
+		 */
+		if (head->flags & OBJPOOL_FROM_VMALLOC)
+			os = __vmalloc_node(size, sizeof(void *), head->gfp,
+				cpu_to_node(i), __builtin_return_address(0));
+		else
+			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
+		if (!os)
+			return -ENOMEM;
+
+		/* initialize percpu slot for the i-th slot */
+		memset(os, 0, size);
+		os->bits = ilog2(head->capacity);
+		os->mask = head->capacity - 1;
+		head->cpu_slots[i] = os;
+		head->slot_sizes[i] = size;
+		cpu = cpu + 1;
+
+		/*
+		 * manually set head & tail to avoid possible conflict:
+		 * We assume that the head item is ready for retrieval
+		 * iff head is equal to ages[head & mask]. but ages is
+		 * initialized as 0, so in view of the caller of pop(),
+		 * the 1st item (0th) is always ready, but the reality
+		 * could be: push() is stalled before the final update,
+		 * thus the item being inserted will be lost forever
+		 */
+		os->head = os->tail = head->capacity;
+
+		if (!objsz)
+			continue;
+
+		for (j = 0; j < n; j++) {
+			uint32_t *ages = SLOT_AGES(os);
+			void **ents = SLOT_ENTS(os);
+			void *obj = SLOT_OBJS(os) + j * objsz;
+			uint32_t ie = os->tail & os->mask;
+
+			/* perform object initialization */
+			if (objinit) {
+				int rc = objinit(obj, context);
+				if (rc)
+					return rc;
+			}
+
+			/* add obj into the ring array */
+			ents[ie] = obj;
+			ages[ie] = os->tail;
+			os->tail++;
+			head->nr_objs++;
+		}
+	}
+
+	return 0;
+}
+
+/* cleanup all percpu slots of the object pool */
+static void objpool_fini_percpu_slots(struct objpool_head *head)
+{
+	int i;
+
+	if (!head->cpu_slots)
+		return;
+
+	for (i = 0; i < head->nr_cpus; i++) {
+		if (!head->cpu_slots[i])
+			continue;
+		if (head->flags & OBJPOOL_FROM_VMALLOC)
+			vfree(head->cpu_slots[i]);
+		else
+			kfree(head->cpu_slots[i]);
+	}
+	kfree(head->cpu_slots);
+	head->cpu_slots = NULL;
+	head->slot_sizes = NULL;
+}
+
+/* initialize object pool and pre-allocate objects */
+int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
+		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+		objpool_fini_cb release)
+{
+	int nents, rc;
+
+	/* check input parameters */
+	if (nr_objs <= 0 || object_size <= 0)
+		return -EINVAL;
+
+	/* calculate percpu slot size (rounded to pow of 2) */
+	nents = max_t(int, roundup_pow_of_two(nr_objs),
+			objpool_nobjs(L1_CACHE_BYTES));
+
+	/* initialize objpool head */
+	memset(head, 0, sizeof(struct objpool_head));
+	head->nr_cpus = nr_cpu_ids;
+	head->obj_size = object_size;
+	head->capacity = nents;
+	head->gfp = gfp & ~__GFP_ZERO;
+	head->context = context;
+	head->release = release;
+
+	/* allocate array for percpu slots */
+	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
+			       head->nr_cpus * sizeof(int), head->gfp);
+	if (!head->cpu_slots)
+		return -ENOMEM;
+	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
+
+	/* initialize per-cpu slots */
+	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
+	if (rc)
+		objpool_fini_percpu_slots(head);
+	else
+		refcount_set(&head->ref, nr_objs + 1);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(objpool_init);
+
+/* adding object to slot, abort if the slot was already full */
+static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
+{
+	uint32_t *ages = SLOT_AGES(os);
+	void **ents = SLOT_ENTS(os);
+	uint32_t head, tail;
+
+	do {
+		/* perform memory loading for both head and tail */
+		head = READ_ONCE(os->head);
+		tail = READ_ONCE(os->tail);
+		/* just abort if slot is full */
+		if (tail - head > os->mask)
+			return -ENOENT;
+		/* try to extend tail by 1 using CAS to avoid races */
+		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
+			break;
+	} while (1);
+
+	/* the tail-th of slot is reserved for the given obj */
+	WRITE_ONCE(ents[tail & os->mask], obj);
+	/* update epoch id to make this object available for pop() */
+	smp_store_release(&ages[tail & os->mask], tail);
+	return 0;
+}
+
+/* reclaim an object to object pool */
+int objpool_push(void *obj, struct objpool_head *oh)
+{
+	unsigned long flags;
+	int cpu, rc;
+
+	/* disable local irq to avoid preemption & interruption */
+	raw_local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	do {
+		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
+		if (!rc)
+			break;
+		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
+	} while (1);
+	raw_local_irq_restore(flags);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(objpool_push);
+
+/* drop the allocated object, rather reclaim it to objpool */
+int objpool_drop(void *obj, struct objpool_head *head)
+{
+	if (!obj || !head)
+		return -EINVAL;
+
+	if (refcount_dec_and_test(&head->ref)) {
+		objpool_free(head);
+		return 0;
+	}
+
+	return -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(objpool_drop);
+
+/* try to retrieve object from slot */
+static inline void *objpool_try_get_slot(struct objpool_slot *os)
+{
+	uint32_t *ages = SLOT_AGES(os);
+	void **ents = SLOT_ENTS(os);
+	/* do memory load of head to local head */
+	uint32_t head = smp_load_acquire(&os->head);
+
+	/* loop if slot isn't empty */
+	while (head != READ_ONCE(os->tail)) {
+		uint32_t id = head & os->mask, prev = head;
+
+		/* do prefetching of object ents */
+		prefetch(&ents[id]);
+
+		/* check whether this item was ready for retrieval */
+		if (smp_load_acquire(&ages[id]) == head) {
+			/* node must have been udpated by push() */
+			void *node = READ_ONCE(ents[id]);
+			/* commit and move forward head of the slot */
+			if (try_cmpxchg_release(&os->head, &head, head + 1))
+				return node;
+			/* head was already updated by others */
+		}
+
+		/* re-load head from memory and continue trying */
+		head = READ_ONCE(os->head);
+		/*
+		 * head stays unchanged, so it's very likely there's an
+		 * ongoing push() on other cpu nodes but yet not update
+		 * ages[] to mark it's completion
+		 */
+		if (head == prev)
+			break;
+	}
+
+	return NULL;
+}
+
+/* allocate an object from object pool */
+void *objpool_pop(struct objpool_head *head)
+{
+	unsigned long flags;
+	int i, cpu;
+	void *obj = NULL;
+
+	/* disable local irq to avoid preemption & interruption */
+	raw_local_irq_save(flags);
+
+	cpu = raw_smp_processor_id();
+	for (i = 0; i < num_possible_cpus(); i++) {
+		obj = objpool_try_get_slot(head->cpu_slots[cpu]);
+		if (obj)
+			break;
+		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
+	}
+	raw_local_irq_restore(flags);
+
+	return obj;
+}
+EXPORT_SYMBOL_GPL(objpool_pop);
+
+/* release whole objpool forcely */
+void objpool_free(struct objpool_head *head)
+{
+	if (!head->cpu_slots)
+		return;
+
+	/* release percpu slots */
+	objpool_fini_percpu_slots(head);
+
+	/* call user's cleanup callback if provided */
+	if (head->release)
+		head->release(head, head->context);
+}
+EXPORT_SYMBOL_GPL(objpool_free);
+
+/* drop unused objects and defref objpool for releasing */
+void objpool_fini(struct objpool_head *head)
+{
+	void *nod;
+
+	do {
+		/* grab object from objpool and drop it */
+		nod = objpool_pop(head);
+
+		/* drop the extra ref of objpool */
+		if (refcount_dec_and_test(&head->ref))
+			objpool_free(head);
+	} while (nod);
+}
+EXPORT_SYMBOL_GPL(objpool_fini);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v9 2/5] lib: objpool test module added
  2023-09-05  1:52 [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement wuqiang.matt
  2023-09-05  1:52 ` [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC wuqiang.matt
@ 2023-09-05  1:52 ` wuqiang.matt
  2023-09-05  1:52 ` [PATCH v9 3/5] kprobes: kretprobe scalability improvement with objpool wuqiang.matt
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: wuqiang.matt @ 2023-09-05  1:52 UTC (permalink / raw)
  To: linux-trace-kernel, mhiramat, davem, anil.s.keshavamurthy,
	naveen.n.rao, rostedt, peterz, akpm, sander, ebiggers,
	dan.j.williams, jpoimboe
  Cc: linux-kernel, lkp, mattwu, wuqiang.matt

The test_objpool module (test_objpool) will run several testcases
for objpool stress and performance evaluation. Each testcase will
have all available cpu cores involved to create a situation of high
parallel and high contention.

As of now there are 5 groups and 5 * 2 testcases in total:

1) group 1: synchronous mode
   objpool is managed synchronously, that is, all objects are to be
   reclaimed before objpool finalization and the objpool owner makes
   sure of it. All threads on different cores run in the same pace
2) group 2: synchronous mode + hrtimer
   this case have 2 customers: normal threads and hrtimer softirqs
3) group 3: synchronous + overrun mode
   This test group is mainly for performance evaluation of missing
   cases when pre-allocated objects are less than the requested
4) group 4: asynchronous mode
   This case is just an emulation of kretprobe, with refcount used
   to control the objpool lifecycle
5) group 5: asynchronous mode with hrtimer
   hrtimer softirq is introduced to stress async objpool operations

Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
---
 lib/Kconfig.debug  |  11 +
 lib/Makefile       |   2 +
 lib/test_objpool.c | 689 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 702 insertions(+)
 create mode 100644 lib/test_objpool.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d6798513a8c2..6598604cf6c8 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2931,6 +2931,17 @@ config TEST_CLOCKSOURCE_WATCHDOG
 
 	  If unsure, say N.
 
+config TEST_OBJPOOL
+	tristate "Test module for correctness and stress of objpool"
+	default n
+	depends on m && DEBUG_KERNEL
+	help
+	  This builds the "test_objpool" module that should be used for
+	  correctness verification and concurrent testings of objects
+	  allocation and reclamation.
+
+	  If unsure, say N.
+
 endif # RUNTIME_TESTING_MENU
 
 config ARCH_USE_MEMTEST
diff --git a/lib/Makefile b/lib/Makefile
index 7a84c922d9ff..19b936f2af1c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -106,6 +106,8 @@ obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
 obj-$(CONFIG_TEST_REF_TRACKER) += test_ref_tracker.o
 CFLAGS_test_fprobe.o += $(CC_FLAGS_FTRACE)
 obj-$(CONFIG_FPROBE_SANITY_TEST) += test_fprobe.o
+obj-$(CONFIG_TEST_OBJPOOL) += test_objpool.o
+
 #
 # CFLAGS for compiling floating point code inside the kernel. x86/Makefile turns
 # off the generation of FPU/SSE* instructions for kernel proper but FPU_FLAGS
diff --git a/lib/test_objpool.c b/lib/test_objpool.c
new file mode 100644
index 000000000000..d329472f8ab6
--- /dev/null
+++ b/lib/test_objpool.c
@@ -0,0 +1,689 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test module for lockless object pool
+ *
+ * Copyright: wuqiang.matt@bytedance.com
+ */
+
+#include <linux/version.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/sched.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/kthread.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/delay.h>
+#include <linux/hrtimer.h>
+#include <linux/interrupt.h>
+#include <linux/objpool.h>
+
+#define OT_NR_MAX_BULK (16)
+
+/* memory usage */
+struct ot_mem_stat {
+	atomic_long_t alloc;
+	atomic_long_t free;
+};
+
+/* object allocation results */
+struct ot_obj_stat {
+	unsigned long nhits;
+	unsigned long nmiss;
+};
+
+/* control & results per testcase */
+struct ot_data {
+	struct rw_semaphore start;
+	struct completion wait;
+	struct completion rcu;
+	atomic_t nthreads ____cacheline_aligned_in_smp;
+	atomic_t stop ____cacheline_aligned_in_smp;
+	struct ot_mem_stat kmalloc;
+	struct ot_mem_stat vmalloc;
+	struct ot_obj_stat objects;
+	u64    duration;
+};
+
+/* testcase */
+struct ot_test {
+	int async; /* synchronous or asynchronous */
+	int mode; /* only mode 0 supported */
+	int objsz; /* object size */
+	int duration; /* ms */
+	int delay; /* ms */
+	int bulk_normal;
+	int bulk_irq;
+	unsigned long hrtimer; /* ms */
+	const char *name;
+	struct ot_data data;
+};
+
+/* per-cpu worker */
+struct ot_item {
+	struct objpool_head *pool; /* pool head */
+	struct ot_test *test; /* test parameters */
+
+	void (*worker)(struct ot_item *item, int irq);
+
+	/* hrtimer control */
+	ktime_t hrtcycle;
+	struct hrtimer hrtimer;
+
+	int bulk[2]; /* for thread and irq */
+	int delay;
+	u32 niters;
+
+	/* summary per thread */
+	struct ot_obj_stat stat[2]; /* thread and irq */
+	u64 duration;
+};
+
+/*
+ * memory leakage checking
+ */
+
+static void *ot_kzalloc(struct ot_test *test, long size)
+{
+	void *ptr = kzalloc(size, GFP_KERNEL);
+
+	if (ptr)
+		atomic_long_add(size, &test->data.kmalloc.alloc);
+	return ptr;
+}
+
+static void ot_kfree(struct ot_test *test, void *ptr, long size)
+{
+	if (!ptr)
+		return;
+	atomic_long_add(size, &test->data.kmalloc.free);
+	kfree(ptr);
+}
+
+static void ot_mem_report(struct ot_test *test)
+{
+	long alloc, free;
+
+	pr_info("memory allocation summary for %s\n", test->name);
+
+	alloc = atomic_long_read(&test->data.kmalloc.alloc);
+	free = atomic_long_read(&test->data.kmalloc.free);
+	pr_info("  kmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free);
+
+	alloc = atomic_long_read(&test->data.vmalloc.alloc);
+	free = atomic_long_read(&test->data.vmalloc.free);
+	pr_info("  vmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free);
+}
+
+/* user object instance */
+struct ot_node {
+	void *owner;
+	unsigned long data;
+	unsigned long refs;
+	unsigned long payload[32];
+};
+
+/* user objpool manager */
+struct ot_context {
+	struct objpool_head pool; /* objpool head */
+	struct ot_test *test; /* test parameters */
+	void *ptr; /* user pool buffer */
+	unsigned long size; /* buffer size */
+	struct rcu_head rcu;
+};
+
+static DEFINE_PER_CPU(struct ot_item, ot_pcup_items);
+
+static int ot_init_data(struct ot_data *data)
+{
+	memset(data, 0, sizeof(*data));
+	init_rwsem(&data->start);
+	init_completion(&data->wait);
+	init_completion(&data->rcu);
+	atomic_set(&data->nthreads, 1);
+
+	return 0;
+}
+
+static int ot_init_node(void *nod, void *context)
+{
+	struct ot_context *sop = context;
+	struct ot_node *on = nod;
+
+	on->owner = &sop->pool;
+	return 0;
+}
+
+static enum hrtimer_restart ot_hrtimer_handler(struct hrtimer *hrt)
+{
+	struct ot_item *item = container_of(hrt, struct ot_item, hrtimer);
+	struct ot_test *test = item->test;
+
+	if (atomic_read_acquire(&test->data.stop))
+		return HRTIMER_NORESTART;
+
+	/* do bulk-testings for objects pop/push */
+	item->worker(item, 1);
+
+	hrtimer_forward(hrt, hrt->base->get_time(), item->hrtcycle);
+	return HRTIMER_RESTART;
+}
+
+static void ot_start_hrtimer(struct ot_item *item)
+{
+	if (!item->test->hrtimer)
+		return;
+	hrtimer_start(&item->hrtimer, item->hrtcycle, HRTIMER_MODE_REL);
+}
+
+static void ot_stop_hrtimer(struct ot_item *item)
+{
+	if (!item->test->hrtimer)
+		return;
+	hrtimer_cancel(&item->hrtimer);
+}
+
+static int ot_init_hrtimer(struct ot_item *item, unsigned long hrtimer)
+{
+	struct hrtimer *hrt = &item->hrtimer;
+
+	if (!hrtimer)
+		return -ENOENT;
+
+	item->hrtcycle = ktime_set(0, hrtimer * 1000000UL);
+	hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrt->function = ot_hrtimer_handler;
+	return 0;
+}
+
+static int ot_init_cpu_item(struct ot_item *item,
+			struct ot_test *test,
+			struct objpool_head *pool,
+			void (*worker)(struct ot_item *, int))
+{
+	memset(item, 0, sizeof(*item));
+	item->pool = pool;
+	item->test = test;
+	item->worker = worker;
+
+	item->bulk[0] = test->bulk_normal;
+	item->bulk[1] = test->bulk_irq;
+	item->delay = test->delay;
+
+	/* initialize hrtimer */
+	ot_init_hrtimer(item, item->test->hrtimer);
+	return 0;
+}
+
+static int ot_thread_worker(void *arg)
+{
+	struct ot_item *item = arg;
+	struct ot_test *test = item->test;
+	ktime_t start;
+
+	atomic_inc(&test->data.nthreads);
+	down_read(&test->data.start);
+	up_read(&test->data.start);
+	start = ktime_get();
+	ot_start_hrtimer(item);
+	do {
+		if (atomic_read_acquire(&test->data.stop))
+			break;
+		/* do bulk-testings for objects pop/push */
+		item->worker(item, 0);
+	} while (!kthread_should_stop());
+	ot_stop_hrtimer(item);
+	item->duration = (u64) ktime_us_delta(ktime_get(), start);
+	if (atomic_dec_and_test(&test->data.nthreads))
+		complete(&test->data.wait);
+
+	return 0;
+}
+
+static void ot_perf_report(struct ot_test *test, u64 duration)
+{
+	struct ot_obj_stat total, normal = {0}, irq = {0};
+	int cpu, nthreads = 0;
+
+	pr_info("\n");
+	pr_info("Testing summary for %s\n", test->name);
+
+	for_each_possible_cpu(cpu) {
+		struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+		if (!item->duration)
+			continue;
+		normal.nhits += item->stat[0].nhits;
+		normal.nmiss += item->stat[0].nmiss;
+		irq.nhits += item->stat[1].nhits;
+		irq.nmiss += item->stat[1].nmiss;
+		pr_info("CPU: %d  duration: %lluus\n", cpu, item->duration);
+		pr_info("\tthread:\t%16lu hits \t%16lu miss\n",
+			item->stat[0].nhits, item->stat[0].nmiss);
+		pr_info("\tirq:   \t%16lu hits \t%16lu miss\n",
+			item->stat[1].nhits, item->stat[1].nmiss);
+		pr_info("\ttotal: \t%16lu hits \t%16lu miss\n",
+			item->stat[0].nhits + item->stat[1].nhits,
+			item->stat[0].nmiss + item->stat[1].nmiss);
+		nthreads++;
+	}
+
+	total.nhits = normal.nhits + irq.nhits;
+	total.nmiss = normal.nmiss + irq.nmiss;
+
+	pr_info("ALL: \tnthreads: %d  duration: %lluus\n", nthreads, duration);
+	pr_info("SUM: \t%16lu hits \t%16lu miss\n",
+		total.nhits, total.nmiss);
+
+	test->data.objects = total;
+	test->data.duration = duration;
+}
+
+/*
+ * synchronous test cases for objpool manipulation
+ */
+
+/* objpool manipulation for synchronous mode (percpu objpool) */
+static struct ot_context *ot_init_sync_m0(struct ot_test *test)
+{
+	struct ot_context *sop = NULL;
+	int max = num_possible_cpus() << 3;
+
+	sop = (struct ot_context *)ot_kzalloc(test, sizeof(*sop));
+	if (!sop)
+		return NULL;
+	sop->test = test;
+
+	if (objpool_init(&sop->pool, max, test->objsz,
+			GFP_KERNEL, sop, ot_init_node, NULL)) {
+		ot_kfree(test, sop, sizeof(*sop));
+		return NULL;
+	}
+	WARN_ON(max != sop->pool.nr_objs);
+
+	return sop;
+}
+
+static void ot_fini_sync(struct ot_context *sop)
+{
+	objpool_fini(&sop->pool);
+	ot_kfree(sop->test, sop, sizeof(*sop));
+}
+
+struct {
+	struct ot_context * (*init)(struct ot_test *oc);
+	void (*fini)(struct ot_context *sop);
+} g_ot_sync_ops[] = {
+	{.init = ot_init_sync_m0, .fini = ot_fini_sync},
+};
+
+/*
+ * synchronous test cases: performance mode
+ */
+
+static void ot_bulk_sync(struct ot_item *item, int irq)
+{
+	struct ot_node *nods[OT_NR_MAX_BULK];
+	int i;
+
+	for (i = 0; i < item->bulk[irq]; i++)
+		nods[i] = objpool_pop(item->pool);
+
+	if (!irq && (item->delay || !(++(item->niters) & 0x7FFF)))
+		msleep(item->delay);
+
+	while (i-- > 0) {
+		struct ot_node *on = nods[i];
+		if (on) {
+			on->refs++;
+			objpool_push(on, item->pool);
+			item->stat[irq].nhits++;
+		} else {
+			item->stat[irq].nmiss++;
+		}
+	}
+}
+
+static int ot_start_sync(struct ot_test *test)
+{
+	struct ot_context *sop;
+	ktime_t start;
+	u64 duration;
+	unsigned long timeout;
+	int cpu;
+
+	/* initialize objpool for syncrhonous testcase */
+	sop = g_ot_sync_ops[test->mode].init(test);
+	if (!sop)
+		return -ENOMEM;
+
+	/* grab rwsem to block testing threads */
+	down_write(&test->data.start);
+
+	for_each_possible_cpu(cpu) {
+		struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+		struct task_struct *work;
+
+		ot_init_cpu_item(item, test, &sop->pool, ot_bulk_sync);
+
+		/* skip offline cpus */
+		if (!cpu_online(cpu))
+			continue;
+
+		work = kthread_create_on_node(ot_thread_worker, item,
+				cpu_to_node(cpu), "ot_worker_%d", cpu);
+		if (IS_ERR(work)) {
+			pr_err("failed to create thread for cpu %d\n", cpu);
+		} else {
+			kthread_bind(work, cpu);
+			wake_up_process(work);
+		}
+	}
+
+	/* wait a while to make sure all threads waiting at start line */
+	msleep(20);
+
+	/* in case no threads were created: memory insufficient ? */
+	if (atomic_dec_and_test(&test->data.nthreads))
+		complete(&test->data.wait);
+
+	// sched_set_fifo_low(current);
+
+	/* start objpool testing threads */
+	start = ktime_get();
+	up_write(&test->data.start);
+
+	/* yeild cpu to worker threads for duration ms */
+	timeout = msecs_to_jiffies(test->duration);
+	schedule_timeout_interruptible(timeout);
+
+	/* tell workers threads to quit */
+	atomic_set_release(&test->data.stop, 1);
+
+	/* wait all workers threads finish and quit */
+	wait_for_completion(&test->data.wait);
+	duration = (u64) ktime_us_delta(ktime_get(), start);
+
+	/* cleanup objpool */
+	g_ot_sync_ops[test->mode].fini(sop);
+
+	/* report testing summary and performance results */
+	ot_perf_report(test, duration);
+
+	/* report memory allocation summary */
+	ot_mem_report(test);
+
+	return 0;
+}
+
+/*
+ * asynchronous test cases: pool lifecycle controlled by refcount
+ */
+
+static void ot_fini_async_rcu(struct rcu_head *rcu)
+{
+	struct ot_context *sop = container_of(rcu, struct ot_context, rcu);
+	struct ot_test *test = sop->test;
+
+	/* here all cpus are aware of the stop event: test->data.stop = 1 */
+	WARN_ON(!atomic_read_acquire(&test->data.stop));
+
+	objpool_fini(&sop->pool);
+	complete(&test->data.rcu);
+}
+
+static void ot_fini_async(struct ot_context *sop)
+{
+	/* make sure the stop event is acknowledged by all cores */
+	call_rcu(&sop->rcu, ot_fini_async_rcu);
+}
+
+static int ot_objpool_release(struct objpool_head *head, void *context)
+{
+	struct ot_context *sop = context;
+
+	WARN_ON(!head || !sop || head != &sop->pool);
+
+	/* do context cleaning if needed */
+	if (sop)
+		ot_kfree(sop->test, sop, sizeof(*sop));
+
+	return 0;
+}
+
+static struct ot_context *ot_init_async_m0(struct ot_test *test)
+{
+	struct ot_context *sop = NULL;
+	int max = num_possible_cpus() << 3;
+
+	sop = (struct ot_context *)ot_kzalloc(test, sizeof(*sop));
+	if (!sop)
+		return NULL;
+	sop->test = test;
+
+	if (objpool_init(&sop->pool, max, test->objsz, GFP_KERNEL,
+			sop, ot_init_node, ot_objpool_release)) {
+		ot_kfree(test, sop, sizeof(*sop));
+		return NULL;
+	}
+	WARN_ON(max != sop->pool.nr_objs);
+
+	return sop;
+}
+
+struct {
+	struct ot_context * (*init)(struct ot_test *oc);
+	void (*fini)(struct ot_context *sop);
+} g_ot_async_ops[] = {
+	{.init = ot_init_async_m0, .fini = ot_fini_async},
+};
+
+static void ot_nod_recycle(struct ot_node *on, struct objpool_head *pool,
+			int release)
+{
+	struct ot_context *sop;
+
+	on->refs++;
+
+	if (!release) {
+		/* push object back to opjpool for reuse */
+		objpool_push(on, pool);
+		return;
+	}
+
+	sop = container_of(pool, struct ot_context, pool);
+	WARN_ON(sop != pool->context);
+
+	/* unref objpool with nod removed forever */
+	objpool_drop(on, pool);
+}
+
+static void ot_bulk_async(struct ot_item *item, int irq)
+{
+	struct ot_test *test = item->test;
+	struct ot_node *nods[OT_NR_MAX_BULK];
+	int i, stop;
+
+	for (i = 0; i < item->bulk[irq]; i++)
+		nods[i] = objpool_pop(item->pool);
+
+	if (!irq) {
+		if (item->delay || !(++(item->niters) & 0x7FFF))
+			msleep(item->delay);
+		get_cpu();
+	}
+
+	stop = atomic_read_acquire(&test->data.stop);
+
+	/* drop all objects and deref objpool */
+	while (i-- > 0) {
+		struct ot_node *on = nods[i];
+
+		if (on) {
+			on->refs++;
+			ot_nod_recycle(on, item->pool, stop);
+			item->stat[irq].nhits++;
+		} else {
+			item->stat[irq].nmiss++;
+		}
+	}
+
+	if (!irq)
+		put_cpu();
+}
+
+static int ot_start_async(struct ot_test *test)
+{
+	struct ot_context *sop;
+	ktime_t start;
+	u64 duration;
+	unsigned long timeout;
+	int cpu;
+
+	/* initialize objpool for syncrhonous testcase */
+	sop = g_ot_async_ops[test->mode].init(test);
+	if (!sop)
+		return -ENOMEM;
+
+	/* grab rwsem to block testing threads */
+	down_write(&test->data.start);
+
+	for_each_possible_cpu(cpu) {
+		struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+		struct task_struct *work;
+
+		ot_init_cpu_item(item, test, &sop->pool, ot_bulk_async);
+
+		/* skip offline cpus */
+		if (!cpu_online(cpu))
+			continue;
+
+		work = kthread_create_on_node(ot_thread_worker, item,
+				cpu_to_node(cpu), "ot_worker_%d", cpu);
+		if (IS_ERR(work)) {
+			pr_err("failed to create thread for cpu %d\n", cpu);
+		} else {
+			kthread_bind(work, cpu);
+			wake_up_process(work);
+		}
+	}
+
+	/* wait a while to make sure all threads waiting at start line */
+	msleep(20);
+
+	/* in case no threads were created: memory insufficient ? */
+	if (atomic_dec_and_test(&test->data.nthreads))
+		complete(&test->data.wait);
+
+	/* start objpool testing threads */
+	start = ktime_get();
+	up_write(&test->data.start);
+
+	/* yeild cpu to worker threads for duration ms */
+	timeout = msecs_to_jiffies(test->duration);
+	schedule_timeout_interruptible(timeout);
+
+	/* tell workers threads to quit */
+	atomic_set_release(&test->data.stop, 1);
+
+	/* do async-finalization */
+	g_ot_async_ops[test->mode].fini(sop);
+
+	/* wait all workers threads finish and quit */
+	wait_for_completion(&test->data.wait);
+	duration = (u64) ktime_us_delta(ktime_get(), start);
+
+	/* assure rcu callback is triggered */
+	wait_for_completion(&test->data.rcu);
+
+	/*
+	 * now we are sure that objpool is finalized either
+	 * by rcu callback or by worker threads
+	 */
+
+	/* report testing summary and performance results */
+	ot_perf_report(test, duration);
+
+	/* report memory allocation summary */
+	ot_mem_report(test);
+
+	return 0;
+}
+
+/*
+ * predefined testing cases:
+ *   synchronous case / overrun case / async case
+ *
+ * async: synchronous or asynchronous testing
+ * mode: only mode 0 supported
+ * duration: int, total test time in ms
+ * delay: int, delay (in ms) between each iteration
+ * bulk_normal: int, repeat times for thread worker
+ * bulk_irq: int, repeat times for irq consumer
+ * hrtimer: unsigned long, hrtimer intervnal in ms
+ * name: char *, tag for current test ot_item
+ */
+
+#define NODE_COMPACT sizeof(struct ot_node)
+#define NODE_VMALLOC (512)
+
+struct ot_test g_testcases[] = {
+
+	/* sync & normal */
+	{0, 0, NODE_COMPACT, 1000, 0,  1,  0,  0, "sync: percpu objpool"},
+	{0, 0, NODE_VMALLOC, 1000, 0,  1,  0,  0, "sync: percpu objpool from vmalloc"},
+
+	/* sync & hrtimer */
+	{0, 0, NODE_COMPACT, 1000, 0,  1,  1,  4, "sync & hrtimer: percpu objpool"},
+	{0, 0, NODE_VMALLOC, 1000, 0,  1,  1,  4, "sync & hrtimer: percpu objpool from vmalloc"},
+
+	/* sync & overrun */
+	{0, 0, NODE_COMPACT, 1000, 0, 16,  0,  0, "sync overrun: percpu objpool"},
+	{0, 0, NODE_VMALLOC, 1000, 0, 16,  0,  0, "sync overrun: percpu objpool from vmalloc"},
+
+	/* async mode */
+	{1, 0, NODE_COMPACT, 1000, 0,  1,  0,  0, "async: percpu objpool"},
+	{1, 0, NODE_VMALLOC, 1000, 0,  1,  0,  0, "async: percpu objpool from vmalloc"},
+
+	/* async + hrtimer mode */
+	{1, 0, NODE_COMPACT, 1000, 0,  4,  4,  4, "async & hrtimer: percpu objpool"},
+	{1, 0, NODE_VMALLOC, 1000, 0,  4,  4,  4, "async & hrtimer: percpu objpool from vmalloc"},
+};
+
+static int __init ot_mod_init(void)
+{
+	int i;
+
+	/* perform testings */
+	for (i = 0; i < ARRAY_SIZE(g_testcases); i++) {
+		ot_init_data(&g_testcases[i].data);
+		if (g_testcases[i].async)
+			ot_start_async(&g_testcases[i]);
+		else
+			ot_start_sync(&g_testcases[i]);
+	}
+
+	/* show tests summary */
+	pr_info("\n");
+	pr_info("Summary of testcases:\n");
+	for (i = 0; i < ARRAY_SIZE(g_testcases); i++) {
+		pr_info("    duration: %lluus \thits: %10lu \tmiss: %10lu \t%s\n",
+			g_testcases[i].data.duration, g_testcases[i].data.objects.nhits,
+			g_testcases[i].data.objects.nmiss, g_testcases[i].name);
+	}
+
+	return -EAGAIN;
+}
+
+static void __exit ot_mod_exit(void)
+{
+}
+
+module_init(ot_mod_init);
+module_exit(ot_mod_exit);
+
+MODULE_LICENSE("GPL");
\ No newline at end of file
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v9 3/5] kprobes: kretprobe scalability improvement with objpool
  2023-09-05  1:52 [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement wuqiang.matt
  2023-09-05  1:52 ` [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC wuqiang.matt
  2023-09-05  1:52 ` [PATCH v9 2/5] lib: objpool test module added wuqiang.matt
@ 2023-09-05  1:52 ` wuqiang.matt
  2023-10-07  2:02   ` Masami Hiramatsu
  2023-09-05  1:52 ` [PATCH v9 4/5] kprobes: freelist.h removed wuqiang.matt
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: wuqiang.matt @ 2023-09-05  1:52 UTC (permalink / raw)
  To: linux-trace-kernel, mhiramat, davem, anil.s.keshavamurthy,
	naveen.n.rao, rostedt, peterz, akpm, sander, ebiggers,
	dan.j.williams, jpoimboe
  Cc: linux-kernel, lkp, mattwu, wuqiang.matt

kretprobe is using freelist to manage return-instances, but freelist,
as LIFO queue based on singly linked list, scales badly and reduces
the overall throughput of kretprobed routines, especially for high
contention scenarios.

Here's a typical throughput test of sys_prctl (counts in 10 seconds,
measured with perf stat -a -I 10000 -e syscalls:sys_enter_prctl):

OS: Debian 10 X86_64, Linux 6.5rc7 with freelist
HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s

      1T       2T       4T       8T      16T      24T
24150045 29317964 15446741 12494489 18287272 18287272
     32T      48T      64T      72T      96T     128T
16200682  1620068 11645677 11269858 10470118  9931051

This patch introduces objpool to kretprobe and rethook, with orginal
freelist replaced and brings near-linear scalability to kretprobed
routines. Tests of kretprobe throughput show the biggest ratio as
166.7x of the original freelist. Here's the comparison:

                  1T         2T         4T         8T        16T
native:     41186213   82336866  164250978  328662645  658810299
freelist:   24150045   29317964   15446741   12494489   18287272
objpool:    24663700   49410571   98544977  197725940  396294399
                 32T        48T        64T        96T       128T
native:   1330338351 1969957941 2512291791 1514690434 2671040914
freelist:   16200682   13737658   11645677   10470118    9931051
objpool:    78673470 1177354156 1514690434 1604472348 1655086824

Tests on 96-core ARM64 system output similarly, but with the biggest
ratio up to 454.8x:

OS: Debian 10 AARCH64, Linux 6.5rc7
HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s

                  1T         2T         4T         8T        16T
native: .   30066096   13733813  126194076  257447289  505800181
freelist:   16152090   11064397   11124068    7215768    5663013
objpool:    13733813   27749031   56540679  112291770  223482778
                 24T        32T        48T        64T        96T
native:    763305277 1015925192 1521075123 2033009392 3021013752
freelist:    5015810    4602893    3766792    3382478    2945292
objpool:   334605663  448310646  675018951  903449904 1339693418

Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
---
 include/linux/kprobes.h | 11 ++---
 include/linux/rethook.h | 16 ++-----
 kernel/kprobes.c        | 93 +++++++++++++++++------------------------
 kernel/trace/fprobe.c   | 32 ++++++--------
 kernel/trace/rethook.c  | 90 ++++++++++++++++++---------------------
 5 files changed, 98 insertions(+), 144 deletions(-)

diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index 85a64cb95d75..365eb092e9c4 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -26,8 +26,7 @@
 #include <linux/rcupdate.h>
 #include <linux/mutex.h>
 #include <linux/ftrace.h>
-#include <linux/refcount.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
 #include <linux/rethook.h>
 #include <asm/kprobes.h>
 
@@ -141,7 +140,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
  */
 struct kretprobe_holder {
 	struct kretprobe	*rp;
-	refcount_t		ref;
+	struct objpool_head	pool;
 };
 
 struct kretprobe {
@@ -154,7 +153,6 @@ struct kretprobe {
 #ifdef CONFIG_KRETPROBE_ON_RETHOOK
 	struct rethook *rh;
 #else
-	struct freelist_head freelist;
 	struct kretprobe_holder *rph;
 #endif
 };
@@ -165,10 +163,7 @@ struct kretprobe_instance {
 #ifdef CONFIG_KRETPROBE_ON_RETHOOK
 	struct rethook_node node;
 #else
-	union {
-		struct freelist_node freelist;
-		struct rcu_head rcu;
-	};
+	struct rcu_head rcu;
 	struct llist_node llist;
 	struct kretprobe_holder *rph;
 	kprobe_opcode_t *ret_addr;
diff --git a/include/linux/rethook.h b/include/linux/rethook.h
index 26b6f3c81a76..ce69b2b7bc35 100644
--- a/include/linux/rethook.h
+++ b/include/linux/rethook.h
@@ -6,11 +6,10 @@
 #define _LINUX_RETHOOK_H
 
 #include <linux/compiler.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
 #include <linux/kallsyms.h>
 #include <linux/llist.h>
 #include <linux/rcupdate.h>
-#include <linux/refcount.h>
 
 struct rethook_node;
 
@@ -30,14 +29,12 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, unsigned long,
 struct rethook {
 	void			*data;
 	rethook_handler_t	handler;
-	struct freelist_head	pool;
-	refcount_t		ref;
+	struct objpool_head	pool;
 	struct rcu_head		rcu;
 };
 
 /**
  * struct rethook_node - The rethook shadow-stack entry node.
- * @freelist: The freelist, linked to struct rethook::pool.
  * @rcu: The rcu_head for deferred freeing.
  * @llist: The llist, linked to a struct task_struct::rethooks.
  * @rethook: The pointer to the struct rethook.
@@ -48,20 +45,16 @@ struct rethook {
  * on each entry of the shadow stack.
  */
 struct rethook_node {
-	union {
-		struct freelist_node freelist;
-		struct rcu_head      rcu;
-	};
+	struct rcu_head		rcu;
 	struct llist_node	llist;
 	struct rethook		*rethook;
 	unsigned long		ret_addr;
 	unsigned long		frame;
 };
 
-struct rethook *rethook_alloc(void *data, rethook_handler_t handler);
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler, int size, int num);
 void rethook_stop(struct rethook *rh);
 void rethook_free(struct rethook *rh);
-void rethook_add_node(struct rethook *rh, struct rethook_node *node);
 struct rethook_node *rethook_try_get(struct rethook *rh);
 void rethook_recycle(struct rethook_node *node);
 void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount);
@@ -98,4 +91,3 @@ void rethook_flush_task(struct task_struct *tk);
 #endif
 
 #endif
-
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index ca385b61d546..075a632e6c7c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1877,13 +1877,27 @@ static struct notifier_block kprobe_exceptions_nb = {
 #ifdef CONFIG_KRETPROBES
 
 #if !defined(CONFIG_KRETPROBE_ON_RETHOOK)
+
+/* callbacks for objpool of kretprobe instances */
+static int kretprobe_init_inst(void *nod, void *context)
+{
+	struct kretprobe_instance *ri = nod;
+
+	ri->rph = context;
+	return 0;
+}
+static int kretprobe_fini_pool(struct objpool_head *head, void *context)
+{
+	kfree(context);
+	return 0;
+}
+
 static void free_rp_inst_rcu(struct rcu_head *head)
 {
 	struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
+	struct kretprobe_holder *rph = ri->rph;
 
-	if (refcount_dec_and_test(&ri->rph->ref))
-		kfree(ri->rph);
-	kfree(ri);
+	objpool_drop(ri, &rph->pool);
 }
 NOKPROBE_SYMBOL(free_rp_inst_rcu);
 
@@ -1892,7 +1906,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
 	struct kretprobe *rp = get_kretprobe(ri);
 
 	if (likely(rp))
-		freelist_add(&ri->freelist, &rp->freelist);
+		objpool_push(ri, &rp->rph->pool);
 	else
 		call_rcu(&ri->rcu, free_rp_inst_rcu);
 }
@@ -1929,23 +1943,12 @@ NOKPROBE_SYMBOL(kprobe_flush_task);
 
 static inline void free_rp_inst(struct kretprobe *rp)
 {
-	struct kretprobe_instance *ri;
-	struct freelist_node *node;
-	int count = 0;
-
-	node = rp->freelist.head;
-	while (node) {
-		ri = container_of(node, struct kretprobe_instance, freelist);
-		node = node->next;
-
-		kfree(ri);
-		count++;
-	}
+	struct kretprobe_holder *rph = rp->rph;
 
-	if (refcount_sub_and_test(count, &rp->rph->ref)) {
-		kfree(rp->rph);
-		rp->rph = NULL;
-	}
+	if (!rph)
+		return;
+	rp->rph = NULL;
+	objpool_fini(&rph->pool);
 }
 
 /* This assumes the 'tsk' is the current task or the is not running. */
@@ -2087,19 +2090,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
 static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
 {
 	struct kretprobe *rp = container_of(p, struct kretprobe, kp);
+	struct kretprobe_holder *rph = rp->rph;
 	struct kretprobe_instance *ri;
-	struct freelist_node *fn;
 
-	fn = freelist_try_get(&rp->freelist);
-	if (!fn) {
+	ri = objpool_pop(&rph->pool);
+	if (!ri) {
 		rp->nmissed++;
 		return 0;
 	}
 
-	ri = container_of(fn, struct kretprobe_instance, freelist);
-
 	if (rp->entry_handler && rp->entry_handler(ri, regs)) {
-		freelist_add(&ri->freelist, &rp->freelist);
+		objpool_push(ri, &rph->pool);
 		return 0;
 	}
 
@@ -2193,7 +2194,6 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
 int register_kretprobe(struct kretprobe *rp)
 {
 	int ret;
-	struct kretprobe_instance *inst;
 	int i;
 	void *addr;
 
@@ -2227,20 +2227,12 @@ int register_kretprobe(struct kretprobe *rp)
 		rp->maxactive = max_t(unsigned int, 10, 2*num_possible_cpus());
 
 #ifdef CONFIG_KRETPROBE_ON_RETHOOK
-	rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler);
-	if (!rp->rh)
-		return -ENOMEM;
+	rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler,
+				sizeof(struct kretprobe_instance) +
+				rp->data_size, rp->maxactive);
+	if (IS_ERR(rp->rh))
+		return PTR_ERR(rp->rh);
 
-	for (i = 0; i < rp->maxactive; i++) {
-		inst = kzalloc(sizeof(struct kretprobe_instance) +
-			       rp->data_size, GFP_KERNEL);
-		if (inst == NULL) {
-			rethook_free(rp->rh);
-			rp->rh = NULL;
-			return -ENOMEM;
-		}
-		rethook_add_node(rp->rh, &inst->node);
-	}
 	rp->nmissed = 0;
 	/* Establish function entry probe point */
 	ret = register_kprobe(&rp->kp);
@@ -2249,25 +2241,18 @@ int register_kretprobe(struct kretprobe *rp)
 		rp->rh = NULL;
 	}
 #else	/* !CONFIG_KRETPROBE_ON_RETHOOK */
-	rp->freelist.head = NULL;
 	rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
 	if (!rp->rph)
 		return -ENOMEM;
 
-	rp->rph->rp = rp;
-	for (i = 0; i < rp->maxactive; i++) {
-		inst = kzalloc(sizeof(struct kretprobe_instance) +
-			       rp->data_size, GFP_KERNEL);
-		if (inst == NULL) {
-			refcount_set(&rp->rph->ref, i);
-			free_rp_inst(rp);
-			return -ENOMEM;
-		}
-		inst->rph = rp->rph;
-		freelist_add(&inst->freelist, &rp->freelist);
+	if (objpool_init(&rp->rph->pool, rp->maxactive, rp->data_size +
+			sizeof(struct kretprobe_instance), GFP_KERNEL,
+			rp->rph, kretprobe_init_inst, kretprobe_fini_pool)) {
+		kfree(rp->rph);
+		rp->rph = NULL;
+		return -ENOMEM;
 	}
-	refcount_set(&rp->rph->ref, i);
-
+	rp->rph->rp = rp;
 	rp->nmissed = 0;
 	/* Establish function entry probe point */
 	ret = register_kprobe(&rp->kp);
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 3b21f4063258..f5bf98e6b2ac 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -187,9 +187,9 @@ static void fprobe_init(struct fprobe *fp)
 
 static int fprobe_init_rethook(struct fprobe *fp, int num)
 {
-	int i, size;
+	int size;
 
-	if (num < 0)
+	if (num <= 0)
 		return -EINVAL;
 
 	if (!fp->exit_handler) {
@@ -202,29 +202,21 @@ static int fprobe_init_rethook(struct fprobe *fp, int num)
 		size = fp->nr_maxactive;
 	else
 		size = num * num_possible_cpus() * 2;
-	if (size < 0)
+	if (size <= 0)
 		return -E2BIG;
 
-	fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler);
-	if (!fp->rethook)
-		return -ENOMEM;
-	for (i = 0; i < size; i++) {
-		struct fprobe_rethook_node *node;
-
-		node = kzalloc(sizeof(*node) + fp->entry_data_size, GFP_KERNEL);
-		if (!node) {
-			rethook_free(fp->rethook);
-			fp->rethook = NULL;
-			return -ENOMEM;
-		}
-		rethook_add_node(fp->rethook, &node->node);
-	}
+	/* Initialize rethook */
+	fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler,
+				sizeof(struct fprobe_rethook_node), size);
+	if (IS_ERR(fp->rethook))
+		return PTR_ERR(fp->rethook);
+
 	return 0;
 }
 
 static void fprobe_fail_cleanup(struct fprobe *fp)
 {
-	if (fp->rethook) {
+	if (!IS_ERR_OR_NULL(fp->rethook)) {
 		/* Don't need to cleanup rethook->handler because this is not used. */
 		rethook_free(fp->rethook);
 		fp->rethook = NULL;
@@ -379,14 +371,14 @@ int unregister_fprobe(struct fprobe *fp)
 	if (!fprobe_is_registered(fp))
 		return -EINVAL;
 
-	if (fp->rethook)
+	if (!IS_ERR_OR_NULL(fp->rethook))
 		rethook_stop(fp->rethook);
 
 	ret = unregister_ftrace_function(&fp->ops);
 	if (ret < 0)
 		return ret;
 
-	if (fp->rethook)
+	if (!IS_ERR_OR_NULL(fp->rethook))
 		rethook_free(fp->rethook);
 
 	ftrace_free_filter(&fp->ops);
diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index 5eb9b598f4e9..13c8e6773892 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -9,6 +9,7 @@
 #include <linux/rethook.h>
 #include <linux/slab.h>
 #include <linux/sort.h>
+#include <linux/smp.h>
 
 /* Return hook list (shadow stack by list) */
 
@@ -36,21 +37,7 @@ void rethook_flush_task(struct task_struct *tk)
 static void rethook_free_rcu(struct rcu_head *head)
 {
 	struct rethook *rh = container_of(head, struct rethook, rcu);
-	struct rethook_node *rhn;
-	struct freelist_node *node;
-	int count = 1;
-
-	node = rh->pool.head;
-	while (node) {
-		rhn = container_of(node, struct rethook_node, freelist);
-		node = node->next;
-		kfree(rhn);
-		count++;
-	}
-
-	/* The rh->ref is the number of pooled node + 1 */
-	if (refcount_sub_and_test(count, &rh->ref))
-		kfree(rh);
+	objpool_fini(&rh->pool);
 }
 
 /**
@@ -83,54 +70,62 @@ void rethook_free(struct rethook *rh)
 	call_rcu(&rh->rcu, rethook_free_rcu);
 }
 
+static int rethook_init_node(void *nod, void *context)
+{
+	struct rethook_node *node = nod;
+
+	node->rethook = context;
+	return 0;
+}
+
+static int rethook_fini_pool(struct objpool_head *head, void *context)
+{
+	kfree(context);
+	return 0;
+}
+
 /**
  * rethook_alloc() - Allocate struct rethook.
  * @data: a data to pass the @handler when hooking the return.
- * @handler: the return hook callback function.
+ * @handler: the return hook callback function, must NOT be NULL
+ * @size: node size: rethook node and additional data
+ * @num: number of rethook nodes to be preallocated
  *
  * Allocate and initialize a new rethook with @data and @handler.
- * Return NULL if memory allocation fails or @handler is NULL.
+ * Return pointer of new rethook, or error codes for failures.
+ *
  * Note that @handler == NULL means this rethook is going to be freed.
  */
-struct rethook *rethook_alloc(void *data, rethook_handler_t handler)
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler,
+			      int size, int num)
 {
-	struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
+	struct rethook *rh;
 
-	if (!rh || !handler) {
-		kfree(rh);
-		return NULL;
-	}
+	if (!handler || num <= 0 || size < sizeof(struct rethook_node))
+		return ERR_PTR(-EINVAL);
+
+	rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
+	if (!rh)
+		return ERR_PTR(-ENOMEM);
 
 	rh->data = data;
 	rh->handler = handler;
-	rh->pool.head = NULL;
-	refcount_set(&rh->ref, 1);
 
+	/* initialize the objpool for rethook nodes */
+	if (objpool_init(&rh->pool, num, size, GFP_KERNEL, rh,
+			 rethook_init_node, rethook_fini_pool)) {
+		kfree(rh);
+		return ERR_PTR(-ENOMEM);
+	}
 	return rh;
 }
 
-/**
- * rethook_add_node() - Add a new node to the rethook.
- * @rh: the struct rethook.
- * @node: the struct rethook_node to be added.
- *
- * Add @node to @rh. User must allocate @node (as a part of user's
- * data structure.) The @node fields are initialized in this function.
- */
-void rethook_add_node(struct rethook *rh, struct rethook_node *node)
-{
-	node->rethook = rh;
-	freelist_add(&node->freelist, &rh->pool);
-	refcount_inc(&rh->ref);
-}
-
 static void free_rethook_node_rcu(struct rcu_head *head)
 {
 	struct rethook_node *node = container_of(head, struct rethook_node, rcu);
+	struct rethook *rh = node->rethook;
 
-	if (refcount_dec_and_test(&node->rethook->ref))
-		kfree(node->rethook);
-	kfree(node);
+	objpool_drop(node, &rh->pool);
 }
 
 /**
@@ -145,7 +140,7 @@ void rethook_recycle(struct rethook_node *node)
 	lockdep_assert_preemption_disabled();
 
 	if (likely(READ_ONCE(node->rethook->handler)))
-		freelist_add(&node->freelist, &node->rethook->pool);
+		objpool_push(node, &node->rethook->pool);
 	else
 		call_rcu(&node->rcu, free_rethook_node_rcu);
 }
@@ -161,7 +156,6 @@ NOKPROBE_SYMBOL(rethook_recycle);
 struct rethook_node *rethook_try_get(struct rethook *rh)
 {
 	rethook_handler_t handler = READ_ONCE(rh->handler);
-	struct freelist_node *fn;
 
 	lockdep_assert_preemption_disabled();
 
@@ -178,11 +172,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
 	if (unlikely(!rcu_is_watching()))
 		return NULL;
 
-	fn = freelist_try_get(&rh->pool);
-	if (!fn)
-		return NULL;
-
-	return container_of(fn, struct rethook_node, freelist);
+	return (struct rethook_node *)objpool_pop(&rh->pool);
 }
 NOKPROBE_SYMBOL(rethook_try_get);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v9 4/5] kprobes: freelist.h removed
  2023-09-05  1:52 [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement wuqiang.matt
                   ` (2 preceding siblings ...)
  2023-09-05  1:52 ` [PATCH v9 3/5] kprobes: kretprobe scalability improvement with objpool wuqiang.matt
@ 2023-09-05  1:52 ` wuqiang.matt
  2023-09-05  1:52 ` [PATCH v9 5/5] MAINTAINERS: objpool added wuqiang.matt
  2023-09-23  8:57 ` [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement Masami Hiramatsu
  5 siblings, 0 replies; 25+ messages in thread
From: wuqiang.matt @ 2023-09-05  1:52 UTC (permalink / raw)
  To: linux-trace-kernel, mhiramat, davem, anil.s.keshavamurthy,
	naveen.n.rao, rostedt, peterz, akpm, sander, ebiggers,
	dan.j.williams, jpoimboe
  Cc: linux-kernel, lkp, mattwu, wuqiang.matt

This patch will remove freelist.h from kernel source tree, since the
only use cases (kretprobe and rethook) are converted to objpool.

Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
---
 include/linux/freelist.h | 129 ---------------------------------------
 1 file changed, 129 deletions(-)
 delete mode 100644 include/linux/freelist.h

diff --git a/include/linux/freelist.h b/include/linux/freelist.h
deleted file mode 100644
index fc1842b96469..000000000000
--- a/include/linux/freelist.h
+++ /dev/null
@@ -1,129 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause */
-#ifndef FREELIST_H
-#define FREELIST_H
-
-#include <linux/atomic.h>
-
-/*
- * Copyright: cameron@moodycamel.com
- *
- * A simple CAS-based lock-free free list. Not the fastest thing in the world
- * under heavy contention, but simple and correct (assuming nodes are never
- * freed until after the free list is destroyed), and fairly speedy under low
- * contention.
- *
- * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
- */
-
-struct freelist_node {
-	atomic_t		refs;
-	struct freelist_node	*next;
-};
-
-struct freelist_head {
-	struct freelist_node	*head;
-};
-
-#define REFS_ON_FREELIST 0x80000000
-#define REFS_MASK	 0x7FFFFFFF
-
-static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
-	/*
-	 * Since the refcount is zero, and nobody can increase it once it's
-	 * zero (except us, and we run only one copy of this method per node at
-	 * a time, i.e. the single thread case), then we know we can safely
-	 * change the next pointer of the node; however, once the refcount is
-	 * back above zero, then other threads could increase it (happens under
-	 * heavy contention, when the refcount goes to zero in between a load
-	 * and a refcount increment of a node in try_get, then back up to
-	 * something non-zero, then the refcount increment is done by the other
-	 * thread) -- so if the CAS to add the node to the actual list fails,
-	 * decrese the refcount and leave the add operation to the next thread
-	 * who puts the refcount back to zero (which could be us, hence the
-	 * loop).
-	 */
-	struct freelist_node *head = READ_ONCE(list->head);
-
-	for (;;) {
-		WRITE_ONCE(node->next, head);
-		atomic_set_release(&node->refs, 1);
-
-		if (!try_cmpxchg_release(&list->head, &head, node)) {
-			/*
-			 * Hmm, the add failed, but we can only try again when
-			 * the refcount goes back to zero.
-			 */
-			if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
-				continue;
-		}
-		return;
-	}
-}
-
-static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
-	/*
-	 * We know that the should-be-on-freelist bit is 0 at this point, so
-	 * it's safe to set it using a fetch_add.
-	 */
-	if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
-		/*
-		 * Oh look! We were the last ones referencing this node, and we
-		 * know we want to add it to the free list, so let's do it!
-		 */
-		__freelist_add(node, list);
-	}
-}
-
-static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
-{
-	struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
-	unsigned int refs;
-
-	while (head) {
-		prev = head;
-		refs = atomic_read(&head->refs);
-		if ((refs & REFS_MASK) == 0 ||
-		    !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
-			head = smp_load_acquire(&list->head);
-			continue;
-		}
-
-		/*
-		 * Good, reference count has been incremented (it wasn't at
-		 * zero), which means we can read the next and not worry about
-		 * it changing between now and the time we do the CAS.
-		 */
-		next = READ_ONCE(head->next);
-		if (try_cmpxchg_acquire(&list->head, &head, next)) {
-			/*
-			 * Yay, got the node. This means it was on the list,
-			 * which means should-be-on-freelist must be false no
-			 * matter the refcount (because nobody else knows it's
-			 * been taken off yet, it can't have been put back on).
-			 */
-			WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
-
-			/*
-			 * Decrease refcount twice, once for our ref, and once
-			 * for the list's ref.
-			 */
-			atomic_fetch_add(-2, &head->refs);
-
-			return head;
-		}
-
-		/*
-		 * OK, the head must have changed on us, but we still need to decrement
-		 * the refcount we increased.
-		 */
-		refs = atomic_fetch_add(-1, &prev->refs);
-		if (refs == REFS_ON_FREELIST + 1)
-			__freelist_add(prev, list);
-	}
-
-	return NULL;
-}
-
-#endif /* FREELIST_H */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v9 5/5] MAINTAINERS: objpool added
  2023-09-05  1:52 [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement wuqiang.matt
                   ` (3 preceding siblings ...)
  2023-09-05  1:52 ` [PATCH v9 4/5] kprobes: freelist.h removed wuqiang.matt
@ 2023-09-05  1:52 ` wuqiang.matt
  2023-09-23  8:57 ` [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement Masami Hiramatsu
  5 siblings, 0 replies; 25+ messages in thread
From: wuqiang.matt @ 2023-09-05  1:52 UTC (permalink / raw)
  To: linux-trace-kernel, mhiramat, davem, anil.s.keshavamurthy,
	naveen.n.rao, rostedt, peterz, akpm, sander, ebiggers,
	dan.j.williams, jpoimboe
  Cc: linux-kernel, lkp, mattwu, wuqiang.matt

objpool, a scalable and lockless ring-array based object pool, was
introduced to replace the original freelist (a LIFO queue based on
singly linked list) to improve kretprobe scalability.

Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 48abe1a281f2..1c0a38d763a2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15328,6 +15328,13 @@ F:	include/linux/objagg.h
 F:	lib/objagg.c
 F:	lib/test_objagg.c
 
+OBJPOOL
+M:	Matt Wu <wuqiang.matt@bytedance.com>
+S:	Supported
+F:	include/linux/objpool.h
+F:	lib/objpool.c
+F:	lib/test_objpool.c
+
 OBJTOOL
 M:	Josh Poimboeuf <jpoimboe@kernel.org>
 M:	Peter Zijlstra <peterz@infradead.org>
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement
  2023-09-05  1:52 [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement wuqiang.matt
                   ` (4 preceding siblings ...)
  2023-09-05  1:52 ` [PATCH v9 5/5] MAINTAINERS: objpool added wuqiang.matt
@ 2023-09-23  8:57 ` Masami Hiramatsu
  2023-10-08 18:33   ` wuqiang
  5 siblings, 1 reply; 25+ messages in thread
From: Masami Hiramatsu @ 2023-09-23  8:57 UTC (permalink / raw)
  To: wuqiang.matt
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

Hi Wuqiang,

I dug my mail box and found this. Sorry for replying late.

On Tue,  5 Sep 2023 09:52:50 +0800
"wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:

> This patch series introduces a scalable and lockless ring-array based
> object pool and replaces the original freelist (a LIFO queue based on
> singly linked list) to improve scalability of kretprobed routines.
> 
> v9:
>   1) objpool: raw_local_irq_save/restore added to prevent interruption
> 
>      To avoid possible ABA issues, we must ensure objpool_try_add_slot
>      and objpool_try_add_slot are uninterruptible. If these operations
>      are blocked or interrupted in the middle, other cores could overrun
>      the same slot's ages[] of uint32, then after resuming back, the
>      interrupted pop() or push() could see same value of ages[], which
>      is a typical ABA problem though the possibility is small.
> 
>      The pair of pop()/push() costs about 8.53 cpu cycles, measured
>      by IACA (Intel Architecture Code Analyzer). That is, on a 4Ghz
>      core dedicated for pop() & push(), theoretically it would only
>      need 8.53 seconds to overflow a 32bit value. Testings upon Intel
>      i7-10700 (2.90GHz) cost 71.88 seconds to overrun a 32bit integer.

What does this mean? This sounds like "There is a timing issue if it's enough fast".

Let me reivew the patch itself.

Thanks,

> 
>   2) codes improvements: thanks to Masami for the thorough inspection
> 
> v8:
>   1) objpool: refcount added for objpool lifecycle management
> 
> wuqiang.matt (5):
>   lib: objpool added: ring-array based lockless MPMC
>   lib: objpool test module added
>   kprobes: kretprobe scalability improvement with objpool
>   kprobes: freelist.h removed
>   MAINTAINERS: objpool added
> 
>  MAINTAINERS              |   7 +
>  include/linux/freelist.h | 129 --------
>  include/linux/kprobes.h  |  11 +-
>  include/linux/objpool.h  | 174 ++++++++++
>  include/linux/rethook.h  |  16 +-
>  kernel/kprobes.c         |  93 +++---
>  kernel/trace/fprobe.c    |  32 +-
>  kernel/trace/rethook.c   |  90 +++--
>  lib/Kconfig.debug        |  11 +
>  lib/Makefile             |   4 +-
>  lib/objpool.c            | 338 +++++++++++++++++++
>  lib/test_objpool.c       | 689 +++++++++++++++++++++++++++++++++++++++
>  12 files changed, 1320 insertions(+), 274 deletions(-)
>  delete mode 100644 include/linux/freelist.h
>  create mode 100644 include/linux/objpool.h
>  create mode 100644 lib/objpool.c
>  create mode 100644 lib/test_objpool.c
> 
> -- 
> 2.40.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-09-05  1:52 ` [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC wuqiang.matt
@ 2023-09-23  9:48   ` Masami Hiramatsu
  2023-10-08 18:40     ` wuqiang
  2023-09-24 14:21   ` Masami Hiramatsu
  2023-09-25  9:42   ` Masami Hiramatsu
  2 siblings, 1 reply; 25+ messages in thread
From: Masami Hiramatsu @ 2023-09-23  9:48 UTC (permalink / raw)
  To: wuqiang.matt
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

Hi Wuqiang,

Sorry for replying later.

On Tue,  5 Sep 2023 09:52:51 +0800
"wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:

> The object pool is a scalable implementaion of high performance queue
> for object allocation and reclamation, such as kretprobe instances.
> 
> With leveraging percpu ring-array to mitigate the hot spot of memory
> contention, it could deliver near-linear scalability for high parallel
> scenarios. The objpool is best suited for following cases:
> 1) Memory allocation or reclamation are prohibited or too expensive
> 2) Consumers are of different priorities, such as irqs and threads
> 
> Limitations:
> 1) Maximum objects (capacity) is determined during pool initializing
>    and can't be modified (extended) after objpool creation

So the pool size is fixed in initialization.

> 2) The memory of objects won't be freed until objpool is finalized
> 3) Object allocation (pop) may fail after trying all cpu slots

This means that object allocation will fail if the all pools are empty,
right?

> 
> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
> ---
>  include/linux/objpool.h | 174 +++++++++++++++++++++
>  lib/Makefile            |   2 +-
>  lib/objpool.c           | 338 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 513 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/objpool.h
>  create mode 100644 lib/objpool.c
> 
> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> new file mode 100644
> index 000000000000..33c832216b98
> --- /dev/null
> +++ b/include/linux/objpool.h
> @@ -0,0 +1,174 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _LINUX_OBJPOOL_H
> +#define _LINUX_OBJPOOL_H
> +
> +#include <linux/types.h>
> +#include <linux/refcount.h>
> +
> +/*
> + * objpool: ring-array based lockless MPMC queue
> + *
> + * Copyright: wuqiang.matt@bytedance.com
> + *
> + * The object pool is a scalable implementaion of high performance queue
> + * for objects allocation and reclamation, such as kretprobe instances.
> + *
> + * With leveraging per-cpu ring-array to mitigate the hot spots of memory
> + * contention, it could deliver near-linear scalability for high parallel
> + * scenarios. The ring-array is compactly managed in a single cache-line
> + * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
> + * The body of pre-allocated objects is stored in continuous cache-lines
> + * just after the ring-array.

I consider the size of entries may be big if we have larger number of
CPU cores, like 72-cores. And if user specifies (2^n) + 1 entries.
In this case, each CPU has (2^n - 1)/72 objects, but has 2^(n + 1)
entries in ring buffer. So it should be noted.

> + *
> + * The object pool is interrupt safe. Both allocation and reclamation
> + * (object pop and push operations) can be preemptible or interruptable.

You've added raw_spinlock_disable/enable(), so it is not preemptible
or interruptible anymore. (Anyway, caller doesn't take care of that)

> + *
> + * It's best suited for following cases:
> + * 1) Memory allocation or reclamation are prohibited or too expensive
> + * 2) Consumers are of different priorities, such as irqs and threads
> + *
> + * Limitations:
> + * 1) Maximum objects (capacity) is determined during pool initializing
> + * 2) The memory of objects won't be freed until the pool is finalized
> + * 3) Object allocation (pop) may fail after trying all cpu slots
> + */
> +
> +/**
> + * struct objpool_slot - percpu ring array of objpool
> + * @head: head of the local ring array (to retrieve at)
> + * @tail: tail of the local ring array (to append at)
> + * @bits: log2 of capacity (for bitwise operations)
> + * @mask: capacity - 1

These description doesn not give idea what those roles are.

> + *
> + * Represents a cpu-local array-based ring buffer, its size is specialized
> + * during initialization of object pool. The percpu objpool slot is to be
> + * allocated from local memory for NUMA system, and to be kept compact in
> + * continuous memory: ages[] is stored just after the body of objpool_slot,
> + * and then entries[]. ages[] describes revision of each item, solely used
> + * to avoid ABA; entries[] contains pointers of the actual objects
> + *
> + * Layout of objpool_slot in memory:
> + *
> + * 64bit:
> + *        4      8      12     16        32                 64
> + * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> + *
> + * 32bit:
> + *        4      8      12     16        32        48       64
> + * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects

Hm, the '4' here means number of objects after this objpool_slot?
I don't recommend you to allocate several arraies after the header, instead,
using another data structure like this;

|head|tail|bits|mask|ents[N]{age:4|offs:4}|padding|objects

here N means the number of total objects.

struct objpool_entry {
	uint32_t	age;
	void *	ptr;
} __attribute__((packed,aligned(8))) ;

> + *
> + */
> +struct objpool_slot {
> +	uint32_t                head;
> +	uint32_t                tail;
> +	uint32_t                bits;
> +	uint32_t                mask;

	struct objpool_entry	entries[];

> +} __packed;

Then, you don't need complex macros to access object, but you need only one
inline function to get the actual object address.

static inline void *objpool_slot_object(struct objpool_slot *slot, int nth)
{
	if (nth > (1 << bits))
		return NULL;

	return (void *)((unsigned long)slot + slot.entries[nth].offs);
}


> +
> +struct objpool_head;
> +
> +/*
> + * caller-specified callback for object initial setup, it's only called
> + * once for each object (just after the memory allocation of the object)
> + */
> +typedef int (*objpool_init_obj_cb)(void *obj, void *context);
> +
> +/* caller-specified cleanup callback for objpool destruction */
> +typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> +
> +/**
> + * struct objpool_head - object pooling metadata
> + * @obj_size:   object & element size

What does the 'element' mean?

> + * @nr_objs:    total objs (to be pre-allocated)

but all objects must be pre-allocated, right? then it is simply

@nr_objs: the total number of objects in this objpool.

> + * @nr_cpus:    nr_cpu_ids

would we have to save it? or just use 'nr_cpu_ids'?

> + * @capacity:   max objects per cpuslot

what is 'cpuslot'?
This seems the size of objpool_entry array in objpool_slot.

> + * @gfp:        gfp flags for kmalloc & vmalloc
> + * @ref:        refcount for objpool
> + * @flags:      flags for objpool management
> + * @cpu_slots:  array of percpu slots
> + * @slot_sizes:	size in bytes of slots
> + * @release:    resource cleanup callback
> + * @context:    caller-provided context
> + */
> +struct objpool_head {
> +	int                     obj_size;
> +	int                     nr_objs;
> +	int                     nr_cpus;
> +	int                     capacity;
> +	gfp_t                   gfp;
> +	refcount_t              ref;
> +	unsigned long           flags;
> +	struct objpool_slot   **cpu_slots;
> +	int                    *slot_sizes;
> +	objpool_fini_cb         release;
> +	void                   *context;
> +};
> +
> +#define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
> +#define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
> +
> +/**
> + * objpool_init() - initialize objpool and pre-allocated objects
> + * @head:    the object pool to be initialized, declared by caller
> + * @nr_objs: total objects to be pre-allocated by this object pool
> + * @object_size: size of an object (should be > 0)
> + * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
> + * @context: user context for object initialization callback
> + * @objinit: object initialization callback for extra setup
> + * @release: cleanup callback for extra cleanup task
> + *
> + * return value: 0 for success, otherwise error code
> + *
> + * All pre-allocated objects are to be zeroed after memory allocation.
> + * Caller could do extra initialization in objinit callback. objinit()
> + * will be called just after slot allocation and will be only once for
> + * each object. Since then the objpool won't touch any content of the
> + * objects. It's caller's duty to perform reinitialization after each
> + * pop (object allocation) or do clearance before each push (object
> + * reclamation).
> + */
> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> +		 objpool_fini_cb release);
> +
> +/**
> + * objpool_pop() - allocate an object from objpool
> + * @head: object pool
> + *
> + * return value: object ptr or NULL if failed
> + */
> +void *objpool_pop(struct objpool_head *head);
> +
> +/**
> + * objpool_push() - reclaim the object and return back to objpool
> + * @obj:  object ptr to be pushed to objpool
> + * @head: object pool
> + *
> + * return: 0 or error code (it fails only when user tries to push
> + * the same object multiple times or wrong "objects" into objpool)
> + */
> +int objpool_push(void *obj, struct objpool_head *head);
> +
> +/**
> + * objpool_drop() - discard the object and deref objpool
> + * @obj:  object ptr to be discarded
> + * @head: object pool
> + *
> + * return: 0 if objpool was released or error code
> + */
> +int objpool_drop(void *obj, struct objpool_head *head);
> +
> +/**
> + * objpool_free() - release objpool forcely (all objects to be freed)
> + * @head: object pool to be released
> + */
> +void objpool_free(struct objpool_head *head);
> +
> +/**
> + * objpool_fini() - deref object pool (also releasing unused objects)
> + * @head: object pool to be dereferenced
> + */
> +void objpool_fini(struct objpool_head *head);
> +
> +#endif /* _LINUX_OBJPOOL_H */
> diff --git a/lib/Makefile b/lib/Makefile
> index 1ffae65bb7ee..7a84c922d9ff 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>  	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
>  	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
>  	 nmi_backtrace.o win_minmax.o memcat_p.o \
> -	 buildid.o
> +	 buildid.o objpool.o
>  
>  lib-$(CONFIG_PRINTK) += dump_stack.o
>  lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/objpool.c b/lib/objpool.c
> new file mode 100644
> index 000000000000..22e752371820
> --- /dev/null
> +++ b/lib/objpool.c
> @@ -0,0 +1,338 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/objpool.h>
> +#include <linux/slab.h>
> +#include <linux/vmalloc.h>
> +#include <linux/atomic.h>
> +#include <linux/prefetch.h>
> +#include <linux/irqflags.h>
> +#include <linux/cpumask.h>
> +#include <linux/log2.h>
> +
> +/*
> + * objpool: ring-array based lockless MPMC/FIFO queues
> + *
> + * Copyright: wuqiang.matt@bytedance.com
> + */
> +
> +#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> +#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> +			(sizeof(uint32_t) << (s)->bits)))
> +#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> +			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> +#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> +
> +/* compute the suitable num of objects to be managed per slot */
> +static int objpool_nobjs(int size)
> +{
> +	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> +			(sizeof(uint32_t) + sizeof(void *)));
> +}
> +
> +/* allocate and initialize percpu slots */

@head: the objpool_head for managing this objpool
@nobjs: the total number of objects in this objpool
@context: context data for @objinit
@objinit: initialize callback for each object.

> +static int
> +objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> +			void *context, objpool_init_obj_cb objinit)
> +{
> +	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;

'nents' is *round up to the power of 2* of the total number of objects.

> +
> +	/* aligned object size by sizeof(void *) */
> +	objsz = ALIGN(head->obj_size, sizeof(void *));
> +	/* shall we allocate objects along with percpu-slot */
> +	if (objsz)
> +		head->flags |= OBJPOOL_HAVE_OBJECTS;

Is there any chance that objsz == 0?

> +
> +	/* vmalloc is used in default to allocate percpu-slots */
> +	if (!(head->gfp & GFP_ATOMIC))
> +		head->flags |= OBJPOOL_FROM_VMALLOC;
> +
> +	for (i = 0; i < head->nr_cpus; i++) {
> +		struct objpool_slot *os;
> +
> +		/* skip the cpus which could never be present */
> +		if (!cpu_possible(i))
> +			continue;
> +
> +		/* compute how many objects to be managed by this slot */

"to be managed"? or "to be allocated with"?
It seems all objects are possible to be managed by each slot.

> +		n = nobjs / num_possible_cpus();
> +		if (cpu < (nobjs % num_possible_cpus()))
> +			n++;
> +		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> +		       sizeof(uint32_t) * nents + objsz * n;
> +
> +		/*
> +		 * here we allocate percpu-slot & objects together in a single
> +		 * allocation, taking advantage of warm caches and TLB hits as
> +		 * vmalloc always aligns the request size to pages

"Since the objpool_entry array in the slot is mostly accessed from the
 i-th CPU, it should be allocated from the memory node for that CPU."

I think the reason of the memory node allocation is mainly for reducing the
penalty of the cache-miss, since it will be bigger if running on NUMA.

> +		 */
> +		if (head->flags & OBJPOOL_FROM_VMALLOC)
> +			os = __vmalloc_node(size, sizeof(void *), head->gfp,
> +				cpu_to_node(i), __builtin_return_address(0));
> +		else
> +			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> +		if (!os)
> +			return -ENOMEM;
> +
> +		/* initialize percpu slot for the i-th slot */
> +		memset(os, 0, size);
> +		os->bits = ilog2(head->capacity);
> +		os->mask = head->capacity - 1;
> +		head->cpu_slots[i] = os;
> +		head->slot_sizes[i] = size;
> +		cpu = cpu + 1;
> +
> +		/*
> +		 * manually set head & tail to avoid possible conflict:
> +		 * We assume that the head item is ready for retrieval
> +		 * iff head is equal to ages[head & mask]. but ages is
> +		 * initialized as 0, so in view of the caller of pop(),
> +		 * the 1st item (0th) is always ready, but the reality
> +		 * could be: push() is stalled before the final update,
> +		 * thus the item being inserted will be lost forever
> +		 */
> +		os->head = os->tail = head->capacity;
> +
> +		if (!objsz)
> +			continue;

Is it possible? and for what?

> +
> +		for (j = 0; j < n; j++) {
> +			uint32_t *ages = SLOT_AGES(os);
> +			void **ents = SLOT_ENTS(os);
> +			void *obj = SLOT_OBJS(os) + j * objsz;
> +			uint32_t ie = os->tail & os->mask;
> +
> +			/* perform object initialization */
> +			if (objinit) {
> +				int rc = objinit(obj, context);
> +				if (rc)
> +					return rc;
> +			}
> +
> +			/* add obj into the ring array */
> +			ents[ie] = obj;
> +			ages[ie] = os->tail;
> +			os->tail++;
> +			head->nr_objs++;
> +		}

To simplify the code, this loop should be another static function.

> +	}
> +
> +	return 0;
> +}
> +
> +/* cleanup all percpu slots of the object pool */
> +static void objpool_fini_percpu_slots(struct objpool_head *head)
> +{
> +	int i;
> +
> +	if (!head->cpu_slots)
> +		return;
> +
> +	for (i = 0; i < head->nr_cpus; i++) {
> +		if (!head->cpu_slots[i])
> +			continue;
> +		if (head->flags & OBJPOOL_FROM_VMALLOC)
> +			vfree(head->cpu_slots[i]);
> +		else
> +			kfree(head->cpu_slots[i]);
> +	}
> +	kfree(head->cpu_slots);
> +	head->cpu_slots = NULL;
> +	head->slot_sizes = NULL;
> +}
> +
> +/* initialize object pool and pre-allocate objects */
> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> +		objpool_fini_cb release)
> +{
> +	int nents, rc;
> +
> +	/* check input parameters */
> +	if (nr_objs <= 0 || object_size <= 0)
> +		return -EINVAL;
> +
> +	/* calculate percpu slot size (rounded to pow of 2) */
> +	nents = max_t(int, roundup_pow_of_two(nr_objs),
> +			objpool_nobjs(L1_CACHE_BYTES));
> +
> +	/* initialize objpool head */
> +	memset(head, 0, sizeof(struct objpool_head));
> +	head->nr_cpus = nr_cpu_ids;
> +	head->obj_size = object_size;
> +	head->capacity = nents;
> +	head->gfp = gfp & ~__GFP_ZERO;
> +	head->context = context;
> +	head->release = release;
> +
> +	/* allocate array for percpu slots */
> +	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> +			       head->nr_cpus * sizeof(int), head->gfp);
> +	if (!head->cpu_slots)
> +		return -ENOMEM;
> +	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
> +
> +	/* initialize per-cpu slots */
> +	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> +	if (rc)
> +		objpool_fini_percpu_slots(head);
> +	else
> +		refcount_set(&head->ref, nr_objs + 1);
> +
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(objpool_init);
> +
> +/* adding object to slot, abort if the slot was already full */
> +static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> +{
> +	uint32_t *ages = SLOT_AGES(os);
> +	void **ents = SLOT_ENTS(os);
> +	uint32_t head, tail;
> +
> +	do {
> +		/* perform memory loading for both head and tail */
> +		head = READ_ONCE(os->head);
> +		tail = READ_ONCE(os->tail);
> +		/* just abort if slot is full */
> +		if (tail - head > os->mask)
> +			return -ENOENT;

Is this really possible? The total number of objects must be less euqal to
the os->mask. If it means a bug, please use WARN_ON_ONCE() here for debug.

> +		/* try to extend tail by 1 using CAS to avoid races */
> +		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> +			break;
> +	} while (1);

"if(cond) ~ break; } while(1)" should be "} (!cond);"

And this seems to be buggy since tail++ can be 0, then "tail - head" < 0.

if (tail < head)
	if (WARN_ON_ONCE(tail + (UINT32_MAX - head) > os->mask))
		return -ENOENT;
else
	if (WARN_ON_ONCE(tail - head > os->mask))
		return -ENOENT;

> +
> +	/* the tail-th of slot is reserved for the given obj */
> +	WRITE_ONCE(ents[tail & os->mask], obj);
> +	/* update epoch id to make this object available for pop() */
> +	smp_store_release(&ages[tail & os->mask], tail);

Note: since the ages array size is the power of 2, this is just a
(32 - os->bits) loop counter. :)

> +	return 0;
> +}
> +
> +/* reclaim an object to object pool */
> +int objpool_push(void *obj, struct objpool_head *oh)
> +{
> +	unsigned long flags;
> +	int cpu, rc;
> +
> +	/* disable local irq to avoid preemption & interruption */
> +	raw_local_irq_save(flags);
> +	cpu = raw_smp_processor_id();
> +	do {
> +		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> +		if (!rc)
> +			break;
> +		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> +	} while (1);

Hmm, as I said, head->capacity >= nr_all_obj, this must not happen,
we can always push it on this CPU's slot, right?

> +	raw_local_irq_restore(flags);
> +
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(objpool_push);
> +
> +/* drop the allocated object, rather reclaim it to objpool */
> +int objpool_drop(void *obj, struct objpool_head *head)
> +{
> +	if (!obj || !head)
> +		return -EINVAL;
> +
> +	if (refcount_dec_and_test(&head->ref)) {
> +		objpool_free(head);
> +		return 0;
> +	}
> +
> +	return -EAGAIN;
> +}
> +EXPORT_SYMBOL_GPL(objpool_drop);
> +
> +/* try to retrieve object from slot */
> +static inline void *objpool_try_get_slot(struct objpool_slot *os)
> +{
> +	uint32_t *ages = SLOT_AGES(os);
> +	void **ents = SLOT_ENTS(os);
> +	/* do memory load of head to local head */
> +	uint32_t head = smp_load_acquire(&os->head);
> +
> +	/* loop if slot isn't empty */
> +	while (head != READ_ONCE(os->tail)) {
> +		uint32_t id = head & os->mask, prev = head;
> +
> +		/* do prefetching of object ents */
> +		prefetch(&ents[id]);
> +
> +		/* check whether this item was ready for retrieval */
> +		if (smp_load_acquire(&ages[id]) == head) {

We may not need this check, since we know head != tail and the
sizeof(ages) >= nr_all_objs.

Hmm, I guess we can remove ages[] from the code.

> +			/* node must have been udpated by push() */
> +			void *node = READ_ONCE(ents[id]);

Please use the same word for the same object.
I mean this is not 'node' but 'object'.

> +			/* commit and move forward head of the slot */
> +			if (try_cmpxchg_release(&os->head, &head, head + 1))
> +				return node;
> +			/* head was already updated by others */
> +		}
> +
> +		/* re-load head from memory and continue trying */
> +		head = READ_ONCE(os->head);
> +		/*
> +		 * head stays unchanged, so it's very likely there's an
> +		 * ongoing push() on other cpu nodes but yet not update
> +		 * ages[] to mark it's completion
> +		 */
> +		if (head == prev)
> +			break;

This is OK. If we always push() on the current CPU slot, and pop() from
any cpus, we can try again here if this slot is not current CPU. But that
maybe to much :P

Thank you,

> +	}
> +
> +	return NULL;
> +}
> +
> +/* allocate an object from object pool */
> +void *objpool_pop(struct objpool_head *head)
> +{
> +	unsigned long flags;
> +	int i, cpu;
> +	void *obj = NULL;
> +
> +	/* disable local irq to avoid preemption & interruption */
> +	raw_local_irq_save(flags);
> +
> +	cpu = raw_smp_processor_id();
> +	for (i = 0; i < num_possible_cpus(); i++) {
> +		obj = objpool_try_get_slot(head->cpu_slots[cpu]);
> +		if (obj)
> +			break;
> +		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> +	}
> +	raw_local_irq_restore(flags);
> +
> +	return obj;
> +}
> +EXPORT_SYMBOL_GPL(objpool_pop);
> +
> +/* release whole objpool forcely */
> +void objpool_free(struct objpool_head *head)
> +{
> +	if (!head->cpu_slots)
> +		return;
> +
> +	/* release percpu slots */
> +	objpool_fini_percpu_slots(head);
> +
> +	/* call user's cleanup callback if provided */
> +	if (head->release)
> +		head->release(head, head->context);
> +}
> +EXPORT_SYMBOL_GPL(objpool_free);
> +
> +/* drop unused objects and defref objpool for releasing */
> +void objpool_fini(struct objpool_head *head)
> +{
> +	void *nod;
> +
> +	do {
> +		/* grab object from objpool and drop it */
> +		nod = objpool_pop(head);
> +
> +		/* drop the extra ref of objpool */
> +		if (refcount_dec_and_test(&head->ref))
> +			objpool_free(head);
> +	} while (nod);
> +}
> +EXPORT_SYMBOL_GPL(objpool_fini);
> -- 
> 2.40.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-09-05  1:52 ` [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC wuqiang.matt
  2023-09-23  9:48   ` Masami Hiramatsu
@ 2023-09-24 14:21   ` Masami Hiramatsu
  2023-09-25  9:42   ` Masami Hiramatsu
  2 siblings, 0 replies; 25+ messages in thread
From: Masami Hiramatsu @ 2023-09-24 14:21 UTC (permalink / raw)
  To: wuqiang.matt
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

Hi,

On Tue,  5 Sep 2023 09:52:51 +0800
"wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:

> +/* cleanup all percpu slots of the object pool */
> +static void objpool_fini_percpu_slots(struct objpool_head *head)
> +{
> +	int i;
> +
> +	if (!head->cpu_slots)
> +		return;
> +
> +	for (i = 0; i < head->nr_cpus; i++) {
> +		if (!head->cpu_slots[i])
> +			continue;
> +		if (head->flags & OBJPOOL_FROM_VMALLOC)
> +			vfree(head->cpu_slots[i]);
> +		else
> +			kfree(head->cpu_slots[i]);

You can use kvfree() for both kmalloc object and vmalloc object.

> +	}
> +	kfree(head->cpu_slots);
> +	head->cpu_slots = NULL;
> +	head->slot_sizes = NULL;
> +}

...

> +/* drop the allocated object, rather reclaim it to objpool */
> +int objpool_drop(void *obj, struct objpool_head *head)
> +{
> +	if (!obj || !head)
> +		return -EINVAL;
> +
> +	if (refcount_dec_and_test(&head->ref)) {
> +		objpool_free(head);
> +		return 0;
> +	}
> +
> +	return -EAGAIN;
> +}
> +EXPORT_SYMBOL_GPL(objpool_drop);

To make this work correctly, you need to disable the objpool (no more
pop the object from it) and ensure the objpool is disabled.
Also, when disabling the objpool, its refcount must be set to the "active"
number of objects.

Thank you,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-09-05  1:52 ` [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC wuqiang.matt
  2023-09-23  9:48   ` Masami Hiramatsu
  2023-09-24 14:21   ` Masami Hiramatsu
@ 2023-09-25  9:42   ` Masami Hiramatsu
  2023-10-08 19:04     ` wuqiang
  2023-10-09  9:23     ` wuqiang
  2 siblings, 2 replies; 25+ messages in thread
From: Masami Hiramatsu @ 2023-09-25  9:42 UTC (permalink / raw)
  To: wuqiang.matt
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

Hi Wuqiang,

On Tue,  5 Sep 2023 09:52:51 +0800
"wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:

> The object pool is a scalable implementaion of high performance queue
> for object allocation and reclamation, such as kretprobe instances.
> 
> With leveraging percpu ring-array to mitigate the hot spot of memory
> contention, it could deliver near-linear scalability for high parallel
> scenarios. The objpool is best suited for following cases:
> 1) Memory allocation or reclamation are prohibited or too expensive
> 2) Consumers are of different priorities, such as irqs and threads
> 
> Limitations:
> 1) Maximum objects (capacity) is determined during pool initializing
>    and can't be modified (extended) after objpool creation
> 2) The memory of objects won't be freed until objpool is finalized
> 3) Object allocation (pop) may fail after trying all cpu slots

I made a simplifying patch on this by (mainly) removing ages array.
I also rename local variable to use more readable names, like slot,
pool, and obj.

Here the results which I run the test_objpool.ko.

Original:
[   50.500235] Summary of testcases:
[   50.503139]     duration: 1027135us 	hits:   30628293 	miss:          0 	sync: percpu objpool
[   50.510416]     duration: 1047977us 	hits:   30453023 	miss:          0 	sync: percpu objpool from vmalloc
[   50.517421]     duration: 1047098us 	hits:   31145034 	miss:          0 	sync & hrtimer: percpu objpool
[   50.524671]     duration: 1053233us 	hits:   30919640 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
[   50.531382]     duration: 1055822us 	hits:    3407221 	miss:     830219 	sync overrun: percpu objpool
[   50.538135]     duration: 1055998us 	hits:    3404624 	miss:     854160 	sync overrun: percpu objpool from vmalloc
[   50.546686]     duration: 1046104us 	hits:   19464798 	miss:          0 	async: percpu objpool
[   50.552989]     duration: 1033054us 	hits:   18957836 	miss:          0 	async: percpu objpool from vmalloc
[   50.560289]     duration: 1041778us 	hits:   33806868 	miss:          0 	async & hrtimer: percpu objpool
[   50.567425]     duration: 1048901us 	hits:   34211860 	miss:          0 	async & hrtimer: percpu objpool from vmalloc

Simplified:
[   48.393236] Summary of testcases:
[   48.395384]     duration: 1013002us 	hits:   29661448 	miss:          0 	sync: percpu objpool
[   48.400351]     duration: 1057231us 	hits:   31035578 	miss:          0 	sync: percpu objpool from vmalloc
[   48.405660]     duration: 1043196us 	hits:   30546652 	miss:          0 	sync & hrtimer: percpu objpool
[   48.411216]     duration: 1047128us 	hits:   30601306 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
[   48.417231]     duration: 1051695us 	hits:    3468287 	miss:     892881 	sync overrun: percpu objpool
[   48.422405]     duration: 1054003us 	hits:    3549491 	miss:     898141 	sync overrun: percpu objpool from vmalloc
[   48.428425]     duration: 1052946us 	hits:   19005228 	miss:          0 	async: percpu objpool
[   48.433597]     duration: 1051757us 	hits:   19670866 	miss:          0 	async: percpu objpool from vmalloc
[   48.439280]     duration: 1042951us 	hits:   37135332 	miss:          0 	async & hrtimer: percpu objpool
[   48.445085]     duration: 1029803us 	hits:   37093248 	miss:          0 	async & hrtimer: percpu objpool from vmalloc

Can you test it too?

Thanks,

From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Date: Mon, 25 Sep 2023 16:07:12 +0900
Subject: [PATCH] objpool: Simplify objpool by removing ages array

Simplify the objpool code by removing ages array from per-cpu slot.
It chooses enough big capacity (which is a rounded up power of 2 value
of nr_objects + 1) for the entries array, the tail never catch up to
the head in per-cpu slot. Thus tail == head means the slot is empty.

This also uses consistent local variable names for pool, slot and obj.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 include/linux/objpool.h |  61 ++++----
 lib/objpool.c           | 310 ++++++++++++++++------------------------
 2 files changed, 147 insertions(+), 224 deletions(-)

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
index 33c832216b98..ecd5ecaffcd3 100644
--- a/include/linux/objpool.h
+++ b/include/linux/objpool.h
@@ -38,33 +38,23 @@
  * struct objpool_slot - percpu ring array of objpool
  * @head: head of the local ring array (to retrieve at)
  * @tail: tail of the local ring array (to append at)
- * @bits: log2 of capacity (for bitwise operations)
- * @mask: capacity - 1
+ * @mask: capacity of entries - 1
+ * @entries: object entries on this slot.
  *
  * Represents a cpu-local array-based ring buffer, its size is specialized
  * during initialization of object pool. The percpu objpool slot is to be
  * allocated from local memory for NUMA system, and to be kept compact in
- * continuous memory: ages[] is stored just after the body of objpool_slot,
- * and then entries[]. ages[] describes revision of each item, solely used
- * to avoid ABA; entries[] contains pointers of the actual objects
- *
- * Layout of objpool_slot in memory:
- *
- * 64bit:
- *        4      8      12     16        32                 64
- * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
- *
- * 32bit:
- *        4      8      12     16        32        48       64
- * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
+ * continuous memory: CPU assigned number of objects are stored just after
+ * the body of objpool_slot.
  *
  */
 struct objpool_slot {
-	uint32_t                head;
-	uint32_t                tail;
-	uint32_t                bits;
-	uint32_t                mask;
-} __packed;
+	uint32_t	head;
+	uint32_t	tail;
+	uint32_t	mask;
+	uint32_t	dummyr;
+	void *		entries[];
+};
 
 struct objpool_head;
 
@@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
  * @obj_size:   object & element size
  * @nr_objs:    total objs (to be pre-allocated)
  * @nr_cpus:    nr_cpu_ids
- * @capacity:   max objects per cpuslot
+ * @capacity:   max objects on percpu slot
  * @gfp:        gfp flags for kmalloc & vmalloc
  * @ref:        refcount for objpool
  * @flags:      flags for objpool management
  * @cpu_slots:  array of percpu slots
- * @slot_sizes:	size in bytes of slots
  * @release:    resource cleanup callback
  * @context:    caller-provided context
  */
@@ -100,7 +89,6 @@ struct objpool_head {
 	refcount_t              ref;
 	unsigned long           flags;
 	struct objpool_slot   **cpu_slots;
-	int                    *slot_sizes;
 	objpool_fini_cb         release;
 	void                   *context;
 };
@@ -108,9 +96,12 @@ struct objpool_head {
 #define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
 #define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
 
+#define OBJPOOL_NR_OBJECT_MAX	(1 << 24)
+#define OBJPOOL_OBJECT_SIZE_MAX	(1 << 16)
+
 /**
  * objpool_init() - initialize objpool and pre-allocated objects
- * @head:    the object pool to be initialized, declared by caller
+ * @pool:    the object pool to be initialized, declared by caller
  * @nr_objs: total objects to be pre-allocated by this object pool
  * @object_size: size of an object (should be > 0)
  * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
@@ -128,47 +119,47 @@ struct objpool_head {
  * pop (object allocation) or do clearance before each push (object
  * reclamation).
  */
-int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
+int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
 		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
 		 objpool_fini_cb release);
 
 /**
  * objpool_pop() - allocate an object from objpool
- * @head: object pool
+ * @pool: object pool
  *
  * return value: object ptr or NULL if failed
  */
-void *objpool_pop(struct objpool_head *head);
+void *objpool_pop(struct objpool_head *pool);
 
 /**
  * objpool_push() - reclaim the object and return back to objpool
  * @obj:  object ptr to be pushed to objpool
- * @head: object pool
+ * @pool: object pool
  *
  * return: 0 or error code (it fails only when user tries to push
  * the same object multiple times or wrong "objects" into objpool)
  */
-int objpool_push(void *obj, struct objpool_head *head);
+int objpool_push(void *obj, struct objpool_head *pool);
 
 /**
  * objpool_drop() - discard the object and deref objpool
  * @obj:  object ptr to be discarded
- * @head: object pool
+ * @pool: object pool
  *
  * return: 0 if objpool was released or error code
  */
-int objpool_drop(void *obj, struct objpool_head *head);
+int objpool_drop(void *obj, struct objpool_head *pool);
 
 /**
  * objpool_free() - release objpool forcely (all objects to be freed)
- * @head: object pool to be released
+ * @pool: object pool to be released
  */
-void objpool_free(struct objpool_head *head);
+void objpool_free(struct objpool_head *pool);
 
 /**
  * objpool_fini() - deref object pool (also releasing unused objects)
- * @head: object pool to be dereferenced
+ * @pool: object pool to be dereferenced
  */
-void objpool_fini(struct objpool_head *head);
+void objpool_fini(struct objpool_head *pool);
 
 #endif /* _LINUX_OBJPOOL_H */
diff --git a/lib/objpool.c b/lib/objpool.c
index 22e752371820..f8e8f70d7253 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -15,104 +15,55 @@
  * Copyright: wuqiang.matt@bytedance.com
  */
 
-#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
-#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
-			(sizeof(uint32_t) << (s)->bits)))
-#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
-			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
-#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
-
-/* compute the suitable num of objects to be managed per slot */
-static int objpool_nobjs(int size)
-{
-	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
-			(sizeof(uint32_t) + sizeof(void *)));
-}
-
 /* allocate and initialize percpu slots */
 static int
-objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
-			void *context, objpool_init_obj_cb objinit)
+objpool_init_percpu_slots(struct objpool_head *pool, void *context,
+			  objpool_init_obj_cb objinit)
 {
-	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
-
-	/* aligned object size by sizeof(void *) */
-	objsz = ALIGN(head->obj_size, sizeof(void *));
-	/* shall we allocate objects along with percpu-slot */
-	if (objsz)
-		head->flags |= OBJPOOL_HAVE_OBJECTS;
-
-	/* vmalloc is used in default to allocate percpu-slots */
-	if (!(head->gfp & GFP_ATOMIC))
-		head->flags |= OBJPOOL_FROM_VMALLOC;
-
-	for (i = 0; i < head->nr_cpus; i++) {
-		struct objpool_slot *os;
+	int i, j, n, size, slot_size, cpu_count = 0;
+	struct objpool_slot *slot;
 
+	for (i = 0; i < pool->nr_cpus; i++) {
 		/* skip the cpus which could never be present */
 		if (!cpu_possible(i))
 			continue;
 
 		/* compute how many objects to be managed by this slot */
-		n = nobjs / num_possible_cpus();
-		if (cpu < (nobjs % num_possible_cpus()))
+		n = pool->nr_objs / num_possible_cpus();
+		if (cpu_count < (pool->nr_objs % num_possible_cpus()))
 			n++;
-		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
-		       sizeof(uint32_t) * nents + objsz * n;
+		cpu_count++;
+
+		slot_size = struct_size(slot, entries, pool->capacity);
+		size = slot_size + pool->obj_size * n;
 
 		/*
 		 * here we allocate percpu-slot & objects together in a single
-		 * allocation, taking advantage of warm caches and TLB hits as
-		 * vmalloc always aligns the request size to pages
+		 * allocation, taking advantage on NUMA.
 		 */
-		if (head->flags & OBJPOOL_FROM_VMALLOC)
-			os = __vmalloc_node(size, sizeof(void *), head->gfp,
+		if (pool->flags & OBJPOOL_FROM_VMALLOC)
+			slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
 				cpu_to_node(i), __builtin_return_address(0));
 		else
-			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
-		if (!os)
+			slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
+		if (!slot)
 			return -ENOMEM;
 
 		/* initialize percpu slot for the i-th slot */
-		memset(os, 0, size);
-		os->bits = ilog2(head->capacity);
-		os->mask = head->capacity - 1;
-		head->cpu_slots[i] = os;
-		head->slot_sizes[i] = size;
-		cpu = cpu + 1;
-
-		/*
-		 * manually set head & tail to avoid possible conflict:
-		 * We assume that the head item is ready for retrieval
-		 * iff head is equal to ages[head & mask]. but ages is
-		 * initialized as 0, so in view of the caller of pop(),
-		 * the 1st item (0th) is always ready, but the reality
-		 * could be: push() is stalled before the final update,
-		 * thus the item being inserted will be lost forever
-		 */
-		os->head = os->tail = head->capacity;
-
-		if (!objsz)
-			continue;
+		memset(slot, 0, size);
+		slot->mask = pool->capacity - 1;
+		pool->cpu_slots[i] = slot;
 
 		for (j = 0; j < n; j++) {
-			uint32_t *ages = SLOT_AGES(os);
-			void **ents = SLOT_ENTS(os);
-			void *obj = SLOT_OBJS(os) + j * objsz;
-			uint32_t ie = os->tail & os->mask;
+			void *obj = (void *)slot + slot_size + pool->obj_size * j;
 
-			/* perform object initialization */
 			if (objinit) {
 				int rc = objinit(obj, context);
 				if (rc)
 					return rc;
 			}
-
-			/* add obj into the ring array */
-			ents[ie] = obj;
-			ages[ie] = os->tail;
-			os->tail++;
-			head->nr_objs++;
+			slot->entries[j] = obj;
+			slot->tail++;
 		}
 	}
 
@@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
 }
 
 /* cleanup all percpu slots of the object pool */
-static void objpool_fini_percpu_slots(struct objpool_head *head)
+static void objpool_fini_percpu_slots(struct objpool_head *pool)
 {
 	int i;
 
-	if (!head->cpu_slots)
+	if (!pool->cpu_slots)
 		return;
 
-	for (i = 0; i < head->nr_cpus; i++) {
-		if (!head->cpu_slots[i])
-			continue;
-		if (head->flags & OBJPOOL_FROM_VMALLOC)
-			vfree(head->cpu_slots[i]);
-		else
-			kfree(head->cpu_slots[i]);
-	}
-	kfree(head->cpu_slots);
-	head->cpu_slots = NULL;
-	head->slot_sizes = NULL;
+	for (i = 0; i < pool->nr_cpus; i++)
+		kvfree(pool->cpu_slots[i]);
+	kfree(pool->cpu_slots);
 }
 
 /* initialize object pool and pre-allocate objects */
-int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
+int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
 		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
 		objpool_fini_cb release)
 {
 	int nents, rc;
 
 	/* check input parameters */
-	if (nr_objs <= 0 || object_size <= 0)
+	if (nr_objs <= 0 || object_size <= 0 ||
+	    nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
+		return -EINVAL;
+
+	/* Align up to unsigned long size */
+	object_size = ALIGN(object_size, sizeof(unsigned long));
+
+	/*
+	 * To avoid filling up the entries array in the per-cpu slot,
+	 * use the power of 2 which is more than N + 1. Thus, the tail never
+	 * catch up to the pool, and able to use pool/tail as the sequencial
+	 * number.
+	 */
+	nents = roundup_pow_of_two(nr_objs + 1);
+	if (!nents)
 		return -EINVAL;
 
-	/* calculate percpu slot size (rounded to pow of 2) */
-	nents = max_t(int, roundup_pow_of_two(nr_objs),
-			objpool_nobjs(L1_CACHE_BYTES));
-
-	/* initialize objpool head */
-	memset(head, 0, sizeof(struct objpool_head));
-	head->nr_cpus = nr_cpu_ids;
-	head->obj_size = object_size;
-	head->capacity = nents;
-	head->gfp = gfp & ~__GFP_ZERO;
-	head->context = context;
-	head->release = release;
-
-	/* allocate array for percpu slots */
-	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
-			       head->nr_cpus * sizeof(int), head->gfp);
-	if (!head->cpu_slots)
+	/* initialize objpool pool */
+	memset(pool, 0, sizeof(struct objpool_head));
+	pool->nr_cpus = nr_cpu_ids;
+	pool->obj_size = object_size;
+	pool->nr_objs = nr_objs;
+	/* just prevent to fullfill the per-cpu ring array */
+	pool->capacity = nents;
+	pool->gfp = gfp & ~__GFP_ZERO;
+	pool->context = context;
+	pool->release = release;
+	/* vmalloc is used in default to allocate percpu-slots */
+	if (!(pool->gfp & GFP_ATOMIC))
+		pool->flags |= OBJPOOL_FROM_VMALLOC;
+
+	pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
+	if (!pool->cpu_slots)
 		return -ENOMEM;
-	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
 
 	/* initialize per-cpu slots */
-	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
+	rc = objpool_init_percpu_slots(pool, context, objinit);
 	if (rc)
-		objpool_fini_percpu_slots(head);
+		objpool_fini_percpu_slots(pool);
 	else
-		refcount_set(&head->ref, nr_objs + 1);
+		refcount_set(&pool->ref, nr_objs + 1);
 
 	return rc;
 }
 EXPORT_SYMBOL_GPL(objpool_init);
 
 /* adding object to slot, abort if the slot was already full */
-static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
+static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
 {
-	uint32_t *ages = SLOT_AGES(os);
-	void **ents = SLOT_ENTS(os);
-	uint32_t head, tail;
+	struct objpool_slot *slot = pool->cpu_slots[cpu];
+	uint32_t tail, next;
 
 	do {
-		/* perform memory loading for both head and tail */
-		head = READ_ONCE(os->head);
-		tail = READ_ONCE(os->tail);
-		/* just abort if slot is full */
-		if (tail - head > os->mask)
-			return -ENOENT;
-		/* try to extend tail by 1 using CAS to avoid races */
-		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
-			break;
-	} while (1);
+		uint32_t head = READ_ONCE(slot->head);
 
-	/* the tail-th of slot is reserved for the given obj */
-	WRITE_ONCE(ents[tail & os->mask], obj);
-	/* update epoch id to make this object available for pop() */
-	smp_store_release(&ages[tail & os->mask], tail);
+		tail = READ_ONCE(slot->tail);
+		next = tail + 1;
+
+		/* This must never happen because capacity >= N + 1 */
+		if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
+				 (next < head && next > head + pool->nr_objs)))
+			return -EINVAL;
+
+	} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
+
+	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
 	return 0;
 }
 
 /* reclaim an object to object pool */
-int objpool_push(void *obj, struct objpool_head *oh)
+int objpool_push(void *obj, struct objpool_head *pool)
 {
 	unsigned long flags;
-	int cpu, rc;
+	int rc;
 
 	/* disable local irq to avoid preemption & interruption */
 	raw_local_irq_save(flags);
-	cpu = raw_smp_processor_id();
-	do {
-		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
-		if (!rc)
-			break;
-		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
-	} while (1);
+	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
 	raw_local_irq_restore(flags);
 
 	return rc;
 }
 EXPORT_SYMBOL_GPL(objpool_push);
 
-/* drop the allocated object, rather reclaim it to objpool */
-int objpool_drop(void *obj, struct objpool_head *head)
-{
-	if (!obj || !head)
-		return -EINVAL;
-
-	if (refcount_dec_and_test(&head->ref)) {
-		objpool_free(head);
-		return 0;
-	}
-
-	return -EAGAIN;
-}
-EXPORT_SYMBOL_GPL(objpool_drop);
-
 /* try to retrieve object from slot */
-static inline void *objpool_try_get_slot(struct objpool_slot *os)
+static inline void *objpool_try_get_slot(struct objpool_slot *slot)
 {
-	uint32_t *ages = SLOT_AGES(os);
-	void **ents = SLOT_ENTS(os);
 	/* do memory load of head to local head */
-	uint32_t head = smp_load_acquire(&os->head);
+	uint32_t head = smp_load_acquire(&slot->head);
 
 	/* loop if slot isn't empty */
-	while (head != READ_ONCE(os->tail)) {
-		uint32_t id = head & os->mask, prev = head;
+	while (head != READ_ONCE(slot->tail)) {
 
 		/* do prefetching of object ents */
-		prefetch(&ents[id]);
-
-		/* check whether this item was ready for retrieval */
-		if (smp_load_acquire(&ages[id]) == head) {
-			/* node must have been udpated by push() */
-			void *node = READ_ONCE(ents[id]);
-			/* commit and move forward head of the slot */
-			if (try_cmpxchg_release(&os->head, &head, head + 1))
-				return node;
-			/* head was already updated by others */
-		}
+		prefetch(&slot->entries[head & slot->mask]);
+
+		/* commit and move forward head of the slot */
+		if (try_cmpxchg_release(&slot->head, &head, head + 1))
+			/*
+			 * TBD: check overwrap the tail/head counter and warn
+			 * if it is broken. But this happens only if this
+			 * process slows down a lot and another CPU updates
+			 * the haed/tail just 2^32 + 1 times, and this slot
+			 * is empty.
+			 */
+			return slot->entries[head & slot->mask];
 
 		/* re-load head from memory and continue trying */
-		head = READ_ONCE(os->head);
-		/*
-		 * head stays unchanged, so it's very likely there's an
-		 * ongoing push() on other cpu nodes but yet not update
-		 * ages[] to mark it's completion
-		 */
-		if (head == prev)
-			break;
+		head = READ_ONCE(slot->head);
 	}
 
 	return NULL;
@@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
 EXPORT_SYMBOL_GPL(objpool_pop);
 
 /* release whole objpool forcely */
-void objpool_free(struct objpool_head *head)
+void objpool_free(struct objpool_head *pool)
 {
-	if (!head->cpu_slots)
+	if (!pool->cpu_slots)
 		return;
 
 	/* release percpu slots */
-	objpool_fini_percpu_slots(head);
+	objpool_fini_percpu_slots(pool);
 
 	/* call user's cleanup callback if provided */
-	if (head->release)
-		head->release(head, head->context);
+	if (pool->release)
+		pool->release(pool, pool->context);
 }
 EXPORT_SYMBOL_GPL(objpool_free);
 
-/* drop unused objects and defref objpool for releasing */
-void objpool_fini(struct objpool_head *head)
+/* drop the allocated object, rather reclaim it to objpool */
+int objpool_drop(void *obj, struct objpool_head *pool)
 {
-	void *nod;
+	if (!obj || !pool)
+		return -EINVAL;
 
-	do {
-		/* grab object from objpool and drop it */
-		nod = objpool_pop(head);
+	if (refcount_dec_and_test(&pool->ref)) {
+		objpool_free(pool);
+		return 0;
+	}
+
+	return -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(objpool_drop);
+
+/* drop unused objects and defref objpool for releasing */
+void objpool_fini(struct objpool_head *pool)
+{
+	void *obj;
 
-		/* drop the extra ref of objpool */
-		if (refcount_dec_and_test(&head->ref))
-			objpool_free(head);
-	} while (nod);
+	/* grab object from objpool and drop it */
+	while ((obj = objpool_pop(pool)))
+		objpool_drop(obj, pool);
 }
 EXPORT_SYMBOL_GPL(objpool_fini);
-- 
2.34.1


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 3/5] kprobes: kretprobe scalability improvement with objpool
  2023-09-05  1:52 ` [PATCH v9 3/5] kprobes: kretprobe scalability improvement with objpool wuqiang.matt
@ 2023-10-07  2:02   ` Masami Hiramatsu
  2023-10-08 18:31     ` wuqiang
  0 siblings, 1 reply; 25+ messages in thread
From: Masami Hiramatsu @ 2023-10-07  2:02 UTC (permalink / raw)
  To: wuqiang.matt
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On Tue,  5 Sep 2023 09:52:53 +0800
"wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:

> kretprobe is using freelist to manage return-instances, but freelist,
> as LIFO queue based on singly linked list, scales badly and reduces
> the overall throughput of kretprobed routines, especially for high
> contention scenarios.
> 
> Here's a typical throughput test of sys_prctl (counts in 10 seconds,
> measured with perf stat -a -I 10000 -e syscalls:sys_enter_prctl):
> 
> OS: Debian 10 X86_64, Linux 6.5rc7 with freelist
> HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
> 
>       1T       2T       4T       8T      16T      24T
> 24150045 29317964 15446741 12494489 18287272 18287272
>      32T      48T      64T      72T      96T     128T
> 16200682  1620068 11645677 11269858 10470118  9931051
> 
> This patch introduces objpool to kretprobe and rethook, with orginal
> freelist replaced and brings near-linear scalability to kretprobed
> routines. Tests of kretprobe throughput show the biggest ratio as
> 166.7x of the original freelist. Here's the comparison:
> 
>                   1T         2T         4T         8T        16T
> native:     41186213   82336866  164250978  328662645  658810299
> freelist:   24150045   29317964   15446741   12494489   18287272
> objpool:    24663700   49410571   98544977  197725940  396294399
>                  32T        48T        64T        96T       128T
> native:   1330338351 1969957941 2512291791 1514690434 2671040914
> freelist:   16200682   13737658   11645677   10470118    9931051
> objpool:    78673470 1177354156 1514690434 1604472348 1655086824
> 
> Tests on 96-core ARM64 system output similarly, but with the biggest
> ratio up to 454.8x:
> 
> OS: Debian 10 AARCH64, Linux 6.5rc7
> HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
> 
>                   1T         2T         4T         8T        16T
> native: .   30066096   13733813  126194076  257447289  505800181
> freelist:   16152090   11064397   11124068    7215768    5663013
> objpool:    13733813   27749031   56540679  112291770  223482778
>                  24T        32T        48T        64T        96T
> native:    763305277 1015925192 1521075123 2033009392 3021013752
> freelist:    5015810    4602893    3766792    3382478    2945292
> objpool:   334605663  448310646  675018951  903449904 1339693418
> 

This looks good to me (and I have tested with updated objpool)

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Wuqiang, can you update the above number with the simplified
objpool? I got better number (always 80% of the native performance)
with 128 node/probe.
(*) https://lore.kernel.org/all/20231003003923.eabc33bb3f4ffb8eac71f2af@kernel.org/

Thank you,

> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
> ---
>  include/linux/kprobes.h | 11 ++---
>  include/linux/rethook.h | 16 ++-----
>  kernel/kprobes.c        | 93 +++++++++++++++++------------------------
>  kernel/trace/fprobe.c   | 32 ++++++--------
>  kernel/trace/rethook.c  | 90 ++++++++++++++++++---------------------
>  5 files changed, 98 insertions(+), 144 deletions(-)
> 
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index 85a64cb95d75..365eb092e9c4 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -26,8 +26,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/mutex.h>
>  #include <linux/ftrace.h>
> -#include <linux/refcount.h>
> -#include <linux/freelist.h>
> +#include <linux/objpool.h>
>  #include <linux/rethook.h>
>  #include <asm/kprobes.h>
>  
> @@ -141,7 +140,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
>   */
>  struct kretprobe_holder {
>  	struct kretprobe	*rp;
> -	refcount_t		ref;
> +	struct objpool_head	pool;
>  };
>  
>  struct kretprobe {
> @@ -154,7 +153,6 @@ struct kretprobe {
>  #ifdef CONFIG_KRETPROBE_ON_RETHOOK
>  	struct rethook *rh;
>  #else
> -	struct freelist_head freelist;
>  	struct kretprobe_holder *rph;
>  #endif
>  };
> @@ -165,10 +163,7 @@ struct kretprobe_instance {
>  #ifdef CONFIG_KRETPROBE_ON_RETHOOK
>  	struct rethook_node node;
>  #else
> -	union {
> -		struct freelist_node freelist;
> -		struct rcu_head rcu;
> -	};
> +	struct rcu_head rcu;
>  	struct llist_node llist;
>  	struct kretprobe_holder *rph;
>  	kprobe_opcode_t *ret_addr;
> diff --git a/include/linux/rethook.h b/include/linux/rethook.h
> index 26b6f3c81a76..ce69b2b7bc35 100644
> --- a/include/linux/rethook.h
> +++ b/include/linux/rethook.h
> @@ -6,11 +6,10 @@
>  #define _LINUX_RETHOOK_H
>  
>  #include <linux/compiler.h>
> -#include <linux/freelist.h>
> +#include <linux/objpool.h>
>  #include <linux/kallsyms.h>
>  #include <linux/llist.h>
>  #include <linux/rcupdate.h>
> -#include <linux/refcount.h>
>  
>  struct rethook_node;
>  
> @@ -30,14 +29,12 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, unsigned long,
>  struct rethook {
>  	void			*data;
>  	rethook_handler_t	handler;
> -	struct freelist_head	pool;
> -	refcount_t		ref;
> +	struct objpool_head	pool;
>  	struct rcu_head		rcu;
>  };
>  
>  /**
>   * struct rethook_node - The rethook shadow-stack entry node.
> - * @freelist: The freelist, linked to struct rethook::pool.
>   * @rcu: The rcu_head for deferred freeing.
>   * @llist: The llist, linked to a struct task_struct::rethooks.
>   * @rethook: The pointer to the struct rethook.
> @@ -48,20 +45,16 @@ struct rethook {
>   * on each entry of the shadow stack.
>   */
>  struct rethook_node {
> -	union {
> -		struct freelist_node freelist;
> -		struct rcu_head      rcu;
> -	};
> +	struct rcu_head		rcu;
>  	struct llist_node	llist;
>  	struct rethook		*rethook;
>  	unsigned long		ret_addr;
>  	unsigned long		frame;
>  };
>  
> -struct rethook *rethook_alloc(void *data, rethook_handler_t handler);
> +struct rethook *rethook_alloc(void *data, rethook_handler_t handler, int size, int num);
>  void rethook_stop(struct rethook *rh);
>  void rethook_free(struct rethook *rh);
> -void rethook_add_node(struct rethook *rh, struct rethook_node *node);
>  struct rethook_node *rethook_try_get(struct rethook *rh);
>  void rethook_recycle(struct rethook_node *node);
>  void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount);
> @@ -98,4 +91,3 @@ void rethook_flush_task(struct task_struct *tk);
>  #endif
>  
>  #endif
> -
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index ca385b61d546..075a632e6c7c 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -1877,13 +1877,27 @@ static struct notifier_block kprobe_exceptions_nb = {
>  #ifdef CONFIG_KRETPROBES
>  
>  #if !defined(CONFIG_KRETPROBE_ON_RETHOOK)
> +
> +/* callbacks for objpool of kretprobe instances */
> +static int kretprobe_init_inst(void *nod, void *context)
> +{
> +	struct kretprobe_instance *ri = nod;
> +
> +	ri->rph = context;
> +	return 0;
> +}
> +static int kretprobe_fini_pool(struct objpool_head *head, void *context)
> +{
> +	kfree(context);
> +	return 0;
> +}
> +
>  static void free_rp_inst_rcu(struct rcu_head *head)
>  {
>  	struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
> +	struct kretprobe_holder *rph = ri->rph;
>  
> -	if (refcount_dec_and_test(&ri->rph->ref))
> -		kfree(ri->rph);
> -	kfree(ri);
> +	objpool_drop(ri, &rph->pool);
>  }
>  NOKPROBE_SYMBOL(free_rp_inst_rcu);
>  
> @@ -1892,7 +1906,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
>  	struct kretprobe *rp = get_kretprobe(ri);
>  
>  	if (likely(rp))
> -		freelist_add(&ri->freelist, &rp->freelist);
> +		objpool_push(ri, &rp->rph->pool);
>  	else
>  		call_rcu(&ri->rcu, free_rp_inst_rcu);
>  }
> @@ -1929,23 +1943,12 @@ NOKPROBE_SYMBOL(kprobe_flush_task);
>  
>  static inline void free_rp_inst(struct kretprobe *rp)
>  {
> -	struct kretprobe_instance *ri;
> -	struct freelist_node *node;
> -	int count = 0;
> -
> -	node = rp->freelist.head;
> -	while (node) {
> -		ri = container_of(node, struct kretprobe_instance, freelist);
> -		node = node->next;
> -
> -		kfree(ri);
> -		count++;
> -	}
> +	struct kretprobe_holder *rph = rp->rph;
>  
> -	if (refcount_sub_and_test(count, &rp->rph->ref)) {
> -		kfree(rp->rph);
> -		rp->rph = NULL;
> -	}
> +	if (!rph)
> +		return;
> +	rp->rph = NULL;
> +	objpool_fini(&rph->pool);
>  }
>  
>  /* This assumes the 'tsk' is the current task or the is not running. */
> @@ -2087,19 +2090,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
>  static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
>  {
>  	struct kretprobe *rp = container_of(p, struct kretprobe, kp);
> +	struct kretprobe_holder *rph = rp->rph;
>  	struct kretprobe_instance *ri;
> -	struct freelist_node *fn;
>  
> -	fn = freelist_try_get(&rp->freelist);
> -	if (!fn) {
> +	ri = objpool_pop(&rph->pool);
> +	if (!ri) {
>  		rp->nmissed++;
>  		return 0;
>  	}
>  
> -	ri = container_of(fn, struct kretprobe_instance, freelist);
> -
>  	if (rp->entry_handler && rp->entry_handler(ri, regs)) {
> -		freelist_add(&ri->freelist, &rp->freelist);
> +		objpool_push(ri, &rph->pool);
>  		return 0;
>  	}
>  
> @@ -2193,7 +2194,6 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
>  int register_kretprobe(struct kretprobe *rp)
>  {
>  	int ret;
> -	struct kretprobe_instance *inst;
>  	int i;
>  	void *addr;
>  
> @@ -2227,20 +2227,12 @@ int register_kretprobe(struct kretprobe *rp)
>  		rp->maxactive = max_t(unsigned int, 10, 2*num_possible_cpus());
>  
>  #ifdef CONFIG_KRETPROBE_ON_RETHOOK
> -	rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler);
> -	if (!rp->rh)
> -		return -ENOMEM;
> +	rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler,
> +				sizeof(struct kretprobe_instance) +
> +				rp->data_size, rp->maxactive);
> +	if (IS_ERR(rp->rh))
> +		return PTR_ERR(rp->rh);
>  
> -	for (i = 0; i < rp->maxactive; i++) {
> -		inst = kzalloc(sizeof(struct kretprobe_instance) +
> -			       rp->data_size, GFP_KERNEL);
> -		if (inst == NULL) {
> -			rethook_free(rp->rh);
> -			rp->rh = NULL;
> -			return -ENOMEM;
> -		}
> -		rethook_add_node(rp->rh, &inst->node);
> -	}
>  	rp->nmissed = 0;
>  	/* Establish function entry probe point */
>  	ret = register_kprobe(&rp->kp);
> @@ -2249,25 +2241,18 @@ int register_kretprobe(struct kretprobe *rp)
>  		rp->rh = NULL;
>  	}
>  #else	/* !CONFIG_KRETPROBE_ON_RETHOOK */
> -	rp->freelist.head = NULL;
>  	rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
>  	if (!rp->rph)
>  		return -ENOMEM;
>  
> -	rp->rph->rp = rp;
> -	for (i = 0; i < rp->maxactive; i++) {
> -		inst = kzalloc(sizeof(struct kretprobe_instance) +
> -			       rp->data_size, GFP_KERNEL);
> -		if (inst == NULL) {
> -			refcount_set(&rp->rph->ref, i);
> -			free_rp_inst(rp);
> -			return -ENOMEM;
> -		}
> -		inst->rph = rp->rph;
> -		freelist_add(&inst->freelist, &rp->freelist);
> +	if (objpool_init(&rp->rph->pool, rp->maxactive, rp->data_size +
> +			sizeof(struct kretprobe_instance), GFP_KERNEL,
> +			rp->rph, kretprobe_init_inst, kretprobe_fini_pool)) {
> +		kfree(rp->rph);
> +		rp->rph = NULL;
> +		return -ENOMEM;
>  	}
> -	refcount_set(&rp->rph->ref, i);
> -
> +	rp->rph->rp = rp;
>  	rp->nmissed = 0;
>  	/* Establish function entry probe point */
>  	ret = register_kprobe(&rp->kp);
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index 3b21f4063258..f5bf98e6b2ac 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -187,9 +187,9 @@ static void fprobe_init(struct fprobe *fp)
>  
>  static int fprobe_init_rethook(struct fprobe *fp, int num)
>  {
> -	int i, size;
> +	int size;
>  
> -	if (num < 0)
> +	if (num <= 0)
>  		return -EINVAL;
>  
>  	if (!fp->exit_handler) {
> @@ -202,29 +202,21 @@ static int fprobe_init_rethook(struct fprobe *fp, int num)
>  		size = fp->nr_maxactive;
>  	else
>  		size = num * num_possible_cpus() * 2;
> -	if (size < 0)
> +	if (size <= 0)
>  		return -E2BIG;
>  
> -	fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler);
> -	if (!fp->rethook)
> -		return -ENOMEM;
> -	for (i = 0; i < size; i++) {
> -		struct fprobe_rethook_node *node;
> -
> -		node = kzalloc(sizeof(*node) + fp->entry_data_size, GFP_KERNEL);
> -		if (!node) {
> -			rethook_free(fp->rethook);
> -			fp->rethook = NULL;
> -			return -ENOMEM;
> -		}
> -		rethook_add_node(fp->rethook, &node->node);
> -	}
> +	/* Initialize rethook */
> +	fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler,
> +				sizeof(struct fprobe_rethook_node), size);
> +	if (IS_ERR(fp->rethook))
> +		return PTR_ERR(fp->rethook);
> +
>  	return 0;
>  }
>  
>  static void fprobe_fail_cleanup(struct fprobe *fp)
>  {
> -	if (fp->rethook) {
> +	if (!IS_ERR_OR_NULL(fp->rethook)) {
>  		/* Don't need to cleanup rethook->handler because this is not used. */
>  		rethook_free(fp->rethook);
>  		fp->rethook = NULL;
> @@ -379,14 +371,14 @@ int unregister_fprobe(struct fprobe *fp)
>  	if (!fprobe_is_registered(fp))
>  		return -EINVAL;
>  
> -	if (fp->rethook)
> +	if (!IS_ERR_OR_NULL(fp->rethook))
>  		rethook_stop(fp->rethook);
>  
>  	ret = unregister_ftrace_function(&fp->ops);
>  	if (ret < 0)
>  		return ret;
>  
> -	if (fp->rethook)
> +	if (!IS_ERR_OR_NULL(fp->rethook))
>  		rethook_free(fp->rethook);
>  
>  	ftrace_free_filter(&fp->ops);
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index 5eb9b598f4e9..13c8e6773892 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -9,6 +9,7 @@
>  #include <linux/rethook.h>
>  #include <linux/slab.h>
>  #include <linux/sort.h>
> +#include <linux/smp.h>
>  
>  /* Return hook list (shadow stack by list) */
>  
> @@ -36,21 +37,7 @@ void rethook_flush_task(struct task_struct *tk)
>  static void rethook_free_rcu(struct rcu_head *head)
>  {
>  	struct rethook *rh = container_of(head, struct rethook, rcu);
> -	struct rethook_node *rhn;
> -	struct freelist_node *node;
> -	int count = 1;
> -
> -	node = rh->pool.head;
> -	while (node) {
> -		rhn = container_of(node, struct rethook_node, freelist);
> -		node = node->next;
> -		kfree(rhn);
> -		count++;
> -	}
> -
> -	/* The rh->ref is the number of pooled node + 1 */
> -	if (refcount_sub_and_test(count, &rh->ref))
> -		kfree(rh);
> +	objpool_fini(&rh->pool);
>  }
>  
>  /**
> @@ -83,54 +70,62 @@ void rethook_free(struct rethook *rh)
>  	call_rcu(&rh->rcu, rethook_free_rcu);
>  }
>  
> +static int rethook_init_node(void *nod, void *context)
> +{
> +	struct rethook_node *node = nod;
> +
> +	node->rethook = context;
> +	return 0;
> +}
> +
> +static int rethook_fini_pool(struct objpool_head *head, void *context)
> +{
> +	kfree(context);
> +	return 0;
> +}
> +
>  /**
>   * rethook_alloc() - Allocate struct rethook.
>   * @data: a data to pass the @handler when hooking the return.
> - * @handler: the return hook callback function.
> + * @handler: the return hook callback function, must NOT be NULL
> + * @size: node size: rethook node and additional data
> + * @num: number of rethook nodes to be preallocated
>   *
>   * Allocate and initialize a new rethook with @data and @handler.
> - * Return NULL if memory allocation fails or @handler is NULL.
> + * Return pointer of new rethook, or error codes for failures.
> + *
>   * Note that @handler == NULL means this rethook is going to be freed.
>   */
> -struct rethook *rethook_alloc(void *data, rethook_handler_t handler)
> +struct rethook *rethook_alloc(void *data, rethook_handler_t handler,
> +			      int size, int num)
>  {
> -	struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
> +	struct rethook *rh;
>  
> -	if (!rh || !handler) {
> -		kfree(rh);
> -		return NULL;
> -	}
> +	if (!handler || num <= 0 || size < sizeof(struct rethook_node))
> +		return ERR_PTR(-EINVAL);
> +
> +	rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
> +	if (!rh)
> +		return ERR_PTR(-ENOMEM);
>  
>  	rh->data = data;
>  	rh->handler = handler;
> -	rh->pool.head = NULL;
> -	refcount_set(&rh->ref, 1);
>  
> +	/* initialize the objpool for rethook nodes */
> +	if (objpool_init(&rh->pool, num, size, GFP_KERNEL, rh,
> +			 rethook_init_node, rethook_fini_pool)) {
> +		kfree(rh);
> +		return ERR_PTR(-ENOMEM);
> +	}
>  	return rh;
>  }
>  
> -/**
> - * rethook_add_node() - Add a new node to the rethook.
> - * @rh: the struct rethook.
> - * @node: the struct rethook_node to be added.
> - *
> - * Add @node to @rh. User must allocate @node (as a part of user's
> - * data structure.) The @node fields are initialized in this function.
> - */
> -void rethook_add_node(struct rethook *rh, struct rethook_node *node)
> -{
> -	node->rethook = rh;
> -	freelist_add(&node->freelist, &rh->pool);
> -	refcount_inc(&rh->ref);
> -}
> -
>  static void free_rethook_node_rcu(struct rcu_head *head)
>  {
>  	struct rethook_node *node = container_of(head, struct rethook_node, rcu);
> +	struct rethook *rh = node->rethook;
>  
> -	if (refcount_dec_and_test(&node->rethook->ref))
> -		kfree(node->rethook);
> -	kfree(node);
> +	objpool_drop(node, &rh->pool);
>  }
>  
>  /**
> @@ -145,7 +140,7 @@ void rethook_recycle(struct rethook_node *node)
>  	lockdep_assert_preemption_disabled();
>  
>  	if (likely(READ_ONCE(node->rethook->handler)))
> -		freelist_add(&node->freelist, &node->rethook->pool);
> +		objpool_push(node, &node->rethook->pool);
>  	else
>  		call_rcu(&node->rcu, free_rethook_node_rcu);
>  }
> @@ -161,7 +156,6 @@ NOKPROBE_SYMBOL(rethook_recycle);
>  struct rethook_node *rethook_try_get(struct rethook *rh)
>  {
>  	rethook_handler_t handler = READ_ONCE(rh->handler);
> -	struct freelist_node *fn;
>  
>  	lockdep_assert_preemption_disabled();
>  
> @@ -178,11 +172,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>  	if (unlikely(!rcu_is_watching()))
>  		return NULL;
>  
> -	fn = freelist_try_get(&rh->pool);
> -	if (!fn)
> -		return NULL;
> -
> -	return container_of(fn, struct rethook_node, freelist);
> +	return (struct rethook_node *)objpool_pop(&rh->pool);
>  }
>  NOKPROBE_SYMBOL(rethook_try_get);
>  
> -- 
> 2.40.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 3/5] kprobes: kretprobe scalability improvement with objpool
  2023-10-07  2:02   ` Masami Hiramatsu
@ 2023-10-08 18:31     ` wuqiang
  2023-10-08 23:20       ` Masami Hiramatsu
  0 siblings, 1 reply; 25+ messages in thread
From: wuqiang @ 2023-10-08 18:31 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On 2023/10/7 10:02, Masami Hiramatsu (Google) wrote:
> On Tue,  5 Sep 2023 09:52:53 +0800
> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> 
>> kretprobe is using freelist to manage return-instances, but freelist,
>> as LIFO queue based on singly linked list, scales badly and reduces
>> the overall throughput of kretprobed routines, especially for high
>> contention scenarios.
>>
>> Here's a typical throughput test of sys_prctl (counts in 10 seconds,
>> measured with perf stat -a -I 10000 -e syscalls:sys_enter_prctl):
>>
>> OS: Debian 10 X86_64, Linux 6.5rc7 with freelist
>> HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
>>
>>        1T       2T       4T       8T      16T      24T
>> 24150045 29317964 15446741 12494489 18287272 18287272
>>       32T      48T      64T      72T      96T     128T
>> 16200682  1620068 11645677 11269858 10470118  9931051
>>
>> This patch introduces objpool to kretprobe and rethook, with orginal
>> freelist replaced and brings near-linear scalability to kretprobed
>> routines. Tests of kretprobe throughput show the biggest ratio as
>> 166.7x of the original freelist. Here's the comparison:
>>
>>                    1T         2T         4T         8T        16T
>> native:     41186213   82336866  164250978  328662645  658810299
>> freelist:   24150045   29317964   15446741   12494489   18287272
>> objpool:    24663700   49410571   98544977  197725940  396294399
>>                   32T        48T        64T        96T       128T
>> native:   1330338351 1969957941 2512291791 1514690434 2671040914
>> freelist:   16200682   13737658   11645677   10470118    9931051
>> objpool:    78673470 1177354156 1514690434 1604472348 1655086824
>>
>> Tests on 96-core ARM64 system output similarly, but with the biggest
>> ratio up to 454.8x:
>>
>> OS: Debian 10 AARCH64, Linux 6.5rc7
>> HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
>>
>>                    1T         2T         4T         8T        16T
>> native: .   30066096   13733813  126194076  257447289  505800181
>> freelist:   16152090   11064397   11124068    7215768    5663013
>> objpool:    13733813   27749031   56540679  112291770  223482778
>>                   24T        32T        48T        64T        96T
>> native:    763305277 1015925192 1521075123 2033009392 3021013752
>> freelist:    5015810    4602893    3766792    3382478    2945292
>> objpool:   334605663  448310646  675018951  903449904 1339693418
>>
> 
> This looks good to me (and I have tested with updated objpool)
> 
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Wuqiang, can you update the above number with the simplified
> objpool? I got better number (always 80% of the native performance)
> with 128 node/probe.
> (*) https://lore.kernel.org/all/20231003003923.eabc33bb3f4ffb8eac71f2af@kernel.org/

That's great. I'll prepare a new patch and try to spare the testbeds for
another round of testing.

> Thank you,

Thanks for your effort. Sorry for the late response, just back from
a 'long' vacation.

> 
>> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
>> ---
>>   include/linux/kprobes.h | 11 ++---
>>   include/linux/rethook.h | 16 ++-----
>>   kernel/kprobes.c        | 93 +++++++++++++++++------------------------
>>   kernel/trace/fprobe.c   | 32 ++++++--------
>>   kernel/trace/rethook.c  | 90 ++++++++++++++++++---------------------
>>   5 files changed, 98 insertions(+), 144 deletions(-)
>>
>> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
>> index 85a64cb95d75..365eb092e9c4 100644
>> --- a/include/linux/kprobes.h
>> +++ b/include/linux/kprobes.h
>> @@ -26,8 +26,7 @@
>>   #include <linux/rcupdate.h>
>>   #include <linux/mutex.h>
>>   #include <linux/ftrace.h>
>> -#include <linux/refcount.h>
>> -#include <linux/freelist.h>
>> +#include <linux/objpool.h>
>>   #include <linux/rethook.h>
>>   #include <asm/kprobes.h>
>>   
>> @@ -141,7 +140,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
>>    */
>>   struct kretprobe_holder {
>>   	struct kretprobe	*rp;
>> -	refcount_t		ref;
>> +	struct objpool_head	pool;
>>   };
>>   
>>   struct kretprobe {
>> @@ -154,7 +153,6 @@ struct kretprobe {
>>   #ifdef CONFIG_KRETPROBE_ON_RETHOOK
>>   	struct rethook *rh;
>>   #else
>> -	struct freelist_head freelist;
>>   	struct kretprobe_holder *rph;
>>   #endif
>>   };
>> @@ -165,10 +163,7 @@ struct kretprobe_instance {
>>   #ifdef CONFIG_KRETPROBE_ON_RETHOOK
>>   	struct rethook_node node;
>>   #else
>> -	union {
>> -		struct freelist_node freelist;
>> -		struct rcu_head rcu;
>> -	};
>> +	struct rcu_head rcu;
>>   	struct llist_node llist;
>>   	struct kretprobe_holder *rph;
>>   	kprobe_opcode_t *ret_addr;
>> diff --git a/include/linux/rethook.h b/include/linux/rethook.h
>> index 26b6f3c81a76..ce69b2b7bc35 100644
>> --- a/include/linux/rethook.h
>> +++ b/include/linux/rethook.h
>> @@ -6,11 +6,10 @@
>>   #define _LINUX_RETHOOK_H
>>   
>>   #include <linux/compiler.h>
>> -#include <linux/freelist.h>
>> +#include <linux/objpool.h>
>>   #include <linux/kallsyms.h>
>>   #include <linux/llist.h>
>>   #include <linux/rcupdate.h>
>> -#include <linux/refcount.h>
>>   
>>   struct rethook_node;
>>   
>> @@ -30,14 +29,12 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, unsigned long,
>>   struct rethook {
>>   	void			*data;
>>   	rethook_handler_t	handler;
>> -	struct freelist_head	pool;
>> -	refcount_t		ref;
>> +	struct objpool_head	pool;
>>   	struct rcu_head		rcu;
>>   };
>>   
>>   /**
>>    * struct rethook_node - The rethook shadow-stack entry node.
>> - * @freelist: The freelist, linked to struct rethook::pool.
>>    * @rcu: The rcu_head for deferred freeing.
>>    * @llist: The llist, linked to a struct task_struct::rethooks.
>>    * @rethook: The pointer to the struct rethook.
>> @@ -48,20 +45,16 @@ struct rethook {
>>    * on each entry of the shadow stack.
>>    */
>>   struct rethook_node {
>> -	union {
>> -		struct freelist_node freelist;
>> -		struct rcu_head      rcu;
>> -	};
>> +	struct rcu_head		rcu;
>>   	struct llist_node	llist;
>>   	struct rethook		*rethook;
>>   	unsigned long		ret_addr;
>>   	unsigned long		frame;
>>   };
>>   
>> -struct rethook *rethook_alloc(void *data, rethook_handler_t handler);
>> +struct rethook *rethook_alloc(void *data, rethook_handler_t handler, int size, int num);
>>   void rethook_stop(struct rethook *rh);
>>   void rethook_free(struct rethook *rh);
>> -void rethook_add_node(struct rethook *rh, struct rethook_node *node);
>>   struct rethook_node *rethook_try_get(struct rethook *rh);
>>   void rethook_recycle(struct rethook_node *node);
>>   void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount);
>> @@ -98,4 +91,3 @@ void rethook_flush_task(struct task_struct *tk);
>>   #endif
>>   
>>   #endif
>> -
>> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
>> index ca385b61d546..075a632e6c7c 100644
>> --- a/kernel/kprobes.c
>> +++ b/kernel/kprobes.c
>> @@ -1877,13 +1877,27 @@ static struct notifier_block kprobe_exceptions_nb = {
>>   #ifdef CONFIG_KRETPROBES
>>   
>>   #if !defined(CONFIG_KRETPROBE_ON_RETHOOK)
>> +
>> +/* callbacks for objpool of kretprobe instances */
>> +static int kretprobe_init_inst(void *nod, void *context)
>> +{
>> +	struct kretprobe_instance *ri = nod;
>> +
>> +	ri->rph = context;
>> +	return 0;
>> +}
>> +static int kretprobe_fini_pool(struct objpool_head *head, void *context)
>> +{
>> +	kfree(context);
>> +	return 0;
>> +}
>> +
>>   static void free_rp_inst_rcu(struct rcu_head *head)
>>   {
>>   	struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
>> +	struct kretprobe_holder *rph = ri->rph;
>>   
>> -	if (refcount_dec_and_test(&ri->rph->ref))
>> -		kfree(ri->rph);
>> -	kfree(ri);
>> +	objpool_drop(ri, &rph->pool);
>>   }
>>   NOKPROBE_SYMBOL(free_rp_inst_rcu);
>>   
>> @@ -1892,7 +1906,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
>>   	struct kretprobe *rp = get_kretprobe(ri);
>>   
>>   	if (likely(rp))
>> -		freelist_add(&ri->freelist, &rp->freelist);
>> +		objpool_push(ri, &rp->rph->pool);
>>   	else
>>   		call_rcu(&ri->rcu, free_rp_inst_rcu);
>>   }
>> @@ -1929,23 +1943,12 @@ NOKPROBE_SYMBOL(kprobe_flush_task);
>>   
>>   static inline void free_rp_inst(struct kretprobe *rp)
>>   {
>> -	struct kretprobe_instance *ri;
>> -	struct freelist_node *node;
>> -	int count = 0;
>> -
>> -	node = rp->freelist.head;
>> -	while (node) {
>> -		ri = container_of(node, struct kretprobe_instance, freelist);
>> -		node = node->next;
>> -
>> -		kfree(ri);
>> -		count++;
>> -	}
>> +	struct kretprobe_holder *rph = rp->rph;
>>   
>> -	if (refcount_sub_and_test(count, &rp->rph->ref)) {
>> -		kfree(rp->rph);
>> -		rp->rph = NULL;
>> -	}
>> +	if (!rph)
>> +		return;
>> +	rp->rph = NULL;
>> +	objpool_fini(&rph->pool);
>>   }
>>   
>>   /* This assumes the 'tsk' is the current task or the is not running. */
>> @@ -2087,19 +2090,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
>>   static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
>>   {
>>   	struct kretprobe *rp = container_of(p, struct kretprobe, kp);
>> +	struct kretprobe_holder *rph = rp->rph;
>>   	struct kretprobe_instance *ri;
>> -	struct freelist_node *fn;
>>   
>> -	fn = freelist_try_get(&rp->freelist);
>> -	if (!fn) {
>> +	ri = objpool_pop(&rph->pool);
>> +	if (!ri) {
>>   		rp->nmissed++;
>>   		return 0;
>>   	}
>>   
>> -	ri = container_of(fn, struct kretprobe_instance, freelist);
>> -
>>   	if (rp->entry_handler && rp->entry_handler(ri, regs)) {
>> -		freelist_add(&ri->freelist, &rp->freelist);
>> +		objpool_push(ri, &rph->pool);
>>   		return 0;
>>   	}
>>   
>> @@ -2193,7 +2194,6 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
>>   int register_kretprobe(struct kretprobe *rp)
>>   {
>>   	int ret;
>> -	struct kretprobe_instance *inst;
>>   	int i;
>>   	void *addr;
>>   
>> @@ -2227,20 +2227,12 @@ int register_kretprobe(struct kretprobe *rp)
>>   		rp->maxactive = max_t(unsigned int, 10, 2*num_possible_cpus());
>>   
>>   #ifdef CONFIG_KRETPROBE_ON_RETHOOK
>> -	rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler);
>> -	if (!rp->rh)
>> -		return -ENOMEM;
>> +	rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler,
>> +				sizeof(struct kretprobe_instance) +
>> +				rp->data_size, rp->maxactive);
>> +	if (IS_ERR(rp->rh))
>> +		return PTR_ERR(rp->rh);
>>   
>> -	for (i = 0; i < rp->maxactive; i++) {
>> -		inst = kzalloc(sizeof(struct kretprobe_instance) +
>> -			       rp->data_size, GFP_KERNEL);
>> -		if (inst == NULL) {
>> -			rethook_free(rp->rh);
>> -			rp->rh = NULL;
>> -			return -ENOMEM;
>> -		}
>> -		rethook_add_node(rp->rh, &inst->node);
>> -	}
>>   	rp->nmissed = 0;
>>   	/* Establish function entry probe point */
>>   	ret = register_kprobe(&rp->kp);
>> @@ -2249,25 +2241,18 @@ int register_kretprobe(struct kretprobe *rp)
>>   		rp->rh = NULL;
>>   	}
>>   #else	/* !CONFIG_KRETPROBE_ON_RETHOOK */
>> -	rp->freelist.head = NULL;
>>   	rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
>>   	if (!rp->rph)
>>   		return -ENOMEM;
>>   
>> -	rp->rph->rp = rp;
>> -	for (i = 0; i < rp->maxactive; i++) {
>> -		inst = kzalloc(sizeof(struct kretprobe_instance) +
>> -			       rp->data_size, GFP_KERNEL);
>> -		if (inst == NULL) {
>> -			refcount_set(&rp->rph->ref, i);
>> -			free_rp_inst(rp);
>> -			return -ENOMEM;
>> -		}
>> -		inst->rph = rp->rph;
>> -		freelist_add(&inst->freelist, &rp->freelist);
>> +	if (objpool_init(&rp->rph->pool, rp->maxactive, rp->data_size +
>> +			sizeof(struct kretprobe_instance), GFP_KERNEL,
>> +			rp->rph, kretprobe_init_inst, kretprobe_fini_pool)) {
>> +		kfree(rp->rph);
>> +		rp->rph = NULL;
>> +		return -ENOMEM;
>>   	}
>> -	refcount_set(&rp->rph->ref, i);
>> -
>> +	rp->rph->rp = rp;
>>   	rp->nmissed = 0;
>>   	/* Establish function entry probe point */
>>   	ret = register_kprobe(&rp->kp);
>> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
>> index 3b21f4063258..f5bf98e6b2ac 100644
>> --- a/kernel/trace/fprobe.c
>> +++ b/kernel/trace/fprobe.c
>> @@ -187,9 +187,9 @@ static void fprobe_init(struct fprobe *fp)
>>   
>>   static int fprobe_init_rethook(struct fprobe *fp, int num)
>>   {
>> -	int i, size;
>> +	int size;
>>   
>> -	if (num < 0)
>> +	if (num <= 0)
>>   		return -EINVAL;
>>   
>>   	if (!fp->exit_handler) {
>> @@ -202,29 +202,21 @@ static int fprobe_init_rethook(struct fprobe *fp, int num)
>>   		size = fp->nr_maxactive;
>>   	else
>>   		size = num * num_possible_cpus() * 2;
>> -	if (size < 0)
>> +	if (size <= 0)
>>   		return -E2BIG;
>>   
>> -	fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler);
>> -	if (!fp->rethook)
>> -		return -ENOMEM;
>> -	for (i = 0; i < size; i++) {
>> -		struct fprobe_rethook_node *node;
>> -
>> -		node = kzalloc(sizeof(*node) + fp->entry_data_size, GFP_KERNEL);
>> -		if (!node) {
>> -			rethook_free(fp->rethook);
>> -			fp->rethook = NULL;
>> -			return -ENOMEM;
>> -		}
>> -		rethook_add_node(fp->rethook, &node->node);
>> -	}
>> +	/* Initialize rethook */
>> +	fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler,
>> +				sizeof(struct fprobe_rethook_node), size);
>> +	if (IS_ERR(fp->rethook))
>> +		return PTR_ERR(fp->rethook);
>> +
>>   	return 0;
>>   }
>>   
>>   static void fprobe_fail_cleanup(struct fprobe *fp)
>>   {
>> -	if (fp->rethook) {
>> +	if (!IS_ERR_OR_NULL(fp->rethook)) {
>>   		/* Don't need to cleanup rethook->handler because this is not used. */
>>   		rethook_free(fp->rethook);
>>   		fp->rethook = NULL;
>> @@ -379,14 +371,14 @@ int unregister_fprobe(struct fprobe *fp)
>>   	if (!fprobe_is_registered(fp))
>>   		return -EINVAL;
>>   
>> -	if (fp->rethook)
>> +	if (!IS_ERR_OR_NULL(fp->rethook))
>>   		rethook_stop(fp->rethook);
>>   
>>   	ret = unregister_ftrace_function(&fp->ops);
>>   	if (ret < 0)
>>   		return ret;
>>   
>> -	if (fp->rethook)
>> +	if (!IS_ERR_OR_NULL(fp->rethook))
>>   		rethook_free(fp->rethook);
>>   
>>   	ftrace_free_filter(&fp->ops);
>> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>> index 5eb9b598f4e9..13c8e6773892 100644
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -9,6 +9,7 @@
>>   #include <linux/rethook.h>
>>   #include <linux/slab.h>
>>   #include <linux/sort.h>
>> +#include <linux/smp.h>
>>   
>>   /* Return hook list (shadow stack by list) */
>>   
>> @@ -36,21 +37,7 @@ void rethook_flush_task(struct task_struct *tk)
>>   static void rethook_free_rcu(struct rcu_head *head)
>>   {
>>   	struct rethook *rh = container_of(head, struct rethook, rcu);
>> -	struct rethook_node *rhn;
>> -	struct freelist_node *node;
>> -	int count = 1;
>> -
>> -	node = rh->pool.head;
>> -	while (node) {
>> -		rhn = container_of(node, struct rethook_node, freelist);
>> -		node = node->next;
>> -		kfree(rhn);
>> -		count++;
>> -	}
>> -
>> -	/* The rh->ref is the number of pooled node + 1 */
>> -	if (refcount_sub_and_test(count, &rh->ref))
>> -		kfree(rh);
>> +	objpool_fini(&rh->pool);
>>   }
>>   
>>   /**
>> @@ -83,54 +70,62 @@ void rethook_free(struct rethook *rh)
>>   	call_rcu(&rh->rcu, rethook_free_rcu);
>>   }
>>   
>> +static int rethook_init_node(void *nod, void *context)
>> +{
>> +	struct rethook_node *node = nod;
>> +
>> +	node->rethook = context;
>> +	return 0;
>> +}
>> +
>> +static int rethook_fini_pool(struct objpool_head *head, void *context)
>> +{
>> +	kfree(context);
>> +	return 0;
>> +}
>> +
>>   /**
>>    * rethook_alloc() - Allocate struct rethook.
>>    * @data: a data to pass the @handler when hooking the return.
>> - * @handler: the return hook callback function.
>> + * @handler: the return hook callback function, must NOT be NULL
>> + * @size: node size: rethook node and additional data
>> + * @num: number of rethook nodes to be preallocated
>>    *
>>    * Allocate and initialize a new rethook with @data and @handler.
>> - * Return NULL if memory allocation fails or @handler is NULL.
>> + * Return pointer of new rethook, or error codes for failures.
>> + *
>>    * Note that @handler == NULL means this rethook is going to be freed.
>>    */
>> -struct rethook *rethook_alloc(void *data, rethook_handler_t handler)
>> +struct rethook *rethook_alloc(void *data, rethook_handler_t handler,
>> +			      int size, int num)
>>   {
>> -	struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
>> +	struct rethook *rh;
>>   
>> -	if (!rh || !handler) {
>> -		kfree(rh);
>> -		return NULL;
>> -	}
>> +	if (!handler || num <= 0 || size < sizeof(struct rethook_node))
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
>> +	if (!rh)
>> +		return ERR_PTR(-ENOMEM);
>>   
>>   	rh->data = data;
>>   	rh->handler = handler;
>> -	rh->pool.head = NULL;
>> -	refcount_set(&rh->ref, 1);
>>   
>> +	/* initialize the objpool for rethook nodes */
>> +	if (objpool_init(&rh->pool, num, size, GFP_KERNEL, rh,
>> +			 rethook_init_node, rethook_fini_pool)) {
>> +		kfree(rh);
>> +		return ERR_PTR(-ENOMEM);
>> +	}
>>   	return rh;
>>   }
>>   
>> -/**
>> - * rethook_add_node() - Add a new node to the rethook.
>> - * @rh: the struct rethook.
>> - * @node: the struct rethook_node to be added.
>> - *
>> - * Add @node to @rh. User must allocate @node (as a part of user's
>> - * data structure.) The @node fields are initialized in this function.
>> - */
>> -void rethook_add_node(struct rethook *rh, struct rethook_node *node)
>> -{
>> -	node->rethook = rh;
>> -	freelist_add(&node->freelist, &rh->pool);
>> -	refcount_inc(&rh->ref);
>> -}
>> -
>>   static void free_rethook_node_rcu(struct rcu_head *head)
>>   {
>>   	struct rethook_node *node = container_of(head, struct rethook_node, rcu);
>> +	struct rethook *rh = node->rethook;
>>   
>> -	if (refcount_dec_and_test(&node->rethook->ref))
>> -		kfree(node->rethook);
>> -	kfree(node);
>> +	objpool_drop(node, &rh->pool);
>>   }
>>   
>>   /**
>> @@ -145,7 +140,7 @@ void rethook_recycle(struct rethook_node *node)
>>   	lockdep_assert_preemption_disabled();
>>   
>>   	if (likely(READ_ONCE(node->rethook->handler)))
>> -		freelist_add(&node->freelist, &node->rethook->pool);
>> +		objpool_push(node, &node->rethook->pool);
>>   	else
>>   		call_rcu(&node->rcu, free_rethook_node_rcu);
>>   }
>> @@ -161,7 +156,6 @@ NOKPROBE_SYMBOL(rethook_recycle);
>>   struct rethook_node *rethook_try_get(struct rethook *rh)
>>   {
>>   	rethook_handler_t handler = READ_ONCE(rh->handler);
>> -	struct freelist_node *fn;
>>   
>>   	lockdep_assert_preemption_disabled();
>>   
>> @@ -178,11 +172,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>>   	if (unlikely(!rcu_is_watching()))
>>   		return NULL;
>>   
>> -	fn = freelist_try_get(&rh->pool);
>> -	if (!fn)
>> -		return NULL;
>> -
>> -	return container_of(fn, struct rethook_node, freelist);
>> +	return (struct rethook_node *)objpool_pop(&rh->pool);
>>   }
>>   NOKPROBE_SYMBOL(rethook_try_get);
>>   
>> -- 
>> 2.40.1
>>
> 
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement
  2023-09-23  8:57 ` [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement Masami Hiramatsu
@ 2023-10-08 18:33   ` wuqiang
  2023-10-08 23:17     ` Masami Hiramatsu
  0 siblings, 1 reply; 25+ messages in thread
From: wuqiang @ 2023-10-08 18:33 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On 2023/9/23 16:57, Masami Hiramatsu (Google) wrote:
> Hi Wuqiang,
> 
> I dug my mail box and found this. Sorry for replying late.
> 
> On Tue,  5 Sep 2023 09:52:50 +0800
> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> 
>> This patch series introduces a scalable and lockless ring-array based
>> object pool and replaces the original freelist (a LIFO queue based on
>> singly linked list) to improve scalability of kretprobed routines.
>>
>> v9:
>>    1) objpool: raw_local_irq_save/restore added to prevent interruption
>>
>>       To avoid possible ABA issues, we must ensure objpool_try_add_slot
>>       and objpool_try_add_slot are uninterruptible. If these operations
>>       are blocked or interrupted in the middle, other cores could overrun
>>       the same slot's ages[] of uint32, then after resuming back, the
>>       interrupted pop() or push() could see same value of ages[], which
>>       is a typical ABA problem though the possibility is small.
>>
>>       The pair of pop()/push() costs about 8.53 cpu cycles, measured
>>       by IACA (Intel Architecture Code Analyzer). That is, on a 4Ghz
>>       core dedicated for pop() & push(), theoretically it would only
>>       need 8.53 seconds to overflow a 32bit value. Testings upon Intel
>>       i7-10700 (2.90GHz) cost 71.88 seconds to overrun a 32bit integer.
> 
> What does this mean? This sounds like "There is a timing issue if it's enough fast".

Yes, that's why local irq must be disabled. If push()/pop() is interrupted or 
preempted long enough (> 10 seconds for the extreme cases), other nodes could
overrun the same ages[] of 32-bit, then after resuming to execution the push()
or pop() would see the same value without notifying the overrun, which is a
typical ABA.

Changing ages[] to 64-bit could be a solution, but it's inappropriate for
32-bit OS and looks too heavy. With local irg disabled, push() or pop() is
uninterrupted,thus the ABA is avoided.

push() or pop() consumes only ~4 cycles to complete (most of the use cases), 
so raw_local_irq_save/restore are used instead of local_irq_save/restore to
minimize the overhead.

> Let me reivew the patch itself.
> 
> Thanks,
> 
>>
>>    2) codes improvements: thanks to Masami for the thorough inspection
>>
>> v8:
>>    1) objpool: refcount added for objpool lifecycle management
>>
>> wuqiang.matt (5):
>>    lib: objpool added: ring-array based lockless MPMC
>>    lib: objpool test module added
>>    kprobes: kretprobe scalability improvement with objpool
>>    kprobes: freelist.h removed
>>    MAINTAINERS: objpool added
>>
>>   MAINTAINERS              |   7 +
>>   include/linux/freelist.h | 129 --------
>>   include/linux/kprobes.h  |  11 +-
>>   include/linux/objpool.h  | 174 ++++++++++
>>   include/linux/rethook.h  |  16 +-
>>   kernel/kprobes.c         |  93 +++---
>>   kernel/trace/fprobe.c    |  32 +-
>>   kernel/trace/rethook.c   |  90 +++--
>>   lib/Kconfig.debug        |  11 +
>>   lib/Makefile             |   4 +-
>>   lib/objpool.c            | 338 +++++++++++++++++++
>>   lib/test_objpool.c       | 689 +++++++++++++++++++++++++++++++++++++++
>>   12 files changed, 1320 insertions(+), 274 deletions(-)
>>   delete mode 100644 include/linux/freelist.h
>>   create mode 100644 include/linux/objpool.h
>>   create mode 100644 lib/objpool.c
>>   create mode 100644 lib/test_objpool.c
>>
>> -- 
>> 2.40.1
>>
> 
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-09-23  9:48   ` Masami Hiramatsu
@ 2023-10-08 18:40     ` wuqiang
  2023-10-09 14:19       ` Masami Hiramatsu
  0 siblings, 1 reply; 25+ messages in thread
From: wuqiang @ 2023-10-08 18:40 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On 2023/9/23 17:48, Masami Hiramatsu (Google) wrote:
> Hi Wuqiang,
> 
> Sorry for replying later.
> 
> On Tue,  5 Sep 2023 09:52:51 +0800
> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> 
>> The object pool is a scalable implementaion of high performance queue
>> for object allocation and reclamation, such as kretprobe instances.
>>
>> With leveraging percpu ring-array to mitigate the hot spot of memory
>> contention, it could deliver near-linear scalability for high parallel
>> scenarios. The objpool is best suited for following cases:
>> 1) Memory allocation or reclamation are prohibited or too expensive
>> 2) Consumers are of different priorities, such as irqs and threads
>>
>> Limitations:
>> 1) Maximum objects (capacity) is determined during pool initializing
>>     and can't be modified (extended) after objpool creation
> 
> So the pool size is fixed in initialization.

Right. The arrary size will be up-rounded to the exponent of 2, but the
actual number of objects (to be allocated) are the exact value specified
by user.

> 
>> 2) The memory of objects won't be freed until objpool is finalized
>> 3) Object allocation (pop) may fail after trying all cpu slots
> 
> This means that object allocation will fail if the all pools are empty,
> right?

Yes, pop() will return NULL for this case. pop() does the checking for
only 1 round of all cpu nodes.

The objpool might not be empty since new object could be inserted back
in the meaintime by other nodes, which is natural for lockless queues.

> 
>>
>> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
>> ---
>>   include/linux/objpool.h | 174 +++++++++++++++++++++
>>   lib/Makefile            |   2 +-
>>   lib/objpool.c           | 338 ++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 513 insertions(+), 1 deletion(-)
>>   create mode 100644 include/linux/objpool.h
>>   create mode 100644 lib/objpool.c
>>
>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
>> new file mode 100644
>> index 000000000000..33c832216b98
>> --- /dev/null
>> +++ b/include/linux/objpool.h
>> @@ -0,0 +1,174 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef _LINUX_OBJPOOL_H
>> +#define _LINUX_OBJPOOL_H
>> +
>> +#include <linux/types.h>
>> +#include <linux/refcount.h>
>> +
>> +/*
>> + * objpool: ring-array based lockless MPMC queue
>> + *
>> + * Copyright: wuqiang.matt@bytedance.com
>> + *
>> + * The object pool is a scalable implementaion of high performance queue
>> + * for objects allocation and reclamation, such as kretprobe instances.
>> + *
>> + * With leveraging per-cpu ring-array to mitigate the hot spots of memory
>> + * contention, it could deliver near-linear scalability for high parallel
>> + * scenarios. The ring-array is compactly managed in a single cache-line
>> + * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
>> + * The body of pre-allocated objects is stored in continuous cache-lines
>> + * just after the ring-array.
> 
> I consider the size of entries may be big if we have larger number of
> CPU cores, like 72-cores. And if user specifies (2^n) + 1 entries.
> In this case, each CPU has (2^n - 1)/72 objects, but has 2^(n + 1)
> entries in ring buffer. So it should be noted.

Yes for the arrary size since it‘s up-rounded to the exponent of 2, but the
actual number of pre-allocated objects will stay the same as user specified.

>> + *
>> + * The object pool is interrupt safe. Both allocation and reclamation
>> + * (object pop and push operations) can be preemptible or interruptable.
> 
> You've added raw_spinlock_disable/enable(), so it is not preemptible
> or interruptible anymore. (Anyway, caller doesn't take care of that)

Sure, this decription is imporper and unnecessary. Will be removed.

>> + *
>> + * It's best suited for following cases:
>> + * 1) Memory allocation or reclamation are prohibited or too expensive
>> + * 2) Consumers are of different priorities, such as irqs and threads
>> + *
>> + * Limitations:
>> + * 1) Maximum objects (capacity) is determined during pool initializing
>> + * 2) The memory of objects won't be freed until the pool is finalized
>> + * 3) Object allocation (pop) may fail after trying all cpu slots
>> + */
>> +
>> +/**
>> + * struct objpool_slot - percpu ring array of objpool
>> + * @head: head of the local ring array (to retrieve at)
>> + * @tail: tail of the local ring array (to append at)
>> + * @bits: log2 of capacity (for bitwise operations)
>> + * @mask: capacity - 1
> 
> These description does not give idea what those roles are.

I'll refine the description. objpool_slot is totally internal to objpool.

> 
>> + *
>> + * Represents a cpu-local array-based ring buffer, its size is specialized
>> + * during initialization of object pool. The percpu objpool slot is to be
>> + * allocated from local memory for NUMA system, and to be kept compact in
>> + * continuous memory: ages[] is stored just after the body of objpool_slot,
>> + * and then entries[]. ages[] describes revision of each item, solely used
>> + * to avoid ABA; entries[] contains pointers of the actual objects
>> + *
>> + * Layout of objpool_slot in memory:
>> + *
>> + * 64bit:
>> + *        4      8      12     16        32                 64
>> + * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
>> + *
>> + * 32bit:
>> + *        4      8      12     16        32        48       64
>> + * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> 
> Hm, the '4' here means number of objects after this objpool_slot?
> I don't recommend you to allocate several arraies after the header, instead,
> using another data structure like this;
> 
> |head|tail|bits|mask|ents[N]{age:4|offs:4}|padding|objects
> 
> here N means the number of total objects.

Sorry for the confusion, I will make it more clear. Here 4/8/.../64 are offset
in bytes. The above is an example with the objpool_slot compacted in a single
cache line.

> 
> struct objpool_entry {
> 	uint32_t	age;
> 	void *	ptr;
> } __attribute__((packed,aligned(8))) ;
> 
>> + *
>> + */
>> +struct objpool_slot {
>> +	uint32_t                head;
>> +	uint32_t                tail;
>> +	uint32_t                bits;
>> +	uint32_t                mask;
> 
> 	struct objpool_entry	entries[];
> 
>> +} __packed;
> 
> Then, you don't need complex macros to access object, but you need only one
> inline function to get the actual object address.
> 
> static inline void *objpool_slot_object(struct objpool_slot *slot, int nth)
> {
> 	if (nth > (1 << bits))
> 		return NULL;
> 
> 	return (void *)((unsigned long)slot + slot.entries[nth].offs);
> }

The reason of these macroes is to compact objpool_slot/ages/ents to hot cache
lines and also minimize the memory footprint.

objpool_head could be a better place to manage these pointers, similarly as
cpu_slots. I'll recheck the overhead.


>> +
>> +struct objpool_head;
>> +
>> +/*
>> + * caller-specified callback for object initial setup, it's only called
>> + * once for each object (just after the memory allocation of the object)
>> + */
>> +typedef int (*objpool_init_obj_cb)(void *obj, void *context);
>> +
>> +/* caller-specified cleanup callback for objpool destruction */
>> +typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
>> +
>> +/**
>> + * struct objpool_head - object pooling metadata
>> + * @obj_size:   object & element size
> 
> What does the 'element' mean?

"object size" should be enough. "element" means object, so it's unnecessary.

> 
>> + * @nr_objs:    total objs (to be pre-allocated)
> 
> but all objects must be pre-allocated, right? then it is simply

Yes, all objects are pre-allocated for this implementation.

> 
> @nr_objs: the total number of objects in this objpool.
> 
>> + * @nr_cpus:    nr_cpu_ids
> 
> would we have to save it? or just use 'nr_cpu_ids'?

Yes, it's just a local save of nr_cpu_ids, just to make the members of
objpool_head aligned by 64 bits (there could be a 4-byte hold anyway).
And possible beatification from hot TLB cache ?

> 
>> + * @capacity:   max objects per cpuslot
> 
> what is 'cpuslot'?
> This seems the size of objpool_entry array in objpool_slot.

Yes, should be "capacity per objpool_slot", i.e. "maximum objects could be
stored in a objpool_slot".

>> + * @gfp:        gfp flags for kmalloc & vmalloc
>> + * @ref:        refcount for objpool
>> + * @flags:      flags for objpool management
>> + * @cpu_slots:  array of percpu slots
>> + * @slot_sizes:	size in bytes of slots
>> + * @release:    resource cleanup callback
>> + * @context:    caller-provided context
>> + */
>> +struct objpool_head {
>> +	int                     obj_size;
>> +	int                     nr_objs;
>> +	int                     nr_cpus;
>> +	int                     capacity;
>> +	gfp_t                   gfp;
>> +	refcount_t              ref;
>> +	unsigned long           flags;
>> +	struct objpool_slot   **cpu_slots;
>> +	int                    *slot_sizes;
>> +	objpool_fini_cb         release;
>> +	void                   *context;
>> +};
>> +
>> +#define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
>> +#define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
>> +
>> +/**
>> + * objpool_init() - initialize objpool and pre-allocated objects
>> + * @head:    the object pool to be initialized, declared by caller
>> + * @nr_objs: total objects to be pre-allocated by this object pool
>> + * @object_size: size of an object (should be > 0)
>> + * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
>> + * @context: user context for object initialization callback
>> + * @objinit: object initialization callback for extra setup
>> + * @release: cleanup callback for extra cleanup task
>> + *
>> + * return value: 0 for success, otherwise error code
>> + *
>> + * All pre-allocated objects are to be zeroed after memory allocation.
>> + * Caller could do extra initialization in objinit callback. objinit()
>> + * will be called just after slot allocation and will be only once for
>> + * each object. Since then the objpool won't touch any content of the
>> + * objects. It's caller's duty to perform reinitialization after each
>> + * pop (object allocation) or do clearance before each push (object
>> + * reclamation).
>> + */
>> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>> +		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>> +		 objpool_fini_cb release);
>> +
>> +/**
>> + * objpool_pop() - allocate an object from objpool
>> + * @head: object pool
>> + *
>> + * return value: object ptr or NULL if failed
>> + */
>> +void *objpool_pop(struct objpool_head *head);
>> +
>> +/**
>> + * objpool_push() - reclaim the object and return back to objpool
>> + * @obj:  object ptr to be pushed to objpool
>> + * @head: object pool
>> + *
>> + * return: 0 or error code (it fails only when user tries to push
>> + * the same object multiple times or wrong "objects" into objpool)
>> + */
>> +int objpool_push(void *obj, struct objpool_head *head);
>> +
>> +/**
>> + * objpool_drop() - discard the object and deref objpool
>> + * @obj:  object ptr to be discarded
>> + * @head: object pool
>> + *
>> + * return: 0 if objpool was released or error code
>> + */
>> +int objpool_drop(void *obj, struct objpool_head *head);
>> +
>> +/**
>> + * objpool_free() - release objpool forcely (all objects to be freed)
>> + * @head: object pool to be released
>> + */
>> +void objpool_free(struct objpool_head *head);
>> +
>> +/**
>> + * objpool_fini() - deref object pool (also releasing unused objects)
>> + * @head: object pool to be dereferenced
>> + */
>> +void objpool_fini(struct objpool_head *head);
>> +
>> +#endif /* _LINUX_OBJPOOL_H */
>> diff --git a/lib/Makefile b/lib/Makefile
>> index 1ffae65bb7ee..7a84c922d9ff 100644
>> --- a/lib/Makefile
>> +++ b/lib/Makefile
>> @@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>>   	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
>>   	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
>>   	 nmi_backtrace.o win_minmax.o memcat_p.o \
>> -	 buildid.o
>> +	 buildid.o objpool.o
>>   
>>   lib-$(CONFIG_PRINTK) += dump_stack.o
>>   lib-$(CONFIG_SMP) += cpumask.o
>> diff --git a/lib/objpool.c b/lib/objpool.c
>> new file mode 100644
>> index 000000000000..22e752371820
>> --- /dev/null
>> +++ b/lib/objpool.c
>> @@ -0,0 +1,338 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <linux/objpool.h>
>> +#include <linux/slab.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/atomic.h>
>> +#include <linux/prefetch.h>
>> +#include <linux/irqflags.h>
>> +#include <linux/cpumask.h>
>> +#include <linux/log2.h>
>> +
>> +/*
>> + * objpool: ring-array based lockless MPMC/FIFO queues
>> + *
>> + * Copyright: wuqiang.matt@bytedance.com
>> + */
>> +
>> +#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
>> +#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
>> +			(sizeof(uint32_t) << (s)->bits)))
>> +#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
>> +			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
>> +#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
>> +
>> +/* compute the suitable num of objects to be managed per slot */
>> +static int objpool_nobjs(int size)
>> +{
>> +	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
>> +			(sizeof(uint32_t) + sizeof(void *)));
>> +}
>> +
>> +/* allocate and initialize percpu slots */
> 
> @head: the objpool_head for managing this objpool
> @nobjs: the total number of objects in this objpool
> @context: context data for @objinit
> @objinit: initialize callback for each object.

Got it. I didn't since objpool_init_percpu_slots is not public.

>> +static int
>> +objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>> +			void *context, objpool_init_obj_cb objinit)
>> +{
>> +	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> 
> 'nents' is *round up to the power of 2* of the total number of objects.
> 
>> +
>> +	/* aligned object size by sizeof(void *) */
>> +	objsz = ALIGN(head->obj_size, sizeof(void *));
>> +	/* shall we allocate objects along with percpu-slot */
>> +	if (objsz)
>> +		head->flags |= OBJPOOL_HAVE_OBJECTS;
> 
> Is there any chance that objsz == 0?

No chance. We always require non-zero objsz. Will update in next verion.

> 
>> +
>> +	/* vmalloc is used in default to allocate percpu-slots */
>> +	if (!(head->gfp & GFP_ATOMIC))
>> +		head->flags |= OBJPOOL_FROM_VMALLOC;
>> +
>> +	for (i = 0; i < head->nr_cpus; i++) {
>> +		struct objpool_slot *os;
>> +
>> +		/* skip the cpus which could never be present */
>> +		if (!cpu_possible(i))
>> +			continue;
>> +
>> +		/* compute how many objects to be managed by this slot */
> 
> "to be managed"? or "to be allocated with"?
> It seems all objects are possible to be managed by each slot.

Right. "to be allocated with" is preferable. Thanks.

>> +		n = nobjs / num_possible_cpus();
>> +		if (cpu < (nobjs % num_possible_cpus()))
>> +			n++;
>> +		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
>> +		       sizeof(uint32_t) * nents + objsz * n;
>> +
>> +		/*
>> +		 * here we allocate percpu-slot & objects together in a single
>> +		 * allocation, taking advantage of warm caches and TLB hits as
>> +		 * vmalloc always aligns the request size to pages
> 
> "Since the objpool_entry array in the slot is mostly accessed from the
>   i-th CPU, it should be allocated from the memory node for that CPU."
> 
> I think the reason of the memory node allocation is mainly for reducing the
> penalty of the cache-miss, since it will be bigger if running on NUMA.

Right, NUMA is addressed by objpool_slot. The above description is to explain
why a single memory allocation (not multiple). I'll try to make it more clear.

> 
>> +		 */
>> +		if (head->flags & OBJPOOL_FROM_VMALLOC)
>> +			os = __vmalloc_node(size, sizeof(void *), head->gfp,
>> +				cpu_to_node(i), __builtin_return_address(0));
>> +		else
>> +			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
>> +		if (!os)
>> +			return -ENOMEM;
>> +
>> +		/* initialize percpu slot for the i-th slot */
>> +		memset(os, 0, size);
>> +		os->bits = ilog2(head->capacity);
>> +		os->mask = head->capacity - 1;
>> +		head->cpu_slots[i] = os;
>> +		head->slot_sizes[i] = size;
>> +		cpu = cpu + 1;
>> +
>> +		/*
>> +		 * manually set head & tail to avoid possible conflict:
>> +		 * We assume that the head item is ready for retrieval
>> +		 * iff head is equal to ages[head & mask]. but ages is
>> +		 * initialized as 0, so in view of the caller of pop(),
>> +		 * the 1st item (0th) is always ready, but the reality
>> +		 * could be: push() is stalled before the final update,
>> +		 * thus the item being inserted will be lost forever
>> +		 */
>> +		os->head = os->tail = head->capacity;
>> +
>> +		if (!objsz)
>> +			continue;
> 
> Is it possible? and for what?

Will be removed in next version.

> 
>> +
>> +		for (j = 0; j < n; j++) {
>> +			uint32_t *ages = SLOT_AGES(os);
>> +			void **ents = SLOT_ENTS(os);
>> +			void *obj = SLOT_OBJS(os) + j * objsz;
>> +			uint32_t ie = os->tail & os->mask;
>> +
>> +			/* perform object initialization */
>> +			if (objinit) {
>> +				int rc = objinit(obj, context);
>> +				if (rc)
>> +					return rc;
>> +			}
>> +
>> +			/* add obj into the ring array */
>> +			ents[ie] = obj;
>> +			ages[ie] = os->tail;
>> +			os->tail++;
>> +			head->nr_objs++;
>> +		}
> 
> To simplify the code, this loop should be another static function.

I'll reconsider the implementation. And the multiple computations of ages/ents
should be avoided too.

> 
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/* cleanup all percpu slots of the object pool */
>> +static void objpool_fini_percpu_slots(struct objpool_head *head)
>> +{
>> +	int i;
>> +
>> +	if (!head->cpu_slots)
>> +		return;
>> +
>> +	for (i = 0; i < head->nr_cpus; i++) {
>> +		if (!head->cpu_slots[i])
>> +			continue;
>> +		if (head->flags & OBJPOOL_FROM_VMALLOC)
>> +			vfree(head->cpu_slots[i]);
>> +		else
>> +			kfree(head->cpu_slots[i]);
>> +	}
>> +	kfree(head->cpu_slots);
>> +	head->cpu_slots = NULL;
>> +	head->slot_sizes = NULL;
>> +}
>> +
>> +/* initialize object pool and pre-allocate objects */
>> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>> +		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>> +		objpool_fini_cb release)
>> +{
>> +	int nents, rc;
>> +
>> +	/* check input parameters */
>> +	if (nr_objs <= 0 || object_size <= 0)
>> +		return -EINVAL;
>> +
>> +	/* calculate percpu slot size (rounded to pow of 2) */
>> +	nents = max_t(int, roundup_pow_of_two(nr_objs),
>> +			objpool_nobjs(L1_CACHE_BYTES));
>> +
>> +	/* initialize objpool head */
>> +	memset(head, 0, sizeof(struct objpool_head));
>> +	head->nr_cpus = nr_cpu_ids;
>> +	head->obj_size = object_size;
>> +	head->capacity = nents;
>> +	head->gfp = gfp & ~__GFP_ZERO;
>> +	head->context = context;
>> +	head->release = release;
>> +
>> +	/* allocate array for percpu slots */
>> +	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
>> +			       head->nr_cpus * sizeof(int), head->gfp);
>> +	if (!head->cpu_slots)
>> +		return -ENOMEM;
>> +	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>> +
>> +	/* initialize per-cpu slots */
>> +	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
>> +	if (rc)
>> +		objpool_fini_percpu_slots(head);
>> +	else
>> +		refcount_set(&head->ref, nr_objs + 1);
>> +
>> +	return rc;
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_init);
>> +
>> +/* adding object to slot, abort if the slot was already full */
>> +static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
>> +{
>> +	uint32_t *ages = SLOT_AGES(os);
>> +	void **ents = SLOT_ENTS(os);
>> +	uint32_t head, tail;
>> +
>> +	do {
>> +		/* perform memory loading for both head and tail */
>> +		head = READ_ONCE(os->head);
>> +		tail = READ_ONCE(os->tail);
>> +		/* just abort if slot is full */
>> +		if (tail - head > os->mask)
>> +			return -ENOENT;
> 
> Is this really possible? The total number of objects must be less euqal to
> the os->mask. If it means a bug, please use WARN_ON_ONCE() here for debug.

Yes, it's a BUG and the caller's fault. When user tries pushing wrong object
or repeatedly pushing a same object, it could break the objpool's consistency.
It's a 'worse' or 'more worse' choice, rather returning error than breaking
the consitency.

As you adviced, better crash than problematic. I'll update in next version.

> 
>> +		/* try to extend tail by 1 using CAS to avoid races */
>> +		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
>> +			break;
>> +	} while (1);
> 
> "if(cond) ~ break; } while(1)" should be "} (!cond);"

I see. Just to make the codes more "balanced" with comments :)

> 
> And this seems to be buggy since tail++ can be 0, then "tail - head" < 0.
> 
> if (tail < head)
> 	if (WARN_ON_ONCE(tail + (UINT32_MAX - head) > os->mask))
> 		return -ENOENT;
> else
> 	if (WARN_ON_ONCE(tail - head > os->mask))
> 		return -ENOENT;

tail and head are unsigned, so "tail - head" is unsigned and should always
be the actual number of free objects in the objpool_slot.

>> +
>> +	/* the tail-th of slot is reserved for the given obj */
>> +	WRITE_ONCE(ents[tail & os->mask], obj);
>> +	/* update epoch id to make this object available for pop() */
>> +	smp_store_release(&ages[tail & os->mask], tail);
> 
> Note: since the ages array size is the power of 2, this is just a
> (32 - os->bits) loop counter. :)
> 
>> +	return 0;
>> +}
>> +
>> +/* reclaim an object to object pool */
>> +int objpool_push(void *obj, struct objpool_head *oh)
>> +{
>> +	unsigned long flags;
>> +	int cpu, rc;
>> +
>> +	/* disable local irq to avoid preemption & interruption */
>> +	raw_local_irq_save(flags);
>> +	cpu = raw_smp_processor_id();
>> +	do {
>> +		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
>> +		if (!rc)
>> +			break;
>> +		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
>> +	} while (1);
> 
> Hmm, as I said, head->capacity >= nr_all_obj, this must not happen,
> we can always push it on this CPU's slot, right?

Right. If it happens, that means the user made mistakes. I'll refine
the codes.

> 
>> +	raw_local_irq_restore(flags);
>> +
>> +	return rc;
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_push);
>> +
>> +/* drop the allocated object, rather reclaim it to objpool */
>> +int objpool_drop(void *obj, struct objpool_head *head)
>> +{
>> +	if (!obj || !head)
>> +		return -EINVAL;
>> +
>> +	if (refcount_dec_and_test(&head->ref)) {
>> +		objpool_free(head);
>> +		return 0;
>> +	}
>> +
>> +	return -EAGAIN;
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_drop);
>> +
>> +/* try to retrieve object from slot */
>> +static inline void *objpool_try_get_slot(struct objpool_slot *os)
>> +{
>> +	uint32_t *ages = SLOT_AGES(os);
>> +	void **ents = SLOT_ENTS(os);
>> +	/* do memory load of head to local head */
>> +	uint32_t head = smp_load_acquire(&os->head);
>> +
>> +	/* loop if slot isn't empty */
>> +	while (head != READ_ONCE(os->tail)) {
>> +		uint32_t id = head & os->mask, prev = head;
>> +
>> +		/* do prefetching of object ents */
>> +		prefetch(&ents[id]);
>> +
>> +		/* check whether this item was ready for retrieval */
>> +		if (smp_load_acquire(&ages[id]) == head) {
> 
> We may not need this check, since we know head != tail and the
> sizeof(ages) >= nr_all_objs.
> 
> Hmm, I guess we can remove ages[] from the code.

Just do a quick peek to avoid an unnecessary call of try_cmpxchg_release.
try_cmpxchg_release is implemented by heavy instruction with "LOCK" prefix,
which could bring cache invalidation among CPU nodes.

> 
>> +			/* node must have been udpated by push() */
>> +			void *node = READ_ONCE(ents[id]);
> 
> Please use the same word for the same object.
> I mean this is not 'node' but 'object'.

Got it.

> 
>> +			/* commit and move forward head of the slot */
>> +			if (try_cmpxchg_release(&os->head, &head, head + 1))
>> +				return node;
>> +			/* head was already updated by others */
>> +		}
>> +
>> +		/* re-load head from memory and continue trying */
>> +		head = READ_ONCE(os->head);
>> +		/*
>> +		 * head stays unchanged, so it's very likely there's an
>> +		 * ongoing push() on other cpu nodes but yet not update
>> +		 * ages[] to mark it's completion
>> +		 */
>> +		if (head == prev)
>> +			break;
> 
> This is OK. If we always push() on the current CPU slot, and pop() from
> any cpus, we can try again here if this slot is not current CPU. But that
> maybe to much :P

Yes. For most cases, every CPU should only touch it's own objpool_slot.

> Thank you,

Thanks for your time.

> 
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +/* allocate an object from object pool */
>> +void *objpool_pop(struct objpool_head *head)
>> +{
>> +	unsigned long flags;
>> +	int i, cpu;
>> +	void *obj = NULL;
>> +
>> +	/* disable local irq to avoid preemption & interruption */
>> +	raw_local_irq_save(flags);
>> +
>> +	cpu = raw_smp_processor_id();
>> +	for (i = 0; i < num_possible_cpus(); i++) {
>> +		obj = objpool_try_get_slot(head->cpu_slots[cpu]);
>> +		if (obj)
>> +			break;
>> +		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
>> +	}
>> +	raw_local_irq_restore(flags);
>> +
>> +	return obj;
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_pop);
>> +
>> +/* release whole objpool forcely */
>> +void objpool_free(struct objpool_head *head)
>> +{
>> +	if (!head->cpu_slots)
>> +		return;
>> +
>> +	/* release percpu slots */
>> +	objpool_fini_percpu_slots(head);
>> +
>> +	/* call user's cleanup callback if provided */
>> +	if (head->release)
>> +		head->release(head, head->context);
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_free);
>> +
>> +/* drop unused objects and defref objpool for releasing */
>> +void objpool_fini(struct objpool_head *head)
>> +{
>> +	void *nod;
>> +
>> +	do {
>> +		/* grab object from objpool and drop it */
>> +		nod = objpool_pop(head);
>> +
>> +		/* drop the extra ref of objpool */
>> +		if (refcount_dec_and_test(&head->ref))
>> +			objpool_free(head);
>> +	} while (nod);
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_fini);
>> -- 
>> 2.40.1
>>
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-09-25  9:42   ` Masami Hiramatsu
@ 2023-10-08 19:04     ` wuqiang
  2023-10-09  9:23     ` wuqiang
  1 sibling, 0 replies; 25+ messages in thread
From: wuqiang @ 2023-10-08 19:04 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> Hi Wuqiang,
> 
> On Tue,  5 Sep 2023 09:52:51 +0800
> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> 
>> The object pool is a scalable implementaion of high performance queue
>> for object allocation and reclamation, such as kretprobe instances.
>>
>> With leveraging percpu ring-array to mitigate the hot spot of memory
>> contention, it could deliver near-linear scalability for high parallel
>> scenarios. The objpool is best suited for following cases:
>> 1) Memory allocation or reclamation are prohibited or too expensive
>> 2) Consumers are of different priorities, such as irqs and threads
>>
>> Limitations:
>> 1) Maximum objects (capacity) is determined during pool initializing
>>     and can't be modified (extended) after objpool creation
>> 2) The memory of objects won't be freed until objpool is finalized
>> 3) Object allocation (pop) may fail after trying all cpu slots
> 
> I made a simplifying patch on this by (mainly) removing ages array.
> I also rename local variable to use more readable names, like slot,
> pool, and obj.
> 
> Here the results which I run the test_objpool.ko.
> 
> Original:
> [   50.500235] Summary of testcases:
> [   50.503139]     duration: 1027135us 	hits:   30628293 	miss:          0 	sync: percpu objpool
> [   50.510416]     duration: 1047977us 	hits:   30453023 	miss:          0 	sync: percpu objpool from vmalloc
> [   50.517421]     duration: 1047098us 	hits:   31145034 	miss:          0 	sync & hrtimer: percpu objpool
> [   50.524671]     duration: 1053233us 	hits:   30919640 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> [   50.531382]     duration: 1055822us 	hits:    3407221 	miss:     830219 	sync overrun: percpu objpool
> [   50.538135]     duration: 1055998us 	hits:    3404624 	miss:     854160 	sync overrun: percpu objpool from vmalloc
> [   50.546686]     duration: 1046104us 	hits:   19464798 	miss:          0 	async: percpu objpool
> [   50.552989]     duration: 1033054us 	hits:   18957836 	miss:          0 	async: percpu objpool from vmalloc
> [   50.560289]     duration: 1041778us 	hits:   33806868 	miss:          0 	async & hrtimer: percpu objpool
> [   50.567425]     duration: 1048901us 	hits:   34211860 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> 
> Simplified:
> [   48.393236] Summary of testcases:
> [   48.395384]     duration: 1013002us 	hits:   29661448 	miss:          0 	sync: percpu objpool
> [   48.400351]     duration: 1057231us 	hits:   31035578 	miss:          0 	sync: percpu objpool from vmalloc
> [   48.405660]     duration: 1043196us 	hits:   30546652 	miss:          0 	sync & hrtimer: percpu objpool
> [   48.411216]     duration: 1047128us 	hits:   30601306 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> [   48.417231]     duration: 1051695us 	hits:    3468287 	miss:     892881 	sync overrun: percpu objpool
> [   48.422405]     duration: 1054003us 	hits:    3549491 	miss:     898141 	sync overrun: percpu objpool from vmalloc
> [   48.428425]     duration: 1052946us 	hits:   19005228 	miss:          0 	async: percpu objpool
> [   48.433597]     duration: 1051757us 	hits:   19670866 	miss:          0 	async: percpu objpool from vmalloc
> [   48.439280]     duration: 1042951us 	hits:   37135332 	miss:          0 	async & hrtimer: percpu objpool
> [   48.445085]     duration: 1029803us 	hits:   37093248 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> 
> Can you test it too?

Sure, I'll do some testings locally. Also do a performance comparsion with
'entries' managed in objpool_head.

> 
> Thanks,

Thank you for your time, I appreciate it.

> 
>  From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> Date: Mon, 25 Sep 2023 16:07:12 +0900
> Subject: [PATCH] objpool: Simplify objpool by removing ages array
> 
> Simplify the objpool code by removing ages array from per-cpu slot.
> It chooses enough big capacity (which is a rounded up power of 2 value
> of nr_objects + 1) for the entries array, the tail never catch up to
> the head in per-cpu slot. Thus tail == head means the slot is empty.
> 
> This also uses consistent local variable names for pool, slot and obj.
> 
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>   include/linux/objpool.h |  61 ++++----
>   lib/objpool.c           | 310 ++++++++++++++++------------------------
>   2 files changed, 147 insertions(+), 224 deletions(-)
> 
> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> index 33c832216b98..ecd5ecaffcd3 100644
> --- a/include/linux/objpool.h
> +++ b/include/linux/objpool.h
> @@ -38,33 +38,23 @@
>    * struct objpool_slot - percpu ring array of objpool
>    * @head: head of the local ring array (to retrieve at)
>    * @tail: tail of the local ring array (to append at)
> - * @bits: log2 of capacity (for bitwise operations)
> - * @mask: capacity - 1
> + * @mask: capacity of entries - 1
> + * @entries: object entries on this slot.
>    *
>    * Represents a cpu-local array-based ring buffer, its size is specialized
>    * during initialization of object pool. The percpu objpool slot is to be
>    * allocated from local memory for NUMA system, and to be kept compact in
> - * continuous memory: ages[] is stored just after the body of objpool_slot,
> - * and then entries[]. ages[] describes revision of each item, solely used
> - * to avoid ABA; entries[] contains pointers of the actual objects
> - *
> - * Layout of objpool_slot in memory:
> - *
> - * 64bit:
> - *        4      8      12     16        32                 64
> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> - *
> - * 32bit:
> - *        4      8      12     16        32        48       64
> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> + * continuous memory: CPU assigned number of objects are stored just after
> + * the body of objpool_slot.
>    *
>    */
>   struct objpool_slot {
> -	uint32_t                head;
> -	uint32_t                tail;
> -	uint32_t                bits;
> -	uint32_t                mask;
> -} __packed;
> +	uint32_t	head;
> +	uint32_t	tail;
> +	uint32_t	mask;
> +	uint32_t	dummyr;
> +	void *		entries[];
> +};
>   
>   struct objpool_head;
>   
> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
>    * @obj_size:   object & element size
>    * @nr_objs:    total objs (to be pre-allocated)
>    * @nr_cpus:    nr_cpu_ids
> - * @capacity:   max objects per cpuslot
> + * @capacity:   max objects on percpu slot
>    * @gfp:        gfp flags for kmalloc & vmalloc
>    * @ref:        refcount for objpool
>    * @flags:      flags for objpool management
>    * @cpu_slots:  array of percpu slots
> - * @slot_sizes:	size in bytes of slots
>    * @release:    resource cleanup callback
>    * @context:    caller-provided context
>    */
> @@ -100,7 +89,6 @@ struct objpool_head {
>   	refcount_t              ref;
>   	unsigned long           flags;
>   	struct objpool_slot   **cpu_slots;
> -	int                    *slot_sizes;
>   	objpool_fini_cb         release;
>   	void                   *context;
>   };
> @@ -108,9 +96,12 @@ struct objpool_head {
>   #define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
>   #define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
>   
> +#define OBJPOOL_NR_OBJECT_MAX	(1 << 24)
> +#define OBJPOOL_OBJECT_SIZE_MAX	(1 << 16)
> +
>   /**
>    * objpool_init() - initialize objpool and pre-allocated objects
> - * @head:    the object pool to be initialized, declared by caller
> + * @pool:    the object pool to be initialized, declared by caller
>    * @nr_objs: total objects to be pre-allocated by this object pool
>    * @object_size: size of an object (should be > 0)
>    * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
> @@ -128,47 +119,47 @@ struct objpool_head {
>    * pop (object allocation) or do clearance before each push (object
>    * reclamation).
>    */
> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>   		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>   		 objpool_fini_cb release);
>   
>   /**
>    * objpool_pop() - allocate an object from objpool
> - * @head: object pool
> + * @pool: object pool
>    *
>    * return value: object ptr or NULL if failed
>    */
> -void *objpool_pop(struct objpool_head *head);
> +void *objpool_pop(struct objpool_head *pool);
>   
>   /**
>    * objpool_push() - reclaim the object and return back to objpool
>    * @obj:  object ptr to be pushed to objpool
> - * @head: object pool
> + * @pool: object pool
>    *
>    * return: 0 or error code (it fails only when user tries to push
>    * the same object multiple times or wrong "objects" into objpool)
>    */
> -int objpool_push(void *obj, struct objpool_head *head);
> +int objpool_push(void *obj, struct objpool_head *pool);
>   
>   /**
>    * objpool_drop() - discard the object and deref objpool
>    * @obj:  object ptr to be discarded
> - * @head: object pool
> + * @pool: object pool
>    *
>    * return: 0 if objpool was released or error code
>    */
> -int objpool_drop(void *obj, struct objpool_head *head);
> +int objpool_drop(void *obj, struct objpool_head *pool);
>   
>   /**
>    * objpool_free() - release objpool forcely (all objects to be freed)
> - * @head: object pool to be released
> + * @pool: object pool to be released
>    */
> -void objpool_free(struct objpool_head *head);
> +void objpool_free(struct objpool_head *pool);
>   
>   /**
>    * objpool_fini() - deref object pool (also releasing unused objects)
> - * @head: object pool to be dereferenced
> + * @pool: object pool to be dereferenced
>    */
> -void objpool_fini(struct objpool_head *head);
> +void objpool_fini(struct objpool_head *pool);
>   
>   #endif /* _LINUX_OBJPOOL_H */
> diff --git a/lib/objpool.c b/lib/objpool.c
> index 22e752371820..f8e8f70d7253 100644
> --- a/lib/objpool.c
> +++ b/lib/objpool.c
> @@ -15,104 +15,55 @@
>    * Copyright: wuqiang.matt@bytedance.com
>    */
>   
> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> -			(sizeof(uint32_t) << (s)->bits)))
> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> -			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> -
> -/* compute the suitable num of objects to be managed per slot */
> -static int objpool_nobjs(int size)
> -{
> -	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> -			(sizeof(uint32_t) + sizeof(void *)));
> -}
> -
>   /* allocate and initialize percpu slots */
>   static int
> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> -			void *context, objpool_init_obj_cb objinit)
> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> +			  objpool_init_obj_cb objinit)
>   {
> -	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> -
> -	/* aligned object size by sizeof(void *) */
> -	objsz = ALIGN(head->obj_size, sizeof(void *));
> -	/* shall we allocate objects along with percpu-slot */
> -	if (objsz)
> -		head->flags |= OBJPOOL_HAVE_OBJECTS;
> -
> -	/* vmalloc is used in default to allocate percpu-slots */
> -	if (!(head->gfp & GFP_ATOMIC))
> -		head->flags |= OBJPOOL_FROM_VMALLOC;
> -
> -	for (i = 0; i < head->nr_cpus; i++) {
> -		struct objpool_slot *os;
> +	int i, j, n, size, slot_size, cpu_count = 0;
> +	struct objpool_slot *slot;
>   
> +	for (i = 0; i < pool->nr_cpus; i++) {
>   		/* skip the cpus which could never be present */
>   		if (!cpu_possible(i))
>   			continue;
>   
>   		/* compute how many objects to be managed by this slot */
> -		n = nobjs / num_possible_cpus();
> -		if (cpu < (nobjs % num_possible_cpus()))
> +		n = pool->nr_objs / num_possible_cpus();
> +		if (cpu_count < (pool->nr_objs % num_possible_cpus()))
>   			n++;
> -		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> -		       sizeof(uint32_t) * nents + objsz * n;
> +		cpu_count++;
> +
> +		slot_size = struct_size(slot, entries, pool->capacity);
> +		size = slot_size + pool->obj_size * n;
>   
>   		/*
>   		 * here we allocate percpu-slot & objects together in a single
> -		 * allocation, taking advantage of warm caches and TLB hits as
> -		 * vmalloc always aligns the request size to pages
> +		 * allocation, taking advantage on NUMA.
>   		 */
> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> -			os = __vmalloc_node(size, sizeof(void *), head->gfp,
> +		if (pool->flags & OBJPOOL_FROM_VMALLOC)
> +			slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
>   				cpu_to_node(i), __builtin_return_address(0));
>   		else
> -			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> -		if (!os)
> +			slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> +		if (!slot)
>   			return -ENOMEM;
>   
>   		/* initialize percpu slot for the i-th slot */
> -		memset(os, 0, size);
> -		os->bits = ilog2(head->capacity);
> -		os->mask = head->capacity - 1;
> -		head->cpu_slots[i] = os;
> -		head->slot_sizes[i] = size;
> -		cpu = cpu + 1;
> -
> -		/*
> -		 * manually set head & tail to avoid possible conflict:
> -		 * We assume that the head item is ready for retrieval
> -		 * iff head is equal to ages[head & mask]. but ages is
> -		 * initialized as 0, so in view of the caller of pop(),
> -		 * the 1st item (0th) is always ready, but the reality
> -		 * could be: push() is stalled before the final update,
> -		 * thus the item being inserted will be lost forever
> -		 */
> -		os->head = os->tail = head->capacity;
> -
> -		if (!objsz)
> -			continue;
> +		memset(slot, 0, size);
> +		slot->mask = pool->capacity - 1;
> +		pool->cpu_slots[i] = slot;
>   
>   		for (j = 0; j < n; j++) {
> -			uint32_t *ages = SLOT_AGES(os);
> -			void **ents = SLOT_ENTS(os);
> -			void *obj = SLOT_OBJS(os) + j * objsz;
> -			uint32_t ie = os->tail & os->mask;
> +			void *obj = (void *)slot + slot_size + pool->obj_size * j;
>   
> -			/* perform object initialization */
>   			if (objinit) {
>   				int rc = objinit(obj, context);
>   				if (rc)
>   					return rc;
>   			}
> -
> -			/* add obj into the ring array */
> -			ents[ie] = obj;
> -			ages[ie] = os->tail;
> -			os->tail++;
> -			head->nr_objs++;
> +			slot->entries[j] = obj;
> +			slot->tail++;
>   		}
>   	}
>   
> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>   }
>   
>   /* cleanup all percpu slots of the object pool */
> -static void objpool_fini_percpu_slots(struct objpool_head *head)
> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
>   {
>   	int i;
>   
> -	if (!head->cpu_slots)
> +	if (!pool->cpu_slots)
>   		return;
>   
> -	for (i = 0; i < head->nr_cpus; i++) {
> -		if (!head->cpu_slots[i])
> -			continue;
> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> -			vfree(head->cpu_slots[i]);
> -		else
> -			kfree(head->cpu_slots[i]);
> -	}
> -	kfree(head->cpu_slots);
> -	head->cpu_slots = NULL;
> -	head->slot_sizes = NULL;
> +	for (i = 0; i < pool->nr_cpus; i++)
> +		kvfree(pool->cpu_slots[i]);
> +	kfree(pool->cpu_slots);
>   }
>   
>   /* initialize object pool and pre-allocate objects */
> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>   		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>   		objpool_fini_cb release)
>   {
>   	int nents, rc;
>   
>   	/* check input parameters */
> -	if (nr_objs <= 0 || object_size <= 0)
> +	if (nr_objs <= 0 || object_size <= 0 ||
> +	    nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> +		return -EINVAL;
> +
> +	/* Align up to unsigned long size */
> +	object_size = ALIGN(object_size, sizeof(unsigned long));
> +
> +	/*
> +	 * To avoid filling up the entries array in the per-cpu slot,
> +	 * use the power of 2 which is more than N + 1. Thus, the tail never
> +	 * catch up to the pool, and able to use pool/tail as the sequencial
> +	 * number.
> +	 */
> +	nents = roundup_pow_of_two(nr_objs + 1);
> +	if (!nents)
>   		return -EINVAL;
>   
> -	/* calculate percpu slot size (rounded to pow of 2) */
> -	nents = max_t(int, roundup_pow_of_two(nr_objs),
> -			objpool_nobjs(L1_CACHE_BYTES));
> -
> -	/* initialize objpool head */
> -	memset(head, 0, sizeof(struct objpool_head));
> -	head->nr_cpus = nr_cpu_ids;
> -	head->obj_size = object_size;
> -	head->capacity = nents;
> -	head->gfp = gfp & ~__GFP_ZERO;
> -	head->context = context;
> -	head->release = release;
> -
> -	/* allocate array for percpu slots */
> -	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> -			       head->nr_cpus * sizeof(int), head->gfp);
> -	if (!head->cpu_slots)
> +	/* initialize objpool pool */
> +	memset(pool, 0, sizeof(struct objpool_head));
> +	pool->nr_cpus = nr_cpu_ids;
> +	pool->obj_size = object_size;
> +	pool->nr_objs = nr_objs;
> +	/* just prevent to fullfill the per-cpu ring array */
> +	pool->capacity = nents;
> +	pool->gfp = gfp & ~__GFP_ZERO;
> +	pool->context = context;
> +	pool->release = release;
> +	/* vmalloc is used in default to allocate percpu-slots */
> +	if (!(pool->gfp & GFP_ATOMIC))
> +		pool->flags |= OBJPOOL_FROM_VMALLOC;
> +
> +	pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> +	if (!pool->cpu_slots)
>   		return -ENOMEM;
> -	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>   
>   	/* initialize per-cpu slots */
> -	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> +	rc = objpool_init_percpu_slots(pool, context, objinit);
>   	if (rc)
> -		objpool_fini_percpu_slots(head);
> +		objpool_fini_percpu_slots(pool);
>   	else
> -		refcount_set(&head->ref, nr_objs + 1);
> +		refcount_set(&pool->ref, nr_objs + 1);
>   
>   	return rc;
>   }
>   EXPORT_SYMBOL_GPL(objpool_init);
>   
>   /* adding object to slot, abort if the slot was already full */
> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
>   {
> -	uint32_t *ages = SLOT_AGES(os);
> -	void **ents = SLOT_ENTS(os);
> -	uint32_t head, tail;
> +	struct objpool_slot *slot = pool->cpu_slots[cpu];
> +	uint32_t tail, next;
>   
>   	do {
> -		/* perform memory loading for both head and tail */
> -		head = READ_ONCE(os->head);
> -		tail = READ_ONCE(os->tail);
> -		/* just abort if slot is full */
> -		if (tail - head > os->mask)
> -			return -ENOENT;
> -		/* try to extend tail by 1 using CAS to avoid races */
> -		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> -			break;
> -	} while (1);
> +		uint32_t head = READ_ONCE(slot->head);
>   
> -	/* the tail-th of slot is reserved for the given obj */
> -	WRITE_ONCE(ents[tail & os->mask], obj);
> -	/* update epoch id to make this object available for pop() */
> -	smp_store_release(&ages[tail & os->mask], tail);
> +		tail = READ_ONCE(slot->tail);
> +		next = tail + 1;
> +
> +		/* This must never happen because capacity >= N + 1 */
> +		if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> +				 (next < head && next > head + pool->nr_objs)))
> +			return -EINVAL;
> +
> +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> +
> +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>   	return 0;
>   }
>   
>   /* reclaim an object to object pool */
> -int objpool_push(void *obj, struct objpool_head *oh)
> +int objpool_push(void *obj, struct objpool_head *pool)
>   {
>   	unsigned long flags;
> -	int cpu, rc;
> +	int rc;
>   
>   	/* disable local irq to avoid preemption & interruption */
>   	raw_local_irq_save(flags);
> -	cpu = raw_smp_processor_id();
> -	do {
> -		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> -		if (!rc)
> -			break;
> -		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> -	} while (1);
> +	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>   	raw_local_irq_restore(flags);
>   
>   	return rc;
>   }
>   EXPORT_SYMBOL_GPL(objpool_push);
>   
> -/* drop the allocated object, rather reclaim it to objpool */
> -int objpool_drop(void *obj, struct objpool_head *head)
> -{
> -	if (!obj || !head)
> -		return -EINVAL;
> -
> -	if (refcount_dec_and_test(&head->ref)) {
> -		objpool_free(head);
> -		return 0;
> -	}
> -
> -	return -EAGAIN;
> -}
> -EXPORT_SYMBOL_GPL(objpool_drop);
> -
>   /* try to retrieve object from slot */
> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
>   {
> -	uint32_t *ages = SLOT_AGES(os);
> -	void **ents = SLOT_ENTS(os);
>   	/* do memory load of head to local head */
> -	uint32_t head = smp_load_acquire(&os->head);
> +	uint32_t head = smp_load_acquire(&slot->head);
>   
>   	/* loop if slot isn't empty */
> -	while (head != READ_ONCE(os->tail)) {
> -		uint32_t id = head & os->mask, prev = head;
> +	while (head != READ_ONCE(slot->tail)) {
>   
>   		/* do prefetching of object ents */
> -		prefetch(&ents[id]);
> -
> -		/* check whether this item was ready for retrieval */
> -		if (smp_load_acquire(&ages[id]) == head) {
> -			/* node must have been udpated by push() */
> -			void *node = READ_ONCE(ents[id]);
> -			/* commit and move forward head of the slot */
> -			if (try_cmpxchg_release(&os->head, &head, head + 1))
> -				return node;
> -			/* head was already updated by others */
> -		}
> +		prefetch(&slot->entries[head & slot->mask]);
> +
> +		/* commit and move forward head of the slot */
> +		if (try_cmpxchg_release(&slot->head, &head, head + 1))
> +			/*
> +			 * TBD: check overwrap the tail/head counter and warn
> +			 * if it is broken. But this happens only if this
> +			 * process slows down a lot and another CPU updates
> +			 * the haed/tail just 2^32 + 1 times, and this slot
> +			 * is empty.
> +			 */
> +			return slot->entries[head & slot->mask];
>   
>   		/* re-load head from memory and continue trying */
> -		head = READ_ONCE(os->head);
> -		/*
> -		 * head stays unchanged, so it's very likely there's an
> -		 * ongoing push() on other cpu nodes but yet not update
> -		 * ages[] to mark it's completion
> -		 */
> -		if (head == prev)
> -			break;
> +		head = READ_ONCE(slot->head);
>   	}
>   
>   	return NULL;
> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
>   EXPORT_SYMBOL_GPL(objpool_pop);
>   
>   /* release whole objpool forcely */
> -void objpool_free(struct objpool_head *head)
> +void objpool_free(struct objpool_head *pool)
>   {
> -	if (!head->cpu_slots)
> +	if (!pool->cpu_slots)
>   		return;
>   
>   	/* release percpu slots */
> -	objpool_fini_percpu_slots(head);
> +	objpool_fini_percpu_slots(pool);
>   
>   	/* call user's cleanup callback if provided */
> -	if (head->release)
> -		head->release(head, head->context);
> +	if (pool->release)
> +		pool->release(pool, pool->context);
>   }
>   EXPORT_SYMBOL_GPL(objpool_free);
>   
> -/* drop unused objects and defref objpool for releasing */
> -void objpool_fini(struct objpool_head *head)
> +/* drop the allocated object, rather reclaim it to objpool */
> +int objpool_drop(void *obj, struct objpool_head *pool)
>   {
> -	void *nod;
> +	if (!obj || !pool)
> +		return -EINVAL;
>   
> -	do {
> -		/* grab object from objpool and drop it */
> -		nod = objpool_pop(head);
> +	if (refcount_dec_and_test(&pool->ref)) {
> +		objpool_free(pool);
> +		return 0;
> +	}
> +
> +	return -EAGAIN;
> +}
> +EXPORT_SYMBOL_GPL(objpool_drop);
> +
> +/* drop unused objects and defref objpool for releasing */
> +void objpool_fini(struct objpool_head *pool)
> +{
> +	void *obj;
>   
> -		/* drop the extra ref of objpool */
> -		if (refcount_dec_and_test(&head->ref))
> -			objpool_free(head);
> -	} while (nod);
> +	/* grab object from objpool and drop it */
> +	while ((obj = objpool_pop(pool)))
> +		objpool_drop(obj, pool);
>   }
>   EXPORT_SYMBOL_GPL(objpool_fini);


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement
  2023-10-08 18:33   ` wuqiang
@ 2023-10-08 23:17     ` Masami Hiramatsu
  0 siblings, 0 replies; 25+ messages in thread
From: Masami Hiramatsu @ 2023-10-08 23:17 UTC (permalink / raw)
  To: wuqiang
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On Mon, 9 Oct 2023 02:33:09 +0800
wuqiang <wuqiang.matt@bytedance.com> wrote:

> On 2023/9/23 16:57, Masami Hiramatsu (Google) wrote:
> > Hi Wuqiang,
> > 
> > I dug my mail box and found this. Sorry for replying late.
> > 
> > On Tue,  5 Sep 2023 09:52:50 +0800
> > "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> > 
> >> This patch series introduces a scalable and lockless ring-array based
> >> object pool and replaces the original freelist (a LIFO queue based on
> >> singly linked list) to improve scalability of kretprobed routines.
> >>
> >> v9:
> >>    1) objpool: raw_local_irq_save/restore added to prevent interruption
> >>
> >>       To avoid possible ABA issues, we must ensure objpool_try_add_slot
> >>       and objpool_try_add_slot are uninterruptible. If these operations
> >>       are blocked or interrupted in the middle, other cores could overrun
> >>       the same slot's ages[] of uint32, then after resuming back, the
> >>       interrupted pop() or push() could see same value of ages[], which
> >>       is a typical ABA problem though the possibility is small.
> >>
> >>       The pair of pop()/push() costs about 8.53 cpu cycles, measured
> >>       by IACA (Intel Architecture Code Analyzer). That is, on a 4Ghz
> >>       core dedicated for pop() & push(), theoretically it would only
> >>       need 8.53 seconds to overflow a 32bit value. Testings upon Intel
> >>       i7-10700 (2.90GHz) cost 71.88 seconds to overrun a 32bit integer.
> > 
> > What does this mean? This sounds like "There is a timing issue if it's enough fast".
> 
> Yes, that's why local irq must be disabled. If push()/pop() is interrupted or 
> preempted long enough (> 10 seconds for the extreme cases), other nodes could
> overrun the same ages[] of 32-bit, then after resuming to execution the push()
> or pop() would see the same value without notifying the overrun, which is a
> typical ABA.

Yeah, indeed.

> 
> Changing ages[] to 64-bit could be a solution, but it's inappropriate for
> 32-bit OS and looks too heavy. With local irg disabled, push() or pop() is
> uninterrupted,thus the ABA is avoided.

As I found, ages[] can be removed. In that case, you can only update the
head and tail to 64 bit (but in that case cmpxchg will be more complicated)

Thank you,

> 
> push() or pop() consumes only ~4 cycles to complete (most of the use cases), 
> so raw_local_irq_save/restore are used instead of local_irq_save/restore to
> minimize the overhead.
> 
> > Let me reivew the patch itself.
> > 
> > Thanks,
> > 
> >>
> >>    2) codes improvements: thanks to Masami for the thorough inspection
> >>
> >> v8:
> >>    1) objpool: refcount added for objpool lifecycle management
> >>
> >> wuqiang.matt (5):
> >>    lib: objpool added: ring-array based lockless MPMC
> >>    lib: objpool test module added
> >>    kprobes: kretprobe scalability improvement with objpool
> >>    kprobes: freelist.h removed
> >>    MAINTAINERS: objpool added
> >>
> >>   MAINTAINERS              |   7 +
> >>   include/linux/freelist.h | 129 --------
> >>   include/linux/kprobes.h  |  11 +-
> >>   include/linux/objpool.h  | 174 ++++++++++
> >>   include/linux/rethook.h  |  16 +-
> >>   kernel/kprobes.c         |  93 +++---
> >>   kernel/trace/fprobe.c    |  32 +-
> >>   kernel/trace/rethook.c   |  90 +++--
> >>   lib/Kconfig.debug        |  11 +
> >>   lib/Makefile             |   4 +-
> >>   lib/objpool.c            | 338 +++++++++++++++++++
> >>   lib/test_objpool.c       | 689 +++++++++++++++++++++++++++++++++++++++
> >>   12 files changed, 1320 insertions(+), 274 deletions(-)
> >>   delete mode 100644 include/linux/freelist.h
> >>   create mode 100644 include/linux/objpool.h
> >>   create mode 100644 lib/objpool.c
> >>   create mode 100644 lib/test_objpool.c
> >>
> >> -- 
> >> 2.40.1
> >>
> > 
> > 
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 3/5] kprobes: kretprobe scalability improvement with objpool
  2023-10-08 18:31     ` wuqiang
@ 2023-10-08 23:20       ` Masami Hiramatsu
  0 siblings, 0 replies; 25+ messages in thread
From: Masami Hiramatsu @ 2023-10-08 23:20 UTC (permalink / raw)
  To: wuqiang
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On Mon, 9 Oct 2023 02:31:34 +0800
wuqiang <wuqiang.matt@bytedance.com> wrote:

> On 2023/10/7 10:02, Masami Hiramatsu (Google) wrote:
> > On Tue,  5 Sep 2023 09:52:53 +0800
> > "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> > 
> >> kretprobe is using freelist to manage return-instances, but freelist,
> >> as LIFO queue based on singly linked list, scales badly and reduces
> >> the overall throughput of kretprobed routines, especially for high
> >> contention scenarios.
> >>
> >> Here's a typical throughput test of sys_prctl (counts in 10 seconds,
> >> measured with perf stat -a -I 10000 -e syscalls:sys_enter_prctl):
> >>
> >> OS: Debian 10 X86_64, Linux 6.5rc7 with freelist
> >> HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
> >>
> >>        1T       2T       4T       8T      16T      24T
> >> 24150045 29317964 15446741 12494489 18287272 18287272
> >>       32T      48T      64T      72T      96T     128T
> >> 16200682  1620068 11645677 11269858 10470118  9931051
> >>
> >> This patch introduces objpool to kretprobe and rethook, with orginal
> >> freelist replaced and brings near-linear scalability to kretprobed
> >> routines. Tests of kretprobe throughput show the biggest ratio as
> >> 166.7x of the original freelist. Here's the comparison:
> >>
> >>                    1T         2T         4T         8T        16T
> >> native:     41186213   82336866  164250978  328662645  658810299
> >> freelist:   24150045   29317964   15446741   12494489   18287272
> >> objpool:    24663700   49410571   98544977  197725940  396294399
> >>                   32T        48T        64T        96T       128T
> >> native:   1330338351 1969957941 2512291791 1514690434 2671040914
> >> freelist:   16200682   13737658   11645677   10470118    9931051
> >> objpool:    78673470 1177354156 1514690434 1604472348 1655086824
> >>
> >> Tests on 96-core ARM64 system output similarly, but with the biggest
> >> ratio up to 454.8x:
> >>
> >> OS: Debian 10 AARCH64, Linux 6.5rc7
> >> HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
> >>
> >>                    1T         2T         4T         8T        16T
> >> native: .   30066096   13733813  126194076  257447289  505800181
> >> freelist:   16152090   11064397   11124068    7215768    5663013
> >> objpool:    13733813   27749031   56540679  112291770  223482778
> >>                   24T        32T        48T        64T        96T
> >> native:    763305277 1015925192 1521075123 2033009392 3021013752
> >> freelist:    5015810    4602893    3766792    3382478    2945292
> >> objpool:   334605663  448310646  675018951  903449904 1339693418
> >>
> > 
> > This looks good to me (and I have tested with updated objpool)
> > 
> > Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > 
> > Wuqiang, can you update the above number with the simplified
> > objpool? I got better number (always 80% of the native performance)
> > with 128 node/probe.
> > (*) https://lore.kernel.org/all/20231003003923.eabc33bb3f4ffb8eac71f2af@kernel.org/
> 
> That's great. I'll prepare a new patch and try to spare the testbeds for
> another round of testing.

Thanks for updating! I'm looking forward to the next version :)

Thank you,

> 
> > Thank you,
> 
> Thanks for your effort. Sorry for the late response, just back from
> a 'long' vacation.
> 
> > 
> >> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
> >> ---
> >>   include/linux/kprobes.h | 11 ++---
> >>   include/linux/rethook.h | 16 ++-----
> >>   kernel/kprobes.c        | 93 +++++++++++++++++------------------------
> >>   kernel/trace/fprobe.c   | 32 ++++++--------
> >>   kernel/trace/rethook.c  | 90 ++++++++++++++++++---------------------
> >>   5 files changed, 98 insertions(+), 144 deletions(-)
> >>
> >> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> >> index 85a64cb95d75..365eb092e9c4 100644
> >> --- a/include/linux/kprobes.h
> >> +++ b/include/linux/kprobes.h
> >> @@ -26,8 +26,7 @@
> >>   #include <linux/rcupdate.h>
> >>   #include <linux/mutex.h>
> >>   #include <linux/ftrace.h>
> >> -#include <linux/refcount.h>
> >> -#include <linux/freelist.h>
> >> +#include <linux/objpool.h>
> >>   #include <linux/rethook.h>
> >>   #include <asm/kprobes.h>
> >>   
> >> @@ -141,7 +140,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
> >>    */
> >>   struct kretprobe_holder {
> >>   	struct kretprobe	*rp;
> >> -	refcount_t		ref;
> >> +	struct objpool_head	pool;
> >>   };
> >>   
> >>   struct kretprobe {
> >> @@ -154,7 +153,6 @@ struct kretprobe {
> >>   #ifdef CONFIG_KRETPROBE_ON_RETHOOK
> >>   	struct rethook *rh;
> >>   #else
> >> -	struct freelist_head freelist;
> >>   	struct kretprobe_holder *rph;
> >>   #endif
> >>   };
> >> @@ -165,10 +163,7 @@ struct kretprobe_instance {
> >>   #ifdef CONFIG_KRETPROBE_ON_RETHOOK
> >>   	struct rethook_node node;
> >>   #else
> >> -	union {
> >> -		struct freelist_node freelist;
> >> -		struct rcu_head rcu;
> >> -	};
> >> +	struct rcu_head rcu;
> >>   	struct llist_node llist;
> >>   	struct kretprobe_holder *rph;
> >>   	kprobe_opcode_t *ret_addr;
> >> diff --git a/include/linux/rethook.h b/include/linux/rethook.h
> >> index 26b6f3c81a76..ce69b2b7bc35 100644
> >> --- a/include/linux/rethook.h
> >> +++ b/include/linux/rethook.h
> >> @@ -6,11 +6,10 @@
> >>   #define _LINUX_RETHOOK_H
> >>   
> >>   #include <linux/compiler.h>
> >> -#include <linux/freelist.h>
> >> +#include <linux/objpool.h>
> >>   #include <linux/kallsyms.h>
> >>   #include <linux/llist.h>
> >>   #include <linux/rcupdate.h>
> >> -#include <linux/refcount.h>
> >>   
> >>   struct rethook_node;
> >>   
> >> @@ -30,14 +29,12 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, unsigned long,
> >>   struct rethook {
> >>   	void			*data;
> >>   	rethook_handler_t	handler;
> >> -	struct freelist_head	pool;
> >> -	refcount_t		ref;
> >> +	struct objpool_head	pool;
> >>   	struct rcu_head		rcu;
> >>   };
> >>   
> >>   /**
> >>    * struct rethook_node - The rethook shadow-stack entry node.
> >> - * @freelist: The freelist, linked to struct rethook::pool.
> >>    * @rcu: The rcu_head for deferred freeing.
> >>    * @llist: The llist, linked to a struct task_struct::rethooks.
> >>    * @rethook: The pointer to the struct rethook.
> >> @@ -48,20 +45,16 @@ struct rethook {
> >>    * on each entry of the shadow stack.
> >>    */
> >>   struct rethook_node {
> >> -	union {
> >> -		struct freelist_node freelist;
> >> -		struct rcu_head      rcu;
> >> -	};
> >> +	struct rcu_head		rcu;
> >>   	struct llist_node	llist;
> >>   	struct rethook		*rethook;
> >>   	unsigned long		ret_addr;
> >>   	unsigned long		frame;
> >>   };
> >>   
> >> -struct rethook *rethook_alloc(void *data, rethook_handler_t handler);
> >> +struct rethook *rethook_alloc(void *data, rethook_handler_t handler, int size, int num);
> >>   void rethook_stop(struct rethook *rh);
> >>   void rethook_free(struct rethook *rh);
> >> -void rethook_add_node(struct rethook *rh, struct rethook_node *node);
> >>   struct rethook_node *rethook_try_get(struct rethook *rh);
> >>   void rethook_recycle(struct rethook_node *node);
> >>   void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount);
> >> @@ -98,4 +91,3 @@ void rethook_flush_task(struct task_struct *tk);
> >>   #endif
> >>   
> >>   #endif
> >> -
> >> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> >> index ca385b61d546..075a632e6c7c 100644
> >> --- a/kernel/kprobes.c
> >> +++ b/kernel/kprobes.c
> >> @@ -1877,13 +1877,27 @@ static struct notifier_block kprobe_exceptions_nb = {
> >>   #ifdef CONFIG_KRETPROBES
> >>   
> >>   #if !defined(CONFIG_KRETPROBE_ON_RETHOOK)
> >> +
> >> +/* callbacks for objpool of kretprobe instances */
> >> +static int kretprobe_init_inst(void *nod, void *context)
> >> +{
> >> +	struct kretprobe_instance *ri = nod;
> >> +
> >> +	ri->rph = context;
> >> +	return 0;
> >> +}
> >> +static int kretprobe_fini_pool(struct objpool_head *head, void *context)
> >> +{
> >> +	kfree(context);
> >> +	return 0;
> >> +}
> >> +
> >>   static void free_rp_inst_rcu(struct rcu_head *head)
> >>   {
> >>   	struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
> >> +	struct kretprobe_holder *rph = ri->rph;
> >>   
> >> -	if (refcount_dec_and_test(&ri->rph->ref))
> >> -		kfree(ri->rph);
> >> -	kfree(ri);
> >> +	objpool_drop(ri, &rph->pool);
> >>   }
> >>   NOKPROBE_SYMBOL(free_rp_inst_rcu);
> >>   
> >> @@ -1892,7 +1906,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
> >>   	struct kretprobe *rp = get_kretprobe(ri);
> >>   
> >>   	if (likely(rp))
> >> -		freelist_add(&ri->freelist, &rp->freelist);
> >> +		objpool_push(ri, &rp->rph->pool);
> >>   	else
> >>   		call_rcu(&ri->rcu, free_rp_inst_rcu);
> >>   }
> >> @@ -1929,23 +1943,12 @@ NOKPROBE_SYMBOL(kprobe_flush_task);
> >>   
> >>   static inline void free_rp_inst(struct kretprobe *rp)
> >>   {
> >> -	struct kretprobe_instance *ri;
> >> -	struct freelist_node *node;
> >> -	int count = 0;
> >> -
> >> -	node = rp->freelist.head;
> >> -	while (node) {
> >> -		ri = container_of(node, struct kretprobe_instance, freelist);
> >> -		node = node->next;
> >> -
> >> -		kfree(ri);
> >> -		count++;
> >> -	}
> >> +	struct kretprobe_holder *rph = rp->rph;
> >>   
> >> -	if (refcount_sub_and_test(count, &rp->rph->ref)) {
> >> -		kfree(rp->rph);
> >> -		rp->rph = NULL;
> >> -	}
> >> +	if (!rph)
> >> +		return;
> >> +	rp->rph = NULL;
> >> +	objpool_fini(&rph->pool);
> >>   }
> >>   
> >>   /* This assumes the 'tsk' is the current task or the is not running. */
> >> @@ -2087,19 +2090,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
> >>   static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
> >>   {
> >>   	struct kretprobe *rp = container_of(p, struct kretprobe, kp);
> >> +	struct kretprobe_holder *rph = rp->rph;
> >>   	struct kretprobe_instance *ri;
> >> -	struct freelist_node *fn;
> >>   
> >> -	fn = freelist_try_get(&rp->freelist);
> >> -	if (!fn) {
> >> +	ri = objpool_pop(&rph->pool);
> >> +	if (!ri) {
> >>   		rp->nmissed++;
> >>   		return 0;
> >>   	}
> >>   
> >> -	ri = container_of(fn, struct kretprobe_instance, freelist);
> >> -
> >>   	if (rp->entry_handler && rp->entry_handler(ri, regs)) {
> >> -		freelist_add(&ri->freelist, &rp->freelist);
> >> +		objpool_push(ri, &rph->pool);
> >>   		return 0;
> >>   	}
> >>   
> >> @@ -2193,7 +2194,6 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
> >>   int register_kretprobe(struct kretprobe *rp)
> >>   {
> >>   	int ret;
> >> -	struct kretprobe_instance *inst;
> >>   	int i;
> >>   	void *addr;
> >>   
> >> @@ -2227,20 +2227,12 @@ int register_kretprobe(struct kretprobe *rp)
> >>   		rp->maxactive = max_t(unsigned int, 10, 2*num_possible_cpus());
> >>   
> >>   #ifdef CONFIG_KRETPROBE_ON_RETHOOK
> >> -	rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler);
> >> -	if (!rp->rh)
> >> -		return -ENOMEM;
> >> +	rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler,
> >> +				sizeof(struct kretprobe_instance) +
> >> +				rp->data_size, rp->maxactive);
> >> +	if (IS_ERR(rp->rh))
> >> +		return PTR_ERR(rp->rh);
> >>   
> >> -	for (i = 0; i < rp->maxactive; i++) {
> >> -		inst = kzalloc(sizeof(struct kretprobe_instance) +
> >> -			       rp->data_size, GFP_KERNEL);
> >> -		if (inst == NULL) {
> >> -			rethook_free(rp->rh);
> >> -			rp->rh = NULL;
> >> -			return -ENOMEM;
> >> -		}
> >> -		rethook_add_node(rp->rh, &inst->node);
> >> -	}
> >>   	rp->nmissed = 0;
> >>   	/* Establish function entry probe point */
> >>   	ret = register_kprobe(&rp->kp);
> >> @@ -2249,25 +2241,18 @@ int register_kretprobe(struct kretprobe *rp)
> >>   		rp->rh = NULL;
> >>   	}
> >>   #else	/* !CONFIG_KRETPROBE_ON_RETHOOK */
> >> -	rp->freelist.head = NULL;
> >>   	rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
> >>   	if (!rp->rph)
> >>   		return -ENOMEM;
> >>   
> >> -	rp->rph->rp = rp;
> >> -	for (i = 0; i < rp->maxactive; i++) {
> >> -		inst = kzalloc(sizeof(struct kretprobe_instance) +
> >> -			       rp->data_size, GFP_KERNEL);
> >> -		if (inst == NULL) {
> >> -			refcount_set(&rp->rph->ref, i);
> >> -			free_rp_inst(rp);
> >> -			return -ENOMEM;
> >> -		}
> >> -		inst->rph = rp->rph;
> >> -		freelist_add(&inst->freelist, &rp->freelist);
> >> +	if (objpool_init(&rp->rph->pool, rp->maxactive, rp->data_size +
> >> +			sizeof(struct kretprobe_instance), GFP_KERNEL,
> >> +			rp->rph, kretprobe_init_inst, kretprobe_fini_pool)) {
> >> +		kfree(rp->rph);
> >> +		rp->rph = NULL;
> >> +		return -ENOMEM;
> >>   	}
> >> -	refcount_set(&rp->rph->ref, i);
> >> -
> >> +	rp->rph->rp = rp;
> >>   	rp->nmissed = 0;
> >>   	/* Establish function entry probe point */
> >>   	ret = register_kprobe(&rp->kp);
> >> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> >> index 3b21f4063258..f5bf98e6b2ac 100644
> >> --- a/kernel/trace/fprobe.c
> >> +++ b/kernel/trace/fprobe.c
> >> @@ -187,9 +187,9 @@ static void fprobe_init(struct fprobe *fp)
> >>   
> >>   static int fprobe_init_rethook(struct fprobe *fp, int num)
> >>   {
> >> -	int i, size;
> >> +	int size;
> >>   
> >> -	if (num < 0)
> >> +	if (num <= 0)
> >>   		return -EINVAL;
> >>   
> >>   	if (!fp->exit_handler) {
> >> @@ -202,29 +202,21 @@ static int fprobe_init_rethook(struct fprobe *fp, int num)
> >>   		size = fp->nr_maxactive;
> >>   	else
> >>   		size = num * num_possible_cpus() * 2;
> >> -	if (size < 0)
> >> +	if (size <= 0)
> >>   		return -E2BIG;
> >>   
> >> -	fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler);
> >> -	if (!fp->rethook)
> >> -		return -ENOMEM;
> >> -	for (i = 0; i < size; i++) {
> >> -		struct fprobe_rethook_node *node;
> >> -
> >> -		node = kzalloc(sizeof(*node) + fp->entry_data_size, GFP_KERNEL);
> >> -		if (!node) {
> >> -			rethook_free(fp->rethook);
> >> -			fp->rethook = NULL;
> >> -			return -ENOMEM;
> >> -		}
> >> -		rethook_add_node(fp->rethook, &node->node);
> >> -	}
> >> +	/* Initialize rethook */
> >> +	fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler,
> >> +				sizeof(struct fprobe_rethook_node), size);
> >> +	if (IS_ERR(fp->rethook))
> >> +		return PTR_ERR(fp->rethook);
> >> +
> >>   	return 0;
> >>   }
> >>   
> >>   static void fprobe_fail_cleanup(struct fprobe *fp)
> >>   {
> >> -	if (fp->rethook) {
> >> +	if (!IS_ERR_OR_NULL(fp->rethook)) {
> >>   		/* Don't need to cleanup rethook->handler because this is not used. */
> >>   		rethook_free(fp->rethook);
> >>   		fp->rethook = NULL;
> >> @@ -379,14 +371,14 @@ int unregister_fprobe(struct fprobe *fp)
> >>   	if (!fprobe_is_registered(fp))
> >>   		return -EINVAL;
> >>   
> >> -	if (fp->rethook)
> >> +	if (!IS_ERR_OR_NULL(fp->rethook))
> >>   		rethook_stop(fp->rethook);
> >>   
> >>   	ret = unregister_ftrace_function(&fp->ops);
> >>   	if (ret < 0)
> >>   		return ret;
> >>   
> >> -	if (fp->rethook)
> >> +	if (!IS_ERR_OR_NULL(fp->rethook))
> >>   		rethook_free(fp->rethook);
> >>   
> >>   	ftrace_free_filter(&fp->ops);
> >> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> >> index 5eb9b598f4e9..13c8e6773892 100644
> >> --- a/kernel/trace/rethook.c
> >> +++ b/kernel/trace/rethook.c
> >> @@ -9,6 +9,7 @@
> >>   #include <linux/rethook.h>
> >>   #include <linux/slab.h>
> >>   #include <linux/sort.h>
> >> +#include <linux/smp.h>
> >>   
> >>   /* Return hook list (shadow stack by list) */
> >>   
> >> @@ -36,21 +37,7 @@ void rethook_flush_task(struct task_struct *tk)
> >>   static void rethook_free_rcu(struct rcu_head *head)
> >>   {
> >>   	struct rethook *rh = container_of(head, struct rethook, rcu);
> >> -	struct rethook_node *rhn;
> >> -	struct freelist_node *node;
> >> -	int count = 1;
> >> -
> >> -	node = rh->pool.head;
> >> -	while (node) {
> >> -		rhn = container_of(node, struct rethook_node, freelist);
> >> -		node = node->next;
> >> -		kfree(rhn);
> >> -		count++;
> >> -	}
> >> -
> >> -	/* The rh->ref is the number of pooled node + 1 */
> >> -	if (refcount_sub_and_test(count, &rh->ref))
> >> -		kfree(rh);
> >> +	objpool_fini(&rh->pool);
> >>   }
> >>   
> >>   /**
> >> @@ -83,54 +70,62 @@ void rethook_free(struct rethook *rh)
> >>   	call_rcu(&rh->rcu, rethook_free_rcu);
> >>   }
> >>   
> >> +static int rethook_init_node(void *nod, void *context)
> >> +{
> >> +	struct rethook_node *node = nod;
> >> +
> >> +	node->rethook = context;
> >> +	return 0;
> >> +}
> >> +
> >> +static int rethook_fini_pool(struct objpool_head *head, void *context)
> >> +{
> >> +	kfree(context);
> >> +	return 0;
> >> +}
> >> +
> >>   /**
> >>    * rethook_alloc() - Allocate struct rethook.
> >>    * @data: a data to pass the @handler when hooking the return.
> >> - * @handler: the return hook callback function.
> >> + * @handler: the return hook callback function, must NOT be NULL
> >> + * @size: node size: rethook node and additional data
> >> + * @num: number of rethook nodes to be preallocated
> >>    *
> >>    * Allocate and initialize a new rethook with @data and @handler.
> >> - * Return NULL if memory allocation fails or @handler is NULL.
> >> + * Return pointer of new rethook, or error codes for failures.
> >> + *
> >>    * Note that @handler == NULL means this rethook is going to be freed.
> >>    */
> >> -struct rethook *rethook_alloc(void *data, rethook_handler_t handler)
> >> +struct rethook *rethook_alloc(void *data, rethook_handler_t handler,
> >> +			      int size, int num)
> >>   {
> >> -	struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
> >> +	struct rethook *rh;
> >>   
> >> -	if (!rh || !handler) {
> >> -		kfree(rh);
> >> -		return NULL;
> >> -	}
> >> +	if (!handler || num <= 0 || size < sizeof(struct rethook_node))
> >> +		return ERR_PTR(-EINVAL);
> >> +
> >> +	rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
> >> +	if (!rh)
> >> +		return ERR_PTR(-ENOMEM);
> >>   
> >>   	rh->data = data;
> >>   	rh->handler = handler;
> >> -	rh->pool.head = NULL;
> >> -	refcount_set(&rh->ref, 1);
> >>   
> >> +	/* initialize the objpool for rethook nodes */
> >> +	if (objpool_init(&rh->pool, num, size, GFP_KERNEL, rh,
> >> +			 rethook_init_node, rethook_fini_pool)) {
> >> +		kfree(rh);
> >> +		return ERR_PTR(-ENOMEM);
> >> +	}
> >>   	return rh;
> >>   }
> >>   
> >> -/**
> >> - * rethook_add_node() - Add a new node to the rethook.
> >> - * @rh: the struct rethook.
> >> - * @node: the struct rethook_node to be added.
> >> - *
> >> - * Add @node to @rh. User must allocate @node (as a part of user's
> >> - * data structure.) The @node fields are initialized in this function.
> >> - */
> >> -void rethook_add_node(struct rethook *rh, struct rethook_node *node)
> >> -{
> >> -	node->rethook = rh;
> >> -	freelist_add(&node->freelist, &rh->pool);
> >> -	refcount_inc(&rh->ref);
> >> -}
> >> -
> >>   static void free_rethook_node_rcu(struct rcu_head *head)
> >>   {
> >>   	struct rethook_node *node = container_of(head, struct rethook_node, rcu);
> >> +	struct rethook *rh = node->rethook;
> >>   
> >> -	if (refcount_dec_and_test(&node->rethook->ref))
> >> -		kfree(node->rethook);
> >> -	kfree(node);
> >> +	objpool_drop(node, &rh->pool);
> >>   }
> >>   
> >>   /**
> >> @@ -145,7 +140,7 @@ void rethook_recycle(struct rethook_node *node)
> >>   	lockdep_assert_preemption_disabled();
> >>   
> >>   	if (likely(READ_ONCE(node->rethook->handler)))
> >> -		freelist_add(&node->freelist, &node->rethook->pool);
> >> +		objpool_push(node, &node->rethook->pool);
> >>   	else
> >>   		call_rcu(&node->rcu, free_rethook_node_rcu);
> >>   }
> >> @@ -161,7 +156,6 @@ NOKPROBE_SYMBOL(rethook_recycle);
> >>   struct rethook_node *rethook_try_get(struct rethook *rh)
> >>   {
> >>   	rethook_handler_t handler = READ_ONCE(rh->handler);
> >> -	struct freelist_node *fn;
> >>   
> >>   	lockdep_assert_preemption_disabled();
> >>   
> >> @@ -178,11 +172,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >>   	if (unlikely(!rcu_is_watching()))
> >>   		return NULL;
> >>   
> >> -	fn = freelist_try_get(&rh->pool);
> >> -	if (!fn)
> >> -		return NULL;
> >> -
> >> -	return container_of(fn, struct rethook_node, freelist);
> >> +	return (struct rethook_node *)objpool_pop(&rh->pool);
> >>   }
> >>   NOKPROBE_SYMBOL(rethook_try_get);
> >>   
> >> -- 
> >> 2.40.1
> >>
> > 
> > 
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-09-25  9:42   ` Masami Hiramatsu
  2023-10-08 19:04     ` wuqiang
@ 2023-10-09  9:23     ` wuqiang
  2023-10-09 13:51       ` Masami Hiramatsu
  2023-10-12 14:02       ` Masami Hiramatsu
  1 sibling, 2 replies; 25+ messages in thread
From: wuqiang @ 2023-10-09  9:23 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

Hello Masami,

Just got time for the new patch and got that ages[] was removed. ages[] is
introduced the way like 2-phase commit to keep consitency and must be kept.

Thinking of the following 2 cases that two cpu nodes are operating the same
objpool_slot simultaneously:

Case 1:

   NODE 1:                  NODE 2:
   push to an empty slot    pop will get wrong value

   try_cmpxchg_acquire(&slot->tail, &tail, next)
                            try_cmpxchg_release(&slot->head, &head, head + 1)
                            return slot->entries[head & slot->mask]
   WRITE_ONCE(slot->entries[tail & slot->mask], obj)


Case 2:

   NODE 1:                  NODE 2
   push to slot w/ 1 obj    pop will get wrong value

                            try_cmpxchg_release(&slot->head, &head, head + 1)
   try_cmpxchg_acquire(&slot->tail, &tail, next)
   WRITE_ONCE(slot->entries[tail & slot->mask], obj)
                            return slot->entries[head & slot->mask]


Regards,
wuqiang

On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> Hi Wuqiang,
> 
> On Tue,  5 Sep 2023 09:52:51 +0800
> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> 
>> The object pool is a scalable implementaion of high performance queue
>> for object allocation and reclamation, such as kretprobe instances.
>>
>> With leveraging percpu ring-array to mitigate the hot spot of memory
>> contention, it could deliver near-linear scalability for high parallel
>> scenarios. The objpool is best suited for following cases:
>> 1) Memory allocation or reclamation are prohibited or too expensive
>> 2) Consumers are of different priorities, such as irqs and threads
>>
>> Limitations:
>> 1) Maximum objects (capacity) is determined during pool initializing
>>     and can't be modified (extended) after objpool creation
>> 2) The memory of objects won't be freed until objpool is finalized
>> 3) Object allocation (pop) may fail after trying all cpu slots
> 
> I made a simplifying patch on this by (mainly) removing ages array.
> I also rename local variable to use more readable names, like slot,
> pool, and obj.
> 
> Here the results which I run the test_objpool.ko.
> 
> Original:
> [   50.500235] Summary of testcases:
> [   50.503139]     duration: 1027135us 	hits:   30628293 	miss:          0 	sync: percpu objpool
> [   50.510416]     duration: 1047977us 	hits:   30453023 	miss:          0 	sync: percpu objpool from vmalloc
> [   50.517421]     duration: 1047098us 	hits:   31145034 	miss:          0 	sync & hrtimer: percpu objpool
> [   50.524671]     duration: 1053233us 	hits:   30919640 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> [   50.531382]     duration: 1055822us 	hits:    3407221 	miss:     830219 	sync overrun: percpu objpool
> [   50.538135]     duration: 1055998us 	hits:    3404624 	miss:     854160 	sync overrun: percpu objpool from vmalloc
> [   50.546686]     duration: 1046104us 	hits:   19464798 	miss:          0 	async: percpu objpool
> [   50.552989]     duration: 1033054us 	hits:   18957836 	miss:          0 	async: percpu objpool from vmalloc
> [   50.560289]     duration: 1041778us 	hits:   33806868 	miss:          0 	async & hrtimer: percpu objpool
> [   50.567425]     duration: 1048901us 	hits:   34211860 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> 
> Simplified:
> [   48.393236] Summary of testcases:
> [   48.395384]     duration: 1013002us 	hits:   29661448 	miss:          0 	sync: percpu objpool
> [   48.400351]     duration: 1057231us 	hits:   31035578 	miss:          0 	sync: percpu objpool from vmalloc
> [   48.405660]     duration: 1043196us 	hits:   30546652 	miss:          0 	sync & hrtimer: percpu objpool
> [   48.411216]     duration: 1047128us 	hits:   30601306 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> [   48.417231]     duration: 1051695us 	hits:    3468287 	miss:     892881 	sync overrun: percpu objpool
> [   48.422405]     duration: 1054003us 	hits:    3549491 	miss:     898141 	sync overrun: percpu objpool from vmalloc
> [   48.428425]     duration: 1052946us 	hits:   19005228 	miss:          0 	async: percpu objpool
> [   48.433597]     duration: 1051757us 	hits:   19670866 	miss:          0 	async: percpu objpool from vmalloc
> [   48.439280]     duration: 1042951us 	hits:   37135332 	miss:          0 	async & hrtimer: percpu objpool
> [   48.445085]     duration: 1029803us 	hits:   37093248 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> 
> Can you test it too?
> 
> Thanks,
> 
>  From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> Date: Mon, 25 Sep 2023 16:07:12 +0900
> Subject: [PATCH] objpool: Simplify objpool by removing ages array
> 
> Simplify the objpool code by removing ages array from per-cpu slot.
> It chooses enough big capacity (which is a rounded up power of 2 value
> of nr_objects + 1) for the entries array, the tail never catch up to
> the head in per-cpu slot. Thus tail == head means the slot is empty.
> 
> This also uses consistent local variable names for pool, slot and obj.
> 
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>   include/linux/objpool.h |  61 ++++----
>   lib/objpool.c           | 310 ++++++++++++++++------------------------
>   2 files changed, 147 insertions(+), 224 deletions(-)
> 
> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> index 33c832216b98..ecd5ecaffcd3 100644
> --- a/include/linux/objpool.h
> +++ b/include/linux/objpool.h
> @@ -38,33 +38,23 @@
>    * struct objpool_slot - percpu ring array of objpool
>    * @head: head of the local ring array (to retrieve at)
>    * @tail: tail of the local ring array (to append at)
> - * @bits: log2 of capacity (for bitwise operations)
> - * @mask: capacity - 1
> + * @mask: capacity of entries - 1
> + * @entries: object entries on this slot.
>    *
>    * Represents a cpu-local array-based ring buffer, its size is specialized
>    * during initialization of object pool. The percpu objpool slot is to be
>    * allocated from local memory for NUMA system, and to be kept compact in
> - * continuous memory: ages[] is stored just after the body of objpool_slot,
> - * and then entries[]. ages[] describes revision of each item, solely used
> - * to avoid ABA; entries[] contains pointers of the actual objects
> - *
> - * Layout of objpool_slot in memory:
> - *
> - * 64bit:
> - *        4      8      12     16        32                 64
> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> - *
> - * 32bit:
> - *        4      8      12     16        32        48       64
> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> + * continuous memory: CPU assigned number of objects are stored just after
> + * the body of objpool_slot.
>    *
>    */
>   struct objpool_slot {
> -	uint32_t                head;
> -	uint32_t                tail;
> -	uint32_t                bits;
> -	uint32_t                mask;
> -} __packed;
> +	uint32_t	head;
> +	uint32_t	tail;
> +	uint32_t	mask;
> +	uint32_t	dummyr;
> +	void *		entries[];
> +};
>   
>   struct objpool_head;
>   
> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
>    * @obj_size:   object & element size
>    * @nr_objs:    total objs (to be pre-allocated)
>    * @nr_cpus:    nr_cpu_ids
> - * @capacity:   max objects per cpuslot
> + * @capacity:   max objects on percpu slot
>    * @gfp:        gfp flags for kmalloc & vmalloc
>    * @ref:        refcount for objpool
>    * @flags:      flags for objpool management
>    * @cpu_slots:  array of percpu slots
> - * @slot_sizes:	size in bytes of slots
>    * @release:    resource cleanup callback
>    * @context:    caller-provided context
>    */
> @@ -100,7 +89,6 @@ struct objpool_head {
>   	refcount_t              ref;
>   	unsigned long           flags;
>   	struct objpool_slot   **cpu_slots;
> -	int                    *slot_sizes;
>   	objpool_fini_cb         release;
>   	void                   *context;
>   };
> @@ -108,9 +96,12 @@ struct objpool_head {
>   #define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
>   #define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
>   
> +#define OBJPOOL_NR_OBJECT_MAX	(1 << 24)
> +#define OBJPOOL_OBJECT_SIZE_MAX	(1 << 16)
> +
>   /**
>    * objpool_init() - initialize objpool and pre-allocated objects
> - * @head:    the object pool to be initialized, declared by caller
> + * @pool:    the object pool to be initialized, declared by caller
>    * @nr_objs: total objects to be pre-allocated by this object pool
>    * @object_size: size of an object (should be > 0)
>    * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
> @@ -128,47 +119,47 @@ struct objpool_head {
>    * pop (object allocation) or do clearance before each push (object
>    * reclamation).
>    */
> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>   		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>   		 objpool_fini_cb release);
>   
>   /**
>    * objpool_pop() - allocate an object from objpool
> - * @head: object pool
> + * @pool: object pool
>    *
>    * return value: object ptr or NULL if failed
>    */
> -void *objpool_pop(struct objpool_head *head);
> +void *objpool_pop(struct objpool_head *pool);
>   
>   /**
>    * objpool_push() - reclaim the object and return back to objpool
>    * @obj:  object ptr to be pushed to objpool
> - * @head: object pool
> + * @pool: object pool
>    *
>    * return: 0 or error code (it fails only when user tries to push
>    * the same object multiple times or wrong "objects" into objpool)
>    */
> -int objpool_push(void *obj, struct objpool_head *head);
> +int objpool_push(void *obj, struct objpool_head *pool);
>   
>   /**
>    * objpool_drop() - discard the object and deref objpool
>    * @obj:  object ptr to be discarded
> - * @head: object pool
> + * @pool: object pool
>    *
>    * return: 0 if objpool was released or error code
>    */
> -int objpool_drop(void *obj, struct objpool_head *head);
> +int objpool_drop(void *obj, struct objpool_head *pool);
>   
>   /**
>    * objpool_free() - release objpool forcely (all objects to be freed)
> - * @head: object pool to be released
> + * @pool: object pool to be released
>    */
> -void objpool_free(struct objpool_head *head);
> +void objpool_free(struct objpool_head *pool);
>   
>   /**
>    * objpool_fini() - deref object pool (also releasing unused objects)
> - * @head: object pool to be dereferenced
> + * @pool: object pool to be dereferenced
>    */
> -void objpool_fini(struct objpool_head *head);
> +void objpool_fini(struct objpool_head *pool);
>   
>   #endif /* _LINUX_OBJPOOL_H */
> diff --git a/lib/objpool.c b/lib/objpool.c
> index 22e752371820..f8e8f70d7253 100644
> --- a/lib/objpool.c
> +++ b/lib/objpool.c
> @@ -15,104 +15,55 @@
>    * Copyright: wuqiang.matt@bytedance.com
>    */
>   
> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> -			(sizeof(uint32_t) << (s)->bits)))
> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> -			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> -
> -/* compute the suitable num of objects to be managed per slot */
> -static int objpool_nobjs(int size)
> -{
> -	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> -			(sizeof(uint32_t) + sizeof(void *)));
> -}
> -
>   /* allocate and initialize percpu slots */
>   static int
> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> -			void *context, objpool_init_obj_cb objinit)
> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> +			  objpool_init_obj_cb objinit)
>   {
> -	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> -
> -	/* aligned object size by sizeof(void *) */
> -	objsz = ALIGN(head->obj_size, sizeof(void *));
> -	/* shall we allocate objects along with percpu-slot */
> -	if (objsz)
> -		head->flags |= OBJPOOL_HAVE_OBJECTS;
> -
> -	/* vmalloc is used in default to allocate percpu-slots */
> -	if (!(head->gfp & GFP_ATOMIC))
> -		head->flags |= OBJPOOL_FROM_VMALLOC;
> -
> -	for (i = 0; i < head->nr_cpus; i++) {
> -		struct objpool_slot *os;
> +	int i, j, n, size, slot_size, cpu_count = 0;
> +	struct objpool_slot *slot;
>   
> +	for (i = 0; i < pool->nr_cpus; i++) {
>   		/* skip the cpus which could never be present */
>   		if (!cpu_possible(i))
>   			continue;
>   
>   		/* compute how many objects to be managed by this slot */
> -		n = nobjs / num_possible_cpus();
> -		if (cpu < (nobjs % num_possible_cpus()))
> +		n = pool->nr_objs / num_possible_cpus();
> +		if (cpu_count < (pool->nr_objs % num_possible_cpus()))
>   			n++;
> -		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> -		       sizeof(uint32_t) * nents + objsz * n;
> +		cpu_count++;
> +
> +		slot_size = struct_size(slot, entries, pool->capacity);
> +		size = slot_size + pool->obj_size * n;
>   
>   		/*
>   		 * here we allocate percpu-slot & objects together in a single
> -		 * allocation, taking advantage of warm caches and TLB hits as
> -		 * vmalloc always aligns the request size to pages
> +		 * allocation, taking advantage on NUMA.
>   		 */
> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> -			os = __vmalloc_node(size, sizeof(void *), head->gfp,
> +		if (pool->flags & OBJPOOL_FROM_VMALLOC)
> +			slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
>   				cpu_to_node(i), __builtin_return_address(0));
>   		else
> -			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> -		if (!os)
> +			slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> +		if (!slot)
>   			return -ENOMEM;
>   
>   		/* initialize percpu slot for the i-th slot */
> -		memset(os, 0, size);
> -		os->bits = ilog2(head->capacity);
> -		os->mask = head->capacity - 1;
> -		head->cpu_slots[i] = os;
> -		head->slot_sizes[i] = size;
> -		cpu = cpu + 1;
> -
> -		/*
> -		 * manually set head & tail to avoid possible conflict:
> -		 * We assume that the head item is ready for retrieval
> -		 * iff head is equal to ages[head & mask]. but ages is
> -		 * initialized as 0, so in view of the caller of pop(),
> -		 * the 1st item (0th) is always ready, but the reality
> -		 * could be: push() is stalled before the final update,
> -		 * thus the item being inserted will be lost forever
> -		 */
> -		os->head = os->tail = head->capacity;
> -
> -		if (!objsz)
> -			continue;
> +		memset(slot, 0, size);
> +		slot->mask = pool->capacity - 1;
> +		pool->cpu_slots[i] = slot;
>   
>   		for (j = 0; j < n; j++) {
> -			uint32_t *ages = SLOT_AGES(os);
> -			void **ents = SLOT_ENTS(os);
> -			void *obj = SLOT_OBJS(os) + j * objsz;
> -			uint32_t ie = os->tail & os->mask;
> +			void *obj = (void *)slot + slot_size + pool->obj_size * j;
>   
> -			/* perform object initialization */
>   			if (objinit) {
>   				int rc = objinit(obj, context);
>   				if (rc)
>   					return rc;
>   			}
> -
> -			/* add obj into the ring array */
> -			ents[ie] = obj;
> -			ages[ie] = os->tail;
> -			os->tail++;
> -			head->nr_objs++;
> +			slot->entries[j] = obj;
> +			slot->tail++;
>   		}
>   	}
>   
> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>   }
>   
>   /* cleanup all percpu slots of the object pool */
> -static void objpool_fini_percpu_slots(struct objpool_head *head)
> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
>   {
>   	int i;
>   
> -	if (!head->cpu_slots)
> +	if (!pool->cpu_slots)
>   		return;
>   
> -	for (i = 0; i < head->nr_cpus; i++) {
> -		if (!head->cpu_slots[i])
> -			continue;
> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> -			vfree(head->cpu_slots[i]);
> -		else
> -			kfree(head->cpu_slots[i]);
> -	}
> -	kfree(head->cpu_slots);
> -	head->cpu_slots = NULL;
> -	head->slot_sizes = NULL;
> +	for (i = 0; i < pool->nr_cpus; i++)
> +		kvfree(pool->cpu_slots[i]);
> +	kfree(pool->cpu_slots);
>   }
>   
>   /* initialize object pool and pre-allocate objects */
> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>   		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>   		objpool_fini_cb release)
>   {
>   	int nents, rc;
>   
>   	/* check input parameters */
> -	if (nr_objs <= 0 || object_size <= 0)
> +	if (nr_objs <= 0 || object_size <= 0 ||
> +	    nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> +		return -EINVAL;
> +
> +	/* Align up to unsigned long size */
> +	object_size = ALIGN(object_size, sizeof(unsigned long));
> +
> +	/*
> +	 * To avoid filling up the entries array in the per-cpu slot,
> +	 * use the power of 2 which is more than N + 1. Thus, the tail never
> +	 * catch up to the pool, and able to use pool/tail as the sequencial
> +	 * number.
> +	 */
> +	nents = roundup_pow_of_two(nr_objs + 1);
> +	if (!nents)
>   		return -EINVAL;
>   
> -	/* calculate percpu slot size (rounded to pow of 2) */
> -	nents = max_t(int, roundup_pow_of_two(nr_objs),
> -			objpool_nobjs(L1_CACHE_BYTES));
> -
> -	/* initialize objpool head */
> -	memset(head, 0, sizeof(struct objpool_head));
> -	head->nr_cpus = nr_cpu_ids;
> -	head->obj_size = object_size;
> -	head->capacity = nents;
> -	head->gfp = gfp & ~__GFP_ZERO;
> -	head->context = context;
> -	head->release = release;
> -
> -	/* allocate array for percpu slots */
> -	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> -			       head->nr_cpus * sizeof(int), head->gfp);
> -	if (!head->cpu_slots)
> +	/* initialize objpool pool */
> +	memset(pool, 0, sizeof(struct objpool_head));
> +	pool->nr_cpus = nr_cpu_ids;
> +	pool->obj_size = object_size;
> +	pool->nr_objs = nr_objs;
> +	/* just prevent to fullfill the per-cpu ring array */
> +	pool->capacity = nents;
> +	pool->gfp = gfp & ~__GFP_ZERO;
> +	pool->context = context;
> +	pool->release = release;
> +	/* vmalloc is used in default to allocate percpu-slots */
> +	if (!(pool->gfp & GFP_ATOMIC))
> +		pool->flags |= OBJPOOL_FROM_VMALLOC;
> +
> +	pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> +	if (!pool->cpu_slots)
>   		return -ENOMEM;
> -	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>   
>   	/* initialize per-cpu slots */
> -	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> +	rc = objpool_init_percpu_slots(pool, context, objinit);
>   	if (rc)
> -		objpool_fini_percpu_slots(head);
> +		objpool_fini_percpu_slots(pool);
>   	else
> -		refcount_set(&head->ref, nr_objs + 1);
> +		refcount_set(&pool->ref, nr_objs + 1);
>   
>   	return rc;
>   }
>   EXPORT_SYMBOL_GPL(objpool_init);
>   
>   /* adding object to slot, abort if the slot was already full */
> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
>   {
> -	uint32_t *ages = SLOT_AGES(os);
> -	void **ents = SLOT_ENTS(os);
> -	uint32_t head, tail;
> +	struct objpool_slot *slot = pool->cpu_slots[cpu];
> +	uint32_t tail, next;
>   
>   	do {
> -		/* perform memory loading for both head and tail */
> -		head = READ_ONCE(os->head);
> -		tail = READ_ONCE(os->tail);
> -		/* just abort if slot is full */
> -		if (tail - head > os->mask)
> -			return -ENOENT;
> -		/* try to extend tail by 1 using CAS to avoid races */
> -		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> -			break;
> -	} while (1);
> +		uint32_t head = READ_ONCE(slot->head);
>   
> -	/* the tail-th of slot is reserved for the given obj */
> -	WRITE_ONCE(ents[tail & os->mask], obj);
> -	/* update epoch id to make this object available for pop() */
> -	smp_store_release(&ages[tail & os->mask], tail);
> +		tail = READ_ONCE(slot->tail);
> +		next = tail + 1;
> +
> +		/* This must never happen because capacity >= N + 1 */
> +		if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> +				 (next < head && next > head + pool->nr_objs)))
> +			return -EINVAL;
> +
> +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> +
> +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>   	return 0;
>   }
>   
>   /* reclaim an object to object pool */
> -int objpool_push(void *obj, struct objpool_head *oh)
> +int objpool_push(void *obj, struct objpool_head *pool)
>   {
>   	unsigned long flags;
> -	int cpu, rc;
> +	int rc;
>   
>   	/* disable local irq to avoid preemption & interruption */
>   	raw_local_irq_save(flags);
> -	cpu = raw_smp_processor_id();
> -	do {
> -		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> -		if (!rc)
> -			break;
> -		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> -	} while (1);
> +	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>   	raw_local_irq_restore(flags);
>   
>   	return rc;
>   }
>   EXPORT_SYMBOL_GPL(objpool_push);
>   
> -/* drop the allocated object, rather reclaim it to objpool */
> -int objpool_drop(void *obj, struct objpool_head *head)
> -{
> -	if (!obj || !head)
> -		return -EINVAL;
> -
> -	if (refcount_dec_and_test(&head->ref)) {
> -		objpool_free(head);
> -		return 0;
> -	}
> -
> -	return -EAGAIN;
> -}
> -EXPORT_SYMBOL_GPL(objpool_drop);
> -
>   /* try to retrieve object from slot */
> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
>   {
> -	uint32_t *ages = SLOT_AGES(os);
> -	void **ents = SLOT_ENTS(os);
>   	/* do memory load of head to local head */
> -	uint32_t head = smp_load_acquire(&os->head);
> +	uint32_t head = smp_load_acquire(&slot->head);
>   
>   	/* loop if slot isn't empty */
> -	while (head != READ_ONCE(os->tail)) {
> -		uint32_t id = head & os->mask, prev = head;
> +	while (head != READ_ONCE(slot->tail)) {
>   
>   		/* do prefetching of object ents */
> -		prefetch(&ents[id]);
> -
> -		/* check whether this item was ready for retrieval */
> -		if (smp_load_acquire(&ages[id]) == head) {
> -			/* node must have been udpated by push() */
> -			void *node = READ_ONCE(ents[id]);
> -			/* commit and move forward head of the slot */
> -			if (try_cmpxchg_release(&os->head, &head, head + 1))
> -				return node;
> -			/* head was already updated by others */
> -		}
> +		prefetch(&slot->entries[head & slot->mask]);
> +
> +		/* commit and move forward head of the slot */
> +		if (try_cmpxchg_release(&slot->head, &head, head + 1))
> +			/*
> +			 * TBD: check overwrap the tail/head counter and warn
> +			 * if it is broken. But this happens only if this
> +			 * process slows down a lot and another CPU updates
> +			 * the haed/tail just 2^32 + 1 times, and this slot
> +			 * is empty.
> +			 */
> +			return slot->entries[head & slot->mask];
>   
>   		/* re-load head from memory and continue trying */
> -		head = READ_ONCE(os->head);
> -		/*
> -		 * head stays unchanged, so it's very likely there's an
> -		 * ongoing push() on other cpu nodes but yet not update
> -		 * ages[] to mark it's completion
> -		 */
> -		if (head == prev)
> -			break;
> +		head = READ_ONCE(slot->head);
>   	}
>   
>   	return NULL;
> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
>   EXPORT_SYMBOL_GPL(objpool_pop);
>   
>   /* release whole objpool forcely */
> -void objpool_free(struct objpool_head *head)
> +void objpool_free(struct objpool_head *pool)
>   {
> -	if (!head->cpu_slots)
> +	if (!pool->cpu_slots)
>   		return;
>   
>   	/* release percpu slots */
> -	objpool_fini_percpu_slots(head);
> +	objpool_fini_percpu_slots(pool);
>   
>   	/* call user's cleanup callback if provided */
> -	if (head->release)
> -		head->release(head, head->context);
> +	if (pool->release)
> +		pool->release(pool, pool->context);
>   }
>   EXPORT_SYMBOL_GPL(objpool_free);
>   
> -/* drop unused objects and defref objpool for releasing */
> -void objpool_fini(struct objpool_head *head)
> +/* drop the allocated object, rather reclaim it to objpool */
> +int objpool_drop(void *obj, struct objpool_head *pool)
>   {
> -	void *nod;
> +	if (!obj || !pool)
> +		return -EINVAL;
>   
> -	do {
> -		/* grab object from objpool and drop it */
> -		nod = objpool_pop(head);
> +	if (refcount_dec_and_test(&pool->ref)) {
> +		objpool_free(pool);
> +		return 0;
> +	}
> +
> +	return -EAGAIN;
> +}
> +EXPORT_SYMBOL_GPL(objpool_drop);
> +
> +/* drop unused objects and defref objpool for releasing */
> +void objpool_fini(struct objpool_head *pool)
> +{
> +	void *obj;
>   
> -		/* drop the extra ref of objpool */
> -		if (refcount_dec_and_test(&head->ref))
> -			objpool_free(head);
> -	} while (nod);
> +	/* grab object from objpool and drop it */
> +	while ((obj = objpool_pop(pool)))
> +		objpool_drop(obj, pool);
>   }
>   EXPORT_SYMBOL_GPL(objpool_fini);


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-10-09  9:23     ` wuqiang
@ 2023-10-09 13:51       ` Masami Hiramatsu
  2023-10-12 14:02       ` Masami Hiramatsu
  1 sibling, 0 replies; 25+ messages in thread
From: Masami Hiramatsu @ 2023-10-09 13:51 UTC (permalink / raw)
  To: wuqiang
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On Mon, 9 Oct 2023 17:23:34 +0800
wuqiang <wuqiang.matt@bytedance.com> wrote:

> Hello Masami,
> 
> Just got time for the new patch and got that ages[] was removed. ages[] is
> introduced the way like 2-phase commit to keep consitency and must be kept.
> 
> Thinking of the following 2 cases that two cpu nodes are operating the same
> objpool_slot simultaneously:
> 
> Case 1:
> 
>    NODE 1:                  NODE 2:
>    push to an empty slot    pop will get wrong value
> 
>    try_cmpxchg_acquire(&slot->tail, &tail, next)
>                             try_cmpxchg_release(&slot->head, &head, head + 1)
>                             return slot->entries[head & slot->mask]
>    WRITE_ONCE(slot->entries[tail & slot->mask], obj)

Oh, good catch! Hmm, indeed. I considered to use another 'commit_tail' but
it may not work (small window to leak the object for nested case).
Thanks for review it!


> 
> 
> Case 2:
> 
>    NODE 1:                  NODE 2
>    push to slot w/ 1 obj    pop will get wrong value
> 
>                             try_cmpxchg_release(&slot->head, &head, head + 1)
>    try_cmpxchg_acquire(&slot->tail, &tail, next)
>    WRITE_ONCE(slot->entries[tail & slot->mask], obj)
>                             return slot->entries[head & slot->mask]
> 
> 
> Regards,
> wuqiang
> 
> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> > Hi Wuqiang,
> > 
> > On Tue,  5 Sep 2023 09:52:51 +0800
> > "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> > 
> >> The object pool is a scalable implementaion of high performance queue
> >> for object allocation and reclamation, such as kretprobe instances.
> >>
> >> With leveraging percpu ring-array to mitigate the hot spot of memory
> >> contention, it could deliver near-linear scalability for high parallel
> >> scenarios. The objpool is best suited for following cases:
> >> 1) Memory allocation or reclamation are prohibited or too expensive
> >> 2) Consumers are of different priorities, such as irqs and threads
> >>
> >> Limitations:
> >> 1) Maximum objects (capacity) is determined during pool initializing
> >>     and can't be modified (extended) after objpool creation
> >> 2) The memory of objects won't be freed until objpool is finalized
> >> 3) Object allocation (pop) may fail after trying all cpu slots
> > 
> > I made a simplifying patch on this by (mainly) removing ages array.
> > I also rename local variable to use more readable names, like slot,
> > pool, and obj.
> > 
> > Here the results which I run the test_objpool.ko.
> > 
> > Original:
> > [   50.500235] Summary of testcases:
> > [   50.503139]     duration: 1027135us 	hits:   30628293 	miss:          0 	sync: percpu objpool
> > [   50.510416]     duration: 1047977us 	hits:   30453023 	miss:          0 	sync: percpu objpool from vmalloc
> > [   50.517421]     duration: 1047098us 	hits:   31145034 	miss:          0 	sync & hrtimer: percpu objpool
> > [   50.524671]     duration: 1053233us 	hits:   30919640 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> > [   50.531382]     duration: 1055822us 	hits:    3407221 	miss:     830219 	sync overrun: percpu objpool
> > [   50.538135]     duration: 1055998us 	hits:    3404624 	miss:     854160 	sync overrun: percpu objpool from vmalloc
> > [   50.546686]     duration: 1046104us 	hits:   19464798 	miss:          0 	async: percpu objpool
> > [   50.552989]     duration: 1033054us 	hits:   18957836 	miss:          0 	async: percpu objpool from vmalloc
> > [   50.560289]     duration: 1041778us 	hits:   33806868 	miss:          0 	async & hrtimer: percpu objpool
> > [   50.567425]     duration: 1048901us 	hits:   34211860 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> > 
> > Simplified:
> > [   48.393236] Summary of testcases:
> > [   48.395384]     duration: 1013002us 	hits:   29661448 	miss:          0 	sync: percpu objpool
> > [   48.400351]     duration: 1057231us 	hits:   31035578 	miss:          0 	sync: percpu objpool from vmalloc
> > [   48.405660]     duration: 1043196us 	hits:   30546652 	miss:          0 	sync & hrtimer: percpu objpool
> > [   48.411216]     duration: 1047128us 	hits:   30601306 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> > [   48.417231]     duration: 1051695us 	hits:    3468287 	miss:     892881 	sync overrun: percpu objpool
> > [   48.422405]     duration: 1054003us 	hits:    3549491 	miss:     898141 	sync overrun: percpu objpool from vmalloc
> > [   48.428425]     duration: 1052946us 	hits:   19005228 	miss:          0 	async: percpu objpool
> > [   48.433597]     duration: 1051757us 	hits:   19670866 	miss:          0 	async: percpu objpool from vmalloc
> > [   48.439280]     duration: 1042951us 	hits:   37135332 	miss:          0 	async & hrtimer: percpu objpool
> > [   48.445085]     duration: 1029803us 	hits:   37093248 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> > 
> > Can you test it too?
> > 
> > Thanks,
> > 
> >  From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> > From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> > Date: Mon, 25 Sep 2023 16:07:12 +0900
> > Subject: [PATCH] objpool: Simplify objpool by removing ages array
> > 
> > Simplify the objpool code by removing ages array from per-cpu slot.
> > It chooses enough big capacity (which is a rounded up power of 2 value
> > of nr_objects + 1) for the entries array, the tail never catch up to
> > the head in per-cpu slot. Thus tail == head means the slot is empty.
> > 
> > This also uses consistent local variable names for pool, slot and obj.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> >   include/linux/objpool.h |  61 ++++----
> >   lib/objpool.c           | 310 ++++++++++++++++------------------------
> >   2 files changed, 147 insertions(+), 224 deletions(-)
> > 
> > diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> > index 33c832216b98..ecd5ecaffcd3 100644
> > --- a/include/linux/objpool.h
> > +++ b/include/linux/objpool.h
> > @@ -38,33 +38,23 @@
> >    * struct objpool_slot - percpu ring array of objpool
> >    * @head: head of the local ring array (to retrieve at)
> >    * @tail: tail of the local ring array (to append at)
> > - * @bits: log2 of capacity (for bitwise operations)
> > - * @mask: capacity - 1
> > + * @mask: capacity of entries - 1
> > + * @entries: object entries on this slot.
> >    *
> >    * Represents a cpu-local array-based ring buffer, its size is specialized
> >    * during initialization of object pool. The percpu objpool slot is to be
> >    * allocated from local memory for NUMA system, and to be kept compact in
> > - * continuous memory: ages[] is stored just after the body of objpool_slot,
> > - * and then entries[]. ages[] describes revision of each item, solely used
> > - * to avoid ABA; entries[] contains pointers of the actual objects
> > - *
> > - * Layout of objpool_slot in memory:
> > - *
> > - * 64bit:
> > - *        4      8      12     16        32                 64
> > - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> > - *
> > - * 32bit:
> > - *        4      8      12     16        32        48       64
> > - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> > + * continuous memory: CPU assigned number of objects are stored just after
> > + * the body of objpool_slot.
> >    *
> >    */
> >   struct objpool_slot {
> > -	uint32_t                head;
> > -	uint32_t                tail;
> > -	uint32_t                bits;
> > -	uint32_t                mask;
> > -} __packed;
> > +	uint32_t	head;
> > +	uint32_t	tail;
> > +	uint32_t	mask;
> > +	uint32_t	dummyr;
> > +	void *		entries[];
> > +};
> >   
> >   struct objpool_head;
> >   
> > @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> >    * @obj_size:   object & element size
> >    * @nr_objs:    total objs (to be pre-allocated)
> >    * @nr_cpus:    nr_cpu_ids
> > - * @capacity:   max objects per cpuslot
> > + * @capacity:   max objects on percpu slot
> >    * @gfp:        gfp flags for kmalloc & vmalloc
> >    * @ref:        refcount for objpool
> >    * @flags:      flags for objpool management
> >    * @cpu_slots:  array of percpu slots
> > - * @slot_sizes:	size in bytes of slots
> >    * @release:    resource cleanup callback
> >    * @context:    caller-provided context
> >    */
> > @@ -100,7 +89,6 @@ struct objpool_head {
> >   	refcount_t              ref;
> >   	unsigned long           flags;
> >   	struct objpool_slot   **cpu_slots;
> > -	int                    *slot_sizes;
> >   	objpool_fini_cb         release;
> >   	void                   *context;
> >   };
> > @@ -108,9 +96,12 @@ struct objpool_head {
> >   #define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
> >   #define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
> >   
> > +#define OBJPOOL_NR_OBJECT_MAX	(1 << 24)
> > +#define OBJPOOL_OBJECT_SIZE_MAX	(1 << 16)
> > +
> >   /**
> >    * objpool_init() - initialize objpool and pre-allocated objects
> > - * @head:    the object pool to be initialized, declared by caller
> > + * @pool:    the object pool to be initialized, declared by caller
> >    * @nr_objs: total objects to be pre-allocated by this object pool
> >    * @object_size: size of an object (should be > 0)
> >    * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
> > @@ -128,47 +119,47 @@ struct objpool_head {
> >    * pop (object allocation) or do clearance before each push (object
> >    * reclamation).
> >    */
> > -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> > +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> >   		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >   		 objpool_fini_cb release);
> >   
> >   /**
> >    * objpool_pop() - allocate an object from objpool
> > - * @head: object pool
> > + * @pool: object pool
> >    *
> >    * return value: object ptr or NULL if failed
> >    */
> > -void *objpool_pop(struct objpool_head *head);
> > +void *objpool_pop(struct objpool_head *pool);
> >   
> >   /**
> >    * objpool_push() - reclaim the object and return back to objpool
> >    * @obj:  object ptr to be pushed to objpool
> > - * @head: object pool
> > + * @pool: object pool
> >    *
> >    * return: 0 or error code (it fails only when user tries to push
> >    * the same object multiple times or wrong "objects" into objpool)
> >    */
> > -int objpool_push(void *obj, struct objpool_head *head);
> > +int objpool_push(void *obj, struct objpool_head *pool);
> >   
> >   /**
> >    * objpool_drop() - discard the object and deref objpool
> >    * @obj:  object ptr to be discarded
> > - * @head: object pool
> > + * @pool: object pool
> >    *
> >    * return: 0 if objpool was released or error code
> >    */
> > -int objpool_drop(void *obj, struct objpool_head *head);
> > +int objpool_drop(void *obj, struct objpool_head *pool);
> >   
> >   /**
> >    * objpool_free() - release objpool forcely (all objects to be freed)
> > - * @head: object pool to be released
> > + * @pool: object pool to be released
> >    */
> > -void objpool_free(struct objpool_head *head);
> > +void objpool_free(struct objpool_head *pool);
> >   
> >   /**
> >    * objpool_fini() - deref object pool (also releasing unused objects)
> > - * @head: object pool to be dereferenced
> > + * @pool: object pool to be dereferenced
> >    */
> > -void objpool_fini(struct objpool_head *head);
> > +void objpool_fini(struct objpool_head *pool);
> >   
> >   #endif /* _LINUX_OBJPOOL_H */
> > diff --git a/lib/objpool.c b/lib/objpool.c
> > index 22e752371820..f8e8f70d7253 100644
> > --- a/lib/objpool.c
> > +++ b/lib/objpool.c
> > @@ -15,104 +15,55 @@
> >    * Copyright: wuqiang.matt@bytedance.com
> >    */
> >   
> > -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> > -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> > -			(sizeof(uint32_t) << (s)->bits)))
> > -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> > -			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> > -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> > -
> > -/* compute the suitable num of objects to be managed per slot */
> > -static int objpool_nobjs(int size)
> > -{
> > -	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> > -			(sizeof(uint32_t) + sizeof(void *)));
> > -}
> > -
> >   /* allocate and initialize percpu slots */
> >   static int
> > -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> > -			void *context, objpool_init_obj_cb objinit)
> > +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> > +			  objpool_init_obj_cb objinit)
> >   {
> > -	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> > -
> > -	/* aligned object size by sizeof(void *) */
> > -	objsz = ALIGN(head->obj_size, sizeof(void *));
> > -	/* shall we allocate objects along with percpu-slot */
> > -	if (objsz)
> > -		head->flags |= OBJPOOL_HAVE_OBJECTS;
> > -
> > -	/* vmalloc is used in default to allocate percpu-slots */
> > -	if (!(head->gfp & GFP_ATOMIC))
> > -		head->flags |= OBJPOOL_FROM_VMALLOC;
> > -
> > -	for (i = 0; i < head->nr_cpus; i++) {
> > -		struct objpool_slot *os;
> > +	int i, j, n, size, slot_size, cpu_count = 0;
> > +	struct objpool_slot *slot;
> >   
> > +	for (i = 0; i < pool->nr_cpus; i++) {
> >   		/* skip the cpus which could never be present */
> >   		if (!cpu_possible(i))
> >   			continue;
> >   
> >   		/* compute how many objects to be managed by this slot */
> > -		n = nobjs / num_possible_cpus();
> > -		if (cpu < (nobjs % num_possible_cpus()))
> > +		n = pool->nr_objs / num_possible_cpus();
> > +		if (cpu_count < (pool->nr_objs % num_possible_cpus()))
> >   			n++;
> > -		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> > -		       sizeof(uint32_t) * nents + objsz * n;
> > +		cpu_count++;
> > +
> > +		slot_size = struct_size(slot, entries, pool->capacity);
> > +		size = slot_size + pool->obj_size * n;
> >   
> >   		/*
> >   		 * here we allocate percpu-slot & objects together in a single
> > -		 * allocation, taking advantage of warm caches and TLB hits as
> > -		 * vmalloc always aligns the request size to pages
> > +		 * allocation, taking advantage on NUMA.
> >   		 */
> > -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> > -			os = __vmalloc_node(size, sizeof(void *), head->gfp,
> > +		if (pool->flags & OBJPOOL_FROM_VMALLOC)
> > +			slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
> >   				cpu_to_node(i), __builtin_return_address(0));
> >   		else
> > -			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> > -		if (!os)
> > +			slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> > +		if (!slot)
> >   			return -ENOMEM;
> >   
> >   		/* initialize percpu slot for the i-th slot */
> > -		memset(os, 0, size);
> > -		os->bits = ilog2(head->capacity);
> > -		os->mask = head->capacity - 1;
> > -		head->cpu_slots[i] = os;
> > -		head->slot_sizes[i] = size;
> > -		cpu = cpu + 1;
> > -
> > -		/*
> > -		 * manually set head & tail to avoid possible conflict:
> > -		 * We assume that the head item is ready for retrieval
> > -		 * iff head is equal to ages[head & mask]. but ages is
> > -		 * initialized as 0, so in view of the caller of pop(),
> > -		 * the 1st item (0th) is always ready, but the reality
> > -		 * could be: push() is stalled before the final update,
> > -		 * thus the item being inserted will be lost forever
> > -		 */
> > -		os->head = os->tail = head->capacity;
> > -
> > -		if (!objsz)
> > -			continue;
> > +		memset(slot, 0, size);
> > +		slot->mask = pool->capacity - 1;
> > +		pool->cpu_slots[i] = slot;
> >   
> >   		for (j = 0; j < n; j++) {
> > -			uint32_t *ages = SLOT_AGES(os);
> > -			void **ents = SLOT_ENTS(os);
> > -			void *obj = SLOT_OBJS(os) + j * objsz;
> > -			uint32_t ie = os->tail & os->mask;
> > +			void *obj = (void *)slot + slot_size + pool->obj_size * j;
> >   
> > -			/* perform object initialization */
> >   			if (objinit) {
> >   				int rc = objinit(obj, context);
> >   				if (rc)
> >   					return rc;
> >   			}
> > -
> > -			/* add obj into the ring array */
> > -			ents[ie] = obj;
> > -			ages[ie] = os->tail;
> > -			os->tail++;
> > -			head->nr_objs++;
> > +			slot->entries[j] = obj;
> > +			slot->tail++;
> >   		}
> >   	}
> >   
> > @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> >   }
> >   
> >   /* cleanup all percpu slots of the object pool */
> > -static void objpool_fini_percpu_slots(struct objpool_head *head)
> > +static void objpool_fini_percpu_slots(struct objpool_head *pool)
> >   {
> >   	int i;
> >   
> > -	if (!head->cpu_slots)
> > +	if (!pool->cpu_slots)
> >   		return;
> >   
> > -	for (i = 0; i < head->nr_cpus; i++) {
> > -		if (!head->cpu_slots[i])
> > -			continue;
> > -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> > -			vfree(head->cpu_slots[i]);
> > -		else
> > -			kfree(head->cpu_slots[i]);
> > -	}
> > -	kfree(head->cpu_slots);
> > -	head->cpu_slots = NULL;
> > -	head->slot_sizes = NULL;
> > +	for (i = 0; i < pool->nr_cpus; i++)
> > +		kvfree(pool->cpu_slots[i]);
> > +	kfree(pool->cpu_slots);
> >   }
> >   
> >   /* initialize object pool and pre-allocate objects */
> > -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> > +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> >   		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >   		objpool_fini_cb release)
> >   {
> >   	int nents, rc;
> >   
> >   	/* check input parameters */
> > -	if (nr_objs <= 0 || object_size <= 0)
> > +	if (nr_objs <= 0 || object_size <= 0 ||
> > +	    nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> > +		return -EINVAL;
> > +
> > +	/* Align up to unsigned long size */
> > +	object_size = ALIGN(object_size, sizeof(unsigned long));
> > +
> > +	/*
> > +	 * To avoid filling up the entries array in the per-cpu slot,
> > +	 * use the power of 2 which is more than N + 1. Thus, the tail never
> > +	 * catch up to the pool, and able to use pool/tail as the sequencial
> > +	 * number.
> > +	 */
> > +	nents = roundup_pow_of_two(nr_objs + 1);
> > +	if (!nents)
> >   		return -EINVAL;
> >   
> > -	/* calculate percpu slot size (rounded to pow of 2) */
> > -	nents = max_t(int, roundup_pow_of_two(nr_objs),
> > -			objpool_nobjs(L1_CACHE_BYTES));
> > -
> > -	/* initialize objpool head */
> > -	memset(head, 0, sizeof(struct objpool_head));
> > -	head->nr_cpus = nr_cpu_ids;
> > -	head->obj_size = object_size;
> > -	head->capacity = nents;
> > -	head->gfp = gfp & ~__GFP_ZERO;
> > -	head->context = context;
> > -	head->release = release;
> > -
> > -	/* allocate array for percpu slots */
> > -	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> > -			       head->nr_cpus * sizeof(int), head->gfp);
> > -	if (!head->cpu_slots)
> > +	/* initialize objpool pool */
> > +	memset(pool, 0, sizeof(struct objpool_head));
> > +	pool->nr_cpus = nr_cpu_ids;
> > +	pool->obj_size = object_size;
> > +	pool->nr_objs = nr_objs;
> > +	/* just prevent to fullfill the per-cpu ring array */
> > +	pool->capacity = nents;
> > +	pool->gfp = gfp & ~__GFP_ZERO;
> > +	pool->context = context;
> > +	pool->release = release;
> > +	/* vmalloc is used in default to allocate percpu-slots */
> > +	if (!(pool->gfp & GFP_ATOMIC))
> > +		pool->flags |= OBJPOOL_FROM_VMALLOC;
> > +
> > +	pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> > +	if (!pool->cpu_slots)
> >   		return -ENOMEM;
> > -	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
> >   
> >   	/* initialize per-cpu slots */
> > -	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> > +	rc = objpool_init_percpu_slots(pool, context, objinit);
> >   	if (rc)
> > -		objpool_fini_percpu_slots(head);
> > +		objpool_fini_percpu_slots(pool);
> >   	else
> > -		refcount_set(&head->ref, nr_objs + 1);
> > +		refcount_set(&pool->ref, nr_objs + 1);
> >   
> >   	return rc;
> >   }
> >   EXPORT_SYMBOL_GPL(objpool_init);
> >   
> >   /* adding object to slot, abort if the slot was already full */
> > -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> > +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> >   {
> > -	uint32_t *ages = SLOT_AGES(os);
> > -	void **ents = SLOT_ENTS(os);
> > -	uint32_t head, tail;
> > +	struct objpool_slot *slot = pool->cpu_slots[cpu];
> > +	uint32_t tail, next;
> >   
> >   	do {
> > -		/* perform memory loading for both head and tail */
> > -		head = READ_ONCE(os->head);
> > -		tail = READ_ONCE(os->tail);
> > -		/* just abort if slot is full */
> > -		if (tail - head > os->mask)
> > -			return -ENOENT;
> > -		/* try to extend tail by 1 using CAS to avoid races */
> > -		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> > -			break;
> > -	} while (1);
> > +		uint32_t head = READ_ONCE(slot->head);
> >   
> > -	/* the tail-th of slot is reserved for the given obj */
> > -	WRITE_ONCE(ents[tail & os->mask], obj);
> > -	/* update epoch id to make this object available for pop() */
> > -	smp_store_release(&ages[tail & os->mask], tail);
> > +		tail = READ_ONCE(slot->tail);
> > +		next = tail + 1;
> > +
> > +		/* This must never happen because capacity >= N + 1 */
> > +		if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> > +				 (next < head && next > head + pool->nr_objs)))
> > +			return -EINVAL;
> > +
> > +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> > +
> > +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> >   	return 0;
> >   }
> >   
> >   /* reclaim an object to object pool */
> > -int objpool_push(void *obj, struct objpool_head *oh)
> > +int objpool_push(void *obj, struct objpool_head *pool)
> >   {
> >   	unsigned long flags;
> > -	int cpu, rc;
> > +	int rc;
> >   
> >   	/* disable local irq to avoid preemption & interruption */
> >   	raw_local_irq_save(flags);
> > -	cpu = raw_smp_processor_id();
> > -	do {
> > -		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> > -		if (!rc)
> > -			break;
> > -		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> > -	} while (1);
> > +	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> >   	raw_local_irq_restore(flags);
> >   
> >   	return rc;
> >   }
> >   EXPORT_SYMBOL_GPL(objpool_push);
> >   
> > -/* drop the allocated object, rather reclaim it to objpool */
> > -int objpool_drop(void *obj, struct objpool_head *head)
> > -{
> > -	if (!obj || !head)
> > -		return -EINVAL;
> > -
> > -	if (refcount_dec_and_test(&head->ref)) {
> > -		objpool_free(head);
> > -		return 0;
> > -	}
> > -
> > -	return -EAGAIN;
> > -}
> > -EXPORT_SYMBOL_GPL(objpool_drop);
> > -
> >   /* try to retrieve object from slot */
> > -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> > +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
> >   {
> > -	uint32_t *ages = SLOT_AGES(os);
> > -	void **ents = SLOT_ENTS(os);
> >   	/* do memory load of head to local head */
> > -	uint32_t head = smp_load_acquire(&os->head);
> > +	uint32_t head = smp_load_acquire(&slot->head);
> >   
> >   	/* loop if slot isn't empty */
> > -	while (head != READ_ONCE(os->tail)) {
> > -		uint32_t id = head & os->mask, prev = head;
> > +	while (head != READ_ONCE(slot->tail)) {
> >   
> >   		/* do prefetching of object ents */
> > -		prefetch(&ents[id]);
> > -
> > -		/* check whether this item was ready for retrieval */
> > -		if (smp_load_acquire(&ages[id]) == head) {
> > -			/* node must have been udpated by push() */
> > -			void *node = READ_ONCE(ents[id]);
> > -			/* commit and move forward head of the slot */
> > -			if (try_cmpxchg_release(&os->head, &head, head + 1))
> > -				return node;
> > -			/* head was already updated by others */
> > -		}
> > +		prefetch(&slot->entries[head & slot->mask]);
> > +
> > +		/* commit and move forward head of the slot */
> > +		if (try_cmpxchg_release(&slot->head, &head, head + 1))
> > +			/*
> > +			 * TBD: check overwrap the tail/head counter and warn
> > +			 * if it is broken. But this happens only if this
> > +			 * process slows down a lot and another CPU updates
> > +			 * the haed/tail just 2^32 + 1 times, and this slot
> > +			 * is empty.
> > +			 */
> > +			return slot->entries[head & slot->mask];
> >   
> >   		/* re-load head from memory and continue trying */
> > -		head = READ_ONCE(os->head);
> > -		/*
> > -		 * head stays unchanged, so it's very likely there's an
> > -		 * ongoing push() on other cpu nodes but yet not update
> > -		 * ages[] to mark it's completion
> > -		 */
> > -		if (head == prev)
> > -			break;
> > +		head = READ_ONCE(slot->head);
> >   	}
> >   
> >   	return NULL;
> > @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
> >   EXPORT_SYMBOL_GPL(objpool_pop);
> >   
> >   /* release whole objpool forcely */
> > -void objpool_free(struct objpool_head *head)
> > +void objpool_free(struct objpool_head *pool)
> >   {
> > -	if (!head->cpu_slots)
> > +	if (!pool->cpu_slots)
> >   		return;
> >   
> >   	/* release percpu slots */
> > -	objpool_fini_percpu_slots(head);
> > +	objpool_fini_percpu_slots(pool);
> >   
> >   	/* call user's cleanup callback if provided */
> > -	if (head->release)
> > -		head->release(head, head->context);
> > +	if (pool->release)
> > +		pool->release(pool, pool->context);
> >   }
> >   EXPORT_SYMBOL_GPL(objpool_free);
> >   
> > -/* drop unused objects and defref objpool for releasing */
> > -void objpool_fini(struct objpool_head *head)
> > +/* drop the allocated object, rather reclaim it to objpool */
> > +int objpool_drop(void *obj, struct objpool_head *pool)
> >   {
> > -	void *nod;
> > +	if (!obj || !pool)
> > +		return -EINVAL;
> >   
> > -	do {
> > -		/* grab object from objpool and drop it */
> > -		nod = objpool_pop(head);
> > +	if (refcount_dec_and_test(&pool->ref)) {
> > +		objpool_free(pool);
> > +		return 0;
> > +	}
> > +
> > +	return -EAGAIN;
> > +}
> > +EXPORT_SYMBOL_GPL(objpool_drop);
> > +
> > +/* drop unused objects and defref objpool for releasing */
> > +void objpool_fini(struct objpool_head *pool)
> > +{
> > +	void *obj;
> >   
> > -		/* drop the extra ref of objpool */
> > -		if (refcount_dec_and_test(&head->ref))
> > -			objpool_free(head);
> > -	} while (nod);
> > +	/* grab object from objpool and drop it */
> > +	while ((obj = objpool_pop(pool)))
> > +		objpool_drop(obj, pool);
> >   }
> >   EXPORT_SYMBOL_GPL(objpool_fini);
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-10-08 18:40     ` wuqiang
@ 2023-10-09 14:19       ` Masami Hiramatsu
  2023-10-12 16:16         ` wuqiang.matt
  0 siblings, 1 reply; 25+ messages in thread
From: Masami Hiramatsu @ 2023-10-09 14:19 UTC (permalink / raw)
  To: wuqiang
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

Hi,

On Mon, 9 Oct 2023 02:40:53 +0800
wuqiang <wuqiang.matt@bytedance.com> wrote:

> On 2023/9/23 17:48, Masami Hiramatsu (Google) wrote:
> > Hi Wuqiang,
> > 
> > Sorry for replying later.
> > 
> > On Tue,  5 Sep 2023 09:52:51 +0800
> > "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> > 
> >> The object pool is a scalable implementaion of high performance queue
> >> for object allocation and reclamation, such as kretprobe instances.
> >>
> >> With leveraging percpu ring-array to mitigate the hot spot of memory
> >> contention, it could deliver near-linear scalability for high parallel
> >> scenarios. The objpool is best suited for following cases:
> >> 1) Memory allocation or reclamation are prohibited or too expensive
> >> 2) Consumers are of different priorities, such as irqs and threads
> >>
> >> Limitations:
> >> 1) Maximum objects (capacity) is determined during pool initializing
> >>     and can't be modified (extended) after objpool creation
> > 
> > So the pool size is fixed in initialization.
> 
> Right. The arrary size will be up-rounded to the exponent of 2, but the
> actual number of objects (to be allocated) are the exact value specified
> by user.

Yeah, this makes easy to use the seq-number as index.

> 
> > 
> >> 2) The memory of objects won't be freed until objpool is finalized
> >> 3) Object allocation (pop) may fail after trying all cpu slots
> > 
> > This means that object allocation will fail if the all pools are empty,
> > right?
> 
> Yes, pop() will return NULL for this case. pop() does the checking for
> only 1 round of all cpu nodes.
> 
> The objpool might not be empty since new object could be inserted back
> in the meaintime by other nodes, which is natural for lockless queues.

OK.

> 
> > 
> >>
> >> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
> >> ---
> >>   include/linux/objpool.h | 174 +++++++++++++++++++++
> >>   lib/Makefile            |   2 +-
> >>   lib/objpool.c           | 338 ++++++++++++++++++++++++++++++++++++++++
> >>   3 files changed, 513 insertions(+), 1 deletion(-)
> >>   create mode 100644 include/linux/objpool.h
> >>   create mode 100644 lib/objpool.c
> >>
> >> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> >> new file mode 100644
> >> index 000000000000..33c832216b98
> >> --- /dev/null
> >> +++ b/include/linux/objpool.h
> >> @@ -0,0 +1,174 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +
> >> +#ifndef _LINUX_OBJPOOL_H
> >> +#define _LINUX_OBJPOOL_H
> >> +
> >> +#include <linux/types.h>
> >> +#include <linux/refcount.h>
> >> +
> >> +/*
> >> + * objpool: ring-array based lockless MPMC queue
> >> + *
> >> + * Copyright: wuqiang.matt@bytedance.com
> >> + *
> >> + * The object pool is a scalable implementaion of high performance queue
> >> + * for objects allocation and reclamation, such as kretprobe instances.
> >> + *
> >> + * With leveraging per-cpu ring-array to mitigate the hot spots of memory
> >> + * contention, it could deliver near-linear scalability for high parallel
> >> + * scenarios. The ring-array is compactly managed in a single cache-line
> >> + * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
> >> + * The body of pre-allocated objects is stored in continuous cache-lines
> >> + * just after the ring-array.
> > 
> > I consider the size of entries may be big if we have larger number of
> > CPU cores, like 72-cores. And if user specifies (2^n) + 1 entries.
> > In this case, each CPU has (2^n - 1)/72 objects, but has 2^(n + 1)
> > entries in ring buffer. So it should be noted.
> 
> Yes for the arrary size since it‘s up-rounded to the exponent of 2, but the
> actual number of pre-allocated objects will stay the same as user specified.
> 
> >> + *
> >> + * The object pool is interrupt safe. Both allocation and reclamation
> >> + * (object pop and push operations) can be preemptible or interruptable.
> > 
> > You've added raw_spinlock_disable/enable(), so it is not preemptible
> > or interruptible anymore. (Anyway, caller doesn't take care of that)
> 
> Sure, this decription is imporper and unnecessary. Will be removed.
> 
> >> + *
> >> + * It's best suited for following cases:
> >> + * 1) Memory allocation or reclamation are prohibited or too expensive
> >> + * 2) Consumers are of different priorities, such as irqs and threads
> >> + *
> >> + * Limitations:
> >> + * 1) Maximum objects (capacity) is determined during pool initializing
> >> + * 2) The memory of objects won't be freed until the pool is finalized
> >> + * 3) Object allocation (pop) may fail after trying all cpu slots
> >> + */
> >> +
> >> +/**
> >> + * struct objpool_slot - percpu ring array of objpool
> >> + * @head: head of the local ring array (to retrieve at)
> >> + * @tail: tail of the local ring array (to append at)
> >> + * @bits: log2 of capacity (for bitwise operations)
> >> + * @mask: capacity - 1
> > 
> > These description does not give idea what those roles are.
> 
> I'll refine the description. objpool_slot is totally internal to objpool.
> 
> > 
> >> + *
> >> + * Represents a cpu-local array-based ring buffer, its size is specialized
> >> + * during initialization of object pool. The percpu objpool slot is to be
> >> + * allocated from local memory for NUMA system, and to be kept compact in
> >> + * continuous memory: ages[] is stored just after the body of objpool_slot,
> >> + * and then entries[]. ages[] describes revision of each item, solely used
> >> + * to avoid ABA; entries[] contains pointers of the actual objects
> >> + *
> >> + * Layout of objpool_slot in memory:
> >> + *
> >> + * 64bit:
> >> + *        4      8      12     16        32                 64
> >> + * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> >> + *
> >> + * 32bit:
> >> + *        4      8      12     16        32        48       64
> >> + * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> > 
> > Hm, the '4' here means number of objects after this objpool_slot?
> > I don't recommend you to allocate several arraies after the header, instead,
> > using another data structure like this;
> > 
> > |head|tail|bits|mask|ents[N]{age:4|offs:4}|padding|objects
> > 
> > here N means the number of total objects.
> 
> Sorry for the confusion, I will make it more clear. Here 4/8/.../64 are offset
> in bytes. The above is an example with the objpool_slot compacted in a single
> cache line.

But in that case, the entry number may not be enough for storing all object.
(or limit the number of objects)

Actually, since the rethook needs to make a shadow stack list per task not
per cpu, the (safe) required number of object is usually proportional to the
number of active tasks. kretprobes sets the default number of nodes according
to the CPUs but it is minimum requirement. This is because, 
- most of the kernel functions are not nested, thus it is called once on each
 thread in the kernel.
- the thread can be scheduled or slept, thus the function return hook also is
  not done until the thread comes back.
So, usually, the recommended number of node (obj) will be 100-200 (depends on
the system.) If it is a server, it may be more than 1000.

> 
> > 
> > struct objpool_entry {
> > 	uint32_t	age;
> > 	void *	ptr;
> > } __attribute__((packed,aligned(8))) ;
> > 
> >> + *
> >> + */
> >> +struct objpool_slot {
> >> +	uint32_t                head;
> >> +	uint32_t                tail;
> >> +	uint32_t                bits;
> >> +	uint32_t                mask;
> > 
> > 	struct objpool_entry	entries[];
> > 
> >> +} __packed;
> > 
> > Then, you don't need complex macros to access object, but you need only one
> > inline function to get the actual object address.
> > 
> > static inline void *objpool_slot_object(struct objpool_slot *slot, int nth)
> > {
> > 	if (nth > (1 << bits))
> > 		return NULL;
> > 
> > 	return (void *)((unsigned long)slot + slot.entries[nth].offs);
> > }
> 
> The reason of these macroes is to compact objpool_slot/ages/ents to hot cache
> lines and also minimize the memory footprint.

Hmm, at this moment, I don't recommend you to stick on the cache line but
easier to read. If you have any number, you can add optimize patch afterwards.
But the initial patch should take care about the readability.

> 
> objpool_head could be a better place to manage these pointers, similarly as
> cpu_slots. I'll recheck the overhead.
> 
> 
> >> +
> >> +struct objpool_head;
> >> +
> >> +/*
> >> + * caller-specified callback for object initial setup, it's only called
> >> + * once for each object (just after the memory allocation of the object)
> >> + */
> >> +typedef int (*objpool_init_obj_cb)(void *obj, void *context);
> >> +
> >> +/* caller-specified cleanup callback for objpool destruction */
> >> +typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> >> +
> >> +/**
> >> + * struct objpool_head - object pooling metadata
> >> + * @obj_size:   object & element size
> > 
> > What does the 'element' mean?
> 
> "object size" should be enough. "element" means object, so it's unnecessary.
> 
> > 
> >> + * @nr_objs:    total objs (to be pre-allocated)
> > 
> > but all objects must be pre-allocated, right? then it is simply
> 
> Yes, all objects are pre-allocated for this implementation.
> 
> > 
> > @nr_objs: the total number of objects in this objpool.
> > 
> >> + * @nr_cpus:    nr_cpu_ids
> > 
> > would we have to save it? or just use 'nr_cpu_ids'?
> 
> Yes, it's just a local save of nr_cpu_ids, just to make the members of
> objpool_head aligned by 64 bits (there could be a 4-byte hold anyway).
> And possible beatification from hot TLB cache ?

Unless you pack the data structure, you don't need to care about
the cache. And the compiler works better than human for initial work.
At this moment, it is more important to reduce the members as simple
as possible.

> 
> > 
> >> + * @capacity:   max objects per cpuslot
> > 
> > what is 'cpuslot'?
> > This seems the size of objpool_entry array in objpool_slot.
> 
> Yes, should be "capacity per objpool_slot", i.e. "maximum objects could be
> stored in a objpool_slot".
> 
> >> + * @gfp:        gfp flags for kmalloc & vmalloc
> >> + * @ref:        refcount for objpool
> >> + * @flags:      flags for objpool management
> >> + * @cpu_slots:  array of percpu slots
> >> + * @slot_sizes:	size in bytes of slots
> >> + * @release:    resource cleanup callback
> >> + * @context:    caller-provided context
> >> + */
> >> +struct objpool_head {
> >> +	int                     obj_size;
> >> +	int                     nr_objs;
> >> +	int                     nr_cpus;
> >> +	int                     capacity;
> >> +	gfp_t                   gfp;
> >> +	refcount_t              ref;
> >> +	unsigned long           flags;
> >> +	struct objpool_slot   **cpu_slots;
> >> +	int                    *slot_sizes;
> >> +	objpool_fini_cb         release;
> >> +	void                   *context;
> >> +};
> >> +
> >> +#define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
> >> +#define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
> >> +
> >> +/**
> >> + * objpool_init() - initialize objpool and pre-allocated objects
> >> + * @head:    the object pool to be initialized, declared by caller
> >> + * @nr_objs: total objects to be pre-allocated by this object pool
> >> + * @object_size: size of an object (should be > 0)
> >> + * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
> >> + * @context: user context for object initialization callback
> >> + * @objinit: object initialization callback for extra setup
> >> + * @release: cleanup callback for extra cleanup task
> >> + *
> >> + * return value: 0 for success, otherwise error code
> >> + *
> >> + * All pre-allocated objects are to be zeroed after memory allocation.
> >> + * Caller could do extra initialization in objinit callback. objinit()
> >> + * will be called just after slot allocation and will be only once for
> >> + * each object. Since then the objpool won't touch any content of the
> >> + * objects. It's caller's duty to perform reinitialization after each
> >> + * pop (object allocation) or do clearance before each push (object
> >> + * reclamation).
> >> + */
> >> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> >> +		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >> +		 objpool_fini_cb release);
> >> +
> >> +/**
> >> + * objpool_pop() - allocate an object from objpool
> >> + * @head: object pool
> >> + *
> >> + * return value: object ptr or NULL if failed
> >> + */
> >> +void *objpool_pop(struct objpool_head *head);
> >> +
> >> +/**
> >> + * objpool_push() - reclaim the object and return back to objpool
> >> + * @obj:  object ptr to be pushed to objpool
> >> + * @head: object pool
> >> + *
> >> + * return: 0 or error code (it fails only when user tries to push
> >> + * the same object multiple times or wrong "objects" into objpool)
> >> + */
> >> +int objpool_push(void *obj, struct objpool_head *head);
> >> +
> >> +/**
> >> + * objpool_drop() - discard the object and deref objpool
> >> + * @obj:  object ptr to be discarded
> >> + * @head: object pool
> >> + *
> >> + * return: 0 if objpool was released or error code
> >> + */
> >> +int objpool_drop(void *obj, struct objpool_head *head);
> >> +
> >> +/**
> >> + * objpool_free() - release objpool forcely (all objects to be freed)
> >> + * @head: object pool to be released
> >> + */
> >> +void objpool_free(struct objpool_head *head);
> >> +
> >> +/**
> >> + * objpool_fini() - deref object pool (also releasing unused objects)
> >> + * @head: object pool to be dereferenced
> >> + */
> >> +void objpool_fini(struct objpool_head *head);
> >> +
> >> +#endif /* _LINUX_OBJPOOL_H */
> >> diff --git a/lib/Makefile b/lib/Makefile
> >> index 1ffae65bb7ee..7a84c922d9ff 100644
> >> --- a/lib/Makefile
> >> +++ b/lib/Makefile
> >> @@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
> >>   	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
> >>   	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
> >>   	 nmi_backtrace.o win_minmax.o memcat_p.o \
> >> -	 buildid.o
> >> +	 buildid.o objpool.o
> >>   
> >>   lib-$(CONFIG_PRINTK) += dump_stack.o
> >>   lib-$(CONFIG_SMP) += cpumask.o
> >> diff --git a/lib/objpool.c b/lib/objpool.c
> >> new file mode 100644
> >> index 000000000000..22e752371820
> >> --- /dev/null
> >> +++ b/lib/objpool.c
> >> @@ -0,0 +1,338 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +
> >> +#include <linux/objpool.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/vmalloc.h>
> >> +#include <linux/atomic.h>
> >> +#include <linux/prefetch.h>
> >> +#include <linux/irqflags.h>
> >> +#include <linux/cpumask.h>
> >> +#include <linux/log2.h>
> >> +
> >> +/*
> >> + * objpool: ring-array based lockless MPMC/FIFO queues
> >> + *
> >> + * Copyright: wuqiang.matt@bytedance.com
> >> + */
> >> +
> >> +#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> >> +#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> >> +			(sizeof(uint32_t) << (s)->bits)))
> >> +#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> >> +			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> >> +#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> >> +
> >> +/* compute the suitable num of objects to be managed per slot */
> >> +static int objpool_nobjs(int size)
> >> +{
> >> +	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> >> +			(sizeof(uint32_t) + sizeof(void *)));
> >> +}
> >> +
> >> +/* allocate and initialize percpu slots */
> > 
> > @head: the objpool_head for managing this objpool
> > @nobjs: the total number of objects in this objpool
> > @context: context data for @objinit
> > @objinit: initialize callback for each object.
> 
> Got it. I didn't since objpool_init_percpu_slots is not public.
> 
> >> +static int
> >> +objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> >> +			void *context, objpool_init_obj_cb objinit)
> >> +{
> >> +	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> > 
> > 'nents' is *round up to the power of 2* of the total number of objects.
> > 
> >> +
> >> +	/* aligned object size by sizeof(void *) */
> >> +	objsz = ALIGN(head->obj_size, sizeof(void *));
> >> +	/* shall we allocate objects along with percpu-slot */
> >> +	if (objsz)
> >> +		head->flags |= OBJPOOL_HAVE_OBJECTS;
> > 
> > Is there any chance that objsz == 0?
> 
> No chance. We always require non-zero objsz. Will update in next verion.
> 
> > 
> >> +
> >> +	/* vmalloc is used in default to allocate percpu-slots */
> >> +	if (!(head->gfp & GFP_ATOMIC))
> >> +		head->flags |= OBJPOOL_FROM_VMALLOC;
> >> +
> >> +	for (i = 0; i < head->nr_cpus; i++) {
> >> +		struct objpool_slot *os;
> >> +
> >> +		/* skip the cpus which could never be present */
> >> +		if (!cpu_possible(i))
> >> +			continue;
> >> +
> >> +		/* compute how many objects to be managed by this slot */
> > 
> > "to be managed"? or "to be allocated with"?
> > It seems all objects are possible to be managed by each slot.
> 
> Right. "to be allocated with" is preferable. Thanks.
> 
> >> +		n = nobjs / num_possible_cpus();
> >> +		if (cpu < (nobjs % num_possible_cpus()))
> >> +			n++;
> >> +		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> >> +		       sizeof(uint32_t) * nents + objsz * n;
> >> +
> >> +		/*
> >> +		 * here we allocate percpu-slot & objects together in a single
> >> +		 * allocation, taking advantage of warm caches and TLB hits as
> >> +		 * vmalloc always aligns the request size to pages
> > 
> > "Since the objpool_entry array in the slot is mostly accessed from the
> >   i-th CPU, it should be allocated from the memory node for that CPU."
> > 
> > I think the reason of the memory node allocation is mainly for reducing the
> > penalty of the cache-miss, since it will be bigger if running on NUMA.
> 
> Right, NUMA is addressed by objpool_slot. The above description is to explain
> why a single memory allocation (not multiple). I'll try to make it more clear.
> 
> > 
> >> +		 */
> >> +		if (head->flags & OBJPOOL_FROM_VMALLOC)
> >> +			os = __vmalloc_node(size, sizeof(void *), head->gfp,
> >> +				cpu_to_node(i), __builtin_return_address(0));
> >> +		else
> >> +			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> >> +		if (!os)
> >> +			return -ENOMEM;
> >> +
> >> +		/* initialize percpu slot for the i-th slot */
> >> +		memset(os, 0, size);
> >> +		os->bits = ilog2(head->capacity);
> >> +		os->mask = head->capacity - 1;
> >> +		head->cpu_slots[i] = os;
> >> +		head->slot_sizes[i] = size;
> >> +		cpu = cpu + 1;
> >> +
> >> +		/*
> >> +		 * manually set head & tail to avoid possible conflict:
> >> +		 * We assume that the head item is ready for retrieval
> >> +		 * iff head is equal to ages[head & mask]. but ages is
> >> +		 * initialized as 0, so in view of the caller of pop(),
> >> +		 * the 1st item (0th) is always ready, but the reality
> >> +		 * could be: push() is stalled before the final update,
> >> +		 * thus the item being inserted will be lost forever
> >> +		 */
> >> +		os->head = os->tail = head->capacity;
> >> +
> >> +		if (!objsz)
> >> +			continue;
> > 
> > Is it possible? and for what?
> 
> Will be removed in next version.
> 
> > 
> >> +
> >> +		for (j = 0; j < n; j++) {
> >> +			uint32_t *ages = SLOT_AGES(os);
> >> +			void **ents = SLOT_ENTS(os);
> >> +			void *obj = SLOT_OBJS(os) + j * objsz;
> >> +			uint32_t ie = os->tail & os->mask;
> >> +
> >> +			/* perform object initialization */
> >> +			if (objinit) {
> >> +				int rc = objinit(obj, context);
> >> +				if (rc)
> >> +					return rc;
> >> +			}
> >> +
> >> +			/* add obj into the ring array */
> >> +			ents[ie] = obj;
> >> +			ages[ie] = os->tail;
> >> +			os->tail++;
> >> +			head->nr_objs++;
> >> +		}
> > 
> > To simplify the code, this loop should be another static function.
> 
> I'll reconsider the implementation. And the multiple computations of ages/ents
> should be avoided too.
> 
> > 
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +/* cleanup all percpu slots of the object pool */
> >> +static void objpool_fini_percpu_slots(struct objpool_head *head)
> >> +{
> >> +	int i;
> >> +
> >> +	if (!head->cpu_slots)
> >> +		return;
> >> +
> >> +	for (i = 0; i < head->nr_cpus; i++) {
> >> +		if (!head->cpu_slots[i])
> >> +			continue;
> >> +		if (head->flags & OBJPOOL_FROM_VMALLOC)
> >> +			vfree(head->cpu_slots[i]);
> >> +		else
> >> +			kfree(head->cpu_slots[i]);
> >> +	}
> >> +	kfree(head->cpu_slots);
> >> +	head->cpu_slots = NULL;
> >> +	head->slot_sizes = NULL;
> >> +}
> >> +
> >> +/* initialize object pool and pre-allocate objects */
> >> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> >> +		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >> +		objpool_fini_cb release)
> >> +{
> >> +	int nents, rc;
> >> +
> >> +	/* check input parameters */
> >> +	if (nr_objs <= 0 || object_size <= 0)
> >> +		return -EINVAL;
> >> +
> >> +	/* calculate percpu slot size (rounded to pow of 2) */
> >> +	nents = max_t(int, roundup_pow_of_two(nr_objs),
> >> +			objpool_nobjs(L1_CACHE_BYTES));
> >> +
> >> +	/* initialize objpool head */
> >> +	memset(head, 0, sizeof(struct objpool_head));
> >> +	head->nr_cpus = nr_cpu_ids;
> >> +	head->obj_size = object_size;
> >> +	head->capacity = nents;
> >> +	head->gfp = gfp & ~__GFP_ZERO;
> >> +	head->context = context;
> >> +	head->release = release;
> >> +
> >> +	/* allocate array for percpu slots */
> >> +	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> >> +			       head->nr_cpus * sizeof(int), head->gfp);
> >> +	if (!head->cpu_slots)
> >> +		return -ENOMEM;
> >> +	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
> >> +
> >> +	/* initialize per-cpu slots */
> >> +	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> >> +	if (rc)
> >> +		objpool_fini_percpu_slots(head);
> >> +	else
> >> +		refcount_set(&head->ref, nr_objs + 1);
> >> +
> >> +	return rc;
> >> +}
> >> +EXPORT_SYMBOL_GPL(objpool_init);
> >> +
> >> +/* adding object to slot, abort if the slot was already full */
> >> +static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> >> +{
> >> +	uint32_t *ages = SLOT_AGES(os);
> >> +	void **ents = SLOT_ENTS(os);
> >> +	uint32_t head, tail;
> >> +
> >> +	do {
> >> +		/* perform memory loading for both head and tail */
> >> +		head = READ_ONCE(os->head);
> >> +		tail = READ_ONCE(os->tail);
> >> +		/* just abort if slot is full */
> >> +		if (tail - head > os->mask)
> >> +			return -ENOENT;
> > 
> > Is this really possible? The total number of objects must be less euqal to
> > the os->mask. If it means a bug, please use WARN_ON_ONCE() here for debug.
> 
> Yes, it's a BUG and the caller's fault. When user tries pushing wrong object
> or repeatedly pushing a same object, it could break the objpool's consistency.
> It's a 'worse' or 'more worse' choice, rather returning error than breaking
> the consitency.
> 
> As you adviced, better crash than problematic. I'll update in next version.
> 
> > 
> >> +		/* try to extend tail by 1 using CAS to avoid races */
> >> +		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> >> +			break;
> >> +	} while (1);
> > 
> > "if(cond) ~ break; } while(1)" should be "} (!cond);"
> 
> I see. Just to make the codes more "balanced" with comments :)
> 
> > 
> > And this seems to be buggy since tail++ can be 0, then "tail - head" < 0.
> > 
> > if (tail < head)
> > 	if (WARN_ON_ONCE(tail + (UINT32_MAX - head) > os->mask))
> > 		return -ENOENT;
> > else
> > 	if (WARN_ON_ONCE(tail - head > os->mask))
> > 		return -ENOENT;
> 
> tail and head are unsigned, so "tail - head" is unsigned and should always
> be the actual number of free objects in the objpool_slot.
> 
> >> +
> >> +	/* the tail-th of slot is reserved for the given obj */
> >> +	WRITE_ONCE(ents[tail & os->mask], obj);
> >> +	/* update epoch id to make this object available for pop() */
> >> +	smp_store_release(&ages[tail & os->mask], tail);
> > 
> > Note: since the ages array size is the power of 2, this is just a
> > (32 - os->bits) loop counter. :)
> > 
> >> +	return 0;
> >> +}
> >> +
> >> +/* reclaim an object to object pool */
> >> +int objpool_push(void *obj, struct objpool_head *oh)
> >> +{
> >> +	unsigned long flags;
> >> +	int cpu, rc;
> >> +
> >> +	/* disable local irq to avoid preemption & interruption */
> >> +	raw_local_irq_save(flags);
> >> +	cpu = raw_smp_processor_id();
> >> +	do {
> >> +		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> >> +		if (!rc)
> >> +			break;
> >> +		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> >> +	} while (1);
> > 
> > Hmm, as I said, head->capacity >= nr_all_obj, this must not happen,
> > we can always push it on this CPU's slot, right?
> 
> Right. If it happens, that means the user made mistakes. I'll refine
> the codes.
> 
> > 
> >> +	raw_local_irq_restore(flags);
> >> +
> >> +	return rc;
> >> +}
> >> +EXPORT_SYMBOL_GPL(objpool_push);
> >> +
> >> +/* drop the allocated object, rather reclaim it to objpool */
> >> +int objpool_drop(void *obj, struct objpool_head *head)
> >> +{
> >> +	if (!obj || !head)
> >> +		return -EINVAL;
> >> +
> >> +	if (refcount_dec_and_test(&head->ref)) {
> >> +		objpool_free(head);
> >> +		return 0;
> >> +	}
> >> +
> >> +	return -EAGAIN;
> >> +}
> >> +EXPORT_SYMBOL_GPL(objpool_drop);
> >> +
> >> +/* try to retrieve object from slot */
> >> +static inline void *objpool_try_get_slot(struct objpool_slot *os)
> >> +{
> >> +	uint32_t *ages = SLOT_AGES(os);
> >> +	void **ents = SLOT_ENTS(os);
> >> +	/* do memory load of head to local head */
> >> +	uint32_t head = smp_load_acquire(&os->head);
> >> +
> >> +	/* loop if slot isn't empty */
> >> +	while (head != READ_ONCE(os->tail)) {
> >> +		uint32_t id = head & os->mask, prev = head;
> >> +
> >> +		/* do prefetching of object ents */
> >> +		prefetch(&ents[id]);
> >> +
> >> +		/* check whether this item was ready for retrieval */
> >> +		if (smp_load_acquire(&ages[id]) == head) {
> > 
> > We may not need this check, since we know head != tail and the
> > sizeof(ages) >= nr_all_objs.
> > 
> > Hmm, I guess we can remove ages[] from the code.
> 
> Just do a quick peek to avoid an unnecessary call of try_cmpxchg_release.
> try_cmpxchg_release is implemented by heavy instruction with "LOCK" prefix,
> which could bring cache invalidation among CPU nodes.

OK, I understand what this ages[] does. This is a nestable commit table
for the ring array.

> 
> > 
> >> +			/* node must have been udpated by push() */
> >> +			void *node = READ_ONCE(ents[id]);
> > 
> > Please use the same word for the same object.
> > I mean this is not 'node' but 'object'.
> 
> Got it.
> 
> > 
> >> +			/* commit and move forward head of the slot */
> >> +			if (try_cmpxchg_release(&os->head, &head, head + 1))
> >> +				return node;
> >> +			/* head was already updated by others */
> >> +		}
> >> +
> >> +		/* re-load head from memory and continue trying */
> >> +		head = READ_ONCE(os->head);
> >> +		/*
> >> +		 * head stays unchanged, so it's very likely there's an
> >> +		 * ongoing push() on other cpu nodes but yet not update
> >> +		 * ages[] to mark it's completion
> >> +		 */
> >> +		if (head == prev)
> >> +			break;
> > 
> > This is OK. If we always push() on the current CPU slot, and pop() from
> > any cpus, we can try again here if this slot is not current CPU. But that
> > maybe to much :P
> 
> Yes. For most cases, every CPU should only touch it's own objpool_slot.
> 
> > Thank you,
> 
> Thanks for your time.

Thank you for your reply!


> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-10-09  9:23     ` wuqiang
  2023-10-09 13:51       ` Masami Hiramatsu
@ 2023-10-12 14:02       ` Masami Hiramatsu
  2023-10-12 17:36         ` wuqiang.matt
  1 sibling, 1 reply; 25+ messages in thread
From: Masami Hiramatsu @ 2023-10-12 14:02 UTC (permalink / raw)
  To: wuqiang
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

Hi Wuqiang,

On Mon, 9 Oct 2023 17:23:34 +0800
wuqiang <wuqiang.matt@bytedance.com> wrote:

> Hello Masami,
> 
> Just got time for the new patch and got that ages[] was removed. ages[] is
> introduced the way like 2-phase commit to keep consitency and must be kept.
> 
> Thinking of the following 2 cases that two cpu nodes are operating the same
> objpool_slot simultaneously:
> 
> Case 1:
> 
>    NODE 1:                  NODE 2:
>    push to an empty slot    pop will get wrong value
> 
>    try_cmpxchg_acquire(&slot->tail, &tail, next)
>                             try_cmpxchg_release(&slot->head, &head, head + 1)
>                             return slot->entries[head & slot->mask]
>    WRITE_ONCE(slot->entries[tail & slot->mask], obj)

Today, I rethink the algorithm.

For this case, we can just add a `commit` to the slot for committing the tail
commit instead of the ages[]. 

CPU1                                       CPU2
push to an empty slot                      pop from the same slot

commit = tail = slot->tail;
next = tail + 1;
try_cmpxchg_acquire(&slot->tail,
                    &tail, next);
                                           while (head != commit) -> NG1
WRITE_ONCE(slot->entries[tail & slot->mask],
            obj)
                                           while (head != commit) -> NG2
WRITE_ONCE(&slot->commit, next);
                                           while (head != commit) -> OK
                                           try_cmpxchg_acquire(&slot->head,
                                                               &head, next);
                                           return slot->entries[head & slot->mask]

So the NG1 and NG2 timing, the pop will fail.

This does not expect the nested "push" case because the reserve-commit block
will be interrupt disabled. This doesn't support NMI but that is OK.

If we need to support such NMI or remove irq-disable, we can do following
(please remember the push operation only happens on the slot owned by that
 CPU. So this is per-cpu process)

---
do {
  commit = tail = slot->tail;
  next = tail + 1;
} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));

WRITE_ONCE(slot->entries[tail & slot->mask], obj);

// At this point, "commit == slot->tail - 1" in this nest level.
// Only outmost (non-nexted) case, "commit == slot->commit"
if (commit != slot->commit)
  return; // do nothing. it must be updated by outmost one.

// catch up commit to tail.
do {
  commit = READ_ONCE(slot->tail);
  WRITE_ONCE(slot->commit, commit);
   // note that someone can interrupt this loop too.
} while (commit != READ_ONCE(slot->tail));
---

For the rethook this may be too much.

Thank you,

> 
> 
> Case 2:
> 
>    NODE 1:                  NODE 2
>    push to slot w/ 1 obj    pop will get wrong value
> 
>                             try_cmpxchg_release(&slot->head, &head, head + 1)
>    try_cmpxchg_acquire(&slot->tail, &tail, next)
>    WRITE_ONCE(slot->entries[tail & slot->mask], obj)
>                             return slot->entries[head & slot->mask]
> 
> 
> Regards,
> wuqiang
> 
> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> > Hi Wuqiang,
> > 
> > On Tue,  5 Sep 2023 09:52:51 +0800
> > "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> > 
> >> The object pool is a scalable implementaion of high performance queue
> >> for object allocation and reclamation, such as kretprobe instances.
> >>
> >> With leveraging percpu ring-array to mitigate the hot spot of memory
> >> contention, it could deliver near-linear scalability for high parallel
> >> scenarios. The objpool is best suited for following cases:
> >> 1) Memory allocation or reclamation are prohibited or too expensive
> >> 2) Consumers are of different priorities, such as irqs and threads
> >>
> >> Limitations:
> >> 1) Maximum objects (capacity) is determined during pool initializing
> >>     and can't be modified (extended) after objpool creation
> >> 2) The memory of objects won't be freed until objpool is finalized
> >> 3) Object allocation (pop) may fail after trying all cpu slots
> > 
> > I made a simplifying patch on this by (mainly) removing ages array.
> > I also rename local variable to use more readable names, like slot,
> > pool, and obj.
> > 
> > Here the results which I run the test_objpool.ko.
> > 
> > Original:
> > [   50.500235] Summary of testcases:
> > [   50.503139]     duration: 1027135us 	hits:   30628293 	miss:          0 	sync: percpu objpool
> > [   50.510416]     duration: 1047977us 	hits:   30453023 	miss:          0 	sync: percpu objpool from vmalloc
> > [   50.517421]     duration: 1047098us 	hits:   31145034 	miss:          0 	sync & hrtimer: percpu objpool
> > [   50.524671]     duration: 1053233us 	hits:   30919640 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> > [   50.531382]     duration: 1055822us 	hits:    3407221 	miss:     830219 	sync overrun: percpu objpool
> > [   50.538135]     duration: 1055998us 	hits:    3404624 	miss:     854160 	sync overrun: percpu objpool from vmalloc
> > [   50.546686]     duration: 1046104us 	hits:   19464798 	miss:          0 	async: percpu objpool
> > [   50.552989]     duration: 1033054us 	hits:   18957836 	miss:          0 	async: percpu objpool from vmalloc
> > [   50.560289]     duration: 1041778us 	hits:   33806868 	miss:          0 	async & hrtimer: percpu objpool
> > [   50.567425]     duration: 1048901us 	hits:   34211860 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> > 
> > Simplified:
> > [   48.393236] Summary of testcases:
> > [   48.395384]     duration: 1013002us 	hits:   29661448 	miss:          0 	sync: percpu objpool
> > [   48.400351]     duration: 1057231us 	hits:   31035578 	miss:          0 	sync: percpu objpool from vmalloc
> > [   48.405660]     duration: 1043196us 	hits:   30546652 	miss:          0 	sync & hrtimer: percpu objpool
> > [   48.411216]     duration: 1047128us 	hits:   30601306 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> > [   48.417231]     duration: 1051695us 	hits:    3468287 	miss:     892881 	sync overrun: percpu objpool
> > [   48.422405]     duration: 1054003us 	hits:    3549491 	miss:     898141 	sync overrun: percpu objpool from vmalloc
> > [   48.428425]     duration: 1052946us 	hits:   19005228 	miss:          0 	async: percpu objpool
> > [   48.433597]     duration: 1051757us 	hits:   19670866 	miss:          0 	async: percpu objpool from vmalloc
> > [   48.439280]     duration: 1042951us 	hits:   37135332 	miss:          0 	async & hrtimer: percpu objpool
> > [   48.445085]     duration: 1029803us 	hits:   37093248 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> > 
> > Can you test it too?
> > 
> > Thanks,
> > 
> >  From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> > From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> > Date: Mon, 25 Sep 2023 16:07:12 +0900
> > Subject: [PATCH] objpool: Simplify objpool by removing ages array
> > 
> > Simplify the objpool code by removing ages array from per-cpu slot.
> > It chooses enough big capacity (which is a rounded up power of 2 value
> > of nr_objects + 1) for the entries array, the tail never catch up to
> > the head in per-cpu slot. Thus tail == head means the slot is empty.
> > 
> > This also uses consistent local variable names for pool, slot and obj.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> >   include/linux/objpool.h |  61 ++++----
> >   lib/objpool.c           | 310 ++++++++++++++++------------------------
> >   2 files changed, 147 insertions(+), 224 deletions(-)
> > 
> > diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> > index 33c832216b98..ecd5ecaffcd3 100644
> > --- a/include/linux/objpool.h
> > +++ b/include/linux/objpool.h
> > @@ -38,33 +38,23 @@
> >    * struct objpool_slot - percpu ring array of objpool
> >    * @head: head of the local ring array (to retrieve at)
> >    * @tail: tail of the local ring array (to append at)
> > - * @bits: log2 of capacity (for bitwise operations)
> > - * @mask: capacity - 1
> > + * @mask: capacity of entries - 1
> > + * @entries: object entries on this slot.
> >    *
> >    * Represents a cpu-local array-based ring buffer, its size is specialized
> >    * during initialization of object pool. The percpu objpool slot is to be
> >    * allocated from local memory for NUMA system, and to be kept compact in
> > - * continuous memory: ages[] is stored just after the body of objpool_slot,
> > - * and then entries[]. ages[] describes revision of each item, solely used
> > - * to avoid ABA; entries[] contains pointers of the actual objects
> > - *
> > - * Layout of objpool_slot in memory:
> > - *
> > - * 64bit:
> > - *        4      8      12     16        32                 64
> > - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> > - *
> > - * 32bit:
> > - *        4      8      12     16        32        48       64
> > - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> > + * continuous memory: CPU assigned number of objects are stored just after
> > + * the body of objpool_slot.
> >    *
> >    */
> >   struct objpool_slot {
> > -	uint32_t                head;
> > -	uint32_t                tail;
> > -	uint32_t                bits;
> > -	uint32_t                mask;
> > -} __packed;
> > +	uint32_t	head;
> > +	uint32_t	tail;
> > +	uint32_t	mask;
> > +	uint32_t	dummyr;
> > +	void *		entries[];
> > +};
> >   
> >   struct objpool_head;
> >   
> > @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> >    * @obj_size:   object & element size
> >    * @nr_objs:    total objs (to be pre-allocated)
> >    * @nr_cpus:    nr_cpu_ids
> > - * @capacity:   max objects per cpuslot
> > + * @capacity:   max objects on percpu slot
> >    * @gfp:        gfp flags for kmalloc & vmalloc
> >    * @ref:        refcount for objpool
> >    * @flags:      flags for objpool management
> >    * @cpu_slots:  array of percpu slots
> > - * @slot_sizes:	size in bytes of slots
> >    * @release:    resource cleanup callback
> >    * @context:    caller-provided context
> >    */
> > @@ -100,7 +89,6 @@ struct objpool_head {
> >   	refcount_t              ref;
> >   	unsigned long           flags;
> >   	struct objpool_slot   **cpu_slots;
> > -	int                    *slot_sizes;
> >   	objpool_fini_cb         release;
> >   	void                   *context;
> >   };
> > @@ -108,9 +96,12 @@ struct objpool_head {
> >   #define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
> >   #define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
> >   
> > +#define OBJPOOL_NR_OBJECT_MAX	(1 << 24)
> > +#define OBJPOOL_OBJECT_SIZE_MAX	(1 << 16)
> > +
> >   /**
> >    * objpool_init() - initialize objpool and pre-allocated objects
> > - * @head:    the object pool to be initialized, declared by caller
> > + * @pool:    the object pool to be initialized, declared by caller
> >    * @nr_objs: total objects to be pre-allocated by this object pool
> >    * @object_size: size of an object (should be > 0)
> >    * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
> > @@ -128,47 +119,47 @@ struct objpool_head {
> >    * pop (object allocation) or do clearance before each push (object
> >    * reclamation).
> >    */
> > -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> > +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> >   		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >   		 objpool_fini_cb release);
> >   
> >   /**
> >    * objpool_pop() - allocate an object from objpool
> > - * @head: object pool
> > + * @pool: object pool
> >    *
> >    * return value: object ptr or NULL if failed
> >    */
> > -void *objpool_pop(struct objpool_head *head);
> > +void *objpool_pop(struct objpool_head *pool);
> >   
> >   /**
> >    * objpool_push() - reclaim the object and return back to objpool
> >    * @obj:  object ptr to be pushed to objpool
> > - * @head: object pool
> > + * @pool: object pool
> >    *
> >    * return: 0 or error code (it fails only when user tries to push
> >    * the same object multiple times or wrong "objects" into objpool)
> >    */
> > -int objpool_push(void *obj, struct objpool_head *head);
> > +int objpool_push(void *obj, struct objpool_head *pool);
> >   
> >   /**
> >    * objpool_drop() - discard the object and deref objpool
> >    * @obj:  object ptr to be discarded
> > - * @head: object pool
> > + * @pool: object pool
> >    *
> >    * return: 0 if objpool was released or error code
> >    */
> > -int objpool_drop(void *obj, struct objpool_head *head);
> > +int objpool_drop(void *obj, struct objpool_head *pool);
> >   
> >   /**
> >    * objpool_free() - release objpool forcely (all objects to be freed)
> > - * @head: object pool to be released
> > + * @pool: object pool to be released
> >    */
> > -void objpool_free(struct objpool_head *head);
> > +void objpool_free(struct objpool_head *pool);
> >   
> >   /**
> >    * objpool_fini() - deref object pool (also releasing unused objects)
> > - * @head: object pool to be dereferenced
> > + * @pool: object pool to be dereferenced
> >    */
> > -void objpool_fini(struct objpool_head *head);
> > +void objpool_fini(struct objpool_head *pool);
> >   
> >   #endif /* _LINUX_OBJPOOL_H */
> > diff --git a/lib/objpool.c b/lib/objpool.c
> > index 22e752371820..f8e8f70d7253 100644
> > --- a/lib/objpool.c
> > +++ b/lib/objpool.c
> > @@ -15,104 +15,55 @@
> >    * Copyright: wuqiang.matt@bytedance.com
> >    */
> >   
> > -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> > -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> > -			(sizeof(uint32_t) << (s)->bits)))
> > -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> > -			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> > -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> > -
> > -/* compute the suitable num of objects to be managed per slot */
> > -static int objpool_nobjs(int size)
> > -{
> > -	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> > -			(sizeof(uint32_t) + sizeof(void *)));
> > -}
> > -
> >   /* allocate and initialize percpu slots */
> >   static int
> > -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> > -			void *context, objpool_init_obj_cb objinit)
> > +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> > +			  objpool_init_obj_cb objinit)
> >   {
> > -	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> > -
> > -	/* aligned object size by sizeof(void *) */
> > -	objsz = ALIGN(head->obj_size, sizeof(void *));
> > -	/* shall we allocate objects along with percpu-slot */
> > -	if (objsz)
> > -		head->flags |= OBJPOOL_HAVE_OBJECTS;
> > -
> > -	/* vmalloc is used in default to allocate percpu-slots */
> > -	if (!(head->gfp & GFP_ATOMIC))
> > -		head->flags |= OBJPOOL_FROM_VMALLOC;
> > -
> > -	for (i = 0; i < head->nr_cpus; i++) {
> > -		struct objpool_slot *os;
> > +	int i, j, n, size, slot_size, cpu_count = 0;
> > +	struct objpool_slot *slot;
> >   
> > +	for (i = 0; i < pool->nr_cpus; i++) {
> >   		/* skip the cpus which could never be present */
> >   		if (!cpu_possible(i))
> >   			continue;
> >   
> >   		/* compute how many objects to be managed by this slot */
> > -		n = nobjs / num_possible_cpus();
> > -		if (cpu < (nobjs % num_possible_cpus()))
> > +		n = pool->nr_objs / num_possible_cpus();
> > +		if (cpu_count < (pool->nr_objs % num_possible_cpus()))
> >   			n++;
> > -		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> > -		       sizeof(uint32_t) * nents + objsz * n;
> > +		cpu_count++;
> > +
> > +		slot_size = struct_size(slot, entries, pool->capacity);
> > +		size = slot_size + pool->obj_size * n;
> >   
> >   		/*
> >   		 * here we allocate percpu-slot & objects together in a single
> > -		 * allocation, taking advantage of warm caches and TLB hits as
> > -		 * vmalloc always aligns the request size to pages
> > +		 * allocation, taking advantage on NUMA.
> >   		 */
> > -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> > -			os = __vmalloc_node(size, sizeof(void *), head->gfp,
> > +		if (pool->flags & OBJPOOL_FROM_VMALLOC)
> > +			slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
> >   				cpu_to_node(i), __builtin_return_address(0));
> >   		else
> > -			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> > -		if (!os)
> > +			slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> > +		if (!slot)
> >   			return -ENOMEM;
> >   
> >   		/* initialize percpu slot for the i-th slot */
> > -		memset(os, 0, size);
> > -		os->bits = ilog2(head->capacity);
> > -		os->mask = head->capacity - 1;
> > -		head->cpu_slots[i] = os;
> > -		head->slot_sizes[i] = size;
> > -		cpu = cpu + 1;
> > -
> > -		/*
> > -		 * manually set head & tail to avoid possible conflict:
> > -		 * We assume that the head item is ready for retrieval
> > -		 * iff head is equal to ages[head & mask]. but ages is
> > -		 * initialized as 0, so in view of the caller of pop(),
> > -		 * the 1st item (0th) is always ready, but the reality
> > -		 * could be: push() is stalled before the final update,
> > -		 * thus the item being inserted will be lost forever
> > -		 */
> > -		os->head = os->tail = head->capacity;
> > -
> > -		if (!objsz)
> > -			continue;
> > +		memset(slot, 0, size);
> > +		slot->mask = pool->capacity - 1;
> > +		pool->cpu_slots[i] = slot;
> >   
> >   		for (j = 0; j < n; j++) {
> > -			uint32_t *ages = SLOT_AGES(os);
> > -			void **ents = SLOT_ENTS(os);
> > -			void *obj = SLOT_OBJS(os) + j * objsz;
> > -			uint32_t ie = os->tail & os->mask;
> > +			void *obj = (void *)slot + slot_size + pool->obj_size * j;
> >   
> > -			/* perform object initialization */
> >   			if (objinit) {
> >   				int rc = objinit(obj, context);
> >   				if (rc)
> >   					return rc;
> >   			}
> > -
> > -			/* add obj into the ring array */
> > -			ents[ie] = obj;
> > -			ages[ie] = os->tail;
> > -			os->tail++;
> > -			head->nr_objs++;
> > +			slot->entries[j] = obj;
> > +			slot->tail++;
> >   		}
> >   	}
> >   
> > @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> >   }
> >   
> >   /* cleanup all percpu slots of the object pool */
> > -static void objpool_fini_percpu_slots(struct objpool_head *head)
> > +static void objpool_fini_percpu_slots(struct objpool_head *pool)
> >   {
> >   	int i;
> >   
> > -	if (!head->cpu_slots)
> > +	if (!pool->cpu_slots)
> >   		return;
> >   
> > -	for (i = 0; i < head->nr_cpus; i++) {
> > -		if (!head->cpu_slots[i])
> > -			continue;
> > -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> > -			vfree(head->cpu_slots[i]);
> > -		else
> > -			kfree(head->cpu_slots[i]);
> > -	}
> > -	kfree(head->cpu_slots);
> > -	head->cpu_slots = NULL;
> > -	head->slot_sizes = NULL;
> > +	for (i = 0; i < pool->nr_cpus; i++)
> > +		kvfree(pool->cpu_slots[i]);
> > +	kfree(pool->cpu_slots);
> >   }
> >   
> >   /* initialize object pool and pre-allocate objects */
> > -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> > +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> >   		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >   		objpool_fini_cb release)
> >   {
> >   	int nents, rc;
> >   
> >   	/* check input parameters */
> > -	if (nr_objs <= 0 || object_size <= 0)
> > +	if (nr_objs <= 0 || object_size <= 0 ||
> > +	    nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> > +		return -EINVAL;
> > +
> > +	/* Align up to unsigned long size */
> > +	object_size = ALIGN(object_size, sizeof(unsigned long));
> > +
> > +	/*
> > +	 * To avoid filling up the entries array in the per-cpu slot,
> > +	 * use the power of 2 which is more than N + 1. Thus, the tail never
> > +	 * catch up to the pool, and able to use pool/tail as the sequencial
> > +	 * number.
> > +	 */
> > +	nents = roundup_pow_of_two(nr_objs + 1);
> > +	if (!nents)
> >   		return -EINVAL;
> >   
> > -	/* calculate percpu slot size (rounded to pow of 2) */
> > -	nents = max_t(int, roundup_pow_of_two(nr_objs),
> > -			objpool_nobjs(L1_CACHE_BYTES));
> > -
> > -	/* initialize objpool head */
> > -	memset(head, 0, sizeof(struct objpool_head));
> > -	head->nr_cpus = nr_cpu_ids;
> > -	head->obj_size = object_size;
> > -	head->capacity = nents;
> > -	head->gfp = gfp & ~__GFP_ZERO;
> > -	head->context = context;
> > -	head->release = release;
> > -
> > -	/* allocate array for percpu slots */
> > -	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> > -			       head->nr_cpus * sizeof(int), head->gfp);
> > -	if (!head->cpu_slots)
> > +	/* initialize objpool pool */
> > +	memset(pool, 0, sizeof(struct objpool_head));
> > +	pool->nr_cpus = nr_cpu_ids;
> > +	pool->obj_size = object_size;
> > +	pool->nr_objs = nr_objs;
> > +	/* just prevent to fullfill the per-cpu ring array */
> > +	pool->capacity = nents;
> > +	pool->gfp = gfp & ~__GFP_ZERO;
> > +	pool->context = context;
> > +	pool->release = release;
> > +	/* vmalloc is used in default to allocate percpu-slots */
> > +	if (!(pool->gfp & GFP_ATOMIC))
> > +		pool->flags |= OBJPOOL_FROM_VMALLOC;
> > +
> > +	pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> > +	if (!pool->cpu_slots)
> >   		return -ENOMEM;
> > -	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
> >   
> >   	/* initialize per-cpu slots */
> > -	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> > +	rc = objpool_init_percpu_slots(pool, context, objinit);
> >   	if (rc)
> > -		objpool_fini_percpu_slots(head);
> > +		objpool_fini_percpu_slots(pool);
> >   	else
> > -		refcount_set(&head->ref, nr_objs + 1);
> > +		refcount_set(&pool->ref, nr_objs + 1);
> >   
> >   	return rc;
> >   }
> >   EXPORT_SYMBOL_GPL(objpool_init);
> >   
> >   /* adding object to slot, abort if the slot was already full */
> > -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> > +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> >   {
> > -	uint32_t *ages = SLOT_AGES(os);
> > -	void **ents = SLOT_ENTS(os);
> > -	uint32_t head, tail;
> > +	struct objpool_slot *slot = pool->cpu_slots[cpu];
> > +	uint32_t tail, next;
> >   
> >   	do {
> > -		/* perform memory loading for both head and tail */
> > -		head = READ_ONCE(os->head);
> > -		tail = READ_ONCE(os->tail);
> > -		/* just abort if slot is full */
> > -		if (tail - head > os->mask)
> > -			return -ENOENT;
> > -		/* try to extend tail by 1 using CAS to avoid races */
> > -		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> > -			break;
> > -	} while (1);
> > +		uint32_t head = READ_ONCE(slot->head);
> >   
> > -	/* the tail-th of slot is reserved for the given obj */
> > -	WRITE_ONCE(ents[tail & os->mask], obj);
> > -	/* update epoch id to make this object available for pop() */
> > -	smp_store_release(&ages[tail & os->mask], tail);
> > +		tail = READ_ONCE(slot->tail);
> > +		next = tail + 1;
> > +
> > +		/* This must never happen because capacity >= N + 1 */
> > +		if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> > +				 (next < head && next > head + pool->nr_objs)))
> > +			return -EINVAL;
> > +
> > +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> > +
> > +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> >   	return 0;
> >   }
> >   
> >   /* reclaim an object to object pool */
> > -int objpool_push(void *obj, struct objpool_head *oh)
> > +int objpool_push(void *obj, struct objpool_head *pool)
> >   {
> >   	unsigned long flags;
> > -	int cpu, rc;
> > +	int rc;
> >   
> >   	/* disable local irq to avoid preemption & interruption */
> >   	raw_local_irq_save(flags);
> > -	cpu = raw_smp_processor_id();
> > -	do {
> > -		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> > -		if (!rc)
> > -			break;
> > -		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> > -	} while (1);
> > +	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> >   	raw_local_irq_restore(flags);
> >   
> >   	return rc;
> >   }
> >   EXPORT_SYMBOL_GPL(objpool_push);
> >   
> > -/* drop the allocated object, rather reclaim it to objpool */
> > -int objpool_drop(void *obj, struct objpool_head *head)
> > -{
> > -	if (!obj || !head)
> > -		return -EINVAL;
> > -
> > -	if (refcount_dec_and_test(&head->ref)) {
> > -		objpool_free(head);
> > -		return 0;
> > -	}
> > -
> > -	return -EAGAIN;
> > -}
> > -EXPORT_SYMBOL_GPL(objpool_drop);
> > -
> >   /* try to retrieve object from slot */
> > -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> > +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
> >   {
> > -	uint32_t *ages = SLOT_AGES(os);
> > -	void **ents = SLOT_ENTS(os);
> >   	/* do memory load of head to local head */
> > -	uint32_t head = smp_load_acquire(&os->head);
> > +	uint32_t head = smp_load_acquire(&slot->head);
> >   
> >   	/* loop if slot isn't empty */
> > -	while (head != READ_ONCE(os->tail)) {
> > -		uint32_t id = head & os->mask, prev = head;
> > +	while (head != READ_ONCE(slot->tail)) {
> >   
> >   		/* do prefetching of object ents */
> > -		prefetch(&ents[id]);
> > -
> > -		/* check whether this item was ready for retrieval */
> > -		if (smp_load_acquire(&ages[id]) == head) {
> > -			/* node must have been udpated by push() */
> > -			void *node = READ_ONCE(ents[id]);
> > -			/* commit and move forward head of the slot */
> > -			if (try_cmpxchg_release(&os->head, &head, head + 1))
> > -				return node;
> > -			/* head was already updated by others */
> > -		}
> > +		prefetch(&slot->entries[head & slot->mask]);
> > +
> > +		/* commit and move forward head of the slot */
> > +		if (try_cmpxchg_release(&slot->head, &head, head + 1))
> > +			/*
> > +			 * TBD: check overwrap the tail/head counter and warn
> > +			 * if it is broken. But this happens only if this
> > +			 * process slows down a lot and another CPU updates
> > +			 * the haed/tail just 2^32 + 1 times, and this slot
> > +			 * is empty.
> > +			 */
> > +			return slot->entries[head & slot->mask];
> >   
> >   		/* re-load head from memory and continue trying */
> > -		head = READ_ONCE(os->head);
> > -		/*
> > -		 * head stays unchanged, so it's very likely there's an
> > -		 * ongoing push() on other cpu nodes but yet not update
> > -		 * ages[] to mark it's completion
> > -		 */
> > -		if (head == prev)
> > -			break;
> > +		head = READ_ONCE(slot->head);
> >   	}
> >   
> >   	return NULL;
> > @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
> >   EXPORT_SYMBOL_GPL(objpool_pop);
> >   
> >   /* release whole objpool forcely */
> > -void objpool_free(struct objpool_head *head)
> > +void objpool_free(struct objpool_head *pool)
> >   {
> > -	if (!head->cpu_slots)
> > +	if (!pool->cpu_slots)
> >   		return;
> >   
> >   	/* release percpu slots */
> > -	objpool_fini_percpu_slots(head);
> > +	objpool_fini_percpu_slots(pool);
> >   
> >   	/* call user's cleanup callback if provided */
> > -	if (head->release)
> > -		head->release(head, head->context);
> > +	if (pool->release)
> > +		pool->release(pool, pool->context);
> >   }
> >   EXPORT_SYMBOL_GPL(objpool_free);
> >   
> > -/* drop unused objects and defref objpool for releasing */
> > -void objpool_fini(struct objpool_head *head)
> > +/* drop the allocated object, rather reclaim it to objpool */
> > +int objpool_drop(void *obj, struct objpool_head *pool)
> >   {
> > -	void *nod;
> > +	if (!obj || !pool)
> > +		return -EINVAL;
> >   
> > -	do {
> > -		/* grab object from objpool and drop it */
> > -		nod = objpool_pop(head);
> > +	if (refcount_dec_and_test(&pool->ref)) {
> > +		objpool_free(pool);
> > +		return 0;
> > +	}
> > +
> > +	return -EAGAIN;
> > +}
> > +EXPORT_SYMBOL_GPL(objpool_drop);
> > +
> > +/* drop unused objects and defref objpool for releasing */
> > +void objpool_fini(struct objpool_head *pool)
> > +{
> > +	void *obj;
> >   
> > -		/* drop the extra ref of objpool */
> > -		if (refcount_dec_and_test(&head->ref))
> > -			objpool_free(head);
> > -	} while (nod);
> > +	/* grab object from objpool and drop it */
> > +	while ((obj = objpool_pop(pool)))
> > +		objpool_drop(obj, pool);
> >   }
> >   EXPORT_SYMBOL_GPL(objpool_fini);
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-10-09 14:19       ` Masami Hiramatsu
@ 2023-10-12 16:16         ` wuqiang.matt
  0 siblings, 0 replies; 25+ messages in thread
From: wuqiang.matt @ 2023-10-12 16:16 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

Hello Masami,

I've udpated the objpool patch and did some function testings for X64 and
ARM64. Later I'll do the performance testings and more regressions.

Here are the changelogs:
1) new struct objpool_node added to represent the real percpu ring arrary
    and struct objpool_slot now represents the expansion of objpool_node.
    ages[] and entries[] are now managed by objpool_slot (which is managed
    by objpool_head)
2) ages[] added back to objpool_try_add_slot and objpool_try_get_slot
3) unnecessary OBJPOOL_FLAG definitions are removed
4) unnecessary head/tail loading removed since try_cmpxchg_acuiqre and
    try_cmxchg_release have inherent memory loading embeded
5) objpool_fini refined to make sure the extra refcount to be released

The new version is attached in this mail for your review. And I will
prepare the full patch after the regression testings.

Best regards,
wuqiang


diff --git a/include/linux/objpool.h b/include/linux/objpool.h
new file mode 100644
index 000000000000..f3e066601df2
--- /dev/null
+++ b/include/linux/objpool.h
@@ -0,0 +1,182 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_OBJPOOL_H
+#define _LINUX_OBJPOOL_H
+
+#include <linux/types.h>
+#include <linux/refcount.h>
+
+/*
+ * objpool: ring-array based lockless MPMC queue
+ *
+ * Copyright: wuqiang.matt@bytedance.com,mhiramat@kernel.org
+ *
+ * The object pool is a scalable implementaion of high performance queue
+ * for objects allocation and reclamation, such as kretprobe instances.
+ *
+ * With leveraging per-cpu ring-array to mitigate the hot spots of memory
+ * contention, it could deliver near-linear scalability for high parallel
+ * scenarios. objpomol is best suited for following cases:
+ * 1) Memory allocation or reclamation are prohibited or too expensive
+ * 2) Consumers are of different priorities, such as irqs and threads
+ *
+ * Limitations:
+ * 1) Maximum objects (capacity) is determined during pool initializing
+ * 2) The memory of objects won't be freed until the pool is finalized
+ * 3) Object allocation (pop) may fail after trying all cpu slots
+ */
+
+/**
+ * struct objpool_node - percpu ring array of objpool
+ * @head: head sequence of the local ring array
+ * @tail: tail sequence of the local ring array
+ *
+ * Represents a cpu-local array-based ring buffer, its size is specialized
+ * during initialization of object pool. The percpu objpool node is to be
+ * allocated from local memory for NUMA system, and to be kept compact in
+ * continuous memory: CPU assigned number of objects are stored just after
+ * the body of objpool_node.
+ *
+ * Real size of the ring array is far too smaller than the value range of
+ * head and tail, typed as uint32_t: [0, 2^32), so only lower bits of head
+ * and tail are used as the actual position in the ring array. In general
+ * the ring array is acting like a small sliding window, which is always
+ * moving forward in the loop of [0, 2^32).
+ */
+ struct objpool_node {
+	uint32_t            head;
+	uint32_t            tail;
+} __packed;
+
+/**
+ * struct objpool_slot - the expansion of percpu objpool_node
+ * @node: the pointer of percpu objpool_node
+ * @ages: unique sequence number to avoid ABA
+ * @entries: object entries on this slot
+ */
+struct objpool_slot {
+	struct objpool_node *node;
+	uint32_t            *ages;
+	void *              *entries;
+};
+
+struct objpool_head;
+
+/*
+ * caller-specified callback for object initial setup, it's only called
+ * once for each object (just after the memory allocation of the object)
+ */
+typedef int (*objpool_init_obj_cb)(void *obj, void *context);
+
+/* caller-specified cleanup callback for objpool destruction */
+typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
+
+/**
+ * struct objpool_head - object pooling metadata
+ * @obj_size:   object size, aligned to sizeof(void *)
+ * @nr_objs:    total objs (to be pre-allocated with objpool)
+ * @nr_cpus:    local copy of nr_cpu_ids
+ * @capacity:   max objs can be managed by one objpool_slot
+ * @gfp:        gfp flags for kmalloc & vmalloc
+ * @ref:        refcount of objpool
+ * @flags:      flags for objpool management
+ * @cpu_slots:  pointer to the array of objpool_slot
+ * @release:    resource cleanup callback
+ * @context:    caller-provided context
+ */
+struct objpool_head {
+	int                     obj_size;
+	int                     nr_objs;
+	int                     nr_cpus;
+	int                     capacity;
+	gfp_t                   gfp;
+	refcount_t              ref;
+	unsigned long           flags;
+	struct objpool_slot    *cpu_slots;
+	objpool_fini_cb         release;
+	void                   *context;
+};
+
+#define OBJPOOL_NR_OBJECT_MAX	(1UL << 24) /* maximum numbers of total objects */
+#define OBJPOOL_OBJECT_SIZE_MAX	(1UL << 16) /* maximum size of an object */
+
+/**
+ * objpool_init() - initialize objpool and pre-allocated objects
+ * @pool:    the object pool to be initialized, declared by caller
+ * @nr_objs: total objects to be pre-allocated by this object pool
+ * @object_size: size of an object (should be > 0)
+ * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
+ * @context: user context for object initialization callback
+ * @objinit: object initialization callback for extra setup
+ * @release: cleanup callback for extra cleanup task
+ *
+ * return value: 0 for success, otherwise error code
+ *
+ * All pre-allocated objects are to be zeroed after memory allocation.
+ * Caller could do extra initialization in objinit callback. objinit()
+ * will be called just after slot allocation and will be only once for
+ * each object. Since then the objpool won't touch any content of the
+ * objects. It's caller's duty to perform reinitialization after each
+ * pop (object allocation) or do clearance before each push (object
+ * reclamation).
+ */
+int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
+		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+		 objpool_fini_cb release);
+
+/**
+ * objpool_pop() - allocate an object from objpool
+ * @pool: object pool
+ *
+ * return value: object ptr or NULL if failed
+ */
+void *objpool_pop(struct objpool_head *pool);
+
+/**
+ * objpool_push() - reclaim the object and return back to objpool
+ * @obj:  object ptr to be pushed to objpool
+ * @pool: object pool
+ *
+ * return: 0 or error code (it fails only when user tries to push
+ * the same object multiple times or wrong "objects" into objpool)
+ */
+int objpool_push(void *obj, struct objpool_head *pool);
+
+/**
+ * objpool_drop() - discard the object and deref objpool
+ * @obj:  object ptr to be discarded
+ * @pool: object pool
+ *
+ * return: 0 if objpool was released; -EAGAIN if there are still
+ *         outstanding objects
+ *
+ * objpool_drop is normally for the release of outstanding objects
+ * after objpool cleanup (objpool_fini). Thinking of this example:
+ * kretprobe is unregistered and objpool_fini() is called to release
+ * all remained objects, but there are still objects being used by
+ * unfinished kretprobes (like blockable function: sys_accept). So
+ * only when the last outstanding object is dropped could the whole
+ * objpool be released along with the call of objpool_drop()
+ */
+int objpool_drop(void *obj, struct objpool_head *pool);
+
+/**
+ * objpool_free() - release objpool forcely (all objects to be freed)
+ * @pool: object pool to be released
+ */
+void objpool_free(struct objpool_head *pool);
+
+/**
+ * objpool_fini() - deref object pool (also releasing unused objects)
+ * @pool: object pool to be dereferenced
+ *
+ * objpool_fini() will try to release all remained free objects and
+ * then drop an extra reference of objpool. So if all objects are
+ * already returned to objpool, the objpool will be freed too. But
+ * if there are still outstanding objects (blockable kretprobes),
+ * the objpool won't be released until all the oustanding objects
+ * are dropped
+ */
+void objpool_fini(struct objpool_head *pool);
+
+#endif /* _LINUX_OBJPOOL_H */
diff --git a/lib/objpool.c b/lib/objpool.c
new file mode 100644
index 000000000000..628993d93638
--- /dev/null
+++ b/lib/objpool.c
@@ -0,0 +1,329 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/objpool.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/prefetch.h>
+#include <linux/irqflags.h>
+#include <linux/cpumask.h>
+#include <linux/log2.h>
+
+/*
+ * objpool: ring-array based lockless MPMC/FIFO queues
+ *
+ * Copyright: wuqiang.matt@bytedance.com,mhiramat@kernel.org
+ */
+
+#define NODE_AGES(p, n) (uint32_t *)((char *)(n) + sizeof(struct objpool_node))
+#define NODE_ENTS(p, n) (void **)((char *)(n) + sizeof(struct objpool_node) + \
+			sizeof(uint32_t) * (p)->capacity)
+#define NODE_OBJS(p, n) (void *)((char *)(n) + sizeof(struct objpool_node) + \
+			(sizeof(uint32_t) + sizeof(void *)) * (p)->capacity)
+
+/* initialize percpu objpool_slot */
+static int
+objpool_init_percpu_slot(struct objpool_head *pool,
+			 struct objpool_slot *slot,
+			 struct objpool_node *node,
+			 int nodes, void *context,
+			 objpool_init_obj_cb objinit)
+{
+	uint32_t mask = pool->capacity - 1;
+	int i;
+
+	/* initialize percpu objpool_slot */
+	slot->node = node;
+	slot->ages = NODE_AGES(pool, node);
+	slot->entries = NODE_ENTS(pool, node);
+
+	/*
+	 * manually set head & tail to avoid possible conflict:
+	 * We assume that the head item is ready for retrieval
+	 * iff head is equal to ages[head & mask]. but ages is
+	 * initialized as 0, so in view of the caller of pop(),
+	 * the 1st item (0th) is always ready, but the reality
+	 * could be: push() is stalled before the final update,
+	 * thus the item being inserted will be lost forever
+	 */
+	node->head = node->tail = pool->capacity;
+
+	/* initialize ages and entries of this objpool_slot */
+	for (i = 0; i < nodes; i++) {
+		void *obj = NODE_OBJS(pool, node) + i * pool->obj_size;
+		if (objinit) {
+			int rc = objinit(obj, context);
+			if (rc)
+				return rc;
+		}
+		slot->ages[node->tail & mask] = node->tail;
+		slot->entries[node->tail & mask] = obj;
+		node->tail++;
+		pool->nr_objs++;
+	}
+
+	return 0;
+}
+
+/* allocate and initialize percpu slots */
+static int
+objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs,
+			  void *context, objpool_init_obj_cb objinit)
+{
+	int i, cpu_count = 0;
+
+	for (i = 0; i < pool->nr_cpus; i++) {
+
+		struct objpool_node *node;
+		int nodes, size, rc;
+
+		/* skip the cpu node which could never be present */
+		if (!cpu_possible(i))
+			continue;
+
+		/* compute how many objects to be allocated with this slot */
+		nodes = nr_objs / num_possible_cpus();
+		if (cpu_count < (nr_objs % num_possible_cpus()))
+			nodes++;
+		cpu_count++;
+
+		size = pool->obj_size * nodes + sizeof(struct objpool_node) +
+		       (sizeof(void *) + sizeof(uint32_t)) * pool->capacity;
+
+		/*
+		 * here we allocate percpu-slot & objs together in a single
+		 * allocation to make it more compact, taking advantage of
+		 * warm caches and TLB hits. in default vmalloc is used to
+		 * reduce the pressure of kernel slab system. as we know,
+		 * mimimal size of vmalloc is one page since vmalloc would
+		 * always align the requested size to page size
+		 */
+		if (pool->gfp & GFP_ATOMIC)
+			node = kmalloc_node(size, pool->gfp, cpu_to_node(i));
+		else
+			node = __vmalloc_node(size, sizeof(void *), pool->gfp,
+				cpu_to_node(i), __builtin_return_address(0));
+		if (!node)
+			return -ENOMEM;
+		memset(node, 0, size);
+
+		/* initialize the objpool_slot of cpu node i */
+		rc = objpool_init_percpu_slot(pool, &pool->cpu_slots[i],
+					node, nodes, context, objinit);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+/* cleanup all percpu slots of the object pool */
+static void objpool_fini_percpu_slots(struct objpool_head *pool)
+{
+	int i;
+
+	if (!pool->cpu_slots)
+		return;
+
+	for (i = 0; i < pool->nr_cpus; i++)
+		kvfree(pool->cpu_slots[i].node);
+	kfree(pool->cpu_slots);
+}
+
+/* initialize object pool and pre-allocate objects */
+int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
+		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+		objpool_fini_cb release)
+{
+	int rc, capacity, slot_size;
+
+	/* check input parameters */
+	if (nr_objs <= 0 || nr_objs > OBJPOOL_NR_OBJECT_MAX ||
+	    object_size <= 0 || object_size > OBJPOOL_OBJECT_SIZE_MAX)
+		return -EINVAL;
+
+	/* align up to unsigned long size */
+	object_size = ALIGN(object_size, sizeof(long));
+
+	/* calculate capacity of percpu objpool_slot */
+	capacity = roundup_pow_of_two(nr_objs);
+	if (!capacity)
+		return -EINVAL;
+
+	/* initialize objpool pool */
+	memset(pool, 0, sizeof(struct objpool_head));
+	pool->nr_cpus = nr_cpu_ids;
+	pool->obj_size = object_size;
+	pool->capacity = capacity;
+	pool->gfp = gfp & ~__GFP_ZERO;
+	pool->context = context;
+	pool->release = release;
+	slot_size = pool->nr_cpus * sizeof(struct objpool_slot);
+	pool->cpu_slots = kzalloc(slot_size, pool->gfp);
+	if (!pool->cpu_slots)
+		return -ENOMEM;
+
+	/* initialize per-cpu slots */
+	rc = objpool_init_percpu_slots(pool, nr_objs, context, objinit);
+	if (rc)
+		objpool_fini_percpu_slots(pool);
+	else
+		refcount_set(&pool->ref, pool->nr_objs + 1);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(objpool_init);
+
+/* adding object to slot, abort if the slot was already full */
+static inline int
+objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
+{
+	struct objpool_slot *slot = &pool->cpu_slots[cpu];
+	struct objpool_node *node = slot->node;
+	uint32_t head, tail, mask = pool->capacity - 1;
+
+	/* loading tail and head as a local snapshot, tail first */
+	tail = READ_ONCE(node->tail);
+
+	do {
+		head = READ_ONCE(node->head);
+		/* fault caught: something must be wrong */
+		WARN_ON_ONCE(tail - head > pool->nr_objs);
+	} while (!try_cmpxchg_acquire(&node->tail, &tail, tail + 1));
+
+	/* now the tail position is reserved for the given obj */
+	WRITE_ONCE(slot->entries[tail & mask], obj);
+	/* update sequence to make this obj available for pop() */
+	smp_store_release(&slot->ages[tail & mask], tail);
+
+	return 0;
+}
+
+/* reclaim an object to object pool */
+int objpool_push(void *obj, struct objpool_head *pool)
+{
+	unsigned long flags;
+	int rc;
+
+	/* disable local irq to avoid preemption & interruption */
+	raw_local_irq_save(flags);
+	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
+	raw_local_irq_restore(flags);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(objpool_push);
+
+/* try to retrieve object from slot */
+static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu)
+{
+	struct objpool_slot *slot = &pool->cpu_slots[cpu];
+	struct objpool_node *node = slot->node;
+	uint32_t head, mask = pool->capacity - 1;
+
+	/* load node->head and save to local head */
+	head = smp_load_acquire(&node->head);
+
+	while (head != READ_ONCE(node->tail)) {
+		uint32_t pos = head & mask, prev = head;
+
+		/* do prefetching of the object pointer */
+		prefetch(&slot->entries[pos]);
+
+		/* check whether the object is ready for retrieval */
+		if (smp_load_acquire(&slot->ages[pos]) == head) {
+			/* obj must've been udpated by its push() */
+			void *obj = READ_ONCE(slot->entries[pos]);
+			/* try to commit and move forward by 1 */
+			if (try_cmpxchg_release(&node->head, &head, head + 1))
+				return obj;
+			/* head mismatch: was consumed by other nodes */
+		} else {
+			/* refresh head from memory and retry */
+			head = READ_ONCE(node->head);
+			/*
+			* head stays unchanged, so it's very likely there's
+			* an ongoing push() on other cpu nodes but yet not
+			* update ages[] to mark it's completion
+			*/
+			if (head == prev)
+				break;
+		}
+	}
+
+	return NULL;
+}
+
+/* allocate an object from object pool */
+void *objpool_pop(struct objpool_head *pool)
+{
+	void *obj = NULL;
+	unsigned long flags;
+	int i, cpu;
+
+	/* disable local irq to avoid preemption & interruption */
+	raw_local_irq_save(flags);
+
+	cpu = raw_smp_processor_id();
+	for (i = 0; i < num_possible_cpus(); i++) {
+		obj = objpool_try_get_slot(pool, cpu);
+		if (obj)
+			break;
+		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
+	}
+	raw_local_irq_restore(flags);
+
+	return obj;
+}
+EXPORT_SYMBOL_GPL(objpool_pop);
+
+/* release whole objpool forcely */
+void objpool_free(struct objpool_head *pool)
+{
+	if (!pool->cpu_slots)
+		return;
+
+	/* release percpu slots */
+	objpool_fini_percpu_slots(pool);
+
+	/* call user's cleanup callback if provided */
+	if (pool->release)
+		pool->release(pool, pool->context);
+}
+EXPORT_SYMBOL_GPL(objpool_free);
+
+/* drop the allocated object, rather reclaim it to objpool */
+int objpool_drop(void *obj, struct objpool_head *pool)
+{
+	if (!obj || !pool)
+		return -EINVAL;
+
+	if (refcount_dec_and_test(&pool->ref)) {
+		objpool_free(pool);
+		return 0;
+	}
+
+	return -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(objpool_drop);
+
+/* drop unused objects and defref objpool for releasing */
+void objpool_fini(struct objpool_head *pool)
+{
+	void *obj;
+
+	do {
+		/* grab object from objpool and drop it */
+		obj = objpool_pop(pool);
+
+		/*
+		 * drop reference of objpool anyway even if
+		 * the obj is NULL, since one extra ref upon
+		 * objpool was already grabbed during pool
+		 * initialization in objpool_init()
+		 */
+		if (refcount_dec_and_test(&pool->ref))
+			objpool_free(pool);
+	} while (obj);
+}
+EXPORT_SYMBOL_GPL(objpool_fini);


On 2023/10/9 22:19, Masami Hiramatsu (Google) wrote:
> Hi,
> 
> On Mon, 9 Oct 2023 02:40:53 +0800
> wuqiang <wuqiang.matt@bytedance.com> wrote:
> 
>> On 2023/9/23 17:48, Masami Hiramatsu (Google) wrote:
>>> Hi Wuqiang,
>>>
>>> Sorry for replying later.
>>>
>>> On Tue,  5 Sep 2023 09:52:51 +0800
>>> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
>>>
>>>> The object pool is a scalable implementaion of high performance queue
>>>> for object allocation and reclamation, such as kretprobe instances.
>>>>
>>>> With leveraging percpu ring-array to mitigate the hot spot of memory
>>>> contention, it could deliver near-linear scalability for high parallel
>>>> scenarios. The objpool is best suited for following cases:
>>>> 1) Memory allocation or reclamation are prohibited or too expensive
>>>> 2) Consumers are of different priorities, such as irqs and threads
>>>>
>>>> Limitations:
>>>> 1) Maximum objects (capacity) is determined during pool initializing
>>>>      and can't be modified (extended) after objpool creation
>>>
>>> So the pool size is fixed in initialization.
>>
>> Right. The arrary size will be up-rounded to the exponent of 2, but the
>> actual number of objects (to be allocated) are the exact value specified
>> by user.
> 
> Yeah, this makes easy to use the seq-number as index.
> 
>>
>>>
>>>> 2) The memory of objects won't be freed until objpool is finalized
>>>> 3) Object allocation (pop) may fail after trying all cpu slots
>>>
>>> This means that object allocation will fail if the all pools are empty,
>>> right?
>>
>> Yes, pop() will return NULL for this case. pop() does the checking for
>> only 1 round of all cpu nodes.
>>
>> The objpool might not be empty since new object could be inserted back
>> in the meaintime by other nodes, which is natural for lockless queues.
> 
> OK.
> 
>>
>>>
>>>>
>>>> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
>>>> ---
>>>>    include/linux/objpool.h | 174 +++++++++++++++++++++
>>>>    lib/Makefile            |   2 +-
>>>>    lib/objpool.c           | 338 ++++++++++++++++++++++++++++++++++++++++
>>>>    3 files changed, 513 insertions(+), 1 deletion(-)
>>>>    create mode 100644 include/linux/objpool.h
>>>>    create mode 100644 lib/objpool.c
>>>>
>>>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
>>>> new file mode 100644
>>>> index 000000000000..33c832216b98
>>>> --- /dev/null
>>>> +++ b/include/linux/objpool.h
>>>> @@ -0,0 +1,174 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +
>>>> +#ifndef _LINUX_OBJPOOL_H
>>>> +#define _LINUX_OBJPOOL_H
>>>> +
>>>> +#include <linux/types.h>
>>>> +#include <linux/refcount.h>
>>>> +
>>>> +/*
>>>> + * objpool: ring-array based lockless MPMC queue
>>>> + *
>>>> + * Copyright: wuqiang.matt@bytedance.com
>>>> + *
>>>> + * The object pool is a scalable implementaion of high performance queue
>>>> + * for objects allocation and reclamation, such as kretprobe instances.
>>>> + *
>>>> + * With leveraging per-cpu ring-array to mitigate the hot spots of memory
>>>> + * contention, it could deliver near-linear scalability for high parallel
>>>> + * scenarios. The ring-array is compactly managed in a single cache-line
>>>> + * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
>>>> + * The body of pre-allocated objects is stored in continuous cache-lines
>>>> + * just after the ring-array.
>>>
>>> I consider the size of entries may be big if we have larger number of
>>> CPU cores, like 72-cores. And if user specifies (2^n) + 1 entries.
>>> In this case, each CPU has (2^n - 1)/72 objects, but has 2^(n + 1)
>>> entries in ring buffer. So it should be noted.
>>
>> Yes for the arrary size since it‘s up-rounded to the exponent of 2, but the
>> actual number of pre-allocated objects will stay the same as user specified.
>>
>>>> + *
>>>> + * The object pool is interrupt safe. Both allocation and reclamation
>>>> + * (object pop and push operations) can be preemptible or interruptable.
>>>
>>> You've added raw_spinlock_disable/enable(), so it is not preemptible
>>> or interruptible anymore. (Anyway, caller doesn't take care of that)
>>
>> Sure, this decription is imporper and unnecessary. Will be removed.
>>
>>>> + *
>>>> + * It's best suited for following cases:
>>>> + * 1) Memory allocation or reclamation are prohibited or too expensive
>>>> + * 2) Consumers are of different priorities, such as irqs and threads
>>>> + *
>>>> + * Limitations:
>>>> + * 1) Maximum objects (capacity) is determined during pool initializing
>>>> + * 2) The memory of objects won't be freed until the pool is finalized
>>>> + * 3) Object allocation (pop) may fail after trying all cpu slots
>>>> + */
>>>> +
>>>> +/**
>>>> + * struct objpool_slot - percpu ring array of objpool
>>>> + * @head: head of the local ring array (to retrieve at)
>>>> + * @tail: tail of the local ring array (to append at)
>>>> + * @bits: log2 of capacity (for bitwise operations)
>>>> + * @mask: capacity - 1
>>>
>>> These description does not give idea what those roles are.
>>
>> I'll refine the description. objpool_slot is totally internal to objpool.
>>
>>>
>>>> + *
>>>> + * Represents a cpu-local array-based ring buffer, its size is specialized
>>>> + * during initialization of object pool. The percpu objpool slot is to be
>>>> + * allocated from local memory for NUMA system, and to be kept compact in
>>>> + * continuous memory: ages[] is stored just after the body of objpool_slot,
>>>> + * and then entries[]. ages[] describes revision of each item, solely used
>>>> + * to avoid ABA; entries[] contains pointers of the actual objects
>>>> + *
>>>> + * Layout of objpool_slot in memory:
>>>> + *
>>>> + * 64bit:
>>>> + *        4      8      12     16        32                 64
>>>> + * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
>>>> + *
>>>> + * 32bit:
>>>> + *        4      8      12     16        32        48       64
>>>> + * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
>>>
>>> Hm, the '4' here means number of objects after this objpool_slot?
>>> I don't recommend you to allocate several arraies after the header, instead,
>>> using another data structure like this;
>>>
>>> |head|tail|bits|mask|ents[N]{age:4|offs:4}|padding|objects
>>>
>>> here N means the number of total objects.
>>
>> Sorry for the confusion, I will make it more clear. Here 4/8/.../64 are offset
>> in bytes. The above is an example with the objpool_slot compacted in a single
>> cache line.
> 
> But in that case, the entry number may not be enough for storing all object.
> (or limit the number of objects)
> 
> Actually, since the rethook needs to make a shadow stack list per task not
> per cpu, the (safe) required number of object is usually proportional to the
> number of active tasks. kretprobes sets the default number of nodes according
> to the CPUs but it is minimum requirement. This is because,
> - most of the kernel functions are not nested, thus it is called once on each
>   thread in the kernel.
> - the thread can be scheduled or slept, thus the function return hook also is
>    not done until the thread comes back.
> So, usually, the recommended number of node (obj) will be 100-200 (depends on
> the system.) If it is a server, it may be more than 1000.
> 
>>
>>>
>>> struct objpool_entry {
>>> 	uint32_t	age;
>>> 	void *	ptr;
>>> } __attribute__((packed,aligned(8))) ;
>>>
>>>> + *
>>>> + */
>>>> +struct objpool_slot {
>>>> +	uint32_t                head;
>>>> +	uint32_t                tail;
>>>> +	uint32_t                bits;
>>>> +	uint32_t                mask;
>>>
>>> 	struct objpool_entry	entries[];
>>>
>>>> +} __packed;
>>>
>>> Then, you don't need complex macros to access object, but you need only one
>>> inline function to get the actual object address.
>>>
>>> static inline void *objpool_slot_object(struct objpool_slot *slot, int nth)
>>> {
>>> 	if (nth > (1 << bits))
>>> 		return NULL;
>>>
>>> 	return (void *)((unsigned long)slot + slot.entries[nth].offs);
>>> }
>>
>> The reason of these macroes is to compact objpool_slot/ages/ents to hot cache
>> lines and also minimize the memory footprint.
> 
> Hmm, at this moment, I don't recommend you to stick on the cache line but
> easier to read. If you have any number, you can add optimize patch afterwards.
> But the initial patch should take care about the readability.
> 
>>
>> objpool_head could be a better place to manage these pointers, similarly as
>> cpu_slots. I'll recheck the overhead.
>>
>>
>>>> +
>>>> +struct objpool_head;
>>>> +
>>>> +/*
>>>> + * caller-specified callback for object initial setup, it's only called
>>>> + * once for each object (just after the memory allocation of the object)
>>>> + */
>>>> +typedef int (*objpool_init_obj_cb)(void *obj, void *context);
>>>> +
>>>> +/* caller-specified cleanup callback for objpool destruction */
>>>> +typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
>>>> +
>>>> +/**
>>>> + * struct objpool_head - object pooling metadata
>>>> + * @obj_size:   object & element size
>>>
>>> What does the 'element' mean?
>>
>> "object size" should be enough. "element" means object, so it's unnecessary.
>>
>>>
>>>> + * @nr_objs:    total objs (to be pre-allocated)
>>>
>>> but all objects must be pre-allocated, right? then it is simply
>>
>> Yes, all objects are pre-allocated for this implementation.
>>
>>>
>>> @nr_objs: the total number of objects in this objpool.
>>>
>>>> + * @nr_cpus:    nr_cpu_ids
>>>
>>> would we have to save it? or just use 'nr_cpu_ids'?
>>
>> Yes, it's just a local save of nr_cpu_ids, just to make the members of
>> objpool_head aligned by 64 bits (there could be a 4-byte hold anyway).
>> And possible beatification from hot TLB cache ?
> 
> Unless you pack the data structure, you don't need to care about
> the cache. And the compiler works better than human for initial work.
> At this moment, it is more important to reduce the members as simple
> as possible.
> 
>>
>>>
>>>> + * @capacity:   max objects per cpuslot
>>>
>>> what is 'cpuslot'?
>>> This seems the size of objpool_entry array in objpool_slot.
>>
>> Yes, should be "capacity per objpool_slot", i.e. "maximum objects could be
>> stored in a objpool_slot".
>>
>>>> + * @gfp:        gfp flags for kmalloc & vmalloc
>>>> + * @ref:        refcount for objpool
>>>> + * @flags:      flags for objpool management
>>>> + * @cpu_slots:  array of percpu slots
>>>> + * @slot_sizes:	size in bytes of slots
>>>> + * @release:    resource cleanup callback
>>>> + * @context:    caller-provided context
>>>> + */
>>>> +struct objpool_head {
>>>> +	int                     obj_size;
>>>> +	int                     nr_objs;
>>>> +	int                     nr_cpus;
>>>> +	int                     capacity;
>>>> +	gfp_t                   gfp;
>>>> +	refcount_t              ref;
>>>> +	unsigned long           flags;
>>>> +	struct objpool_slot   **cpu_slots;
>>>> +	int                    *slot_sizes;
>>>> +	objpool_fini_cb         release;
>>>> +	void                   *context;
>>>> +};
>>>> +
>>>> +#define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
>>>> +#define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
>>>> +
>>>> +/**
>>>> + * objpool_init() - initialize objpool and pre-allocated objects
>>>> + * @head:    the object pool to be initialized, declared by caller
>>>> + * @nr_objs: total objects to be pre-allocated by this object pool
>>>> + * @object_size: size of an object (should be > 0)
>>>> + * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
>>>> + * @context: user context for object initialization callback
>>>> + * @objinit: object initialization callback for extra setup
>>>> + * @release: cleanup callback for extra cleanup task
>>>> + *
>>>> + * return value: 0 for success, otherwise error code
>>>> + *
>>>> + * All pre-allocated objects are to be zeroed after memory allocation.
>>>> + * Caller could do extra initialization in objinit callback. objinit()
>>>> + * will be called just after slot allocation and will be only once for
>>>> + * each object. Since then the objpool won't touch any content of the
>>>> + * objects. It's caller's duty to perform reinitialization after each
>>>> + * pop (object allocation) or do clearance before each push (object
>>>> + * reclamation).
>>>> + */
>>>> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>>> +		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>>> +		 objpool_fini_cb release);
>>>> +
>>>> +/**
>>>> + * objpool_pop() - allocate an object from objpool
>>>> + * @head: object pool
>>>> + *
>>>> + * return value: object ptr or NULL if failed
>>>> + */
>>>> +void *objpool_pop(struct objpool_head *head);
>>>> +
>>>> +/**
>>>> + * objpool_push() - reclaim the object and return back to objpool
>>>> + * @obj:  object ptr to be pushed to objpool
>>>> + * @head: object pool
>>>> + *
>>>> + * return: 0 or error code (it fails only when user tries to push
>>>> + * the same object multiple times or wrong "objects" into objpool)
>>>> + */
>>>> +int objpool_push(void *obj, struct objpool_head *head);
>>>> +
>>>> +/**
>>>> + * objpool_drop() - discard the object and deref objpool
>>>> + * @obj:  object ptr to be discarded
>>>> + * @head: object pool
>>>> + *
>>>> + * return: 0 if objpool was released or error code
>>>> + */
>>>> +int objpool_drop(void *obj, struct objpool_head *head);
>>>> +
>>>> +/**
>>>> + * objpool_free() - release objpool forcely (all objects to be freed)
>>>> + * @head: object pool to be released
>>>> + */
>>>> +void objpool_free(struct objpool_head *head);
>>>> +
>>>> +/**
>>>> + * objpool_fini() - deref object pool (also releasing unused objects)
>>>> + * @head: object pool to be dereferenced
>>>> + */
>>>> +void objpool_fini(struct objpool_head *head);
>>>> +
>>>> +#endif /* _LINUX_OBJPOOL_H */
>>>> diff --git a/lib/Makefile b/lib/Makefile
>>>> index 1ffae65bb7ee..7a84c922d9ff 100644
>>>> --- a/lib/Makefile
>>>> +++ b/lib/Makefile
>>>> @@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>>>>    	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
>>>>    	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
>>>>    	 nmi_backtrace.o win_minmax.o memcat_p.o \
>>>> -	 buildid.o
>>>> +	 buildid.o objpool.o
>>>>    
>>>>    lib-$(CONFIG_PRINTK) += dump_stack.o
>>>>    lib-$(CONFIG_SMP) += cpumask.o
>>>> diff --git a/lib/objpool.c b/lib/objpool.c
>>>> new file mode 100644
>>>> index 000000000000..22e752371820
>>>> --- /dev/null
>>>> +++ b/lib/objpool.c
>>>> @@ -0,0 +1,338 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +
>>>> +#include <linux/objpool.h>
>>>> +#include <linux/slab.h>
>>>> +#include <linux/vmalloc.h>
>>>> +#include <linux/atomic.h>
>>>> +#include <linux/prefetch.h>
>>>> +#include <linux/irqflags.h>
>>>> +#include <linux/cpumask.h>
>>>> +#include <linux/log2.h>
>>>> +
>>>> +/*
>>>> + * objpool: ring-array based lockless MPMC/FIFO queues
>>>> + *
>>>> + * Copyright: wuqiang.matt@bytedance.com
>>>> + */
>>>> +
>>>> +#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
>>>> +#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
>>>> +			(sizeof(uint32_t) << (s)->bits)))
>>>> +#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
>>>> +			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
>>>> +#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
>>>> +
>>>> +/* compute the suitable num of objects to be managed per slot */
>>>> +static int objpool_nobjs(int size)
>>>> +{
>>>> +	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
>>>> +			(sizeof(uint32_t) + sizeof(void *)));
>>>> +}
>>>> +
>>>> +/* allocate and initialize percpu slots */
>>>
>>> @head: the objpool_head for managing this objpool
>>> @nobjs: the total number of objects in this objpool
>>> @context: context data for @objinit
>>> @objinit: initialize callback for each object.
>>
>> Got it. I didn't since objpool_init_percpu_slots is not public.
>>
>>>> +static int
>>>> +objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>>>> +			void *context, objpool_init_obj_cb objinit)
>>>> +{
>>>> +	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
>>>
>>> 'nents' is *round up to the power of 2* of the total number of objects.
>>>
>>>> +
>>>> +	/* aligned object size by sizeof(void *) */
>>>> +	objsz = ALIGN(head->obj_size, sizeof(void *));
>>>> +	/* shall we allocate objects along with percpu-slot */
>>>> +	if (objsz)
>>>> +		head->flags |= OBJPOOL_HAVE_OBJECTS;
>>>
>>> Is there any chance that objsz == 0?
>>
>> No chance. We always require non-zero objsz. Will update in next verion.
>>
>>>
>>>> +
>>>> +	/* vmalloc is used in default to allocate percpu-slots */
>>>> +	if (!(head->gfp & GFP_ATOMIC))
>>>> +		head->flags |= OBJPOOL_FROM_VMALLOC;
>>>> +
>>>> +	for (i = 0; i < head->nr_cpus; i++) {
>>>> +		struct objpool_slot *os;
>>>> +
>>>> +		/* skip the cpus which could never be present */
>>>> +		if (!cpu_possible(i))
>>>> +			continue;
>>>> +
>>>> +		/* compute how many objects to be managed by this slot */
>>>
>>> "to be managed"? or "to be allocated with"?
>>> It seems all objects are possible to be managed by each slot.
>>
>> Right. "to be allocated with" is preferable. Thanks.
>>
>>>> +		n = nobjs / num_possible_cpus();
>>>> +		if (cpu < (nobjs % num_possible_cpus()))
>>>> +			n++;
>>>> +		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
>>>> +		       sizeof(uint32_t) * nents + objsz * n;
>>>> +
>>>> +		/*
>>>> +		 * here we allocate percpu-slot & objects together in a single
>>>> +		 * allocation, taking advantage of warm caches and TLB hits as
>>>> +		 * vmalloc always aligns the request size to pages
>>>
>>> "Since the objpool_entry array in the slot is mostly accessed from the
>>>    i-th CPU, it should be allocated from the memory node for that CPU."
>>>
>>> I think the reason of the memory node allocation is mainly for reducing the
>>> penalty of the cache-miss, since it will be bigger if running on NUMA.
>>
>> Right, NUMA is addressed by objpool_slot. The above description is to explain
>> why a single memory allocation (not multiple). I'll try to make it more clear.
>>
>>>
>>>> +		 */
>>>> +		if (head->flags & OBJPOOL_FROM_VMALLOC)
>>>> +			os = __vmalloc_node(size, sizeof(void *), head->gfp,
>>>> +				cpu_to_node(i), __builtin_return_address(0));
>>>> +		else
>>>> +			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
>>>> +		if (!os)
>>>> +			return -ENOMEM;
>>>> +
>>>> +		/* initialize percpu slot for the i-th slot */
>>>> +		memset(os, 0, size);
>>>> +		os->bits = ilog2(head->capacity);
>>>> +		os->mask = head->capacity - 1;
>>>> +		head->cpu_slots[i] = os;
>>>> +		head->slot_sizes[i] = size;
>>>> +		cpu = cpu + 1;
>>>> +
>>>> +		/*
>>>> +		 * manually set head & tail to avoid possible conflict:
>>>> +		 * We assume that the head item is ready for retrieval
>>>> +		 * iff head is equal to ages[head & mask]. but ages is
>>>> +		 * initialized as 0, so in view of the caller of pop(),
>>>> +		 * the 1st item (0th) is always ready, but the reality
>>>> +		 * could be: push() is stalled before the final update,
>>>> +		 * thus the item being inserted will be lost forever
>>>> +		 */
>>>> +		os->head = os->tail = head->capacity;
>>>> +
>>>> +		if (!objsz)
>>>> +			continue;
>>>
>>> Is it possible? and for what?
>>
>> Will be removed in next version.
>>
>>>
>>>> +
>>>> +		for (j = 0; j < n; j++) {
>>>> +			uint32_t *ages = SLOT_AGES(os);
>>>> +			void **ents = SLOT_ENTS(os);
>>>> +			void *obj = SLOT_OBJS(os) + j * objsz;
>>>> +			uint32_t ie = os->tail & os->mask;
>>>> +
>>>> +			/* perform object initialization */
>>>> +			if (objinit) {
>>>> +				int rc = objinit(obj, context);
>>>> +				if (rc)
>>>> +					return rc;
>>>> +			}
>>>> +
>>>> +			/* add obj into the ring array */
>>>> +			ents[ie] = obj;
>>>> +			ages[ie] = os->tail;
>>>> +			os->tail++;
>>>> +			head->nr_objs++;
>>>> +		}
>>>
>>> To simplify the code, this loop should be another static function.
>>
>> I'll reconsider the implementation. And the multiple computations of ages/ents
>> should be avoided too.
>>
>>>
>>>> +	}
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/* cleanup all percpu slots of the object pool */
>>>> +static void objpool_fini_percpu_slots(struct objpool_head *head)
>>>> +{
>>>> +	int i;
>>>> +
>>>> +	if (!head->cpu_slots)
>>>> +		return;
>>>> +
>>>> +	for (i = 0; i < head->nr_cpus; i++) {
>>>> +		if (!head->cpu_slots[i])
>>>> +			continue;
>>>> +		if (head->flags & OBJPOOL_FROM_VMALLOC)
>>>> +			vfree(head->cpu_slots[i]);
>>>> +		else
>>>> +			kfree(head->cpu_slots[i]);
>>>> +	}
>>>> +	kfree(head->cpu_slots);
>>>> +	head->cpu_slots = NULL;
>>>> +	head->slot_sizes = NULL;
>>>> +}
>>>> +
>>>> +/* initialize object pool and pre-allocate objects */
>>>> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>>> +		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>>> +		objpool_fini_cb release)
>>>> +{
>>>> +	int nents, rc;
>>>> +
>>>> +	/* check input parameters */
>>>> +	if (nr_objs <= 0 || object_size <= 0)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* calculate percpu slot size (rounded to pow of 2) */
>>>> +	nents = max_t(int, roundup_pow_of_two(nr_objs),
>>>> +			objpool_nobjs(L1_CACHE_BYTES));
>>>> +
>>>> +	/* initialize objpool head */
>>>> +	memset(head, 0, sizeof(struct objpool_head));
>>>> +	head->nr_cpus = nr_cpu_ids;
>>>> +	head->obj_size = object_size;
>>>> +	head->capacity = nents;
>>>> +	head->gfp = gfp & ~__GFP_ZERO;
>>>> +	head->context = context;
>>>> +	head->release = release;
>>>> +
>>>> +	/* allocate array for percpu slots */
>>>> +	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
>>>> +			       head->nr_cpus * sizeof(int), head->gfp);
>>>> +	if (!head->cpu_slots)
>>>> +		return -ENOMEM;
>>>> +	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>>>> +
>>>> +	/* initialize per-cpu slots */
>>>> +	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
>>>> +	if (rc)
>>>> +		objpool_fini_percpu_slots(head);
>>>> +	else
>>>> +		refcount_set(&head->ref, nr_objs + 1);
>>>> +
>>>> +	return rc;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(objpool_init);
>>>> +
>>>> +/* adding object to slot, abort if the slot was already full */
>>>> +static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
>>>> +{
>>>> +	uint32_t *ages = SLOT_AGES(os);
>>>> +	void **ents = SLOT_ENTS(os);
>>>> +	uint32_t head, tail;
>>>> +
>>>> +	do {
>>>> +		/* perform memory loading for both head and tail */
>>>> +		head = READ_ONCE(os->head);
>>>> +		tail = READ_ONCE(os->tail);
>>>> +		/* just abort if slot is full */
>>>> +		if (tail - head > os->mask)
>>>> +			return -ENOENT;
>>>
>>> Is this really possible? The total number of objects must be less euqal to
>>> the os->mask. If it means a bug, please use WARN_ON_ONCE() here for debug.
>>
>> Yes, it's a BUG and the caller's fault. When user tries pushing wrong object
>> or repeatedly pushing a same object, it could break the objpool's consistency.
>> It's a 'worse' or 'more worse' choice, rather returning error than breaking
>> the consitency.
>>
>> As you adviced, better crash than problematic. I'll update in next version.
>>
>>>
>>>> +		/* try to extend tail by 1 using CAS to avoid races */
>>>> +		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
>>>> +			break;
>>>> +	} while (1);
>>>
>>> "if(cond) ~ break; } while(1)" should be "} (!cond);"
>>
>> I see. Just to make the codes more "balanced" with comments :)
>>
>>>
>>> And this seems to be buggy since tail++ can be 0, then "tail - head" < 0.
>>>
>>> if (tail < head)
>>> 	if (WARN_ON_ONCE(tail + (UINT32_MAX - head) > os->mask))
>>> 		return -ENOENT;
>>> else
>>> 	if (WARN_ON_ONCE(tail - head > os->mask))
>>> 		return -ENOENT;
>>
>> tail and head are unsigned, so "tail - head" is unsigned and should always
>> be the actual number of free objects in the objpool_slot.
>>
>>>> +
>>>> +	/* the tail-th of slot is reserved for the given obj */
>>>> +	WRITE_ONCE(ents[tail & os->mask], obj);
>>>> +	/* update epoch id to make this object available for pop() */
>>>> +	smp_store_release(&ages[tail & os->mask], tail);
>>>
>>> Note: since the ages array size is the power of 2, this is just a
>>> (32 - os->bits) loop counter. :)
>>>
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/* reclaim an object to object pool */
>>>> +int objpool_push(void *obj, struct objpool_head *oh)
>>>> +{
>>>> +	unsigned long flags;
>>>> +	int cpu, rc;
>>>> +
>>>> +	/* disable local irq to avoid preemption & interruption */
>>>> +	raw_local_irq_save(flags);
>>>> +	cpu = raw_smp_processor_id();
>>>> +	do {
>>>> +		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
>>>> +		if (!rc)
>>>> +			break;
>>>> +		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
>>>> +	} while (1);
>>>
>>> Hmm, as I said, head->capacity >= nr_all_obj, this must not happen,
>>> we can always push it on this CPU's slot, right?
>>
>> Right. If it happens, that means the user made mistakes. I'll refine
>> the codes.
>>
>>>
>>>> +	raw_local_irq_restore(flags);
>>>> +
>>>> +	return rc;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(objpool_push);
>>>> +
>>>> +/* drop the allocated object, rather reclaim it to objpool */
>>>> +int objpool_drop(void *obj, struct objpool_head *head)
>>>> +{
>>>> +	if (!obj || !head)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (refcount_dec_and_test(&head->ref)) {
>>>> +		objpool_free(head);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	return -EAGAIN;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(objpool_drop);
>>>> +
>>>> +/* try to retrieve object from slot */
>>>> +static inline void *objpool_try_get_slot(struct objpool_slot *os)
>>>> +{
>>>> +	uint32_t *ages = SLOT_AGES(os);
>>>> +	void **ents = SLOT_ENTS(os);
>>>> +	/* do memory load of head to local head */
>>>> +	uint32_t head = smp_load_acquire(&os->head);
>>>> +
>>>> +	/* loop if slot isn't empty */
>>>> +	while (head != READ_ONCE(os->tail)) {
>>>> +		uint32_t id = head & os->mask, prev = head;
>>>> +
>>>> +		/* do prefetching of object ents */
>>>> +		prefetch(&ents[id]);
>>>> +
>>>> +		/* check whether this item was ready for retrieval */
>>>> +		if (smp_load_acquire(&ages[id]) == head) {
>>>
>>> We may not need this check, since we know head != tail and the
>>> sizeof(ages) >= nr_all_objs.
>>>
>>> Hmm, I guess we can remove ages[] from the code.
>>
>> Just do a quick peek to avoid an unnecessary call of try_cmpxchg_release.
>> try_cmpxchg_release is implemented by heavy instruction with "LOCK" prefix,
>> which could bring cache invalidation among CPU nodes.
> 
> OK, I understand what this ages[] does. This is a nestable commit table
> for the ring array.
> 
>>
>>>
>>>> +			/* node must have been udpated by push() */
>>>> +			void *node = READ_ONCE(ents[id]);
>>>
>>> Please use the same word for the same object.
>>> I mean this is not 'node' but 'object'.
>>
>> Got it.
>>
>>>
>>>> +			/* commit and move forward head of the slot */
>>>> +			if (try_cmpxchg_release(&os->head, &head, head + 1))
>>>> +				return node;
>>>> +			/* head was already updated by others */
>>>> +		}
>>>> +
>>>> +		/* re-load head from memory and continue trying */
>>>> +		head = READ_ONCE(os->head);
>>>> +		/*
>>>> +		 * head stays unchanged, so it's very likely there's an
>>>> +		 * ongoing push() on other cpu nodes but yet not update
>>>> +		 * ages[] to mark it's completion
>>>> +		 */
>>>> +		if (head == prev)
>>>> +			break;
>>>
>>> This is OK. If we always push() on the current CPU slot, and pop() from
>>> any cpus, we can try again here if this slot is not current CPU. But that
>>> maybe to much :P
>>
>> Yes. For most cases, every CPU should only touch it's own objpool_slot.
>>
>>> Thank you,
>>
>> Thanks for your time.
> 
> Thank you for your reply!
> 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-10-12 14:02       ` Masami Hiramatsu
@ 2023-10-12 17:36         ` wuqiang.matt
  2023-10-13  1:59           ` Masami Hiramatsu
  0 siblings, 1 reply; 25+ messages in thread
From: wuqiang.matt @ 2023-10-12 17:36 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On 2023/10/12 22:02, Masami Hiramatsu (Google) wrote:
> Hi Wuqiang,
> 
> On Mon, 9 Oct 2023 17:23:34 +0800
> wuqiang <wuqiang.matt@bytedance.com> wrote:
> 
>> Hello Masami,
>>
>> Just got time for the new patch and got that ages[] was removed. ages[] is
>> introduced the way like 2-phase commit to keep consitency and must be kept.
>>
>> Thinking of the following 2 cases that two cpu nodes are operating the same
>> objpool_slot simultaneously:
>>
>> Case 1:
>>
>>     NODE 1:                  NODE 2:
>>     push to an empty slot    pop will get wrong value
>>
>>     try_cmpxchg_acquire(&slot->tail, &tail, next)
>>                              try_cmpxchg_release(&slot->head, &head, head + 1)
>>                              return slot->entries[head & slot->mask]
>>     WRITE_ONCE(slot->entries[tail & slot->mask], obj)
> 
> Today, I rethink the algorithm.
> 
> For this case, we can just add a `commit` to the slot for committing the tail
> commit instead of the ages[].
> 
> CPU1                                       CPU2
> push to an empty slot                      pop from the same slot
> 
> commit = tail = slot->tail;
> next = tail + 1;
> try_cmpxchg_acquire(&slot->tail,
>                      &tail, next);
>                                             while (head != commit) -> NG1
> WRITE_ONCE(slot->entries[tail & slot->mask],
>              obj)
>                                             while (head != commit) -> NG2
> WRITE_ONCE(&slot->commit, next);
>                                             while (head != commit) -> OK
>                                             try_cmpxchg_acquire(&slot->head,
>                                                                 &head, next);
>                                             return slot->entries[head & slot->mask]
> 
> So the NG1 and NG2 timing, the pop will fail.
> 
> This does not expect the nested "push" case because the reserve-commit block
> will be interrupt disabled. This doesn't support NMI but that is OK.

If 2 push are performed in a row, slot->commit will be 'tail + 2', CPU2 won't
meet the condition "head == commit".

Since slot->commit' is always synced to slot->tail after a successful push,
should pop check "tail != commit" ? In this case when a push is ongoing with
the slot, pop() has to wait for its completion even if there are objects in
the same slot.

> If we need to support such NMI or remove irq-disable, we can do following
> (please remember the push operation only happens on the slot owned by that
>   CPU. So this is per-cpu process)
> 
> ---
> do {
>    commit = tail = slot->tail;
>    next = tail + 1;
> } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> 
> WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> 
> // At this point, "commit == slot->tail - 1" in this nest level.
> // Only outmost (non-nexted) case, "commit == slot->commit"
> if (commit != slot->commit)
>    return; // do nothing. it must be updated by outmost one.
> 
> // catch up commit to tail.
> do {
>    commit = READ_ONCE(slot->tail);
>    WRITE_ONCE(slot->commit, commit);
>     // note that someone can interrupt this loop too.
> } while (commit != READ_ONCE(slot->tail));
> ---

Yes, I got you: push can only happen on local node and the first try has the
right to extend 'commit'. It does resolve nested push attempts, and preemption
must be disabled.

The overflow/ABA issue can still happen if the cpu is being interrupted for
long time.

> For the rethook this may be too much.
> 
> Thank you,
> 
>>
>>
>> Case 2:
>>
>>     NODE 1:                  NODE 2
>>     push to slot w/ 1 obj    pop will get wrong value
>>
>>                              try_cmpxchg_release(&slot->head, &head, head + 1)
>>     try_cmpxchg_acquire(&slot->tail, &tail, next)
>>     WRITE_ONCE(slot->entries[tail & slot->mask], obj)
>>                              return slot->entries[head & slot->mask]

The pre-condition should be: CPU 1 tries to push to a full slot, in this case
tail = head + capacity but tail & mask == head & mask

>>
>> Regards,
>> wuqiang
>>
>> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
>>> Hi Wuqiang,
>>>
>>> On Tue,  5 Sep 2023 09:52:51 +0800
>>> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
>>>
>>>> The object pool is a scalable implementaion of high performance queue
>>>> for object allocation and reclamation, such as kretprobe instances.
>>>>
>>>> With leveraging percpu ring-array to mitigate the hot spot of memory
>>>> contention, it could deliver near-linear scalability for high parallel
>>>> scenarios. The objpool is best suited for following cases:
>>>> 1) Memory allocation or reclamation are prohibited or too expensive
>>>> 2) Consumers are of different priorities, such as irqs and threads
>>>>
>>>> Limitations:
>>>> 1) Maximum objects (capacity) is determined during pool initializing
>>>>      and can't be modified (extended) after objpool creation
>>>> 2) The memory of objects won't be freed until objpool is finalized
>>>> 3) Object allocation (pop) may fail after trying all cpu slots
>>>
>>> I made a simplifying patch on this by (mainly) removing ages array.
>>> I also rename local variable to use more readable names, like slot,
>>> pool, and obj.
>>>
>>> Here the results which I run the test_objpool.ko.
>>>
>>> Original:
>>> [   50.500235] Summary of testcases:
>>> [   50.503139]     duration: 1027135us 	hits:   30628293 	miss:          0 	sync: percpu objpool
>>> [   50.510416]     duration: 1047977us 	hits:   30453023 	miss:          0 	sync: percpu objpool from vmalloc
>>> [   50.517421]     duration: 1047098us 	hits:   31145034 	miss:          0 	sync & hrtimer: percpu objpool
>>> [   50.524671]     duration: 1053233us 	hits:   30919640 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
>>> [   50.531382]     duration: 1055822us 	hits:    3407221 	miss:     830219 	sync overrun: percpu objpool
>>> [   50.538135]     duration: 1055998us 	hits:    3404624 	miss:     854160 	sync overrun: percpu objpool from vmalloc
>>> [   50.546686]     duration: 1046104us 	hits:   19464798 	miss:          0 	async: percpu objpool
>>> [   50.552989]     duration: 1033054us 	hits:   18957836 	miss:          0 	async: percpu objpool from vmalloc
>>> [   50.560289]     duration: 1041778us 	hits:   33806868 	miss:          0 	async & hrtimer: percpu objpool
>>> [   50.567425]     duration: 1048901us 	hits:   34211860 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
>>>
>>> Simplified:
>>> [   48.393236] Summary of testcases:
>>> [   48.395384]     duration: 1013002us 	hits:   29661448 	miss:          0 	sync: percpu objpool
>>> [   48.400351]     duration: 1057231us 	hits:   31035578 	miss:          0 	sync: percpu objpool from vmalloc
>>> [   48.405660]     duration: 1043196us 	hits:   30546652 	miss:          0 	sync & hrtimer: percpu objpool
>>> [   48.411216]     duration: 1047128us 	hits:   30601306 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
>>> [   48.417231]     duration: 1051695us 	hits:    3468287 	miss:     892881 	sync overrun: percpu objpool
>>> [   48.422405]     duration: 1054003us 	hits:    3549491 	miss:     898141 	sync overrun: percpu objpool from vmalloc
>>> [   48.428425]     duration: 1052946us 	hits:   19005228 	miss:          0 	async: percpu objpool
>>> [   48.433597]     duration: 1051757us 	hits:   19670866 	miss:          0 	async: percpu objpool from vmalloc
>>> [   48.439280]     duration: 1042951us 	hits:   37135332 	miss:          0 	async & hrtimer: percpu objpool
>>> [   48.445085]     duration: 1029803us 	hits:   37093248 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
>>>
>>> Can you test it too?
>>>
>>> Thanks,
>>>
>>>   From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
>>> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
>>> Date: Mon, 25 Sep 2023 16:07:12 +0900
>>> Subject: [PATCH] objpool: Simplify objpool by removing ages array
>>>
>>> Simplify the objpool code by removing ages array from per-cpu slot.
>>> It chooses enough big capacity (which is a rounded up power of 2 value
>>> of nr_objects + 1) for the entries array, the tail never catch up to
>>> the head in per-cpu slot. Thus tail == head means the slot is empty.
>>>
>>> This also uses consistent local variable names for pool, slot and obj.
>>>
>>> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>>> ---
>>>    include/linux/objpool.h |  61 ++++----
>>>    lib/objpool.c           | 310 ++++++++++++++++------------------------
>>>    2 files changed, 147 insertions(+), 224 deletions(-)
>>>
>>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
>>> index 33c832216b98..ecd5ecaffcd3 100644
>>> --- a/include/linux/objpool.h
>>> +++ b/include/linux/objpool.h
>>> @@ -38,33 +38,23 @@
>>>     * struct objpool_slot - percpu ring array of objpool
>>>     * @head: head of the local ring array (to retrieve at)
>>>     * @tail: tail of the local ring array (to append at)
>>> - * @bits: log2 of capacity (for bitwise operations)
>>> - * @mask: capacity - 1
>>> + * @mask: capacity of entries - 1
>>> + * @entries: object entries on this slot.
>>>     *
>>>     * Represents a cpu-local array-based ring buffer, its size is specialized
>>>     * during initialization of object pool. The percpu objpool slot is to be
>>>     * allocated from local memory for NUMA system, and to be kept compact in
>>> - * continuous memory: ages[] is stored just after the body of objpool_slot,
>>> - * and then entries[]. ages[] describes revision of each item, solely used
>>> - * to avoid ABA; entries[] contains pointers of the actual objects
>>> - *
>>> - * Layout of objpool_slot in memory:
>>> - *
>>> - * 64bit:
>>> - *        4      8      12     16        32                 64
>>> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
>>> - *
>>> - * 32bit:
>>> - *        4      8      12     16        32        48       64
>>> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
>>> + * continuous memory: CPU assigned number of objects are stored just after
>>> + * the body of objpool_slot.
>>>     *
>>>     */
>>>    struct objpool_slot {
>>> -	uint32_t                head;
>>> -	uint32_t                tail;
>>> -	uint32_t                bits;
>>> -	uint32_t                mask;
>>> -} __packed;
>>> +	uint32_t	head;
>>> +	uint32_t	tail;
>>> +	uint32_t	mask;
>>> +	uint32_t	dummyr;
>>> +	void *		entries[];
>>> +};
>>>    
>>>    struct objpool_head;
>>>    
>>> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
>>>     * @obj_size:   object & element size
>>>     * @nr_objs:    total objs (to be pre-allocated)
>>>     * @nr_cpus:    nr_cpu_ids
>>> - * @capacity:   max objects per cpuslot
>>> + * @capacity:   max objects on percpu slot
>>>     * @gfp:        gfp flags for kmalloc & vmalloc
>>>     * @ref:        refcount for objpool
>>>     * @flags:      flags for objpool management
>>>     * @cpu_slots:  array of percpu slots
>>> - * @slot_sizes:	size in bytes of slots
>>>     * @release:    resource cleanup callback
>>>     * @context:    caller-provided context
>>>     */
>>> @@ -100,7 +89,6 @@ struct objpool_head {
>>>    	refcount_t              ref;
>>>    	unsigned long           flags;
>>>    	struct objpool_slot   **cpu_slots;
>>> -	int                    *slot_sizes;
>>>    	objpool_fini_cb         release;
>>>    	void                   *context;
>>>    };
>>> @@ -108,9 +96,12 @@ struct objpool_head {
>>>    #define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
>>>    #define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
>>>    
>>> +#define OBJPOOL_NR_OBJECT_MAX	(1 << 24)
>>> +#define OBJPOOL_OBJECT_SIZE_MAX	(1 << 16)
>>> +
>>>    /**
>>>     * objpool_init() - initialize objpool and pre-allocated objects
>>> - * @head:    the object pool to be initialized, declared by caller
>>> + * @pool:    the object pool to be initialized, declared by caller
>>>     * @nr_objs: total objects to be pre-allocated by this object pool
>>>     * @object_size: size of an object (should be > 0)
>>>     * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
>>> @@ -128,47 +119,47 @@ struct objpool_head {
>>>     * pop (object allocation) or do clearance before each push (object
>>>     * reclamation).
>>>     */
>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>>>    		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>>    		 objpool_fini_cb release);
>>>    
>>>    /**
>>>     * objpool_pop() - allocate an object from objpool
>>> - * @head: object pool
>>> + * @pool: object pool
>>>     *
>>>     * return value: object ptr or NULL if failed
>>>     */
>>> -void *objpool_pop(struct objpool_head *head);
>>> +void *objpool_pop(struct objpool_head *pool);
>>>    
>>>    /**
>>>     * objpool_push() - reclaim the object and return back to objpool
>>>     * @obj:  object ptr to be pushed to objpool
>>> - * @head: object pool
>>> + * @pool: object pool
>>>     *
>>>     * return: 0 or error code (it fails only when user tries to push
>>>     * the same object multiple times or wrong "objects" into objpool)
>>>     */
>>> -int objpool_push(void *obj, struct objpool_head *head);
>>> +int objpool_push(void *obj, struct objpool_head *pool);
>>>    
>>>    /**
>>>     * objpool_drop() - discard the object and deref objpool
>>>     * @obj:  object ptr to be discarded
>>> - * @head: object pool
>>> + * @pool: object pool
>>>     *
>>>     * return: 0 if objpool was released or error code
>>>     */
>>> -int objpool_drop(void *obj, struct objpool_head *head);
>>> +int objpool_drop(void *obj, struct objpool_head *pool);
>>>    
>>>    /**
>>>     * objpool_free() - release objpool forcely (all objects to be freed)
>>> - * @head: object pool to be released
>>> + * @pool: object pool to be released
>>>     */
>>> -void objpool_free(struct objpool_head *head);
>>> +void objpool_free(struct objpool_head *pool);
>>>    
>>>    /**
>>>     * objpool_fini() - deref object pool (also releasing unused objects)
>>> - * @head: object pool to be dereferenced
>>> + * @pool: object pool to be dereferenced
>>>     */
>>> -void objpool_fini(struct objpool_head *head);
>>> +void objpool_fini(struct objpool_head *pool);
>>>    
>>>    #endif /* _LINUX_OBJPOOL_H */
>>> diff --git a/lib/objpool.c b/lib/objpool.c
>>> index 22e752371820..f8e8f70d7253 100644
>>> --- a/lib/objpool.c
>>> +++ b/lib/objpool.c
>>> @@ -15,104 +15,55 @@
>>>     * Copyright: wuqiang.matt@bytedance.com
>>>     */
>>>    
>>> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
>>> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
>>> -			(sizeof(uint32_t) << (s)->bits)))
>>> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
>>> -			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
>>> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
>>> -
>>> -/* compute the suitable num of objects to be managed per slot */
>>> -static int objpool_nobjs(int size)
>>> -{
>>> -	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
>>> -			(sizeof(uint32_t) + sizeof(void *)));
>>> -}
>>> -
>>>    /* allocate and initialize percpu slots */
>>>    static int
>>> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>>> -			void *context, objpool_init_obj_cb objinit)
>>> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
>>> +			  objpool_init_obj_cb objinit)
>>>    {
>>> -	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
>>> -
>>> -	/* aligned object size by sizeof(void *) */
>>> -	objsz = ALIGN(head->obj_size, sizeof(void *));
>>> -	/* shall we allocate objects along with percpu-slot */
>>> -	if (objsz)
>>> -		head->flags |= OBJPOOL_HAVE_OBJECTS;
>>> -
>>> -	/* vmalloc is used in default to allocate percpu-slots */
>>> -	if (!(head->gfp & GFP_ATOMIC))
>>> -		head->flags |= OBJPOOL_FROM_VMALLOC;
>>> -
>>> -	for (i = 0; i < head->nr_cpus; i++) {
>>> -		struct objpool_slot *os;
>>> +	int i, j, n, size, slot_size, cpu_count = 0;
>>> +	struct objpool_slot *slot;
>>>    
>>> +	for (i = 0; i < pool->nr_cpus; i++) {
>>>    		/* skip the cpus which could never be present */
>>>    		if (!cpu_possible(i))
>>>    			continue;
>>>    
>>>    		/* compute how many objects to be managed by this slot */
>>> -		n = nobjs / num_possible_cpus();
>>> -		if (cpu < (nobjs % num_possible_cpus()))
>>> +		n = pool->nr_objs / num_possible_cpus();
>>> +		if (cpu_count < (pool->nr_objs % num_possible_cpus()))
>>>    			n++;
>>> -		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
>>> -		       sizeof(uint32_t) * nents + objsz * n;
>>> +		cpu_count++;
>>> +
>>> +		slot_size = struct_size(slot, entries, pool->capacity);
>>> +		size = slot_size + pool->obj_size * n;
>>>    
>>>    		/*
>>>    		 * here we allocate percpu-slot & objects together in a single
>>> -		 * allocation, taking advantage of warm caches and TLB hits as
>>> -		 * vmalloc always aligns the request size to pages
>>> +		 * allocation, taking advantage on NUMA.
>>>    		 */
>>> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
>>> -			os = __vmalloc_node(size, sizeof(void *), head->gfp,
>>> +		if (pool->flags & OBJPOOL_FROM_VMALLOC)
>>> +			slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
>>>    				cpu_to_node(i), __builtin_return_address(0));
>>>    		else
>>> -			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
>>> -		if (!os)
>>> +			slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
>>> +		if (!slot)
>>>    			return -ENOMEM;
>>>    
>>>    		/* initialize percpu slot for the i-th slot */
>>> -		memset(os, 0, size);
>>> -		os->bits = ilog2(head->capacity);
>>> -		os->mask = head->capacity - 1;
>>> -		head->cpu_slots[i] = os;
>>> -		head->slot_sizes[i] = size;
>>> -		cpu = cpu + 1;
>>> -
>>> -		/*
>>> -		 * manually set head & tail to avoid possible conflict:
>>> -		 * We assume that the head item is ready for retrieval
>>> -		 * iff head is equal to ages[head & mask]. but ages is
>>> -		 * initialized as 0, so in view of the caller of pop(),
>>> -		 * the 1st item (0th) is always ready, but the reality
>>> -		 * could be: push() is stalled before the final update,
>>> -		 * thus the item being inserted will be lost forever
>>> -		 */
>>> -		os->head = os->tail = head->capacity;
>>> -
>>> -		if (!objsz)
>>> -			continue;
>>> +		memset(slot, 0, size);
>>> +		slot->mask = pool->capacity - 1;
>>> +		pool->cpu_slots[i] = slot;
>>>    
>>>    		for (j = 0; j < n; j++) {
>>> -			uint32_t *ages = SLOT_AGES(os);
>>> -			void **ents = SLOT_ENTS(os);
>>> -			void *obj = SLOT_OBJS(os) + j * objsz;
>>> -			uint32_t ie = os->tail & os->mask;
>>> +			void *obj = (void *)slot + slot_size + pool->obj_size * j;
>>>    
>>> -			/* perform object initialization */
>>>    			if (objinit) {
>>>    				int rc = objinit(obj, context);
>>>    				if (rc)
>>>    					return rc;
>>>    			}
>>> -
>>> -			/* add obj into the ring array */
>>> -			ents[ie] = obj;
>>> -			ages[ie] = os->tail;
>>> -			os->tail++;
>>> -			head->nr_objs++;
>>> +			slot->entries[j] = obj;
>>> +			slot->tail++;
>>>    		}
>>>    	}
>>>    
>>> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>>>    }
>>>    
>>>    /* cleanup all percpu slots of the object pool */
>>> -static void objpool_fini_percpu_slots(struct objpool_head *head)
>>> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
>>>    {
>>>    	int i;
>>>    
>>> -	if (!head->cpu_slots)
>>> +	if (!pool->cpu_slots)
>>>    		return;
>>>    
>>> -	for (i = 0; i < head->nr_cpus; i++) {
>>> -		if (!head->cpu_slots[i])
>>> -			continue;
>>> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
>>> -			vfree(head->cpu_slots[i]);
>>> -		else
>>> -			kfree(head->cpu_slots[i]);
>>> -	}
>>> -	kfree(head->cpu_slots);
>>> -	head->cpu_slots = NULL;
>>> -	head->slot_sizes = NULL;
>>> +	for (i = 0; i < pool->nr_cpus; i++)
>>> +		kvfree(pool->cpu_slots[i]);
>>> +	kfree(pool->cpu_slots);
>>>    }
>>>    
>>>    /* initialize object pool and pre-allocate objects */
>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>>>    		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>>    		objpool_fini_cb release)
>>>    {
>>>    	int nents, rc;
>>>    
>>>    	/* check input parameters */
>>> -	if (nr_objs <= 0 || object_size <= 0)
>>> +	if (nr_objs <= 0 || object_size <= 0 ||
>>> +	    nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
>>> +		return -EINVAL;
>>> +
>>> +	/* Align up to unsigned long size */
>>> +	object_size = ALIGN(object_size, sizeof(unsigned long));
>>> +
>>> +	/*
>>> +	 * To avoid filling up the entries array in the per-cpu slot,
>>> +	 * use the power of 2 which is more than N + 1. Thus, the tail never
>>> +	 * catch up to the pool, and able to use pool/tail as the sequencial
>>> +	 * number.
>>> +	 */
>>> +	nents = roundup_pow_of_two(nr_objs + 1);
>>> +	if (!nents)
>>>    		return -EINVAL;
>>>    
>>> -	/* calculate percpu slot size (rounded to pow of 2) */
>>> -	nents = max_t(int, roundup_pow_of_two(nr_objs),
>>> -			objpool_nobjs(L1_CACHE_BYTES));
>>> -
>>> -	/* initialize objpool head */
>>> -	memset(head, 0, sizeof(struct objpool_head));
>>> -	head->nr_cpus = nr_cpu_ids;
>>> -	head->obj_size = object_size;
>>> -	head->capacity = nents;
>>> -	head->gfp = gfp & ~__GFP_ZERO;
>>> -	head->context = context;
>>> -	head->release = release;
>>> -
>>> -	/* allocate array for percpu slots */
>>> -	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
>>> -			       head->nr_cpus * sizeof(int), head->gfp);
>>> -	if (!head->cpu_slots)
>>> +	/* initialize objpool pool */
>>> +	memset(pool, 0, sizeof(struct objpool_head));
>>> +	pool->nr_cpus = nr_cpu_ids;
>>> +	pool->obj_size = object_size;
>>> +	pool->nr_objs = nr_objs;
>>> +	/* just prevent to fullfill the per-cpu ring array */
>>> +	pool->capacity = nents;
>>> +	pool->gfp = gfp & ~__GFP_ZERO;
>>> +	pool->context = context;
>>> +	pool->release = release;
>>> +	/* vmalloc is used in default to allocate percpu-slots */
>>> +	if (!(pool->gfp & GFP_ATOMIC))
>>> +		pool->flags |= OBJPOOL_FROM_VMALLOC;
>>> +
>>> +	pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
>>> +	if (!pool->cpu_slots)
>>>    		return -ENOMEM;
>>> -	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>>>    
>>>    	/* initialize per-cpu slots */
>>> -	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
>>> +	rc = objpool_init_percpu_slots(pool, context, objinit);
>>>    	if (rc)
>>> -		objpool_fini_percpu_slots(head);
>>> +		objpool_fini_percpu_slots(pool);
>>>    	else
>>> -		refcount_set(&head->ref, nr_objs + 1);
>>> +		refcount_set(&pool->ref, nr_objs + 1);
>>>    
>>>    	return rc;
>>>    }
>>>    EXPORT_SYMBOL_GPL(objpool_init);
>>>    
>>>    /* adding object to slot, abort if the slot was already full */
>>> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
>>> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
>>>    {
>>> -	uint32_t *ages = SLOT_AGES(os);
>>> -	void **ents = SLOT_ENTS(os);
>>> -	uint32_t head, tail;
>>> +	struct objpool_slot *slot = pool->cpu_slots[cpu];
>>> +	uint32_t tail, next;
>>>    
>>>    	do {
>>> -		/* perform memory loading for both head and tail */
>>> -		head = READ_ONCE(os->head);
>>> -		tail = READ_ONCE(os->tail);
>>> -		/* just abort if slot is full */
>>> -		if (tail - head > os->mask)
>>> -			return -ENOENT;
>>> -		/* try to extend tail by 1 using CAS to avoid races */
>>> -		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
>>> -			break;
>>> -	} while (1);
>>> +		uint32_t head = READ_ONCE(slot->head);
>>>    
>>> -	/* the tail-th of slot is reserved for the given obj */
>>> -	WRITE_ONCE(ents[tail & os->mask], obj);
>>> -	/* update epoch id to make this object available for pop() */
>>> -	smp_store_release(&ages[tail & os->mask], tail);
>>> +		tail = READ_ONCE(slot->tail);
>>> +		next = tail + 1;
>>> +
>>> +		/* This must never happen because capacity >= N + 1 */
>>> +		if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
>>> +				 (next < head && next > head + pool->nr_objs)))
>>> +			return -EINVAL;
>>> +
>>> +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
>>> +
>>> +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>>>    	return 0;
>>>    }
>>>    
>>>    /* reclaim an object to object pool */
>>> -int objpool_push(void *obj, struct objpool_head *oh)
>>> +int objpool_push(void *obj, struct objpool_head *pool)
>>>    {
>>>    	unsigned long flags;
>>> -	int cpu, rc;
>>> +	int rc;
>>>    
>>>    	/* disable local irq to avoid preemption & interruption */
>>>    	raw_local_irq_save(flags);
>>> -	cpu = raw_smp_processor_id();
>>> -	do {
>>> -		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
>>> -		if (!rc)
>>> -			break;
>>> -		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
>>> -	} while (1);
>>> +	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>>>    	raw_local_irq_restore(flags);
>>>    
>>>    	return rc;
>>>    }
>>>    EXPORT_SYMBOL_GPL(objpool_push);
>>>    
>>> -/* drop the allocated object, rather reclaim it to objpool */
>>> -int objpool_drop(void *obj, struct objpool_head *head)
>>> -{
>>> -	if (!obj || !head)
>>> -		return -EINVAL;
>>> -
>>> -	if (refcount_dec_and_test(&head->ref)) {
>>> -		objpool_free(head);
>>> -		return 0;
>>> -	}
>>> -
>>> -	return -EAGAIN;
>>> -}
>>> -EXPORT_SYMBOL_GPL(objpool_drop);
>>> -
>>>    /* try to retrieve object from slot */
>>> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
>>> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
>>>    {
>>> -	uint32_t *ages = SLOT_AGES(os);
>>> -	void **ents = SLOT_ENTS(os);
>>>    	/* do memory load of head to local head */
>>> -	uint32_t head = smp_load_acquire(&os->head);
>>> +	uint32_t head = smp_load_acquire(&slot->head);
>>>    
>>>    	/* loop if slot isn't empty */
>>> -	while (head != READ_ONCE(os->tail)) {
>>> -		uint32_t id = head & os->mask, prev = head;
>>> +	while (head != READ_ONCE(slot->tail)) {
>>>    
>>>    		/* do prefetching of object ents */
>>> -		prefetch(&ents[id]);
>>> -
>>> -		/* check whether this item was ready for retrieval */
>>> -		if (smp_load_acquire(&ages[id]) == head) {
>>> -			/* node must have been udpated by push() */
>>> -			void *node = READ_ONCE(ents[id]);
>>> -			/* commit and move forward head of the slot */
>>> -			if (try_cmpxchg_release(&os->head, &head, head + 1))
>>> -				return node;
>>> -			/* head was already updated by others */
>>> -		}
>>> +		prefetch(&slot->entries[head & slot->mask]);
>>> +
>>> +		/* commit and move forward head of the slot */
>>> +		if (try_cmpxchg_release(&slot->head, &head, head + 1))
>>> +			/*
>>> +			 * TBD: check overwrap the tail/head counter and warn
>>> +			 * if it is broken. But this happens only if this
>>> +			 * process slows down a lot and another CPU updates
>>> +			 * the haed/tail just 2^32 + 1 times, and this slot
>>> +			 * is empty.
>>> +			 */
>>> +			return slot->entries[head & slot->mask];
>>>    
>>>    		/* re-load head from memory and continue trying */
>>> -		head = READ_ONCE(os->head);
>>> -		/*
>>> -		 * head stays unchanged, so it's very likely there's an
>>> -		 * ongoing push() on other cpu nodes but yet not update
>>> -		 * ages[] to mark it's completion
>>> -		 */
>>> -		if (head == prev)
>>> -			break;
>>> +		head = READ_ONCE(slot->head);
>>>    	}
>>>    
>>>    	return NULL;
>>> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
>>>    EXPORT_SYMBOL_GPL(objpool_pop);
>>>    
>>>    /* release whole objpool forcely */
>>> -void objpool_free(struct objpool_head *head)
>>> +void objpool_free(struct objpool_head *pool)
>>>    {
>>> -	if (!head->cpu_slots)
>>> +	if (!pool->cpu_slots)
>>>    		return;
>>>    
>>>    	/* release percpu slots */
>>> -	objpool_fini_percpu_slots(head);
>>> +	objpool_fini_percpu_slots(pool);
>>>    
>>>    	/* call user's cleanup callback if provided */
>>> -	if (head->release)
>>> -		head->release(head, head->context);
>>> +	if (pool->release)
>>> +		pool->release(pool, pool->context);
>>>    }
>>>    EXPORT_SYMBOL_GPL(objpool_free);
>>>    
>>> -/* drop unused objects and defref objpool for releasing */
>>> -void objpool_fini(struct objpool_head *head)
>>> +/* drop the allocated object, rather reclaim it to objpool */
>>> +int objpool_drop(void *obj, struct objpool_head *pool)
>>>    {
>>> -	void *nod;
>>> +	if (!obj || !pool)
>>> +		return -EINVAL;
>>>    
>>> -	do {
>>> -		/* grab object from objpool and drop it */
>>> -		nod = objpool_pop(head);
>>> +	if (refcount_dec_and_test(&pool->ref)) {
>>> +		objpool_free(pool);
>>> +		return 0;
>>> +	}
>>> +
>>> +	return -EAGAIN;
>>> +}
>>> +EXPORT_SYMBOL_GPL(objpool_drop);
>>> +
>>> +/* drop unused objects and defref objpool for releasing */
>>> +void objpool_fini(struct objpool_head *pool)
>>> +{
>>> +	void *obj;
>>>    
>>> -		/* drop the extra ref of objpool */
>>> -		if (refcount_dec_and_test(&head->ref))
>>> -			objpool_free(head);
>>> -	} while (nod);
>>> +	/* grab object from objpool and drop it */
>>> +	while ((obj = objpool_pop(pool)))
>>> +		objpool_drop(obj, pool);
>>>    }
>>>    EXPORT_SYMBOL_GPL(objpool_fini);
>>
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-10-12 17:36         ` wuqiang.matt
@ 2023-10-13  1:59           ` Masami Hiramatsu
  2023-10-13  3:03             ` wuqiang.matt
  0 siblings, 1 reply; 25+ messages in thread
From: Masami Hiramatsu @ 2023-10-13  1:59 UTC (permalink / raw)
  To: wuqiang.matt
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On Fri, 13 Oct 2023 01:36:05 +0800
"wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:

> On 2023/10/12 22:02, Masami Hiramatsu (Google) wrote:
> > Hi Wuqiang,
> > 
> > On Mon, 9 Oct 2023 17:23:34 +0800
> > wuqiang <wuqiang.matt@bytedance.com> wrote:
> > 
> >> Hello Masami,
> >>
> >> Just got time for the new patch and got that ages[] was removed. ages[] is
> >> introduced the way like 2-phase commit to keep consitency and must be kept.
> >>
> >> Thinking of the following 2 cases that two cpu nodes are operating the same
> >> objpool_slot simultaneously:
> >>
> >> Case 1:
> >>
> >>     NODE 1:                  NODE 2:
> >>     push to an empty slot    pop will get wrong value
> >>
> >>     try_cmpxchg_acquire(&slot->tail, &tail, next)
> >>                              try_cmpxchg_release(&slot->head, &head, head + 1)
> >>                              return slot->entries[head & slot->mask]
> >>     WRITE_ONCE(slot->entries[tail & slot->mask], obj)
> > 
> > Today, I rethink the algorithm.
> > 
> > For this case, we can just add a `commit` to the slot for committing the tail
> > commit instead of the ages[].
> > 
> > CPU1                                       CPU2
> > push to an empty slot                      pop from the same slot
> > 
> > commit = tail = slot->tail;
> > next = tail + 1;
> > try_cmpxchg_acquire(&slot->tail,
> >                      &tail, next);
> >                                             while (head != commit) -> NG1
> > WRITE_ONCE(slot->entries[tail & slot->mask],
> >              obj)
> >                                             while (head != commit) -> NG2
> > WRITE_ONCE(&slot->commit, next);
> >                                             while (head != commit) -> OK
> >                                             try_cmpxchg_acquire(&slot->head,
> >                                                                 &head, next);
> >                                             return slot->entries[head & slot->mask]
> > 
> > So the NG1 and NG2 timing, the pop will fail.
> > 
> > This does not expect the nested "push" case because the reserve-commit block
> > will be interrupt disabled. This doesn't support NMI but that is OK.
> 
> If 2 push are performed in a row, slot->commit will be 'tail + 2', CPU2 won't
> meet the condition "head == commit".

No, it doesn't happen, because pop() may fetch the object from other's slot, but
push() only happens on its own slot. So push() on the same slot is never
in parallel.

> 
> Since slot->commit' is always synced to slot->tail after a successful push,
> should pop check "tail != commit" ?

No, pop() only need to check the "head != commit". If "head == commit", this slot
is empty or the slot owner is trying to push() (not committed yet).

> In this case when a push is ongoing with
> the slot, pop() has to wait for its completion even if there are objects in
> the same slot.

No, pop() will go to other slots to find available object.

> 
> > If we need to support such NMI or remove irq-disable, we can do following
> > (please remember the push operation only happens on the slot owned by that
> >   CPU. So this is per-cpu process)
> > 
> > ---
> > do {
> >    commit = tail = slot->tail;
> >    next = tail + 1;
> > } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> > 
> > WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> > 
> > // At this point, "commit == slot->tail - 1" in this nest level.
> > // Only outmost (non-nexted) case, "commit == slot->commit"
> > if (commit != slot->commit)
> >    return; // do nothing. it must be updated by outmost one.
> > 
> > // catch up commit to tail.
> > do {
> >    commit = READ_ONCE(slot->tail);
> >    WRITE_ONCE(slot->commit, commit);
> >     // note that someone can interrupt this loop too.
> > } while (commit != READ_ONCE(slot->tail));
> > ---
> 
> Yes, I got you: push can only happen on local node and the first try has the
> right to extend 'commit'. It does resolve nested push attempts, and preemption
> must be disabled.

Yes.

> 
> The overflow/ABA issue can still happen if the cpu is being interrupted for
> long time.

Yes, but ages[] also can not avoid that. (Note that head, tail and commit are
32bit counter, and per-cpu slot ring has enough big to store the pointer of
all objects.)

Thank you,

> 
> > For the rethook this may be too much.
> > 
> > Thank you,
> > 
> >>
> >>
> >> Case 2:
> >>
> >>     NODE 1:                  NODE 2
> >>     push to slot w/ 1 obj    pop will get wrong value
> >>
> >>                              try_cmpxchg_release(&slot->head, &head, head + 1)
> >>     try_cmpxchg_acquire(&slot->tail, &tail, next)
> >>     WRITE_ONCE(slot->entries[tail & slot->mask], obj)
> >>                              return slot->entries[head & slot->mask]
> 
> The pre-condition should be: CPU 1 tries to push to a full slot, in this case
> tail = head + capacity but tail & mask == head & mask
> 
> >>
> >> Regards,
> >> wuqiang
> >>
> >> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> >>> Hi Wuqiang,
> >>>
> >>> On Tue,  5 Sep 2023 09:52:51 +0800
> >>> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> >>>
> >>>> The object pool is a scalable implementaion of high performance queue
> >>>> for object allocation and reclamation, such as kretprobe instances.
> >>>>
> >>>> With leveraging percpu ring-array to mitigate the hot spot of memory
> >>>> contention, it could deliver near-linear scalability for high parallel
> >>>> scenarios. The objpool is best suited for following cases:
> >>>> 1) Memory allocation or reclamation are prohibited or too expensive
> >>>> 2) Consumers are of different priorities, such as irqs and threads
> >>>>
> >>>> Limitations:
> >>>> 1) Maximum objects (capacity) is determined during pool initializing
> >>>>      and can't be modified (extended) after objpool creation
> >>>> 2) The memory of objects won't be freed until objpool is finalized
> >>>> 3) Object allocation (pop) may fail after trying all cpu slots
> >>>
> >>> I made a simplifying patch on this by (mainly) removing ages array.
> >>> I also rename local variable to use more readable names, like slot,
> >>> pool, and obj.
> >>>
> >>> Here the results which I run the test_objpool.ko.
> >>>
> >>> Original:
> >>> [   50.500235] Summary of testcases:
> >>> [   50.503139]     duration: 1027135us 	hits:   30628293 	miss:          0 	sync: percpu objpool
> >>> [   50.510416]     duration: 1047977us 	hits:   30453023 	miss:          0 	sync: percpu objpool from vmalloc
> >>> [   50.517421]     duration: 1047098us 	hits:   31145034 	miss:          0 	sync & hrtimer: percpu objpool
> >>> [   50.524671]     duration: 1053233us 	hits:   30919640 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> >>> [   50.531382]     duration: 1055822us 	hits:    3407221 	miss:     830219 	sync overrun: percpu objpool
> >>> [   50.538135]     duration: 1055998us 	hits:    3404624 	miss:     854160 	sync overrun: percpu objpool from vmalloc
> >>> [   50.546686]     duration: 1046104us 	hits:   19464798 	miss:          0 	async: percpu objpool
> >>> [   50.552989]     duration: 1033054us 	hits:   18957836 	miss:          0 	async: percpu objpool from vmalloc
> >>> [   50.560289]     duration: 1041778us 	hits:   33806868 	miss:          0 	async & hrtimer: percpu objpool
> >>> [   50.567425]     duration: 1048901us 	hits:   34211860 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> >>>
> >>> Simplified:
> >>> [   48.393236] Summary of testcases:
> >>> [   48.395384]     duration: 1013002us 	hits:   29661448 	miss:          0 	sync: percpu objpool
> >>> [   48.400351]     duration: 1057231us 	hits:   31035578 	miss:          0 	sync: percpu objpool from vmalloc
> >>> [   48.405660]     duration: 1043196us 	hits:   30546652 	miss:          0 	sync & hrtimer: percpu objpool
> >>> [   48.411216]     duration: 1047128us 	hits:   30601306 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
> >>> [   48.417231]     duration: 1051695us 	hits:    3468287 	miss:     892881 	sync overrun: percpu objpool
> >>> [   48.422405]     duration: 1054003us 	hits:    3549491 	miss:     898141 	sync overrun: percpu objpool from vmalloc
> >>> [   48.428425]     duration: 1052946us 	hits:   19005228 	miss:          0 	async: percpu objpool
> >>> [   48.433597]     duration: 1051757us 	hits:   19670866 	miss:          0 	async: percpu objpool from vmalloc
> >>> [   48.439280]     duration: 1042951us 	hits:   37135332 	miss:          0 	async & hrtimer: percpu objpool
> >>> [   48.445085]     duration: 1029803us 	hits:   37093248 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
> >>>
> >>> Can you test it too?
> >>>
> >>> Thanks,
> >>>
> >>>   From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> >>> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> >>> Date: Mon, 25 Sep 2023 16:07:12 +0900
> >>> Subject: [PATCH] objpool: Simplify objpool by removing ages array
> >>>
> >>> Simplify the objpool code by removing ages array from per-cpu slot.
> >>> It chooses enough big capacity (which is a rounded up power of 2 value
> >>> of nr_objects + 1) for the entries array, the tail never catch up to
> >>> the head in per-cpu slot. Thus tail == head means the slot is empty.
> >>>
> >>> This also uses consistent local variable names for pool, slot and obj.
> >>>
> >>> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> >>> ---
> >>>    include/linux/objpool.h |  61 ++++----
> >>>    lib/objpool.c           | 310 ++++++++++++++++------------------------
> >>>    2 files changed, 147 insertions(+), 224 deletions(-)
> >>>
> >>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> >>> index 33c832216b98..ecd5ecaffcd3 100644
> >>> --- a/include/linux/objpool.h
> >>> +++ b/include/linux/objpool.h
> >>> @@ -38,33 +38,23 @@
> >>>     * struct objpool_slot - percpu ring array of objpool
> >>>     * @head: head of the local ring array (to retrieve at)
> >>>     * @tail: tail of the local ring array (to append at)
> >>> - * @bits: log2 of capacity (for bitwise operations)
> >>> - * @mask: capacity - 1
> >>> + * @mask: capacity of entries - 1
> >>> + * @entries: object entries on this slot.
> >>>     *
> >>>     * Represents a cpu-local array-based ring buffer, its size is specialized
> >>>     * during initialization of object pool. The percpu objpool slot is to be
> >>>     * allocated from local memory for NUMA system, and to be kept compact in
> >>> - * continuous memory: ages[] is stored just after the body of objpool_slot,
> >>> - * and then entries[]. ages[] describes revision of each item, solely used
> >>> - * to avoid ABA; entries[] contains pointers of the actual objects
> >>> - *
> >>> - * Layout of objpool_slot in memory:
> >>> - *
> >>> - * 64bit:
> >>> - *        4      8      12     16        32                 64
> >>> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> >>> - *
> >>> - * 32bit:
> >>> - *        4      8      12     16        32        48       64
> >>> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> >>> + * continuous memory: CPU assigned number of objects are stored just after
> >>> + * the body of objpool_slot.
> >>>     *
> >>>     */
> >>>    struct objpool_slot {
> >>> -	uint32_t                head;
> >>> -	uint32_t                tail;
> >>> -	uint32_t                bits;
> >>> -	uint32_t                mask;
> >>> -} __packed;
> >>> +	uint32_t	head;
> >>> +	uint32_t	tail;
> >>> +	uint32_t	mask;
> >>> +	uint32_t	dummyr;
> >>> +	void *		entries[];
> >>> +};
> >>>    
> >>>    struct objpool_head;
> >>>    
> >>> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> >>>     * @obj_size:   object & element size
> >>>     * @nr_objs:    total objs (to be pre-allocated)
> >>>     * @nr_cpus:    nr_cpu_ids
> >>> - * @capacity:   max objects per cpuslot
> >>> + * @capacity:   max objects on percpu slot
> >>>     * @gfp:        gfp flags for kmalloc & vmalloc
> >>>     * @ref:        refcount for objpool
> >>>     * @flags:      flags for objpool management
> >>>     * @cpu_slots:  array of percpu slots
> >>> - * @slot_sizes:	size in bytes of slots
> >>>     * @release:    resource cleanup callback
> >>>     * @context:    caller-provided context
> >>>     */
> >>> @@ -100,7 +89,6 @@ struct objpool_head {
> >>>    	refcount_t              ref;
> >>>    	unsigned long           flags;
> >>>    	struct objpool_slot   **cpu_slots;
> >>> -	int                    *slot_sizes;
> >>>    	objpool_fini_cb         release;
> >>>    	void                   *context;
> >>>    };
> >>> @@ -108,9 +96,12 @@ struct objpool_head {
> >>>    #define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
> >>>    #define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
> >>>    
> >>> +#define OBJPOOL_NR_OBJECT_MAX	(1 << 24)
> >>> +#define OBJPOOL_OBJECT_SIZE_MAX	(1 << 16)
> >>> +
> >>>    /**
> >>>     * objpool_init() - initialize objpool and pre-allocated objects
> >>> - * @head:    the object pool to be initialized, declared by caller
> >>> + * @pool:    the object pool to be initialized, declared by caller
> >>>     * @nr_objs: total objects to be pre-allocated by this object pool
> >>>     * @object_size: size of an object (should be > 0)
> >>>     * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
> >>> @@ -128,47 +119,47 @@ struct objpool_head {
> >>>     * pop (object allocation) or do clearance before each push (object
> >>>     * reclamation).
> >>>     */
> >>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> >>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> >>>    		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >>>    		 objpool_fini_cb release);
> >>>    
> >>>    /**
> >>>     * objpool_pop() - allocate an object from objpool
> >>> - * @head: object pool
> >>> + * @pool: object pool
> >>>     *
> >>>     * return value: object ptr or NULL if failed
> >>>     */
> >>> -void *objpool_pop(struct objpool_head *head);
> >>> +void *objpool_pop(struct objpool_head *pool);
> >>>    
> >>>    /**
> >>>     * objpool_push() - reclaim the object and return back to objpool
> >>>     * @obj:  object ptr to be pushed to objpool
> >>> - * @head: object pool
> >>> + * @pool: object pool
> >>>     *
> >>>     * return: 0 or error code (it fails only when user tries to push
> >>>     * the same object multiple times or wrong "objects" into objpool)
> >>>     */
> >>> -int objpool_push(void *obj, struct objpool_head *head);
> >>> +int objpool_push(void *obj, struct objpool_head *pool);
> >>>    
> >>>    /**
> >>>     * objpool_drop() - discard the object and deref objpool
> >>>     * @obj:  object ptr to be discarded
> >>> - * @head: object pool
> >>> + * @pool: object pool
> >>>     *
> >>>     * return: 0 if objpool was released or error code
> >>>     */
> >>> -int objpool_drop(void *obj, struct objpool_head *head);
> >>> +int objpool_drop(void *obj, struct objpool_head *pool);
> >>>    
> >>>    /**
> >>>     * objpool_free() - release objpool forcely (all objects to be freed)
> >>> - * @head: object pool to be released
> >>> + * @pool: object pool to be released
> >>>     */
> >>> -void objpool_free(struct objpool_head *head);
> >>> +void objpool_free(struct objpool_head *pool);
> >>>    
> >>>    /**
> >>>     * objpool_fini() - deref object pool (also releasing unused objects)
> >>> - * @head: object pool to be dereferenced
> >>> + * @pool: object pool to be dereferenced
> >>>     */
> >>> -void objpool_fini(struct objpool_head *head);
> >>> +void objpool_fini(struct objpool_head *pool);
> >>>    
> >>>    #endif /* _LINUX_OBJPOOL_H */
> >>> diff --git a/lib/objpool.c b/lib/objpool.c
> >>> index 22e752371820..f8e8f70d7253 100644
> >>> --- a/lib/objpool.c
> >>> +++ b/lib/objpool.c
> >>> @@ -15,104 +15,55 @@
> >>>     * Copyright: wuqiang.matt@bytedance.com
> >>>     */
> >>>    
> >>> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> >>> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> >>> -			(sizeof(uint32_t) << (s)->bits)))
> >>> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> >>> -			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> >>> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> >>> -
> >>> -/* compute the suitable num of objects to be managed per slot */
> >>> -static int objpool_nobjs(int size)
> >>> -{
> >>> -	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> >>> -			(sizeof(uint32_t) + sizeof(void *)));
> >>> -}
> >>> -
> >>>    /* allocate and initialize percpu slots */
> >>>    static int
> >>> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> >>> -			void *context, objpool_init_obj_cb objinit)
> >>> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> >>> +			  objpool_init_obj_cb objinit)
> >>>    {
> >>> -	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> >>> -
> >>> -	/* aligned object size by sizeof(void *) */
> >>> -	objsz = ALIGN(head->obj_size, sizeof(void *));
> >>> -	/* shall we allocate objects along with percpu-slot */
> >>> -	if (objsz)
> >>> -		head->flags |= OBJPOOL_HAVE_OBJECTS;
> >>> -
> >>> -	/* vmalloc is used in default to allocate percpu-slots */
> >>> -	if (!(head->gfp & GFP_ATOMIC))
> >>> -		head->flags |= OBJPOOL_FROM_VMALLOC;
> >>> -
> >>> -	for (i = 0; i < head->nr_cpus; i++) {
> >>> -		struct objpool_slot *os;
> >>> +	int i, j, n, size, slot_size, cpu_count = 0;
> >>> +	struct objpool_slot *slot;
> >>>    
> >>> +	for (i = 0; i < pool->nr_cpus; i++) {
> >>>    		/* skip the cpus which could never be present */
> >>>    		if (!cpu_possible(i))
> >>>    			continue;
> >>>    
> >>>    		/* compute how many objects to be managed by this slot */
> >>> -		n = nobjs / num_possible_cpus();
> >>> -		if (cpu < (nobjs % num_possible_cpus()))
> >>> +		n = pool->nr_objs / num_possible_cpus();
> >>> +		if (cpu_count < (pool->nr_objs % num_possible_cpus()))
> >>>    			n++;
> >>> -		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> >>> -		       sizeof(uint32_t) * nents + objsz * n;
> >>> +		cpu_count++;
> >>> +
> >>> +		slot_size = struct_size(slot, entries, pool->capacity);
> >>> +		size = slot_size + pool->obj_size * n;
> >>>    
> >>>    		/*
> >>>    		 * here we allocate percpu-slot & objects together in a single
> >>> -		 * allocation, taking advantage of warm caches and TLB hits as
> >>> -		 * vmalloc always aligns the request size to pages
> >>> +		 * allocation, taking advantage on NUMA.
> >>>    		 */
> >>> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> >>> -			os = __vmalloc_node(size, sizeof(void *), head->gfp,
> >>> +		if (pool->flags & OBJPOOL_FROM_VMALLOC)
> >>> +			slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
> >>>    				cpu_to_node(i), __builtin_return_address(0));
> >>>    		else
> >>> -			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> >>> -		if (!os)
> >>> +			slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> >>> +		if (!slot)
> >>>    			return -ENOMEM;
> >>>    
> >>>    		/* initialize percpu slot for the i-th slot */
> >>> -		memset(os, 0, size);
> >>> -		os->bits = ilog2(head->capacity);
> >>> -		os->mask = head->capacity - 1;
> >>> -		head->cpu_slots[i] = os;
> >>> -		head->slot_sizes[i] = size;
> >>> -		cpu = cpu + 1;
> >>> -
> >>> -		/*
> >>> -		 * manually set head & tail to avoid possible conflict:
> >>> -		 * We assume that the head item is ready for retrieval
> >>> -		 * iff head is equal to ages[head & mask]. but ages is
> >>> -		 * initialized as 0, so in view of the caller of pop(),
> >>> -		 * the 1st item (0th) is always ready, but the reality
> >>> -		 * could be: push() is stalled before the final update,
> >>> -		 * thus the item being inserted will be lost forever
> >>> -		 */
> >>> -		os->head = os->tail = head->capacity;
> >>> -
> >>> -		if (!objsz)
> >>> -			continue;
> >>> +		memset(slot, 0, size);
> >>> +		slot->mask = pool->capacity - 1;
> >>> +		pool->cpu_slots[i] = slot;
> >>>    
> >>>    		for (j = 0; j < n; j++) {
> >>> -			uint32_t *ages = SLOT_AGES(os);
> >>> -			void **ents = SLOT_ENTS(os);
> >>> -			void *obj = SLOT_OBJS(os) + j * objsz;
> >>> -			uint32_t ie = os->tail & os->mask;
> >>> +			void *obj = (void *)slot + slot_size + pool->obj_size * j;
> >>>    
> >>> -			/* perform object initialization */
> >>>    			if (objinit) {
> >>>    				int rc = objinit(obj, context);
> >>>    				if (rc)
> >>>    					return rc;
> >>>    			}
> >>> -
> >>> -			/* add obj into the ring array */
> >>> -			ents[ie] = obj;
> >>> -			ages[ie] = os->tail;
> >>> -			os->tail++;
> >>> -			head->nr_objs++;
> >>> +			slot->entries[j] = obj;
> >>> +			slot->tail++;
> >>>    		}
> >>>    	}
> >>>    
> >>> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> >>>    }
> >>>    
> >>>    /* cleanup all percpu slots of the object pool */
> >>> -static void objpool_fini_percpu_slots(struct objpool_head *head)
> >>> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
> >>>    {
> >>>    	int i;
> >>>    
> >>> -	if (!head->cpu_slots)
> >>> +	if (!pool->cpu_slots)
> >>>    		return;
> >>>    
> >>> -	for (i = 0; i < head->nr_cpus; i++) {
> >>> -		if (!head->cpu_slots[i])
> >>> -			continue;
> >>> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
> >>> -			vfree(head->cpu_slots[i]);
> >>> -		else
> >>> -			kfree(head->cpu_slots[i]);
> >>> -	}
> >>> -	kfree(head->cpu_slots);
> >>> -	head->cpu_slots = NULL;
> >>> -	head->slot_sizes = NULL;
> >>> +	for (i = 0; i < pool->nr_cpus; i++)
> >>> +		kvfree(pool->cpu_slots[i]);
> >>> +	kfree(pool->cpu_slots);
> >>>    }
> >>>    
> >>>    /* initialize object pool and pre-allocate objects */
> >>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> >>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> >>>    		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >>>    		objpool_fini_cb release)
> >>>    {
> >>>    	int nents, rc;
> >>>    
> >>>    	/* check input parameters */
> >>> -	if (nr_objs <= 0 || object_size <= 0)
> >>> +	if (nr_objs <= 0 || object_size <= 0 ||
> >>> +	    nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> >>> +		return -EINVAL;
> >>> +
> >>> +	/* Align up to unsigned long size */
> >>> +	object_size = ALIGN(object_size, sizeof(unsigned long));
> >>> +
> >>> +	/*
> >>> +	 * To avoid filling up the entries array in the per-cpu slot,
> >>> +	 * use the power of 2 which is more than N + 1. Thus, the tail never
> >>> +	 * catch up to the pool, and able to use pool/tail as the sequencial
> >>> +	 * number.
> >>> +	 */
> >>> +	nents = roundup_pow_of_two(nr_objs + 1);
> >>> +	if (!nents)
> >>>    		return -EINVAL;
> >>>    
> >>> -	/* calculate percpu slot size (rounded to pow of 2) */
> >>> -	nents = max_t(int, roundup_pow_of_two(nr_objs),
> >>> -			objpool_nobjs(L1_CACHE_BYTES));
> >>> -
> >>> -	/* initialize objpool head */
> >>> -	memset(head, 0, sizeof(struct objpool_head));
> >>> -	head->nr_cpus = nr_cpu_ids;
> >>> -	head->obj_size = object_size;
> >>> -	head->capacity = nents;
> >>> -	head->gfp = gfp & ~__GFP_ZERO;
> >>> -	head->context = context;
> >>> -	head->release = release;
> >>> -
> >>> -	/* allocate array for percpu slots */
> >>> -	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> >>> -			       head->nr_cpus * sizeof(int), head->gfp);
> >>> -	if (!head->cpu_slots)
> >>> +	/* initialize objpool pool */
> >>> +	memset(pool, 0, sizeof(struct objpool_head));
> >>> +	pool->nr_cpus = nr_cpu_ids;
> >>> +	pool->obj_size = object_size;
> >>> +	pool->nr_objs = nr_objs;
> >>> +	/* just prevent to fullfill the per-cpu ring array */
> >>> +	pool->capacity = nents;
> >>> +	pool->gfp = gfp & ~__GFP_ZERO;
> >>> +	pool->context = context;
> >>> +	pool->release = release;
> >>> +	/* vmalloc is used in default to allocate percpu-slots */
> >>> +	if (!(pool->gfp & GFP_ATOMIC))
> >>> +		pool->flags |= OBJPOOL_FROM_VMALLOC;
> >>> +
> >>> +	pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> >>> +	if (!pool->cpu_slots)
> >>>    		return -ENOMEM;
> >>> -	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
> >>>    
> >>>    	/* initialize per-cpu slots */
> >>> -	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> >>> +	rc = objpool_init_percpu_slots(pool, context, objinit);
> >>>    	if (rc)
> >>> -		objpool_fini_percpu_slots(head);
> >>> +		objpool_fini_percpu_slots(pool);
> >>>    	else
> >>> -		refcount_set(&head->ref, nr_objs + 1);
> >>> +		refcount_set(&pool->ref, nr_objs + 1);
> >>>    
> >>>    	return rc;
> >>>    }
> >>>    EXPORT_SYMBOL_GPL(objpool_init);
> >>>    
> >>>    /* adding object to slot, abort if the slot was already full */
> >>> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> >>> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> >>>    {
> >>> -	uint32_t *ages = SLOT_AGES(os);
> >>> -	void **ents = SLOT_ENTS(os);
> >>> -	uint32_t head, tail;
> >>> +	struct objpool_slot *slot = pool->cpu_slots[cpu];
> >>> +	uint32_t tail, next;
> >>>    
> >>>    	do {
> >>> -		/* perform memory loading for both head and tail */
> >>> -		head = READ_ONCE(os->head);
> >>> -		tail = READ_ONCE(os->tail);
> >>> -		/* just abort if slot is full */
> >>> -		if (tail - head > os->mask)
> >>> -			return -ENOENT;
> >>> -		/* try to extend tail by 1 using CAS to avoid races */
> >>> -		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> >>> -			break;
> >>> -	} while (1);
> >>> +		uint32_t head = READ_ONCE(slot->head);
> >>>    
> >>> -	/* the tail-th of slot is reserved for the given obj */
> >>> -	WRITE_ONCE(ents[tail & os->mask], obj);
> >>> -	/* update epoch id to make this object available for pop() */
> >>> -	smp_store_release(&ages[tail & os->mask], tail);
> >>> +		tail = READ_ONCE(slot->tail);
> >>> +		next = tail + 1;
> >>> +
> >>> +		/* This must never happen because capacity >= N + 1 */
> >>> +		if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> >>> +				 (next < head && next > head + pool->nr_objs)))
> >>> +			return -EINVAL;
> >>> +
> >>> +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> >>> +
> >>> +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> >>>    	return 0;
> >>>    }
> >>>    
> >>>    /* reclaim an object to object pool */
> >>> -int objpool_push(void *obj, struct objpool_head *oh)
> >>> +int objpool_push(void *obj, struct objpool_head *pool)
> >>>    {
> >>>    	unsigned long flags;
> >>> -	int cpu, rc;
> >>> +	int rc;
> >>>    
> >>>    	/* disable local irq to avoid preemption & interruption */
> >>>    	raw_local_irq_save(flags);
> >>> -	cpu = raw_smp_processor_id();
> >>> -	do {
> >>> -		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> >>> -		if (!rc)
> >>> -			break;
> >>> -		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> >>> -	} while (1);
> >>> +	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> >>>    	raw_local_irq_restore(flags);
> >>>    
> >>>    	return rc;
> >>>    }
> >>>    EXPORT_SYMBOL_GPL(objpool_push);
> >>>    
> >>> -/* drop the allocated object, rather reclaim it to objpool */
> >>> -int objpool_drop(void *obj, struct objpool_head *head)
> >>> -{
> >>> -	if (!obj || !head)
> >>> -		return -EINVAL;
> >>> -
> >>> -	if (refcount_dec_and_test(&head->ref)) {
> >>> -		objpool_free(head);
> >>> -		return 0;
> >>> -	}
> >>> -
> >>> -	return -EAGAIN;
> >>> -}
> >>> -EXPORT_SYMBOL_GPL(objpool_drop);
> >>> -
> >>>    /* try to retrieve object from slot */
> >>> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> >>> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
> >>>    {
> >>> -	uint32_t *ages = SLOT_AGES(os);
> >>> -	void **ents = SLOT_ENTS(os);
> >>>    	/* do memory load of head to local head */
> >>> -	uint32_t head = smp_load_acquire(&os->head);
> >>> +	uint32_t head = smp_load_acquire(&slot->head);
> >>>    
> >>>    	/* loop if slot isn't empty */
> >>> -	while (head != READ_ONCE(os->tail)) {
> >>> -		uint32_t id = head & os->mask, prev = head;
> >>> +	while (head != READ_ONCE(slot->tail)) {
> >>>    
> >>>    		/* do prefetching of object ents */
> >>> -		prefetch(&ents[id]);
> >>> -
> >>> -		/* check whether this item was ready for retrieval */
> >>> -		if (smp_load_acquire(&ages[id]) == head) {
> >>> -			/* node must have been udpated by push() */
> >>> -			void *node = READ_ONCE(ents[id]);
> >>> -			/* commit and move forward head of the slot */
> >>> -			if (try_cmpxchg_release(&os->head, &head, head + 1))
> >>> -				return node;
> >>> -			/* head was already updated by others */
> >>> -		}
> >>> +		prefetch(&slot->entries[head & slot->mask]);
> >>> +
> >>> +		/* commit and move forward head of the slot */
> >>> +		if (try_cmpxchg_release(&slot->head, &head, head + 1))
> >>> +			/*
> >>> +			 * TBD: check overwrap the tail/head counter and warn
> >>> +			 * if it is broken. But this happens only if this
> >>> +			 * process slows down a lot and another CPU updates
> >>> +			 * the haed/tail just 2^32 + 1 times, and this slot
> >>> +			 * is empty.
> >>> +			 */
> >>> +			return slot->entries[head & slot->mask];
> >>>    
> >>>    		/* re-load head from memory and continue trying */
> >>> -		head = READ_ONCE(os->head);
> >>> -		/*
> >>> -		 * head stays unchanged, so it's very likely there's an
> >>> -		 * ongoing push() on other cpu nodes but yet not update
> >>> -		 * ages[] to mark it's completion
> >>> -		 */
> >>> -		if (head == prev)
> >>> -			break;
> >>> +		head = READ_ONCE(slot->head);
> >>>    	}
> >>>    
> >>>    	return NULL;
> >>> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
> >>>    EXPORT_SYMBOL_GPL(objpool_pop);
> >>>    
> >>>    /* release whole objpool forcely */
> >>> -void objpool_free(struct objpool_head *head)
> >>> +void objpool_free(struct objpool_head *pool)
> >>>    {
> >>> -	if (!head->cpu_slots)
> >>> +	if (!pool->cpu_slots)
> >>>    		return;
> >>>    
> >>>    	/* release percpu slots */
> >>> -	objpool_fini_percpu_slots(head);
> >>> +	objpool_fini_percpu_slots(pool);
> >>>    
> >>>    	/* call user's cleanup callback if provided */
> >>> -	if (head->release)
> >>> -		head->release(head, head->context);
> >>> +	if (pool->release)
> >>> +		pool->release(pool, pool->context);
> >>>    }
> >>>    EXPORT_SYMBOL_GPL(objpool_free);
> >>>    
> >>> -/* drop unused objects and defref objpool for releasing */
> >>> -void objpool_fini(struct objpool_head *head)
> >>> +/* drop the allocated object, rather reclaim it to objpool */
> >>> +int objpool_drop(void *obj, struct objpool_head *pool)
> >>>    {
> >>> -	void *nod;
> >>> +	if (!obj || !pool)
> >>> +		return -EINVAL;
> >>>    
> >>> -	do {
> >>> -		/* grab object from objpool and drop it */
> >>> -		nod = objpool_pop(head);
> >>> +	if (refcount_dec_and_test(&pool->ref)) {
> >>> +		objpool_free(pool);
> >>> +		return 0;
> >>> +	}
> >>> +
> >>> +	return -EAGAIN;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(objpool_drop);
> >>> +
> >>> +/* drop unused objects and defref objpool for releasing */
> >>> +void objpool_fini(struct objpool_head *pool)
> >>> +{
> >>> +	void *obj;
> >>>    
> >>> -		/* drop the extra ref of objpool */
> >>> -		if (refcount_dec_and_test(&head->ref))
> >>> -			objpool_free(head);
> >>> -	} while (nod);
> >>> +	/* grab object from objpool and drop it */
> >>> +	while ((obj = objpool_pop(pool)))
> >>> +		objpool_drop(obj, pool);
> >>>    }
> >>>    EXPORT_SYMBOL_GPL(objpool_fini);
> >>
> > 
> > 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC
  2023-10-13  1:59           ` Masami Hiramatsu
@ 2023-10-13  3:03             ` wuqiang.matt
  0 siblings, 0 replies; 25+ messages in thread
From: wuqiang.matt @ 2023-10-13  3:03 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-trace-kernel, davem, anil.s.keshavamurthy, naveen.n.rao,
	rostedt, peterz, akpm, sander, ebiggers, dan.j.williams,
	jpoimboe, linux-kernel, lkp, mattwu

On 2023/10/13 09:59, Masami Hiramatsu (Google) wrote:
> On Fri, 13 Oct 2023 01:36:05 +0800
> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> 
>> On 2023/10/12 22:02, Masami Hiramatsu (Google) wrote:
>>> Hi Wuqiang,
>>>
>>> On Mon, 9 Oct 2023 17:23:34 +0800
>>> wuqiang <wuqiang.matt@bytedance.com> wrote:
>>>
>>>> Hello Masami,
>>>>
>>>> Just got time for the new patch and got that ages[] was removed. ages[] is
>>>> introduced the way like 2-phase commit to keep consitency and must be kept.
>>>>
>>>> Thinking of the following 2 cases that two cpu nodes are operating the same
>>>> objpool_slot simultaneously:
>>>>
>>>> Case 1:
>>>>
>>>>      NODE 1:                  NODE 2:
>>>>      push to an empty slot    pop will get wrong value
>>>>
>>>>      try_cmpxchg_acquire(&slot->tail, &tail, next)
>>>>                               try_cmpxchg_release(&slot->head, &head, head + 1)
>>>>                               return slot->entries[head & slot->mask]
>>>>      WRITE_ONCE(slot->entries[tail & slot->mask], obj)
>>>
>>> Today, I rethink the algorithm.
>>>
>>> For this case, we can just add a `commit` to the slot for committing the tail
>>> commit instead of the ages[].
>>>
>>> CPU1                                       CPU2
>>> push to an empty slot                      pop from the same slot
>>>
>>> commit = tail = slot->tail;
>>> next = tail + 1;
>>> try_cmpxchg_acquire(&slot->tail,
>>>                       &tail, next);
>>>                                              while (head != commit) -> NG1
>>> WRITE_ONCE(slot->entries[tail & slot->mask],
>>>               obj)
>>>                                              while (head != commit) -> NG2
>>> WRITE_ONCE(&slot->commit, next);
>>>                                              while (head != commit) -> OK
>>>                                              try_cmpxchg_acquire(&slot->head,
>>>                                                                  &head, next);
>>>                                              return slot->entries[head & slot->mask]
>>>
>>> So the NG1 and NG2 timing, the pop will fail.
>>>
>>> This does not expect the nested "push" case because the reserve-commit block
>>> will be interrupt disabled. This doesn't support NMI but that is OK.
>>
>> If 2 push are performed in a row, slot->commit will be 'tail + 2', CPU2 won't
>> meet the condition "head == commit".
> 
> No, it doesn't happen, because pop() may fetch the object from other's slot, but
> push() only happens on its own slot. So push() on the same slot is never
> in parallel.

I got it: 'commit' is now the actual tail. Then "head != commit" means there
are free objects ready for pop.

Great idea indeed. I'll give it a try.

>> Since slot->commit' is always synced to slot->tail after a successful push,
>> should pop check "tail != commit" ?
> 
> No, pop() only need to check the "head != commit". If "head == commit", this slot
> is empty or the slot owner is trying to push() (not committed yet).
> 
>> In this case when a push is ongoing with
>> the slot, pop() has to wait for its completion even if there are objects in
>> the same slot.
> 
> No, pop() will go to other slots to find available object.
> 
>>
>>> If we need to support such NMI or remove irq-disable, we can do following
>>> (please remember the push operation only happens on the slot owned by that
>>>    CPU. So this is per-cpu process)
>>>
>>> ---
>>> do {
>>>     commit = tail = slot->tail;
>>>     next = tail + 1;
>>> } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
>>>
>>> WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>>>
>>> // At this point, "commit == slot->tail - 1" in this nest level.
>>> // Only outmost (non-nexted) case, "commit == slot->commit"
>>> if (commit != slot->commit)
>>>     return; // do nothing. it must be updated by outmost one.
>>>
>>> // catch up commit to tail.
>>> do {
>>>     commit = READ_ONCE(slot->tail);
>>>     WRITE_ONCE(slot->commit, commit);
>>>      // note that someone can interrupt this loop too.
>>> } while (commit != READ_ONCE(slot->tail));
>>> ---
>>
>> Yes, I got you: push can only happen on local node and the first try has the
>> right to extend 'commit'. It does resolve nested push attempts, and preemption
>> must be disabled.
> 
> Yes.
> 
>>
>> The overflow/ABA issue can still happen if the cpu is being interrupted for
>> long time.
> 
> Yes, but ages[] also can not avoid that. (Note that head, tail and commit are
> 32bit counter, and per-cpu slot ring has enough big to store the pointer of
> all objects.)
> 
> Thank you,
> 
>>
>>> For the rethook this may be too much.
>>>
>>> Thank you,
>>>
>>>>
>>>>
>>>> Case 2:
>>>>
>>>>      NODE 1:                  NODE 2
>>>>      push to slot w/ 1 obj    pop will get wrong value
>>>>
>>>>                               try_cmpxchg_release(&slot->head, &head, head + 1)
>>>>      try_cmpxchg_acquire(&slot->tail, &tail, next)
>>>>      WRITE_ONCE(slot->entries[tail & slot->mask], obj)
>>>>                               return slot->entries[head & slot->mask]
>>
>> The pre-condition should be: CPU 1 tries to push to a full slot, in this case
>> tail = head + capacity but tail & mask == head & mask
>>
>>>>
>>>> Regards,
>>>> wuqiang
>>>>
>>>> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
>>>>> Hi Wuqiang,
>>>>>
>>>>> On Tue,  5 Sep 2023 09:52:51 +0800
>>>>> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
>>>>>
>>>>>> The object pool is a scalable implementaion of high performance queue
>>>>>> for object allocation and reclamation, such as kretprobe instances.
>>>>>>
>>>>>> With leveraging percpu ring-array to mitigate the hot spot of memory
>>>>>> contention, it could deliver near-linear scalability for high parallel
>>>>>> scenarios. The objpool is best suited for following cases:
>>>>>> 1) Memory allocation or reclamation are prohibited or too expensive
>>>>>> 2) Consumers are of different priorities, such as irqs and threads
>>>>>>
>>>>>> Limitations:
>>>>>> 1) Maximum objects (capacity) is determined during pool initializing
>>>>>>       and can't be modified (extended) after objpool creation
>>>>>> 2) The memory of objects won't be freed until objpool is finalized
>>>>>> 3) Object allocation (pop) may fail after trying all cpu slots
>>>>>
>>>>> I made a simplifying patch on this by (mainly) removing ages array.
>>>>> I also rename local variable to use more readable names, like slot,
>>>>> pool, and obj.
>>>>>
>>>>> Here the results which I run the test_objpool.ko.
>>>>>
>>>>> Original:
>>>>> [   50.500235] Summary of testcases:
>>>>> [   50.503139]     duration: 1027135us 	hits:   30628293 	miss:          0 	sync: percpu objpool
>>>>> [   50.510416]     duration: 1047977us 	hits:   30453023 	miss:          0 	sync: percpu objpool from vmalloc
>>>>> [   50.517421]     duration: 1047098us 	hits:   31145034 	miss:          0 	sync & hrtimer: percpu objpool
>>>>> [   50.524671]     duration: 1053233us 	hits:   30919640 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
>>>>> [   50.531382]     duration: 1055822us 	hits:    3407221 	miss:     830219 	sync overrun: percpu objpool
>>>>> [   50.538135]     duration: 1055998us 	hits:    3404624 	miss:     854160 	sync overrun: percpu objpool from vmalloc
>>>>> [   50.546686]     duration: 1046104us 	hits:   19464798 	miss:          0 	async: percpu objpool
>>>>> [   50.552989]     duration: 1033054us 	hits:   18957836 	miss:          0 	async: percpu objpool from vmalloc
>>>>> [   50.560289]     duration: 1041778us 	hits:   33806868 	miss:          0 	async & hrtimer: percpu objpool
>>>>> [   50.567425]     duration: 1048901us 	hits:   34211860 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
>>>>>
>>>>> Simplified:
>>>>> [   48.393236] Summary of testcases:
>>>>> [   48.395384]     duration: 1013002us 	hits:   29661448 	miss:          0 	sync: percpu objpool
>>>>> [   48.400351]     duration: 1057231us 	hits:   31035578 	miss:          0 	sync: percpu objpool from vmalloc
>>>>> [   48.405660]     duration: 1043196us 	hits:   30546652 	miss:          0 	sync & hrtimer: percpu objpool
>>>>> [   48.411216]     duration: 1047128us 	hits:   30601306 	miss:          0 	sync & hrtimer: percpu objpool from vmalloc
>>>>> [   48.417231]     duration: 1051695us 	hits:    3468287 	miss:     892881 	sync overrun: percpu objpool
>>>>> [   48.422405]     duration: 1054003us 	hits:    3549491 	miss:     898141 	sync overrun: percpu objpool from vmalloc
>>>>> [   48.428425]     duration: 1052946us 	hits:   19005228 	miss:          0 	async: percpu objpool
>>>>> [   48.433597]     duration: 1051757us 	hits:   19670866 	miss:          0 	async: percpu objpool from vmalloc
>>>>> [   48.439280]     duration: 1042951us 	hits:   37135332 	miss:          0 	async & hrtimer: percpu objpool
>>>>> [   48.445085]     duration: 1029803us 	hits:   37093248 	miss:          0 	async & hrtimer: percpu objpool from vmalloc
>>>>>
>>>>> Can you test it too?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>    From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
>>>>> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
>>>>> Date: Mon, 25 Sep 2023 16:07:12 +0900
>>>>> Subject: [PATCH] objpool: Simplify objpool by removing ages array
>>>>>
>>>>> Simplify the objpool code by removing ages array from per-cpu slot.
>>>>> It chooses enough big capacity (which is a rounded up power of 2 value
>>>>> of nr_objects + 1) for the entries array, the tail never catch up to
>>>>> the head in per-cpu slot. Thus tail == head means the slot is empty.
>>>>>
>>>>> This also uses consistent local variable names for pool, slot and obj.
>>>>>
>>>>> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>>>>> ---
>>>>>     include/linux/objpool.h |  61 ++++----
>>>>>     lib/objpool.c           | 310 ++++++++++++++++------------------------
>>>>>     2 files changed, 147 insertions(+), 224 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
>>>>> index 33c832216b98..ecd5ecaffcd3 100644
>>>>> --- a/include/linux/objpool.h
>>>>> +++ b/include/linux/objpool.h
>>>>> @@ -38,33 +38,23 @@
>>>>>      * struct objpool_slot - percpu ring array of objpool
>>>>>      * @head: head of the local ring array (to retrieve at)
>>>>>      * @tail: tail of the local ring array (to append at)
>>>>> - * @bits: log2 of capacity (for bitwise operations)
>>>>> - * @mask: capacity - 1
>>>>> + * @mask: capacity of entries - 1
>>>>> + * @entries: object entries on this slot.
>>>>>      *
>>>>>      * Represents a cpu-local array-based ring buffer, its size is specialized
>>>>>      * during initialization of object pool. The percpu objpool slot is to be
>>>>>      * allocated from local memory for NUMA system, and to be kept compact in
>>>>> - * continuous memory: ages[] is stored just after the body of objpool_slot,
>>>>> - * and then entries[]. ages[] describes revision of each item, solely used
>>>>> - * to avoid ABA; entries[] contains pointers of the actual objects
>>>>> - *
>>>>> - * Layout of objpool_slot in memory:
>>>>> - *
>>>>> - * 64bit:
>>>>> - *        4      8      12     16        32                 64
>>>>> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
>>>>> - *
>>>>> - * 32bit:
>>>>> - *        4      8      12     16        32        48       64
>>>>> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
>>>>> + * continuous memory: CPU assigned number of objects are stored just after
>>>>> + * the body of objpool_slot.
>>>>>      *
>>>>>      */
>>>>>     struct objpool_slot {
>>>>> -	uint32_t                head;
>>>>> -	uint32_t                tail;
>>>>> -	uint32_t                bits;
>>>>> -	uint32_t                mask;
>>>>> -} __packed;
>>>>> +	uint32_t	head;
>>>>> +	uint32_t	tail;
>>>>> +	uint32_t	mask;
>>>>> +	uint32_t	dummyr;
>>>>> +	void *		entries[];
>>>>> +};
>>>>>     
>>>>>     struct objpool_head;
>>>>>     
>>>>> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
>>>>>      * @obj_size:   object & element size
>>>>>      * @nr_objs:    total objs (to be pre-allocated)
>>>>>      * @nr_cpus:    nr_cpu_ids
>>>>> - * @capacity:   max objects per cpuslot
>>>>> + * @capacity:   max objects on percpu slot
>>>>>      * @gfp:        gfp flags for kmalloc & vmalloc
>>>>>      * @ref:        refcount for objpool
>>>>>      * @flags:      flags for objpool management
>>>>>      * @cpu_slots:  array of percpu slots
>>>>> - * @slot_sizes:	size in bytes of slots
>>>>>      * @release:    resource cleanup callback
>>>>>      * @context:    caller-provided context
>>>>>      */
>>>>> @@ -100,7 +89,6 @@ struct objpool_head {
>>>>>     	refcount_t              ref;
>>>>>     	unsigned long           flags;
>>>>>     	struct objpool_slot   **cpu_slots;
>>>>> -	int                    *slot_sizes;
>>>>>     	objpool_fini_cb         release;
>>>>>     	void                   *context;
>>>>>     };
>>>>> @@ -108,9 +96,12 @@ struct objpool_head {
>>>>>     #define OBJPOOL_FROM_VMALLOC	(0x800000000)	/* objpool allocated from vmalloc area */
>>>>>     #define OBJPOOL_HAVE_OBJECTS	(0x400000000)	/* objects allocated along with objpool */
>>>>>     
>>>>> +#define OBJPOOL_NR_OBJECT_MAX	(1 << 24)
>>>>> +#define OBJPOOL_OBJECT_SIZE_MAX	(1 << 16)
>>>>> +
>>>>>     /**
>>>>>      * objpool_init() - initialize objpool and pre-allocated objects
>>>>> - * @head:    the object pool to be initialized, declared by caller
>>>>> + * @pool:    the object pool to be initialized, declared by caller
>>>>>      * @nr_objs: total objects to be pre-allocated by this object pool
>>>>>      * @object_size: size of an object (should be > 0)
>>>>>      * @gfp:     flags for memory allocation (via kmalloc or vmalloc)
>>>>> @@ -128,47 +119,47 @@ struct objpool_head {
>>>>>      * pop (object allocation) or do clearance before each push (object
>>>>>      * reclamation).
>>>>>      */
>>>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>>>>>     		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>>>>     		 objpool_fini_cb release);
>>>>>     
>>>>>     /**
>>>>>      * objpool_pop() - allocate an object from objpool
>>>>> - * @head: object pool
>>>>> + * @pool: object pool
>>>>>      *
>>>>>      * return value: object ptr or NULL if failed
>>>>>      */
>>>>> -void *objpool_pop(struct objpool_head *head);
>>>>> +void *objpool_pop(struct objpool_head *pool);
>>>>>     
>>>>>     /**
>>>>>      * objpool_push() - reclaim the object and return back to objpool
>>>>>      * @obj:  object ptr to be pushed to objpool
>>>>> - * @head: object pool
>>>>> + * @pool: object pool
>>>>>      *
>>>>>      * return: 0 or error code (it fails only when user tries to push
>>>>>      * the same object multiple times or wrong "objects" into objpool)
>>>>>      */
>>>>> -int objpool_push(void *obj, struct objpool_head *head);
>>>>> +int objpool_push(void *obj, struct objpool_head *pool);
>>>>>     
>>>>>     /**
>>>>>      * objpool_drop() - discard the object and deref objpool
>>>>>      * @obj:  object ptr to be discarded
>>>>> - * @head: object pool
>>>>> + * @pool: object pool
>>>>>      *
>>>>>      * return: 0 if objpool was released or error code
>>>>>      */
>>>>> -int objpool_drop(void *obj, struct objpool_head *head);
>>>>> +int objpool_drop(void *obj, struct objpool_head *pool);
>>>>>     
>>>>>     /**
>>>>>      * objpool_free() - release objpool forcely (all objects to be freed)
>>>>> - * @head: object pool to be released
>>>>> + * @pool: object pool to be released
>>>>>      */
>>>>> -void objpool_free(struct objpool_head *head);
>>>>> +void objpool_free(struct objpool_head *pool);
>>>>>     
>>>>>     /**
>>>>>      * objpool_fini() - deref object pool (also releasing unused objects)
>>>>> - * @head: object pool to be dereferenced
>>>>> + * @pool: object pool to be dereferenced
>>>>>      */
>>>>> -void objpool_fini(struct objpool_head *head);
>>>>> +void objpool_fini(struct objpool_head *pool);
>>>>>     
>>>>>     #endif /* _LINUX_OBJPOOL_H */
>>>>> diff --git a/lib/objpool.c b/lib/objpool.c
>>>>> index 22e752371820..f8e8f70d7253 100644
>>>>> --- a/lib/objpool.c
>>>>> +++ b/lib/objpool.c
>>>>> @@ -15,104 +15,55 @@
>>>>>      * Copyright: wuqiang.matt@bytedance.com
>>>>>      */
>>>>>     
>>>>> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
>>>>> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
>>>>> -			(sizeof(uint32_t) << (s)->bits)))
>>>>> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
>>>>> -			((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
>>>>> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
>>>>> -
>>>>> -/* compute the suitable num of objects to be managed per slot */
>>>>> -static int objpool_nobjs(int size)
>>>>> -{
>>>>> -	return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
>>>>> -			(sizeof(uint32_t) + sizeof(void *)));
>>>>> -}
>>>>> -
>>>>>     /* allocate and initialize percpu slots */
>>>>>     static int
>>>>> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>>>>> -			void *context, objpool_init_obj_cb objinit)
>>>>> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
>>>>> +			  objpool_init_obj_cb objinit)
>>>>>     {
>>>>> -	int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
>>>>> -
>>>>> -	/* aligned object size by sizeof(void *) */
>>>>> -	objsz = ALIGN(head->obj_size, sizeof(void *));
>>>>> -	/* shall we allocate objects along with percpu-slot */
>>>>> -	if (objsz)
>>>>> -		head->flags |= OBJPOOL_HAVE_OBJECTS;
>>>>> -
>>>>> -	/* vmalloc is used in default to allocate percpu-slots */
>>>>> -	if (!(head->gfp & GFP_ATOMIC))
>>>>> -		head->flags |= OBJPOOL_FROM_VMALLOC;
>>>>> -
>>>>> -	for (i = 0; i < head->nr_cpus; i++) {
>>>>> -		struct objpool_slot *os;
>>>>> +	int i, j, n, size, slot_size, cpu_count = 0;
>>>>> +	struct objpool_slot *slot;
>>>>>     
>>>>> +	for (i = 0; i < pool->nr_cpus; i++) {
>>>>>     		/* skip the cpus which could never be present */
>>>>>     		if (!cpu_possible(i))
>>>>>     			continue;
>>>>>     
>>>>>     		/* compute how many objects to be managed by this slot */
>>>>> -		n = nobjs / num_possible_cpus();
>>>>> -		if (cpu < (nobjs % num_possible_cpus()))
>>>>> +		n = pool->nr_objs / num_possible_cpus();
>>>>> +		if (cpu_count < (pool->nr_objs % num_possible_cpus()))
>>>>>     			n++;
>>>>> -		size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
>>>>> -		       sizeof(uint32_t) * nents + objsz * n;
>>>>> +		cpu_count++;
>>>>> +
>>>>> +		slot_size = struct_size(slot, entries, pool->capacity);
>>>>> +		size = slot_size + pool->obj_size * n;
>>>>>     
>>>>>     		/*
>>>>>     		 * here we allocate percpu-slot & objects together in a single
>>>>> -		 * allocation, taking advantage of warm caches and TLB hits as
>>>>> -		 * vmalloc always aligns the request size to pages
>>>>> +		 * allocation, taking advantage on NUMA.
>>>>>     		 */
>>>>> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
>>>>> -			os = __vmalloc_node(size, sizeof(void *), head->gfp,
>>>>> +		if (pool->flags & OBJPOOL_FROM_VMALLOC)
>>>>> +			slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
>>>>>     				cpu_to_node(i), __builtin_return_address(0));
>>>>>     		else
>>>>> -			os = kmalloc_node(size, head->gfp, cpu_to_node(i));
>>>>> -		if (!os)
>>>>> +			slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
>>>>> +		if (!slot)
>>>>>     			return -ENOMEM;
>>>>>     
>>>>>     		/* initialize percpu slot for the i-th slot */
>>>>> -		memset(os, 0, size);
>>>>> -		os->bits = ilog2(head->capacity);
>>>>> -		os->mask = head->capacity - 1;
>>>>> -		head->cpu_slots[i] = os;
>>>>> -		head->slot_sizes[i] = size;
>>>>> -		cpu = cpu + 1;
>>>>> -
>>>>> -		/*
>>>>> -		 * manually set head & tail to avoid possible conflict:
>>>>> -		 * We assume that the head item is ready for retrieval
>>>>> -		 * iff head is equal to ages[head & mask]. but ages is
>>>>> -		 * initialized as 0, so in view of the caller of pop(),
>>>>> -		 * the 1st item (0th) is always ready, but the reality
>>>>> -		 * could be: push() is stalled before the final update,
>>>>> -		 * thus the item being inserted will be lost forever
>>>>> -		 */
>>>>> -		os->head = os->tail = head->capacity;
>>>>> -
>>>>> -		if (!objsz)
>>>>> -			continue;
>>>>> +		memset(slot, 0, size);
>>>>> +		slot->mask = pool->capacity - 1;
>>>>> +		pool->cpu_slots[i] = slot;
>>>>>     
>>>>>     		for (j = 0; j < n; j++) {
>>>>> -			uint32_t *ages = SLOT_AGES(os);
>>>>> -			void **ents = SLOT_ENTS(os);
>>>>> -			void *obj = SLOT_OBJS(os) + j * objsz;
>>>>> -			uint32_t ie = os->tail & os->mask;
>>>>> +			void *obj = (void *)slot + slot_size + pool->obj_size * j;
>>>>>     
>>>>> -			/* perform object initialization */
>>>>>     			if (objinit) {
>>>>>     				int rc = objinit(obj, context);
>>>>>     				if (rc)
>>>>>     					return rc;
>>>>>     			}
>>>>> -
>>>>> -			/* add obj into the ring array */
>>>>> -			ents[ie] = obj;
>>>>> -			ages[ie] = os->tail;
>>>>> -			os->tail++;
>>>>> -			head->nr_objs++;
>>>>> +			slot->entries[j] = obj;
>>>>> +			slot->tail++;
>>>>>     		}
>>>>>     	}
>>>>>     
>>>>> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>>>>>     }
>>>>>     
>>>>>     /* cleanup all percpu slots of the object pool */
>>>>> -static void objpool_fini_percpu_slots(struct objpool_head *head)
>>>>> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
>>>>>     {
>>>>>     	int i;
>>>>>     
>>>>> -	if (!head->cpu_slots)
>>>>> +	if (!pool->cpu_slots)
>>>>>     		return;
>>>>>     
>>>>> -	for (i = 0; i < head->nr_cpus; i++) {
>>>>> -		if (!head->cpu_slots[i])
>>>>> -			continue;
>>>>> -		if (head->flags & OBJPOOL_FROM_VMALLOC)
>>>>> -			vfree(head->cpu_slots[i]);
>>>>> -		else
>>>>> -			kfree(head->cpu_slots[i]);
>>>>> -	}
>>>>> -	kfree(head->cpu_slots);
>>>>> -	head->cpu_slots = NULL;
>>>>> -	head->slot_sizes = NULL;
>>>>> +	for (i = 0; i < pool->nr_cpus; i++)
>>>>> +		kvfree(pool->cpu_slots[i]);
>>>>> +	kfree(pool->cpu_slots);
>>>>>     }
>>>>>     
>>>>>     /* initialize object pool and pre-allocate objects */
>>>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>>>>>     		gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>>>>     		objpool_fini_cb release)
>>>>>     {
>>>>>     	int nents, rc;
>>>>>     
>>>>>     	/* check input parameters */
>>>>> -	if (nr_objs <= 0 || object_size <= 0)
>>>>> +	if (nr_objs <= 0 || object_size <= 0 ||
>>>>> +	    nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	/* Align up to unsigned long size */
>>>>> +	object_size = ALIGN(object_size, sizeof(unsigned long));
>>>>> +
>>>>> +	/*
>>>>> +	 * To avoid filling up the entries array in the per-cpu slot,
>>>>> +	 * use the power of 2 which is more than N + 1. Thus, the tail never
>>>>> +	 * catch up to the pool, and able to use pool/tail as the sequencial
>>>>> +	 * number.
>>>>> +	 */
>>>>> +	nents = roundup_pow_of_two(nr_objs + 1);
>>>>> +	if (!nents)
>>>>>     		return -EINVAL;
>>>>>     
>>>>> -	/* calculate percpu slot size (rounded to pow of 2) */
>>>>> -	nents = max_t(int, roundup_pow_of_two(nr_objs),
>>>>> -			objpool_nobjs(L1_CACHE_BYTES));
>>>>> -
>>>>> -	/* initialize objpool head */
>>>>> -	memset(head, 0, sizeof(struct objpool_head));
>>>>> -	head->nr_cpus = nr_cpu_ids;
>>>>> -	head->obj_size = object_size;
>>>>> -	head->capacity = nents;
>>>>> -	head->gfp = gfp & ~__GFP_ZERO;
>>>>> -	head->context = context;
>>>>> -	head->release = release;
>>>>> -
>>>>> -	/* allocate array for percpu slots */
>>>>> -	head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
>>>>> -			       head->nr_cpus * sizeof(int), head->gfp);
>>>>> -	if (!head->cpu_slots)
>>>>> +	/* initialize objpool pool */
>>>>> +	memset(pool, 0, sizeof(struct objpool_head));
>>>>> +	pool->nr_cpus = nr_cpu_ids;
>>>>> +	pool->obj_size = object_size;
>>>>> +	pool->nr_objs = nr_objs;
>>>>> +	/* just prevent to fullfill the per-cpu ring array */
>>>>> +	pool->capacity = nents;
>>>>> +	pool->gfp = gfp & ~__GFP_ZERO;
>>>>> +	pool->context = context;
>>>>> +	pool->release = release;
>>>>> +	/* vmalloc is used in default to allocate percpu-slots */
>>>>> +	if (!(pool->gfp & GFP_ATOMIC))
>>>>> +		pool->flags |= OBJPOOL_FROM_VMALLOC;
>>>>> +
>>>>> +	pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
>>>>> +	if (!pool->cpu_slots)
>>>>>     		return -ENOMEM;
>>>>> -	head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>>>>>     
>>>>>     	/* initialize per-cpu slots */
>>>>> -	rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
>>>>> +	rc = objpool_init_percpu_slots(pool, context, objinit);
>>>>>     	if (rc)
>>>>> -		objpool_fini_percpu_slots(head);
>>>>> +		objpool_fini_percpu_slots(pool);
>>>>>     	else
>>>>> -		refcount_set(&head->ref, nr_objs + 1);
>>>>> +		refcount_set(&pool->ref, nr_objs + 1);
>>>>>     
>>>>>     	return rc;
>>>>>     }
>>>>>     EXPORT_SYMBOL_GPL(objpool_init);
>>>>>     
>>>>>     /* adding object to slot, abort if the slot was already full */
>>>>> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
>>>>> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
>>>>>     {
>>>>> -	uint32_t *ages = SLOT_AGES(os);
>>>>> -	void **ents = SLOT_ENTS(os);
>>>>> -	uint32_t head, tail;
>>>>> +	struct objpool_slot *slot = pool->cpu_slots[cpu];
>>>>> +	uint32_t tail, next;
>>>>>     
>>>>>     	do {
>>>>> -		/* perform memory loading for both head and tail */
>>>>> -		head = READ_ONCE(os->head);
>>>>> -		tail = READ_ONCE(os->tail);
>>>>> -		/* just abort if slot is full */
>>>>> -		if (tail - head > os->mask)
>>>>> -			return -ENOENT;
>>>>> -		/* try to extend tail by 1 using CAS to avoid races */
>>>>> -		if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
>>>>> -			break;
>>>>> -	} while (1);
>>>>> +		uint32_t head = READ_ONCE(slot->head);
>>>>>     
>>>>> -	/* the tail-th of slot is reserved for the given obj */
>>>>> -	WRITE_ONCE(ents[tail & os->mask], obj);
>>>>> -	/* update epoch id to make this object available for pop() */
>>>>> -	smp_store_release(&ages[tail & os->mask], tail);
>>>>> +		tail = READ_ONCE(slot->tail);
>>>>> +		next = tail + 1;
>>>>> +
>>>>> +		/* This must never happen because capacity >= N + 1 */
>>>>> +		if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
>>>>> +				 (next < head && next > head + pool->nr_objs)))
>>>>> +			return -EINVAL;
>>>>> +
>>>>> +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
>>>>> +
>>>>> +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>>>>>     	return 0;
>>>>>     }
>>>>>     
>>>>>     /* reclaim an object to object pool */
>>>>> -int objpool_push(void *obj, struct objpool_head *oh)
>>>>> +int objpool_push(void *obj, struct objpool_head *pool)
>>>>>     {
>>>>>     	unsigned long flags;
>>>>> -	int cpu, rc;
>>>>> +	int rc;
>>>>>     
>>>>>     	/* disable local irq to avoid preemption & interruption */
>>>>>     	raw_local_irq_save(flags);
>>>>> -	cpu = raw_smp_processor_id();
>>>>> -	do {
>>>>> -		rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
>>>>> -		if (!rc)
>>>>> -			break;
>>>>> -		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
>>>>> -	} while (1);
>>>>> +	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>>>>>     	raw_local_irq_restore(flags);
>>>>>     
>>>>>     	return rc;
>>>>>     }
>>>>>     EXPORT_SYMBOL_GPL(objpool_push);
>>>>>     
>>>>> -/* drop the allocated object, rather reclaim it to objpool */
>>>>> -int objpool_drop(void *obj, struct objpool_head *head)
>>>>> -{
>>>>> -	if (!obj || !head)
>>>>> -		return -EINVAL;
>>>>> -
>>>>> -	if (refcount_dec_and_test(&head->ref)) {
>>>>> -		objpool_free(head);
>>>>> -		return 0;
>>>>> -	}
>>>>> -
>>>>> -	return -EAGAIN;
>>>>> -}
>>>>> -EXPORT_SYMBOL_GPL(objpool_drop);
>>>>> -
>>>>>     /* try to retrieve object from slot */
>>>>> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
>>>>> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
>>>>>     {
>>>>> -	uint32_t *ages = SLOT_AGES(os);
>>>>> -	void **ents = SLOT_ENTS(os);
>>>>>     	/* do memory load of head to local head */
>>>>> -	uint32_t head = smp_load_acquire(&os->head);
>>>>> +	uint32_t head = smp_load_acquire(&slot->head);
>>>>>     
>>>>>     	/* loop if slot isn't empty */
>>>>> -	while (head != READ_ONCE(os->tail)) {
>>>>> -		uint32_t id = head & os->mask, prev = head;
>>>>> +	while (head != READ_ONCE(slot->tail)) {
>>>>>     
>>>>>     		/* do prefetching of object ents */
>>>>> -		prefetch(&ents[id]);
>>>>> -
>>>>> -		/* check whether this item was ready for retrieval */
>>>>> -		if (smp_load_acquire(&ages[id]) == head) {
>>>>> -			/* node must have been udpated by push() */
>>>>> -			void *node = READ_ONCE(ents[id]);
>>>>> -			/* commit and move forward head of the slot */
>>>>> -			if (try_cmpxchg_release(&os->head, &head, head + 1))
>>>>> -				return node;
>>>>> -			/* head was already updated by others */
>>>>> -		}
>>>>> +		prefetch(&slot->entries[head & slot->mask]);
>>>>> +
>>>>> +		/* commit and move forward head of the slot */
>>>>> +		if (try_cmpxchg_release(&slot->head, &head, head + 1))
>>>>> +			/*
>>>>> +			 * TBD: check overwrap the tail/head counter and warn
>>>>> +			 * if it is broken. But this happens only if this
>>>>> +			 * process slows down a lot and another CPU updates
>>>>> +			 * the haed/tail just 2^32 + 1 times, and this slot
>>>>> +			 * is empty.
>>>>> +			 */
>>>>> +			return slot->entries[head & slot->mask];
>>>>>     
>>>>>     		/* re-load head from memory and continue trying */
>>>>> -		head = READ_ONCE(os->head);
>>>>> -		/*
>>>>> -		 * head stays unchanged, so it's very likely there's an
>>>>> -		 * ongoing push() on other cpu nodes but yet not update
>>>>> -		 * ages[] to mark it's completion
>>>>> -		 */
>>>>> -		if (head == prev)
>>>>> -			break;
>>>>> +		head = READ_ONCE(slot->head);
>>>>>     	}
>>>>>     
>>>>>     	return NULL;
>>>>> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
>>>>>     EXPORT_SYMBOL_GPL(objpool_pop);
>>>>>     
>>>>>     /* release whole objpool forcely */
>>>>> -void objpool_free(struct objpool_head *head)
>>>>> +void objpool_free(struct objpool_head *pool)
>>>>>     {
>>>>> -	if (!head->cpu_slots)
>>>>> +	if (!pool->cpu_slots)
>>>>>     		return;
>>>>>     
>>>>>     	/* release percpu slots */
>>>>> -	objpool_fini_percpu_slots(head);
>>>>> +	objpool_fini_percpu_slots(pool);
>>>>>     
>>>>>     	/* call user's cleanup callback if provided */
>>>>> -	if (head->release)
>>>>> -		head->release(head, head->context);
>>>>> +	if (pool->release)
>>>>> +		pool->release(pool, pool->context);
>>>>>     }
>>>>>     EXPORT_SYMBOL_GPL(objpool_free);
>>>>>     
>>>>> -/* drop unused objects and defref objpool for releasing */
>>>>> -void objpool_fini(struct objpool_head *head)
>>>>> +/* drop the allocated object, rather reclaim it to objpool */
>>>>> +int objpool_drop(void *obj, struct objpool_head *pool)
>>>>>     {
>>>>> -	void *nod;
>>>>> +	if (!obj || !pool)
>>>>> +		return -EINVAL;
>>>>>     
>>>>> -	do {
>>>>> -		/* grab object from objpool and drop it */
>>>>> -		nod = objpool_pop(head);
>>>>> +	if (refcount_dec_and_test(&pool->ref)) {
>>>>> +		objpool_free(pool);
>>>>> +		return 0;
>>>>> +	}
>>>>> +
>>>>> +	return -EAGAIN;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(objpool_drop);
>>>>> +
>>>>> +/* drop unused objects and defref objpool for releasing */
>>>>> +void objpool_fini(struct objpool_head *pool)
>>>>> +{
>>>>> +	void *obj;
>>>>>     
>>>>> -		/* drop the extra ref of objpool */
>>>>> -		if (refcount_dec_and_test(&head->ref))
>>>>> -			objpool_free(head);
>>>>> -	} while (nod);
>>>>> +	/* grab object from objpool and drop it */
>>>>> +	while ((obj = objpool_pop(pool)))
>>>>> +		objpool_drop(obj, pool);
>>>>>     }
>>>>>     EXPORT_SYMBOL_GPL(objpool_fini);
>>>>
>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-10-13  3:03 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-05  1:52 [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement wuqiang.matt
2023-09-05  1:52 ` [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC wuqiang.matt
2023-09-23  9:48   ` Masami Hiramatsu
2023-10-08 18:40     ` wuqiang
2023-10-09 14:19       ` Masami Hiramatsu
2023-10-12 16:16         ` wuqiang.matt
2023-09-24 14:21   ` Masami Hiramatsu
2023-09-25  9:42   ` Masami Hiramatsu
2023-10-08 19:04     ` wuqiang
2023-10-09  9:23     ` wuqiang
2023-10-09 13:51       ` Masami Hiramatsu
2023-10-12 14:02       ` Masami Hiramatsu
2023-10-12 17:36         ` wuqiang.matt
2023-10-13  1:59           ` Masami Hiramatsu
2023-10-13  3:03             ` wuqiang.matt
2023-09-05  1:52 ` [PATCH v9 2/5] lib: objpool test module added wuqiang.matt
2023-09-05  1:52 ` [PATCH v9 3/5] kprobes: kretprobe scalability improvement with objpool wuqiang.matt
2023-10-07  2:02   ` Masami Hiramatsu
2023-10-08 18:31     ` wuqiang
2023-10-08 23:20       ` Masami Hiramatsu
2023-09-05  1:52 ` [PATCH v9 4/5] kprobes: freelist.h removed wuqiang.matt
2023-09-05  1:52 ` [PATCH v9 5/5] MAINTAINERS: objpool added wuqiang.matt
2023-09-23  8:57 ` [PATCH v9 0/5] lib,kprobes: kretprobe scalability improvement Masami Hiramatsu
2023-10-08 18:33   ` wuqiang
2023-10-08 23:17     ` Masami Hiramatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).