All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator.
@ 2022-08-17 21:04 Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 01/12] bpf: Introduce any context " Alexei Starovoitov
                   ` (11 more replies)
  0 siblings, 12 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce any context BPF specific memory allocator.

Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.
Front-end kmalloc() with per-cpu cache of free elements.
Refill this cache asynchronously from irq_work.

v1->v2:
- Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
- Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
- Tuned watermarks per allocation size. Patch 8
- Adopted this approach to per-cpu allocation. Patch 10.
- Fully converted hash map to bpf_mem_alloc. Patch 11.
- Removed tracing prog restriction on map types. Combination of all patches and final patch 12.

v1 thread:
https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/

LWN article:
https://lwn.net/Articles/899274/

There is a lot more work ahead, but this set is useful base.
Future work:
- allow sleepable bpf progs use dynamically allocated maps
- expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
- add sysctl to force bpf_mem_alloc in hash map when safe even if pre-alloc
  requested to reduce memory consumption
- convert lru map to bpf_mem_alloc

Alexei Starovoitov (12):
  bpf: Introduce any context BPF specific memory allocator.
  bpf: Convert hash map to bpf_mem_alloc.
  selftests/bpf: Improve test coverage of test_maps
  samples/bpf: Reduce syscall overhead in map_perf_test.
  bpf: Relax the requirement to use preallocated hash maps in tracing
    progs.
  bpf: Optimize element count in non-preallocated hash map.
  bpf: Optimize call_rcu in non-preallocated hash map.
  bpf: Adjust low/high watermarks in bpf_mem_cache
  bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  bpf: Add percpu allocation support to bpf_mem_alloc.
  bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
  bpf: Remove tracing program restriction on map types

 include/linux/bpf_mem_alloc.h             |  26 +
 kernel/bpf/Makefile                       |   2 +-
 kernel/bpf/hashtab.c                      | 120 ++--
 kernel/bpf/memalloc.c                     | 638 ++++++++++++++++++++++
 kernel/bpf/syscall.c                      |   5 +-
 kernel/bpf/verifier.c                     |  29 -
 samples/bpf/map_perf_test_kern.c          |  44 +-
 samples/bpf/map_perf_test_user.c          |   2 +-
 tools/testing/selftests/bpf/progs/timer.c |  11 -
 tools/testing/selftests/bpf/test_maps.c   |  38 +-
 10 files changed, 811 insertions(+), 104 deletions(-)
 create mode 100644 include/linux/bpf_mem_alloc.h
 create mode 100644 kernel/bpf/memalloc.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 01/12] bpf: Introduce any context BPF specific memory allocator.
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 23:51   ` Kumar Kartikeya Dwivedi
  2022-08-17 21:04 ` [PATCH v2 bpf-next 02/12] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.

Front-end kmalloc() with minimal per-cpu cache of free elements.
Refill this cache asynchronously from irq_work.

BPF programs always run with migration disabled.
It's safe to allocate from cache of the current cpu with irqs disabled.
Free-ing is always done into bucket of the current cpu as well.
irq_work trims extra free elements from buckets with kfree
and refills them with kmalloc, so global kmalloc logic takes care
of freeing objects allocated by one cpu and freed on another.

struct bpf_mem_alloc supports two modes:
- When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
  This is typical bpf hash map use case when all elements have equal size.
- When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
  kmalloc/kfree. Max allocation size is 4096 in this case.
  This is bpf_dynptr and bpf_kptr use case.

bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.

The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_mem_alloc.h |  26 ++
 kernel/bpf/Makefile           |   2 +-
 kernel/bpf/memalloc.c         | 526 ++++++++++++++++++++++++++++++++++
 3 files changed, 553 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf_mem_alloc.h
 create mode 100644 kernel/bpf/memalloc.c

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
new file mode 100644
index 000000000000..804733070f8d
--- /dev/null
+++ b/include/linux/bpf_mem_alloc.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+#ifndef _BPF_MEM_ALLOC_H
+#define _BPF_MEM_ALLOC_H
+#include <linux/compiler_types.h>
+
+struct bpf_mem_cache;
+struct bpf_mem_caches;
+
+struct bpf_mem_alloc {
+	struct bpf_mem_caches __percpu *caches;
+	struct bpf_mem_cache __percpu *cache;
+};
+
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size);
+void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
+
+/* kmalloc/kfree equivalent: */
+void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size);
+void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr);
+
+/* kmem_cache_alloc/free equivalent: */
+void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma);
+void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
+
+#endif /* _BPF_MEM_ALLOC_H */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..11fb9220909b 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
-obj-$(CONFIG_BPF_SYSCALL) += btf.o
+obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
new file mode 100644
index 000000000000..8de268922380
--- /dev/null
+++ b/kernel/bpf/memalloc.c
@@ -0,0 +1,526 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+#include <linux/mm.h>
+#include <linux/llist.h>
+#include <linux/bpf.h>
+#include <linux/irq_work.h>
+#include <linux/bpf_mem_alloc.h>
+#include <linux/memcontrol.h>
+
+/* Any context (including NMI) BPF specific memory allocator.
+ *
+ * Tracing BPF programs can attach to kprobe and fentry. Hence they
+ * run in unknown context where calling plain kmalloc() might not be safe.
+ *
+ * Front-end kmalloc() with per-cpu per-bucket cache of free elements.
+ * Refill this cache asynchronously from irq_work.
+ *
+ * CPU_0 buckets
+ * 16 32 64 96 128 196 256 512 1024 2048 4096
+ * ...
+ * CPU_N buckets
+ * 16 32 64 96 128 196 256 512 1024 2048 4096
+ *
+ * The buckets are prefilled at the start.
+ * BPF programs always run with migration disabled.
+ * It's safe to allocate from cache of the current cpu with irqs disabled.
+ * Free-ing is always done into bucket of the current cpu as well.
+ * irq_work trims extra free elements from buckets with kfree
+ * and refills them with kmalloc, so global kmalloc logic takes care
+ * of freeing objects allocated by one cpu and freed on another.
+ *
+ * Every allocated objected is padded with extra 8 bytes that contains
+ * struct llist_node.
+ */
+#define LLIST_NODE_SZ sizeof(struct llist_node)
+
+/* similar to kmalloc, but sizeof == 8 bucket is gone */
+static u8 size_index[24] __ro_after_init = {
+	3,	/* 8 */
+	3,	/* 16 */
+	4,	/* 24 */
+	4,	/* 32 */
+	5,	/* 40 */
+	5,	/* 48 */
+	5,	/* 56 */
+	5,	/* 64 */
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+	6,	/* 104 */
+	6,	/* 112 */
+	6,	/* 120 */
+	6,	/* 128 */
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+};
+
+static int bpf_mem_cache_idx(size_t size)
+{
+	if (!size || size > 4096)
+		return -1;
+
+	if (size <= 192)
+		return size_index[(size - 1) / 8] - 1;
+
+	return fls(size - 1) - 1;
+}
+
+#define NUM_CACHES 11
+
+struct bpf_mem_cache {
+	/* per-cpu list of free objects of size 'unit_size'.
+	 * All accesses are done with preemption disabled
+	 * with __llist_add() and __llist_del_first().
+	 */
+	struct llist_head free_llist;
+
+	/* NMI only free list.
+	 * All accesses are NMI-safe llist_add() and llist_del_first().
+	 *
+	 * Each allocated object is either on free_llist or on free_llist_nmi.
+	 * One cpu can allocate it from NMI by doing llist_del_first() from
+	 * free_llist_nmi, while another might free it back from non-NMI by
+	 * doing llist_add() into free_llist.
+	 */
+	struct llist_head free_llist_nmi;
+
+	/* kmem_cache != NULL when bpf_mem_alloc was created for specific
+	 * element size.
+	 */
+	struct kmem_cache *kmem_cache;
+	struct irq_work refill_work;
+	struct obj_cgroup *objcg;
+	int unit_size;
+	/* count of objects in free_llist */
+	int free_cnt;
+	/* count of objects in free_llist_nmi */
+	atomic_t free_cnt_nmi;
+	/* flag to refill nmi list too */
+	bool refill_nmi_list;
+};
+
+struct bpf_mem_caches {
+	struct bpf_mem_cache cache[NUM_CACHES];
+};
+
+static struct llist_node notrace *__llist_del_first(struct llist_head *head)
+{
+	struct llist_node *entry, *next;
+
+	entry = head->first;
+	if (!entry)
+		return NULL;
+	next = entry->next;
+	head->first = next;
+	return entry;
+}
+
+#define BATCH 48
+#define LOW_WATERMARK 32
+#define HIGH_WATERMARK 96
+/* Assuming the average number of elements per bucket is 64, when all buckets
+ * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
+ * 64*4096*32 ~ 20Mbyte
+ */
+
+/* extra macro useful for testing by randomizing in_nmi condition */
+#define bpf_in_nmi() in_nmi()
+
+static void *__alloc(struct bpf_mem_cache *c, int node)
+{
+	/* Allocate, but don't deplete atomic reserves that typical
+	 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
+	 * will allocate from the current numa node which is what we
+	 * want here.
+	 */
+	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT;
+
+	if (c->kmem_cache)
+		return kmem_cache_alloc_node(c->kmem_cache, flags, node);
+
+	return kmalloc_node(c->unit_size, flags, node);
+}
+
+static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	if (c->objcg)
+		return get_mem_cgroup_from_objcg(c->objcg);
+#endif
+
+#ifdef CONFIG_MEMCG
+	return root_mem_cgroup;
+#else
+	return NULL;
+#endif
+}
+
+/* Mostly runs from irq_work except __init phase. */
+static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
+{
+	struct mem_cgroup *memcg = NULL, *old_memcg;
+	unsigned long flags;
+	void *obj;
+	int i;
+
+	memcg = get_memcg(c);
+	old_memcg = set_active_memcg(memcg);
+	for (i = 0; i < cnt; i++) {
+		obj = __alloc(c, node);
+		if (!obj)
+			break;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			/* In RT irq_work runs in per-cpu kthread, so we have
+			 * to disable interrupts to avoid race with
+			 * bpf_mem_alloc/free. Note the read of free_cnt in
+			 * bpf_mem_refill is racy in RT. It's ok to do.
+			 */
+			local_irq_save(flags);
+		__llist_add(obj, &c->free_llist);
+		c->free_cnt++;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_restore(flags);
+	}
+	set_active_memcg(old_memcg);
+	mem_cgroup_put(memcg);
+}
+
+/* Refill NMI specific llist. Mostly runs from irq_work except __init phase. */
+static void alloc_bulk_nmi(struct bpf_mem_cache *c, int cnt, int node)
+{
+	struct mem_cgroup *memcg = NULL, *old_memcg;
+	void *obj;
+	int i;
+
+	memcg = get_memcg(c);
+	old_memcg = set_active_memcg(memcg);
+	for (i = 0; i < cnt; i++) {
+		obj = __alloc(c, node);
+		if (!obj)
+			break;
+		llist_add(obj, &c->free_llist_nmi);
+		atomic_inc(&c->free_cnt_nmi);
+	}
+	set_active_memcg(old_memcg);
+	mem_cgroup_put(memcg);
+}
+
+static void free_one(struct bpf_mem_cache *c, void *obj)
+{
+	if (c->kmem_cache)
+		kmem_cache_free(c->kmem_cache, obj);
+	else
+		kfree(obj);
+}
+
+static void free_bulk(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode;
+	unsigned long flags;
+	int cnt;
+
+	do {
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_save(flags);
+		llnode = __llist_del_first(&c->free_llist);
+		if (llnode)
+			cnt = --c->free_cnt;
+		else
+			cnt = 0;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_restore(flags);
+		free_one(c, llnode);
+	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+}
+
+static void free_bulk_nmi(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode;
+	int cnt;
+
+	do {
+		llnode = llist_del_first(&c->free_llist_nmi);
+		if (llnode)
+			cnt = atomic_dec_return(&c->free_cnt_nmi);
+		else
+			cnt = 0;
+		free_one(c, llnode);
+	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+}
+
+static void bpf_mem_refill(struct irq_work *work)
+{
+	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
+	int cnt;
+
+	cnt = c->free_cnt;
+	if (cnt < LOW_WATERMARK)
+		/* irq_work runs on this cpu and kmalloc will allocate
+		 * from the current numa node which is what we want here.
+		 */
+		alloc_bulk(c, BATCH, NUMA_NO_NODE);
+	else if (cnt > HIGH_WATERMARK)
+		free_bulk(c);
+
+	if (!c->refill_nmi_list)
+		/* don't refill NMI specific freelist
+		 * until alloc/free from NMI.
+		 */
+		return;
+	cnt = atomic_read(&c->free_cnt_nmi);
+	if (cnt < LOW_WATERMARK)
+		alloc_bulk_nmi(c, BATCH, NUMA_NO_NODE);
+	else if (cnt > HIGH_WATERMARK)
+		free_bulk_nmi(c);
+	c->refill_nmi_list = false;
+}
+
+static void notrace irq_work_raise(struct bpf_mem_cache *c, bool in_nmi)
+{
+	if (in_nmi)
+		/* Raise the flag only if in_nmi. Cannot assign it
+		 * unconditionally since subsequent non-nmi irq_work_raise
+		 * might clear it.
+		 */
+		c->refill_nmi_list = in_nmi;
+	irq_work_queue(&c->refill_work);
+}
+
+static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
+{
+	init_irq_work(&c->refill_work, bpf_mem_refill);
+	/* To avoid consuming memory assume that 1st run of bpf
+	 * prog won't be doing more than 4 map_update_elem from
+	 * irq disabled region
+	 */
+	alloc_bulk(c, c->unit_size < 256 ? 4 : 1, cpu_to_node(cpu));
+
+	/* NMI progs are rare. Assume they have one map_update
+	 * per prog at the very beginning.
+	 */
+	alloc_bulk_nmi(c, 1, cpu_to_node(cpu));
+}
+
+/* When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
+ * This is typical bpf hash map use case when all elements have equal size.
+ *
+ * When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
+ * kmalloc/kfree. Max allocation size is 4096 in this case.
+ * This is bpf_dynptr and bpf_kptr use case.
+ */
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
+{
+	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
+	struct bpf_mem_caches *cc, __percpu *pcc;
+	struct bpf_mem_cache *c, __percpu *pc;
+	struct kmem_cache *kmem_cache;
+	struct obj_cgroup *objcg = NULL;
+	char buf[32];
+	int cpu, i;
+
+	if (size) {
+		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
+		if (!pc)
+			return -ENOMEM;
+		size += LLIST_NODE_SZ; /* room for llist_node */
+		snprintf(buf, sizeof(buf), "bpf-%u", size);
+		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
+		if (!kmem_cache) {
+			free_percpu(pc);
+			return -ENOMEM;
+		}
+#ifdef CONFIG_MEMCG_KMEM
+		objcg = get_obj_cgroup_from_current();
+#endif
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(pc, cpu);
+			c->kmem_cache = kmem_cache;
+			c->unit_size = size;
+			c->objcg = objcg;
+			prefill_mem_cache(c, cpu);
+		}
+		ma->cache = pc;
+		return 0;
+	}
+
+	pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL);
+	if (!pcc)
+		return -ENOMEM;
+#ifdef CONFIG_MEMCG_KMEM
+	objcg = get_obj_cgroup_from_current();
+#endif
+	for_each_possible_cpu(cpu) {
+		cc = per_cpu_ptr(pcc, cpu);
+		for (i = 0; i < NUM_CACHES; i++) {
+			c = &cc->cache[i];
+			c->unit_size = sizes[i];
+			c->objcg = objcg;
+			prefill_mem_cache(c, cpu);
+		}
+	}
+	ma->caches = pcc;
+	return 0;
+}
+
+static void drain_mem_cache(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode;
+
+	while ((llnode = llist_del_first(&c->free_llist_nmi)))
+		free_one(c, llnode);
+	while ((llnode = __llist_del_first(&c->free_llist)))
+		free_one(c, llnode);
+}
+
+void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
+{
+	struct bpf_mem_caches *cc;
+	struct bpf_mem_cache *c;
+	int cpu, i;
+
+	if (ma->cache) {
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(ma->cache, cpu);
+			drain_mem_cache(c);
+		}
+		/* kmem_cache and memcg are the same across cpus */
+		kmem_cache_destroy(c->kmem_cache);
+		if (c->objcg)
+			obj_cgroup_put(c->objcg);
+		free_percpu(ma->cache);
+		ma->cache = NULL;
+	}
+	if (ma->caches) {
+		for_each_possible_cpu(cpu) {
+			cc = per_cpu_ptr(ma->caches, cpu);
+			for (i = 0; i < NUM_CACHES; i++) {
+				c = &cc->cache[i];
+				drain_mem_cache(c);
+			}
+		}
+		if (c->objcg)
+			obj_cgroup_put(c->objcg);
+		free_percpu(ma->caches);
+		ma->caches = NULL;
+	}
+}
+
+/* notrace is necessary here and in other functions to make sure
+ * bpf programs cannot attach to them and cause llist corruptions.
+ */
+static void notrace *unit_alloc(struct bpf_mem_cache *c)
+{
+	bool in_nmi = bpf_in_nmi();
+	struct llist_node *llnode;
+	unsigned long flags;
+	int cnt = 0;
+
+	if (unlikely(in_nmi)) {
+		llnode = llist_del_first(&c->free_llist_nmi);
+		if (llnode)
+			cnt = atomic_dec_return(&c->free_cnt_nmi);
+	} else {
+		/* Disable irqs to prevent the following race:
+		 * bpf_prog_A
+		 *   bpf_mem_alloc
+		 *      preemption or irq -> bpf_prog_B
+		 *        bpf_mem_alloc
+		 */
+		local_irq_save(flags);
+		llnode = __llist_del_first(&c->free_llist);
+		if (llnode)
+			cnt = --c->free_cnt;
+		local_irq_restore(flags);
+	}
+	WARN_ON(cnt < 0);
+
+	if (cnt < LOW_WATERMARK)
+		irq_work_raise(c, in_nmi);
+	return llnode;
+}
+
+/* Though 'ptr' object could have been allocated on a different cpu
+ * add it to the free_llist of the current cpu.
+ * Let kfree() logic deal with it when it's later called from irq_work.
+ */
+static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
+{
+	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
+	bool in_nmi = bpf_in_nmi();
+	unsigned long flags;
+	int cnt;
+
+	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
+
+	if (unlikely(in_nmi)) {
+		llist_add(llnode, &c->free_llist_nmi);
+		cnt = atomic_inc_return(&c->free_cnt_nmi);
+	} else {
+		local_irq_save(flags);
+		__llist_add(llnode, &c->free_llist);
+		cnt = ++c->free_cnt;
+		local_irq_restore(flags);
+	}
+	WARN_ON(cnt <= 0);
+
+	if (cnt > HIGH_WATERMARK)
+		/* free few objects from current cpu into global kmalloc pool */
+		irq_work_raise(c, in_nmi);
+}
+
+/* Called from BPF program or from sys_bpf syscall.
+ * In both cases migration is disabled.
+ */
+void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
+{
+	int idx;
+	void *ret;
+
+	if (!size)
+		return ZERO_SIZE_PTR;
+
+	idx = bpf_mem_cache_idx(size + LLIST_NODE_SZ);
+	if (idx < 0)
+		return NULL;
+
+	ret = unit_alloc(this_cpu_ptr(ma->caches)->cache + idx);
+	return !ret ? NULL : ret + LLIST_NODE_SZ;
+}
+
+void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
+{
+	int idx;
+
+	if (!ptr)
+		return;
+
+	idx = bpf_mem_cache_idx(__ksize(ptr - LLIST_NODE_SZ));
+	if (idx < 0)
+		return;
+
+	unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
+}
+
+void notrace *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma)
+{
+	void *ret;
+
+	ret = unit_alloc(this_cpu_ptr(ma->cache));
+	return !ret ? NULL : ret + LLIST_NODE_SZ;
+}
+
+void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
+{
+	if (!ptr)
+		return;
+
+	unit_free(this_cpu_ptr(ma->cache), ptr);
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 02/12] bpf: Convert hash map to bpf_mem_alloc.
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 01/12] bpf: Introduce any context " Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 03/12] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert bpf hash map to use bpf memory allocator.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8392f7f8a8ac..6c0db430507a 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -14,6 +14,7 @@
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
+#include <linux/bpf_mem_alloc.h>
 
 #define HTAB_CREATE_FLAG_MASK						\
 	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |	\
@@ -92,6 +93,7 @@ struct bucket {
 
 struct bpf_htab {
 	struct bpf_map map;
+	struct bpf_mem_alloc ma;
 	struct bucket *buckets;
 	void *elems;
 	union {
@@ -567,6 +569,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 			if (err)
 				goto free_prealloc;
 		}
+	} else {
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size);
+		if (err)
+			goto free_map_locked;
 	}
 
 	return &htab->map;
@@ -577,6 +583,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->ma);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
 	bpf_map_area_free(htab);
@@ -853,7 +860,7 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
 	check_and_free_fields(htab, l);
-	kfree(l);
+	bpf_mem_cache_free(&htab->ma, l);
 }
 
 static void htab_elem_free_rcu(struct rcu_head *head)
@@ -977,9 +984,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 				l_new = ERR_PTR(-E2BIG);
 				goto dec_count;
 			}
-		l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
-					     GFP_NOWAIT | __GFP_NOWARN,
-					     htab->map.numa_node);
+		l_new = bpf_mem_cache_alloc(&htab->ma);
 		if (!l_new) {
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
@@ -998,7 +1003,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			pptr = bpf_map_alloc_percpu(&htab->map, size, 8,
 						    GFP_NOWAIT | __GFP_NOWARN);
 			if (!pptr) {
-				kfree(l_new);
+				bpf_mem_cache_free(&htab->ma, l_new);
 				l_new = ERR_PTR(-ENOMEM);
 				goto dec_count;
 			}
@@ -1493,6 +1498,7 @@ static void htab_map_free(struct bpf_map *map)
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->ma);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 03/12] selftests/bpf: Improve test coverage of test_maps
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 01/12] bpf: Introduce any context " Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 02/12] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 04/12] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Make test_maps more stressful with more parallelism in
update/delete/lookup/walk including different value sizes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_maps.c | 38 ++++++++++++++++---------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index cbebfaa7c1e8..d1ffc76814d9 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -264,10 +264,11 @@ static void test_hashmap_percpu(unsigned int task, void *data)
 	close(fd);
 }
 
+#define VALUE_SIZE 3
 static int helper_fill_hashmap(int max_entries)
 {
 	int i, fd, ret;
-	long long key, value;
+	long long key, value[VALUE_SIZE] = {};
 
 	fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(key), sizeof(value),
 			    max_entries, &map_opts);
@@ -276,8 +277,8 @@ static int helper_fill_hashmap(int max_entries)
 	      "err: %s, flags: 0x%x\n", strerror(errno), map_opts.map_flags);
 
 	for (i = 0; i < max_entries; i++) {
-		key = i; value = key;
-		ret = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		key = i; value[0] = key;
+		ret = bpf_map_update_elem(fd, &key, value, BPF_NOEXIST);
 		CHECK(ret != 0,
 		      "can't update hashmap",
 		      "err: %s\n", strerror(ret));
@@ -288,8 +289,8 @@ static int helper_fill_hashmap(int max_entries)
 
 static void test_hashmap_walk(unsigned int task, void *data)
 {
-	int fd, i, max_entries = 1000;
-	long long key, value, next_key;
+	int fd, i, max_entries = 10000;
+	long long key, value[VALUE_SIZE], next_key;
 	bool next_key_valid = true;
 
 	fd = helper_fill_hashmap(max_entries);
@@ -297,7 +298,7 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : &key,
 					 &next_key) == 0; i++) {
 		key = next_key;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
 	}
 
 	assert(i == max_entries);
@@ -305,9 +306,9 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
 	for (i = 0; next_key_valid; i++) {
 		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
-		value++;
-		assert(bpf_map_update_elem(fd, &key, &value, BPF_EXIST) == 0);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
+		value[0]++;
+		assert(bpf_map_update_elem(fd, &key, value, BPF_EXIST) == 0);
 		key = next_key;
 	}
 
@@ -316,8 +317,8 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : &key,
 					 &next_key) == 0; i++) {
 		key = next_key;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
-		assert(value - 1 == key);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
+		assert(value[0] - 1 == key);
 	}
 
 	assert(i == max_entries);
@@ -1371,16 +1372,16 @@ static void __run_parallel(unsigned int tasks,
 
 static void test_map_stress(void)
 {
+	run_parallel(100, test_hashmap_walk, NULL);
 	run_parallel(100, test_hashmap, NULL);
 	run_parallel(100, test_hashmap_percpu, NULL);
 	run_parallel(100, test_hashmap_sizes, NULL);
-	run_parallel(100, test_hashmap_walk, NULL);
 
 	run_parallel(100, test_arraymap, NULL);
 	run_parallel(100, test_arraymap_percpu, NULL);
 }
 
-#define TASKS 1024
+#define TASKS 100
 
 #define DO_UPDATE 1
 #define DO_DELETE 0
@@ -1432,6 +1433,8 @@ static void test_update_delete(unsigned int fn, void *data)
 	int fd = ((int *)data)[0];
 	int i, key, value, err;
 
+	if (fn & 1)
+		test_hashmap_walk(fn, NULL);
 	for (i = fn; i < MAP_SIZE; i += TASKS) {
 		key = value = i;
 
@@ -1455,7 +1458,7 @@ static void test_update_delete(unsigned int fn, void *data)
 
 static void test_map_parallel(void)
 {
-	int i, fd, key = 0, value = 0;
+	int i, fd, key = 0, value = 0, j = 0;
 	int data[2];
 
 	fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(key), sizeof(value),
@@ -1466,6 +1469,7 @@ static void test_map_parallel(void)
 		exit(1);
 	}
 
+again:
 	/* Use the same fd in children to add elements to this map:
 	 * child_0 adds key=0, key=1024, key=2048, ...
 	 * child_1 adds key=1, key=1025, key=2049, ...
@@ -1502,6 +1506,12 @@ static void test_map_parallel(void)
 	key = -1;
 	assert(bpf_map_get_next_key(fd, NULL, &key) < 0 && errno == ENOENT);
 	assert(bpf_map_get_next_key(fd, &key, &key) < 0 && errno == ENOENT);
+
+	key = 0;
+	bpf_map_delete_elem(fd, &key);
+	if (j++ < 5)
+		goto again;
+	close(fd);
 }
 
 static void test_map_rdonly(void)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 04/12] samples/bpf: Reduce syscall overhead in map_perf_test.
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2022-08-17 21:04 ` [PATCH v2 bpf-next 03/12] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 05/12] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Make map_perf_test for preallocated and non-preallocated hash map
spend more time inside bpf program to focus performance analysis
on the speed of update/lookup/delete operations performed by bpf program.

It makes 'perf report' of bpf_mem_alloc look like:
 11.76%  map_perf_test    [k] _raw_spin_lock_irqsave
 11.26%  map_perf_test    [k] htab_map_update_elem
  9.70%  map_perf_test    [k] _raw_spin_lock
  9.47%  map_perf_test    [k] htab_map_delete_elem
  8.57%  map_perf_test    [k] memcpy_erms
  5.58%  map_perf_test    [k] alloc_htab_elem
  4.09%  map_perf_test    [k] __htab_map_lookup_elem
  3.44%  map_perf_test    [k] syscall_exit_to_user_mode
  3.13%  map_perf_test    [k] lookup_nulls_elem_raw
  3.05%  map_perf_test    [k] migrate_enable
  3.04%  map_perf_test    [k] memcmp
  2.67%  map_perf_test    [k] unit_free
  2.39%  map_perf_test    [k] lookup_elem_raw

Reduce default iteration count as well to make 'map_perf_test' quick enough
even on debug kernels.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/map_perf_test_kern.c | 44 ++++++++++++++++++++------------
 samples/bpf/map_perf_test_user.c |  2 +-
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test_kern.c
index 8773f22b6a98..7342c5b2f278 100644
--- a/samples/bpf/map_perf_test_kern.c
+++ b/samples/bpf/map_perf_test_kern.c
@@ -108,11 +108,14 @@ int stress_hmap(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&hash_map, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&hash_map, &key);
-	if (value)
-		bpf_map_delete_elem(&hash_map, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&hash_map, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&hash_map, &key);
+		if (value)
+			bpf_map_delete_elem(&hash_map, &key);
+	}
 
 	return 0;
 }
@@ -123,11 +126,14 @@ int stress_percpu_hmap(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&percpu_hash_map, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&percpu_hash_map, &key);
-	if (value)
-		bpf_map_delete_elem(&percpu_hash_map, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&percpu_hash_map, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&percpu_hash_map, &key);
+		if (value)
+			bpf_map_delete_elem(&percpu_hash_map, &key);
+	}
 	return 0;
 }
 
@@ -137,11 +143,14 @@ int stress_hmap_alloc(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&hash_map_alloc, &key);
-	if (value)
-		bpf_map_delete_elem(&hash_map_alloc, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&hash_map_alloc, &key);
+		if (value)
+			bpf_map_delete_elem(&hash_map_alloc, &key);
+	}
 	return 0;
 }
 
@@ -151,11 +160,14 @@ int stress_percpu_hmap_alloc(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&percpu_hash_map_alloc, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&percpu_hash_map_alloc, &key);
-	if (value)
-		bpf_map_delete_elem(&percpu_hash_map_alloc, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&percpu_hash_map_alloc, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&percpu_hash_map_alloc, &key);
+		if (value)
+			bpf_map_delete_elem(&percpu_hash_map_alloc, &key);
+	}
 	return 0;
 }
 
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index b6fc174ab1f2..1bb53f4b29e1 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -72,7 +72,7 @@ static int test_flags = ~0;
 static uint32_t num_map_entries;
 static uint32_t inner_lru_hash_size;
 static int lru_hash_lookup_test_entries = 32;
-static uint32_t max_cnt = 1000000;
+static uint32_t max_cnt = 10000;
 
 static int check_test_flags(enum test_type t)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 05/12] bpf: Relax the requirement to use preallocated hash maps in tracing progs.
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2022-08-17 21:04 ` [PATCH v2 bpf-next 04/12] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 06/12] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Since bpf hash map was converted to use bpf_mem_alloc it is safe to use
from tracing programs and in RT kernels.
But per-cpu hash map is still using dynamic allocation for per-cpu map
values, hence keep the warning for this map type.
In the future alloc_percpu_gfp can be front-end-ed with bpf_mem_cache
and this restriction will be completely lifted.
perf_event (NMI) bpf programs have to use preallocated hash maps,
because free_htab_elem() is using call_rcu which might crash if re-entered.

Sleepable bpf programs have to use preallocated hash maps, because
life time of the map elements is not protected by rcu_read_lock/unlock.
This restriction can be lifted in the future as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2c1f8069f7b7..d785f29047d7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12605,10 +12605,12 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 	 * For programs attached to PERF events this is mandatory as the
 	 * perf NMI can hit any arbitrary code sequence.
 	 *
-	 * All other trace types using preallocated hash maps are unsafe as
-	 * well because tracepoint or kprobes can be inside locked regions
-	 * of the memory allocator or at a place where a recursion into the
-	 * memory allocator would see inconsistent state.
+	 * All other trace types using non-preallocated per-cpu hash maps are
+	 * unsafe as well because tracepoint or kprobes can be inside locked
+	 * regions of the per-cpu memory allocator or at a place where a
+	 * recursion into the per-cpu memory allocator would see inconsistent
+	 * state. Non per-cpu hash maps are using bpf_mem_alloc-tor which is
+	 * safe to use from kprobe/fentry and in RT.
 	 *
 	 * On RT enabled kernels run-time allocation of all trace type
 	 * programs is strictly prohibited due to lock type constraints. On
@@ -12618,15 +12620,26 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 	 */
 	if (is_tracing_prog_type(prog_type) && !is_preallocated_map(map)) {
 		if (prog_type == BPF_PROG_TYPE_PERF_EVENT) {
+			/* perf_event bpf progs have to use preallocated hash maps
+			 * because non-prealloc is still relying on call_rcu to free
+			 * elements.
+			 */
 			verbose(env, "perf_event programs can only use preallocated hash map\n");
 			return -EINVAL;
 		}
-		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-			verbose(env, "trace type programs can only use preallocated hash map\n");
-			return -EINVAL;
+		if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+		    (map->inner_map_meta &&
+		     map->inner_map_meta->map_type == BPF_MAP_TYPE_PERCPU_HASH)) {
+			if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+				verbose(env,
+					"trace type programs can only use preallocated per-cpu hash map\n");
+				return -EINVAL;
+			}
+			WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
+			verbose(env,
+				"trace type programs with run-time allocated per-cpu hash maps are unsafe."
+				" Switch to preallocated hash maps.\n");
 		}
-		WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
-		verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n");
 	}
 
 	if (map_value_has_spin_lock(map)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 06/12] bpf: Optimize element count in non-preallocated hash map.
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2022-08-17 21:04 ` [PATCH v2 bpf-next 05/12] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 07/12] bpf: Optimize call_rcu " Alexei Starovoitov
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The atomic_inc/dec might cause extreme cache line bouncing when multiple cpus
access the same bpf map. Based on specified max_entries for the hash map
calculate when percpu_counter becomes faster than atomic_t and use it for such
maps. For example samples/bpf/map_perf_test is using hash map with max_entries
1000. On a system with 16 cpus the 'map_perf_test 4' shows 14k events per
second using atomic_t. On a system with 15 cpus it shows 100k events per second
using percpu. map_perf_test is an extreme case where all cpus colliding on
atomic_t which causes extreme cache bouncing. Note that the slow path of
percpu_counter is 5k events per secound vs 14k for atomic, so the heuristic is
necessary. See comment in the code why the heuristic is based on
num_online_cpus().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 70 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 62 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6c0db430507a..65ebe5a719f5 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -101,7 +101,12 @@ struct bpf_htab {
 		struct bpf_lru lru;
 	};
 	struct htab_elem *__percpu *extra_elems;
-	atomic_t count;	/* number of elements in this hashtable */
+	/* number of elements in non-preallocated hashtable are kept
+	 * in either pcount or count
+	 */
+	struct percpu_counter pcount;
+	atomic_t count;
+	bool use_percpu_counter;
 	u32 n_buckets;	/* number of hash buckets */
 	u32 elem_size;	/* size of each element in bytes */
 	u32 hashrnd;
@@ -556,6 +561,29 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	htab_init_buckets(htab);
 
+/* compute_batch_value() computes batch value as num_online_cpus() * 2
+ * and __percpu_counter_compare() needs
+ * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus()
+ * for percpu_counter to be faster than atomic_t. In practice the average bpf
+ * hash map size is 10k, which means that a system with 64 cpus will fill
+ * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore
+ * define our own batch count as 32 then 10k hash map can be filled up to 80%:
+ * 10k - 8k > 32 _batch_ * 64 _cpus_
+ * and __percpu_counter_compare() will still be fast. At that point hash map
+ * collisions will dominate its performance anyway. Assume that hash map filled
+ * to 50+% isn't going to be O(1) and use the following formula to choose
+ * between percpu_counter and atomic_t.
+ */
+#define PERCPU_COUNTER_BATCH 32
+	if (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH)
+		htab->use_percpu_counter = true;
+
+	if (htab->use_percpu_counter) {
+		err = percpu_counter_init(&htab->pcount, 0, GFP_KERNEL);
+		if (err)
+			goto free_map_locked;
+	}
+
 	if (prealloc) {
 		err = prealloc_init(htab);
 		if (err)
@@ -882,6 +910,31 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 	}
 }
 
+static bool is_map_full(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		return __percpu_counter_compare(&htab->pcount, htab->map.max_entries,
+						PERCPU_COUNTER_BATCH) >= 0;
+	return atomic_read(&htab->count) >= htab->map.max_entries;
+}
+
+static void inc_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_inc(&htab->count);
+}
+
+static void dec_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_dec(&htab->count);
+}
+
+
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	htab_put_fd_value(htab, l);
@@ -890,7 +943,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 		check_and_free_fields(htab, l);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
-		atomic_dec(&htab->count);
+		dec_elem_count(htab);
 		l->htab = htab;
 		call_rcu(&l->rcu, htab_elem_free_rcu);
 	}
@@ -974,16 +1027,15 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = container_of(l, struct htab_elem, fnode);
 		}
 	} else {
-		if (atomic_inc_return(&htab->count) > htab->map.max_entries)
-			if (!old_elem) {
+		if (is_map_full(htab))
+			if (!old_elem)
 				/* when map is full and update() is replacing
 				 * old element, it's ok to allocate, since
 				 * old element will be freed immediately.
 				 * Otherwise return an error
 				 */
-				l_new = ERR_PTR(-E2BIG);
-				goto dec_count;
-			}
+				return ERR_PTR(-E2BIG);
+		inc_elem_count(htab);
 		l_new = bpf_mem_cache_alloc(&htab->ma);
 		if (!l_new) {
 			l_new = ERR_PTR(-ENOMEM);
@@ -1025,7 +1077,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 	l_new->hash = hash;
 	return l_new;
 dec_count:
-	atomic_dec(&htab->count);
+	dec_elem_count(htab);
 	return l_new;
 }
 
@@ -1499,6 +1551,8 @@ static void htab_map_free(struct bpf_map *map)
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
 	bpf_mem_alloc_destroy(&htab->ma);
+	if (htab->use_percpu_counter)
+		percpu_counter_destroy(&htab->pcount);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 07/12] bpf: Optimize call_rcu in non-preallocated hash map.
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2022-08-17 21:04 ` [PATCH v2 bpf-next 06/12] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 08/12] bpf: Adjust low/high watermarks in bpf_mem_cache Alexei Starovoitov
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Doing call_rcu() million times a second becomes a bottle neck.
Convert non-preallocated hash map from call_rcu to SLAB_TYPESAFE_BY_RCU.
The rcu critical section is no longer observed for one htab element
which makes non-preallocated hash map behave just like preallocated hash map.
The map elements are released back to kernel memory after observing
rcu critical section.
This improves 'map_perf_test 4' performance from 100k events per second
to 250k events per second.

bpf_mem_alloc + percpu_counter + typesafe_by_rcu provide 10x performance
boost to non-preallocated hash map and make it within few % of preallocated map
while consuming fraction of memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c                      |  8 ++++++--
 kernel/bpf/memalloc.c                     |  2 +-
 tools/testing/selftests/bpf/progs/timer.c | 11 -----------
 3 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 65ebe5a719f5..3c1d15fd052a 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -944,8 +944,12 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		dec_elem_count(htab);
-		l->htab = htab;
-		call_rcu(&l->rcu, htab_elem_free_rcu);
+		if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH) {
+			l->htab = htab;
+			call_rcu(&l->rcu, htab_elem_free_rcu);
+		} else {
+			htab_elem_free(htab, l);
+		}
 	}
 }
 
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 8de268922380..a43630371b9f 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -332,7 +332,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
 			return -ENOMEM;
 		size += LLIST_NODE_SZ; /* room for llist_node */
 		snprintf(buf, sizeof(buf), "bpf-%u", size);
-		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
+		kmem_cache = kmem_cache_create(buf, size, 8, SLAB_TYPESAFE_BY_RCU, NULL);
 		if (!kmem_cache) {
 			free_percpu(pc);
 			return -ENOMEM;
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
index 5f5309791649..0053c5402173 100644
--- a/tools/testing/selftests/bpf/progs/timer.c
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -208,17 +208,6 @@ static int timer_cb2(void *map, int *key, struct hmap_elem *val)
 		 */
 		bpf_map_delete_elem(map, key);
 
-		/* in non-preallocated hashmap both 'key' and 'val' are RCU
-		 * protected and still valid though this element was deleted
-		 * from the map. Arm this timer for ~35 seconds. When callback
-		 * finishes the call_rcu will invoke:
-		 * htab_elem_free_rcu
-		 *   check_and_free_timer
-		 *     bpf_timer_cancel_and_free
-		 * to cancel this 35 second sleep and delete the timer for real.
-		 */
-		if (bpf_timer_start(&val->timer, 1ull << 35, 0) != 0)
-			err |= 256;
 		ok |= 4;
 	}
 	return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 08/12] bpf: Adjust low/high watermarks in bpf_mem_cache
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2022-08-17 21:04 ` [PATCH v2 bpf-next 07/12] bpf: Optimize call_rcu " Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 09/12] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Same low/high watermarks for every bucket in bpf_mem_cache consume
significant amount of memory. Preallocating 64 elements of PAGE_SIZE
to the free list is not efficient.
Make low/high watermarks and batching value depend on element size.
This change brings significant memory savings.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 64 ++++++++++++++++++++++++++++++-------------
 1 file changed, 45 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index a43630371b9f..be8262f5c9ec 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -105,6 +105,7 @@ struct bpf_mem_cache {
 	atomic_t free_cnt_nmi;
 	/* flag to refill nmi list too */
 	bool refill_nmi_list;
+	int low_watermark, high_watermark, batch;
 };
 
 struct bpf_mem_caches {
@@ -123,14 +124,6 @@ static struct llist_node notrace *__llist_del_first(struct llist_head *head)
 	return entry;
 }
 
-#define BATCH 48
-#define LOW_WATERMARK 32
-#define HIGH_WATERMARK 96
-/* Assuming the average number of elements per bucket is 64, when all buckets
- * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
- * 64*4096*32 ~ 20Mbyte
- */
-
 /* extra macro useful for testing by randomizing in_nmi condition */
 #define bpf_in_nmi() in_nmi()
 
@@ -238,7 +231,7 @@ static void free_bulk(struct bpf_mem_cache *c)
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			local_irq_restore(flags);
 		free_one(c, llnode);
-	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
 }
 
 static void free_bulk_nmi(struct bpf_mem_cache *c)
@@ -253,7 +246,7 @@ static void free_bulk_nmi(struct bpf_mem_cache *c)
 		else
 			cnt = 0;
 		free_one(c, llnode);
-	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
 }
 
 static void bpf_mem_refill(struct irq_work *work)
@@ -262,12 +255,12 @@ static void bpf_mem_refill(struct irq_work *work)
 	int cnt;
 
 	cnt = c->free_cnt;
-	if (cnt < LOW_WATERMARK)
+	if (cnt < c->low_watermark)
 		/* irq_work runs on this cpu and kmalloc will allocate
 		 * from the current numa node which is what we want here.
 		 */
-		alloc_bulk(c, BATCH, NUMA_NO_NODE);
-	else if (cnt > HIGH_WATERMARK)
+		alloc_bulk(c, c->batch, NUMA_NO_NODE);
+	else if (cnt > c->high_watermark)
 		free_bulk(c);
 
 	if (!c->refill_nmi_list)
@@ -276,9 +269,9 @@ static void bpf_mem_refill(struct irq_work *work)
 		 */
 		return;
 	cnt = atomic_read(&c->free_cnt_nmi);
-	if (cnt < LOW_WATERMARK)
-		alloc_bulk_nmi(c, BATCH, NUMA_NO_NODE);
-	else if (cnt > HIGH_WATERMARK)
+	if (cnt < c->low_watermark)
+		alloc_bulk_nmi(c, c->batch, NUMA_NO_NODE);
+	else if (cnt > c->high_watermark)
 		free_bulk_nmi(c);
 	c->refill_nmi_list = false;
 }
@@ -294,14 +287,47 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c, bool in_nmi)
 	irq_work_queue(&c->refill_work);
 }
 
+/* For typical bpf map case that uses bpf_mem_cache_alloc and single bucket
+ * the freelist cache will be elem_size * 64 (or less) on each cpu.
+ *
+ * For bpf programs that don't have statically known allocation sizes and
+ * assuming (low_mark + high_mark) / 2 as an average number of elements per
+ * bucket and all buckets are used the total amount of memory in freelists
+ * on each cpu will be:
+ * 64*16 + 64*32 + 64*64 + 64*96 + 64*128 + 64*196 + 64*256 + 32*512 + 16*1024 + 8*2048 + 4*4096
+ * + nmi's reserves
+ * 1*16 + 1*32 + 1*64 + 1*96 + 1*128 + 1*196 + 1*256 + 1*512 + 1*1024 + 1*2048 + 1*4096
+ * == ~ 122 Kbyte using below heuristic.
+ * In unlikely worst case where bpf progs used all allocations sizes from
+ * non-NMI and from NMI too: ~ 227 Kbyte per cpu.
+ * Initialized, but unused bpf allocator (not bpf map specific one) will
+ * consume ~ 19 Kbyte per cpu.
+ * Typical case will be between 19K and 122K closer to 19K.
+ * bpf progs can and should share bpf_mem_cache when possible.
+ */
+
 static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
 {
 	init_irq_work(&c->refill_work, bpf_mem_refill);
+	if (c->unit_size <= 256) {
+		c->low_watermark = 32;
+		c->high_watermark = 96;
+	} else {
+		/* When page_size == 4k, order-0 cache will have low_mark == 2
+		 * and high_mark == 6 with batch alloc of 3 individual pages at
+		 * a time.
+		 * 8k allocs and above low == 1, high == 3, batch == 1.
+		 */
+		c->low_watermark = max(32 * 256 / c->unit_size, 1);
+		c->high_watermark = max(96 * 256 / c->unit_size, 3);
+	}
+	c->batch = max((c->high_watermark - c->low_watermark) / 4 * 3, 1);
+
 	/* To avoid consuming memory assume that 1st run of bpf
 	 * prog won't be doing more than 4 map_update_elem from
 	 * irq disabled region
 	 */
-	alloc_bulk(c, c->unit_size < 256 ? 4 : 1, cpu_to_node(cpu));
+	alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu));
 
 	/* NMI progs are rare. Assume they have one map_update
 	 * per prog at the very beginning.
@@ -442,7 +468,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
 	}
 	WARN_ON(cnt < 0);
 
-	if (cnt < LOW_WATERMARK)
+	if (cnt < c->low_watermark)
 		irq_work_raise(c, in_nmi);
 	return llnode;
 }
@@ -471,7 +497,7 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 	}
 	WARN_ON(cnt <= 0);
 
-	if (cnt > HIGH_WATERMARK)
+	if (cnt > c->high_watermark)
 		/* free few objects from current cpu into global kmalloc pool */
 		irq_work_raise(c, in_nmi);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 09/12] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (7 preceding siblings ...)
  2022-08-17 21:04 ` [PATCH v2 bpf-next 08/12] bpf: Adjust low/high watermarks in bpf_mem_cache Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 10/12] bpf: Add percpu allocation support to bpf_mem_alloc Alexei Starovoitov
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
solves the memory consumption issue, avoids kmem_cache_destroy latency and
keeps bpf hash map performance the same.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 58 ++++++++++++++++++++++++++++++++++++++++---
 kernel/bpf/syscall.c  |  5 +++-
 2 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index be8262f5c9ec..ae4cdc9493c3 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -106,6 +106,11 @@ struct bpf_mem_cache {
 	/* flag to refill nmi list too */
 	bool refill_nmi_list;
 	int low_watermark, high_watermark, batch;
+
+	struct rcu_head rcu;
+	struct llist_head free_by_rcu;
+	struct llist_head waiting_for_gp;
+	atomic_t call_rcu_in_progress;
 };
 
 struct bpf_mem_caches {
@@ -214,6 +219,39 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
 		kfree(obj);
 }
 
+static void __free_rcu(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+	struct llist_node *llnode = __llist_del_all(&c->waiting_for_gp);
+	struct llist_node *pos, *t;
+
+	llist_for_each_safe(pos, t, llnode)
+		free_one(c, pos);
+	atomic_set(&c->call_rcu_in_progress, 0);
+}
+
+static void enque_to_free(struct bpf_mem_cache *c, void *obj)
+{
+	struct llist_node *llnode = obj;
+
+	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
+	 * Nothing races to add to free_by_rcu list.
+	 */
+	__llist_add(llnode, &c->free_by_rcu);
+}
+
+static void do_call_rcu(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode, *t;
+
+	if (atomic_xchg(&c->call_rcu_in_progress, 1))
+		return;
+
+	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
+		__llist_add(llnode, &c->waiting_for_gp);
+	call_rcu(&c->rcu, __free_rcu);
+}
+
 static void free_bulk(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode;
@@ -230,8 +268,9 @@ static void free_bulk(struct bpf_mem_cache *c)
 			cnt = 0;
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			local_irq_restore(flags);
-		free_one(c, llnode);
+		enque_to_free(c, llnode);
 	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
+	do_call_rcu(c);
 }
 
 static void free_bulk_nmi(struct bpf_mem_cache *c)
@@ -245,8 +284,9 @@ static void free_bulk_nmi(struct bpf_mem_cache *c)
 			cnt = atomic_dec_return(&c->free_cnt_nmi);
 		else
 			cnt = 0;
-		free_one(c, llnode);
+		enque_to_free(c, llnode);
 	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
+	do_call_rcu(c);
 }
 
 static void bpf_mem_refill(struct irq_work *work)
@@ -358,7 +398,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
 			return -ENOMEM;
 		size += LLIST_NODE_SZ; /* room for llist_node */
 		snprintf(buf, sizeof(buf), "bpf-%u", size);
-		kmem_cache = kmem_cache_create(buf, size, 8, SLAB_TYPESAFE_BY_RCU, NULL);
+		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
 		if (!kmem_cache) {
 			free_percpu(pc);
 			return -ENOMEM;
@@ -400,6 +440,18 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode;
 
+	/* The caller has done rcu_barrier() and no progs are using this
+	 * bpf_mem_cache, but htab_map_free() called bpf_mem_cache_free() for
+	 * all remaining elements and they can be in free_by_rcu or in
+	 * waiting_for_gp lists, so drain accumulating free_by_rcu list and
+	 * optionally wait for callbacks to finish.
+	 */
+	while ((llnode = __llist_del_first(&c->free_by_rcu)))
+		free_one(c, llnode);
+	if (atomic_xchg(&c->call_rcu_in_progress, 1))
+		rcu_barrier();
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
+
 	while ((llnode = llist_del_first(&c->free_llist_nmi)))
 		free_one(c, llnode);
 	while ((llnode = __llist_del_first(&c->free_llist)))
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 83c7136c5788..eeef64b27683 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -638,7 +638,10 @@ static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
 		bpf_map_free_id(map, do_idr_lock);
 		btf_put(map->btf);
 		INIT_WORK(&map->work, bpf_map_free_deferred);
-		schedule_work(&map->work);
+		/* Avoid spawning kworkers, since they all might contend
+		 * for the same mutex like slab_mutex.
+		 */
+		queue_work(system_unbound_wq, &map->work);
 	}
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 10/12] bpf: Add percpu allocation support to bpf_mem_alloc.
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (8 preceding siblings ...)
  2022-08-17 21:04 ` [PATCH v2 bpf-next 09/12] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 11/12] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 12/12] bpf: Remove tracing program restriction on map types Alexei Starovoitov
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations.
Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects.
bpf_mem_cache_free() will free them back into global per-cpu pool after
observing RCU grace period.
per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps.

The free list cache consists of tuples { llist_node, per-cpu pointer }
Unlike alloc_percpu() that returns per-cpu pointer
the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and
bpf_mem_cache_free() expects to receive it back.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_mem_alloc.h |  2 +-
 kernel/bpf/hashtab.c          |  2 +-
 kernel/bpf/memalloc.c         | 44 +++++++++++++++++++++++++++++++----
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 804733070f8d..653ed1584a03 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -12,7 +12,7 @@ struct bpf_mem_alloc {
 	struct bpf_mem_cache __percpu *cache;
 };
 
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size);
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 
 /* kmalloc/kfree equivalent: */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 3c1d15fd052a..bf20c45002fe 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -598,7 +598,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 				goto free_prealloc;
 		}
 	} else {
-		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size);
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
 		if (err)
 			goto free_map_locked;
 	}
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index ae4cdc9493c3..633e7eb9ba62 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -105,6 +105,7 @@ struct bpf_mem_cache {
 	atomic_t free_cnt_nmi;
 	/* flag to refill nmi list too */
 	bool refill_nmi_list;
+	bool percpu;
 	int low_watermark, high_watermark, batch;
 
 	struct rcu_head rcu;
@@ -141,6 +142,19 @@ static void *__alloc(struct bpf_mem_cache *c, int node)
 	 */
 	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT;
 
+	if (c->percpu) {
+		void **obj = kmem_cache_alloc_node(c->kmem_cache, flags, node);
+		void *pptr = __alloc_percpu_gfp(c->unit_size, 8, flags);
+
+		if (!obj || !pptr) {
+			free_percpu(pptr);
+			kfree(obj);
+			return NULL;
+		}
+		obj[1] = pptr;
+		return obj;
+	}
+
 	if (c->kmem_cache)
 		return kmem_cache_alloc_node(c->kmem_cache, flags, node);
 
@@ -213,6 +227,12 @@ static void alloc_bulk_nmi(struct bpf_mem_cache *c, int cnt, int node)
 
 static void free_one(struct bpf_mem_cache *c, void *obj)
 {
+	if (c->percpu) {
+		free_percpu(((void **)obj)[1]);
+		kmem_cache_free(c->kmem_cache, obj);
+		return;
+	}
+
 	if (c->kmem_cache)
 		kmem_cache_free(c->kmem_cache, obj);
 	else
@@ -382,21 +402,30 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
  * kmalloc/kfree. Max allocation size is 4096 in this case.
  * This is bpf_dynptr and bpf_kptr use case.
  */
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 {
 	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
 	struct bpf_mem_caches *cc, __percpu *pcc;
 	struct bpf_mem_cache *c, __percpu *pc;
-	struct kmem_cache *kmem_cache;
+	struct kmem_cache *kmem_cache = NULL;
 	struct obj_cgroup *objcg = NULL;
 	char buf[32];
-	int cpu, i;
+	int cpu, i, unit_size;
 
 	if (size) {
 		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
 		if (!pc)
 			return -ENOMEM;
-		size += LLIST_NODE_SZ; /* room for llist_node */
+
+		if (percpu) {
+			unit_size = size;
+			/* room for llist_node and per-cpu pointer */
+			size = LLIST_NODE_SZ + sizeof(void *);
+		} else {
+			size += LLIST_NODE_SZ; /* room for llist_node */
+			unit_size = size;
+		}
+
 		snprintf(buf, sizeof(buf), "bpf-%u", size);
 		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
 		if (!kmem_cache) {
@@ -409,14 +438,19 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
 		for_each_possible_cpu(cpu) {
 			c = per_cpu_ptr(pc, cpu);
 			c->kmem_cache = kmem_cache;
-			c->unit_size = size;
+			c->unit_size = unit_size;
 			c->objcg = objcg;
+			c->percpu = percpu;
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
 		return 0;
 	}
 
+	/* size == 0 && percpu is an invalid combination */
+	if (WARN_ON_ONCE(percpu))
+		return -EINVAL;
+
 	pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL);
 	if (!pcc)
 		return -ENOMEM;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 11/12] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (9 preceding siblings ...)
  2022-08-17 21:04 ` [PATCH v2 bpf-next 10/12] bpf: Add percpu allocation support to bpf_mem_alloc Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  2022-08-17 21:04 ` [PATCH v2 bpf-next 12/12] bpf: Remove tracing program restriction on map types Alexei Starovoitov
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert dynamic allocations in percpu hash map from alloc_percpu() to
bpf_mem_cache_alloc() from per-cpu bpf_mem_alloc. Since bpf_mem_alloc frees
objects after RCU gp the call_rcu() is removed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 38 ++++++++++++++++----------------------
 1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index bf20c45002fe..921f6fa9dc1b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -94,6 +94,7 @@ struct bucket {
 struct bpf_htab {
 	struct bpf_map map;
 	struct bpf_mem_alloc ma;
+	struct bpf_mem_alloc pcpu_ma;
 	struct bucket *buckets;
 	void *elems;
 	union {
@@ -121,14 +122,14 @@ struct htab_elem {
 		struct {
 			void *padding;
 			union {
-				struct bpf_htab *htab;
 				struct pcpu_freelist_node fnode;
 				struct htab_elem *batch_flink;
 			};
 		};
 	};
 	union {
-		struct rcu_head rcu;
+		/* pointer to per-cpu pointer */
+		void *ptr_to_pptr;
 		struct bpf_lru_node lru_node;
 	};
 	u32 hash;
@@ -439,8 +440,6 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	bool zero_seed = (attr->map_flags & BPF_F_ZERO_SEED);
 	int numa_node = bpf_map_attr_numa_node(attr);
 
-	BUILD_BUG_ON(offsetof(struct htab_elem, htab) !=
-		     offsetof(struct htab_elem, hash_node.pprev));
 	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
 		     offsetof(struct htab_elem, hash_node.pprev));
 
@@ -601,6 +600,12 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
 		if (err)
 			goto free_map_locked;
+		if (percpu) {
+			err = bpf_mem_alloc_init(&htab->pcpu_ma,
+						 round_up(htab->map.value_size, 8), true);
+			if (err)
+				goto free_map_locked;
+		}
 	}
 
 	return &htab->map;
@@ -611,6 +616,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
@@ -886,19 +892,11 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
-		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
+		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
 	check_and_free_fields(htab, l);
 	bpf_mem_cache_free(&htab->ma, l);
 }
 
-static void htab_elem_free_rcu(struct rcu_head *head)
-{
-	struct htab_elem *l = container_of(head, struct htab_elem, rcu);
-	struct bpf_htab *htab = l->htab;
-
-	htab_elem_free(htab, l);
-}
-
 static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 {
 	struct bpf_map *map = &htab->map;
@@ -944,12 +942,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		dec_elem_count(htab);
-		if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH) {
-			l->htab = htab;
-			call_rcu(&l->rcu, htab_elem_free_rcu);
-		} else {
-			htab_elem_free(htab, l);
-		}
+		htab_elem_free(htab, l);
 	}
 }
 
@@ -1051,18 +1044,18 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 
 	memcpy(l_new->key, key, key_size);
 	if (percpu) {
-		size = round_up(size, 8);
 		if (prealloc) {
 			pptr = htab_elem_get_ptr(l_new, key_size);
 		} else {
 			/* alloc_percpu zero-fills */
-			pptr = bpf_map_alloc_percpu(&htab->map, size, 8,
-						    GFP_NOWAIT | __GFP_NOWARN);
+			pptr = bpf_mem_cache_alloc(&htab->pcpu_ma);
 			if (!pptr) {
 				bpf_mem_cache_free(&htab->ma, l_new);
 				l_new = ERR_PTR(-ENOMEM);
 				goto dec_count;
 			}
+			l_new->ptr_to_pptr = pptr;
+			pptr = *(void **)pptr;
 		}
 
 		pcpu_init_value(htab, pptr, value, onallcpus);
@@ -1554,6 +1547,7 @@ static void htab_map_free(struct bpf_map *map)
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
 	if (htab->use_percpu_counter)
 		percpu_counter_destroy(&htab->pcount);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 bpf-next 12/12] bpf: Remove tracing program restriction on map types
  2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (10 preceding siblings ...)
  2022-08-17 21:04 ` [PATCH v2 bpf-next 11/12] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc Alexei Starovoitov
@ 2022-08-17 21:04 ` Alexei Starovoitov
  11 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-17 21:04 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The hash map is now fully converted to bpf_mem_alloc. Its implementation is not
allocating synchronously and not calling call_rcu() directly. It's now safe to
use non-preallocated hash maps in all types of tracing programs including
BPF_PROG_TYPE_PERF_EVENT that runs out of NMI context.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 42 ------------------------------------------
 1 file changed, 42 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d785f29047d7..a1ada707c57c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12599,48 +12599,6 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 
 {
 	enum bpf_prog_type prog_type = resolve_prog_type(prog);
-	/*
-	 * Validate that trace type programs use preallocated hash maps.
-	 *
-	 * For programs attached to PERF events this is mandatory as the
-	 * perf NMI can hit any arbitrary code sequence.
-	 *
-	 * All other trace types using non-preallocated per-cpu hash maps are
-	 * unsafe as well because tracepoint or kprobes can be inside locked
-	 * regions of the per-cpu memory allocator or at a place where a
-	 * recursion into the per-cpu memory allocator would see inconsistent
-	 * state. Non per-cpu hash maps are using bpf_mem_alloc-tor which is
-	 * safe to use from kprobe/fentry and in RT.
-	 *
-	 * On RT enabled kernels run-time allocation of all trace type
-	 * programs is strictly prohibited due to lock type constraints. On
-	 * !RT kernels it is allowed for backwards compatibility reasons for
-	 * now, but warnings are emitted so developers are made aware of
-	 * the unsafety and can fix their programs before this is enforced.
-	 */
-	if (is_tracing_prog_type(prog_type) && !is_preallocated_map(map)) {
-		if (prog_type == BPF_PROG_TYPE_PERF_EVENT) {
-			/* perf_event bpf progs have to use preallocated hash maps
-			 * because non-prealloc is still relying on call_rcu to free
-			 * elements.
-			 */
-			verbose(env, "perf_event programs can only use preallocated hash map\n");
-			return -EINVAL;
-		}
-		if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
-		    (map->inner_map_meta &&
-		     map->inner_map_meta->map_type == BPF_MAP_TYPE_PERCPU_HASH)) {
-			if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-				verbose(env,
-					"trace type programs can only use preallocated per-cpu hash map\n");
-				return -EINVAL;
-			}
-			WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
-			verbose(env,
-				"trace type programs with run-time allocated per-cpu hash maps are unsafe."
-				" Switch to preallocated hash maps.\n");
-		}
-	}
 
 	if (map_value_has_spin_lock(map)) {
 		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 bpf-next 01/12] bpf: Introduce any context BPF specific memory allocator.
  2022-08-17 21:04 ` [PATCH v2 bpf-next 01/12] bpf: Introduce any context " Alexei Starovoitov
@ 2022-08-17 23:51   ` Kumar Kartikeya Dwivedi
  2022-08-18  0:39     ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-17 23:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Wed, 17 Aug 2022 at 23:04, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Tracing BPF programs can attach to kprobe and fentry. Hence they
> run in unknown context where calling plain kmalloc() might not be safe.
>
> Front-end kmalloc() with minimal per-cpu cache of free elements.
> Refill this cache asynchronously from irq_work.
>
> BPF programs always run with migration disabled.
> It's safe to allocate from cache of the current cpu with irqs disabled.
> Free-ing is always done into bucket of the current cpu as well.
> irq_work trims extra free elements from buckets with kfree
> and refills them with kmalloc, so global kmalloc logic takes care
> of freeing objects allocated by one cpu and freed on another.
>
> struct bpf_mem_alloc supports two modes:
> - When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
>   This is typical bpf hash map use case when all elements have equal size.
> - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
>   kmalloc/kfree. Max allocation size is 4096 in this case.
>   This is bpf_dynptr and bpf_kptr use case.
>
> bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
> bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.
>
> The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf_mem_alloc.h |  26 ++
>  kernel/bpf/Makefile           |   2 +-
>  kernel/bpf/memalloc.c         | 526 ++++++++++++++++++++++++++++++++++
>  3 files changed, 553 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/bpf_mem_alloc.h
>  create mode 100644 kernel/bpf/memalloc.c
>
> diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
> new file mode 100644
> index 000000000000..804733070f8d
> --- /dev/null
> +++ b/include/linux/bpf_mem_alloc.h
> @@ -0,0 +1,26 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +#ifndef _BPF_MEM_ALLOC_H
> +#define _BPF_MEM_ALLOC_H
> +#include <linux/compiler_types.h>
> +
> +struct bpf_mem_cache;
> +struct bpf_mem_caches;
> +
> +struct bpf_mem_alloc {
> +       struct bpf_mem_caches __percpu *caches;
> +       struct bpf_mem_cache __percpu *cache;
> +};
> +
> +int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size);
> +void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
> +
> +/* kmalloc/kfree equivalent: */
> +void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size);
> +void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr);
> +
> +/* kmem_cache_alloc/free equivalent: */
> +void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma);
> +void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
> +
> +#endif /* _BPF_MEM_ALLOC_H */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 057ba8e01e70..11fb9220909b 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -13,7 +13,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>  obj-${CONFIG_BPF_LSM}    += bpf_inode_storage.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>  obj-$(CONFIG_BPF_JIT) += trampoline.o
> -obj-$(CONFIG_BPF_SYSCALL) += btf.o
> +obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
>  obj-$(CONFIG_BPF_JIT) += dispatcher.o
>  ifeq ($(CONFIG_NET),y)
>  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> new file mode 100644
> index 000000000000..8de268922380
> --- /dev/null
> +++ b/kernel/bpf/memalloc.c
> @@ -0,0 +1,526 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +#include <linux/mm.h>
> +#include <linux/llist.h>
> +#include <linux/bpf.h>
> +#include <linux/irq_work.h>
> +#include <linux/bpf_mem_alloc.h>
> +#include <linux/memcontrol.h>
> +
> +/* Any context (including NMI) BPF specific memory allocator.
> + *
> + * Tracing BPF programs can attach to kprobe and fentry. Hence they
> + * run in unknown context where calling plain kmalloc() might not be safe.
> + *
> + * Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> + * Refill this cache asynchronously from irq_work.
> + *
> + * CPU_0 buckets
> + * 16 32 64 96 128 196 256 512 1024 2048 4096
> + * ...
> + * CPU_N buckets
> + * 16 32 64 96 128 196 256 512 1024 2048 4096
> + *
> + * The buckets are prefilled at the start.
> + * BPF programs always run with migration disabled.
> + * It's safe to allocate from cache of the current cpu with irqs disabled.
> + * Free-ing is always done into bucket of the current cpu as well.
> + * irq_work trims extra free elements from buckets with kfree
> + * and refills them with kmalloc, so global kmalloc logic takes care
> + * of freeing objects allocated by one cpu and freed on another.
> + *
> + * Every allocated objected is padded with extra 8 bytes that contains
> + * struct llist_node.
> + */
> +#define LLIST_NODE_SZ sizeof(struct llist_node)
> +
> +/* similar to kmalloc, but sizeof == 8 bucket is gone */
> +static u8 size_index[24] __ro_after_init = {
> +       3,      /* 8 */
> +       3,      /* 16 */
> +       4,      /* 24 */
> +       4,      /* 32 */
> +       5,      /* 40 */
> +       5,      /* 48 */
> +       5,      /* 56 */
> +       5,      /* 64 */
> +       1,      /* 72 */
> +       1,      /* 80 */
> +       1,      /* 88 */
> +       1,      /* 96 */
> +       6,      /* 104 */
> +       6,      /* 112 */
> +       6,      /* 120 */
> +       6,      /* 128 */
> +       2,      /* 136 */
> +       2,      /* 144 */
> +       2,      /* 152 */
> +       2,      /* 160 */
> +       2,      /* 168 */
> +       2,      /* 176 */
> +       2,      /* 184 */
> +       2       /* 192 */
> +};
> +
> +static int bpf_mem_cache_idx(size_t size)
> +{
> +       if (!size || size > 4096)
> +               return -1;
> +
> +       if (size <= 192)
> +               return size_index[(size - 1) / 8] - 1;
> +
> +       return fls(size - 1) - 1;
> +}
> +
> +#define NUM_CACHES 11
> +
> +struct bpf_mem_cache {
> +       /* per-cpu list of free objects of size 'unit_size'.
> +        * All accesses are done with preemption disabled
> +        * with __llist_add() and __llist_del_first().
> +        */
> +       struct llist_head free_llist;
> +
> +       /* NMI only free list.
> +        * All accesses are NMI-safe llist_add() and llist_del_first().
> +        *
> +        * Each allocated object is either on free_llist or on free_llist_nmi.
> +        * One cpu can allocate it from NMI by doing llist_del_first() from
> +        * free_llist_nmi, while another might free it back from non-NMI by
> +        * doing llist_add() into free_llist.
> +        */
> +       struct llist_head free_llist_nmi;
> +
> +       /* kmem_cache != NULL when bpf_mem_alloc was created for specific
> +        * element size.
> +        */
> +       struct kmem_cache *kmem_cache;
> +       struct irq_work refill_work;
> +       struct obj_cgroup *objcg;
> +       int unit_size;
> +       /* count of objects in free_llist */
> +       int free_cnt;
> +       /* count of objects in free_llist_nmi */
> +       atomic_t free_cnt_nmi;
> +       /* flag to refill nmi list too */
> +       bool refill_nmi_list;
> +};
> +
> +struct bpf_mem_caches {
> +       struct bpf_mem_cache cache[NUM_CACHES];
> +};
> +
> +static struct llist_node notrace *__llist_del_first(struct llist_head *head)
> +{
> +       struct llist_node *entry, *next;
> +
> +       entry = head->first;
> +       if (!entry)
> +               return NULL;
> +       next = entry->next;
> +       head->first = next;
> +       return entry;
> +}
> +
> +#define BATCH 48
> +#define LOW_WATERMARK 32
> +#define HIGH_WATERMARK 96
> +/* Assuming the average number of elements per bucket is 64, when all buckets
> + * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
> + * 64*4096*32 ~ 20Mbyte
> + */
> +
> +/* extra macro useful for testing by randomizing in_nmi condition */
> +#define bpf_in_nmi() in_nmi()
> +
> +static void *__alloc(struct bpf_mem_cache *c, int node)
> +{
> +       /* Allocate, but don't deplete atomic reserves that typical
> +        * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
> +        * will allocate from the current numa node which is what we
> +        * want here.
> +        */
> +       gfp_t flags = GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT;
> +
> +       if (c->kmem_cache)
> +               return kmem_cache_alloc_node(c->kmem_cache, flags, node);
> +
> +       return kmalloc_node(c->unit_size, flags, node);
> +}
> +
> +static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
> +{
> +#ifdef CONFIG_MEMCG_KMEM
> +       if (c->objcg)
> +               return get_mem_cgroup_from_objcg(c->objcg);
> +#endif
> +
> +#ifdef CONFIG_MEMCG
> +       return root_mem_cgroup;
> +#else
> +       return NULL;
> +#endif
> +}
> +
> +/* Mostly runs from irq_work except __init phase. */
> +static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> +{
> +       struct mem_cgroup *memcg = NULL, *old_memcg;
> +       unsigned long flags;
> +       void *obj;
> +       int i;
> +
> +       memcg = get_memcg(c);
> +       old_memcg = set_active_memcg(memcg);
> +       for (i = 0; i < cnt; i++) {
> +               obj = __alloc(c, node);
> +               if (!obj)
> +                       break;
> +               if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +                       /* In RT irq_work runs in per-cpu kthread, so we have
> +                        * to disable interrupts to avoid race with
> +                        * bpf_mem_alloc/free. Note the read of free_cnt in
> +                        * bpf_mem_refill is racy in RT. It's ok to do.
> +                        */
> +                       local_irq_save(flags);
> +               __llist_add(obj, &c->free_llist);
> +               c->free_cnt++;
> +               if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +                       local_irq_restore(flags);
> +       }
> +       set_active_memcg(old_memcg);
> +       mem_cgroup_put(memcg);
> +}
> +
> +/* Refill NMI specific llist. Mostly runs from irq_work except __init phase. */
> +static void alloc_bulk_nmi(struct bpf_mem_cache *c, int cnt, int node)
> +{
> +       struct mem_cgroup *memcg = NULL, *old_memcg;
> +       void *obj;
> +       int i;
> +
> +       memcg = get_memcg(c);
> +       old_memcg = set_active_memcg(memcg);
> +       for (i = 0; i < cnt; i++) {
> +               obj = __alloc(c, node);
> +               if (!obj)
> +                       break;
> +               llist_add(obj, &c->free_llist_nmi);
> +               atomic_inc(&c->free_cnt_nmi);
> +       }
> +       set_active_memcg(old_memcg);
> +       mem_cgroup_put(memcg);
> +}
> +
> +static void free_one(struct bpf_mem_cache *c, void *obj)
> +{
> +       if (c->kmem_cache)
> +               kmem_cache_free(c->kmem_cache, obj);
> +       else
> +               kfree(obj);
> +}
> +
> +static void free_bulk(struct bpf_mem_cache *c)
> +{
> +       struct llist_node *llnode;
> +       unsigned long flags;
> +       int cnt;
> +
> +       do {
> +               if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +                       local_irq_save(flags);
> +               llnode = __llist_del_first(&c->free_llist);
> +               if (llnode)
> +                       cnt = --c->free_cnt;
> +               else
> +                       cnt = 0;
> +               if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +                       local_irq_restore(flags);
> +               free_one(c, llnode);
> +       } while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
> +}
> +
> +static void free_bulk_nmi(struct bpf_mem_cache *c)
> +{
> +       struct llist_node *llnode;
> +       int cnt;
> +
> +       do {
> +               llnode = llist_del_first(&c->free_llist_nmi);
> +               if (llnode)
> +                       cnt = atomic_dec_return(&c->free_cnt_nmi);
> +               else
> +                       cnt = 0;
> +               free_one(c, llnode);
> +       } while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
> +}
> +
> +static void bpf_mem_refill(struct irq_work *work)
> +{
> +       struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
> +       int cnt;
> +
> +       cnt = c->free_cnt;
> +       if (cnt < LOW_WATERMARK)
> +               /* irq_work runs on this cpu and kmalloc will allocate
> +                * from the current numa node which is what we want here.
> +                */
> +               alloc_bulk(c, BATCH, NUMA_NO_NODE);
> +       else if (cnt > HIGH_WATERMARK)
> +               free_bulk(c);
> +
> +       if (!c->refill_nmi_list)
> +               /* don't refill NMI specific freelist
> +                * until alloc/free from NMI.
> +                */
> +               return;
> +       cnt = atomic_read(&c->free_cnt_nmi);
> +       if (cnt < LOW_WATERMARK)
> +               alloc_bulk_nmi(c, BATCH, NUMA_NO_NODE);
> +       else if (cnt > HIGH_WATERMARK)
> +               free_bulk_nmi(c);
> +       c->refill_nmi_list = false;
> +}
> +
> +static void notrace irq_work_raise(struct bpf_mem_cache *c, bool in_nmi)
> +{
> +       if (in_nmi)
> +               /* Raise the flag only if in_nmi. Cannot assign it
> +                * unconditionally since subsequent non-nmi irq_work_raise
> +                * might clear it.
> +                */
> +               c->refill_nmi_list = in_nmi;
> +       irq_work_queue(&c->refill_work);
> +}
> +
> +static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
> +{
> +       init_irq_work(&c->refill_work, bpf_mem_refill);
> +       /* To avoid consuming memory assume that 1st run of bpf
> +        * prog won't be doing more than 4 map_update_elem from
> +        * irq disabled region
> +        */
> +       alloc_bulk(c, c->unit_size < 256 ? 4 : 1, cpu_to_node(cpu));
> +
> +       /* NMI progs are rare. Assume they have one map_update
> +        * per prog at the very beginning.
> +        */
> +       alloc_bulk_nmi(c, 1, cpu_to_node(cpu));
> +}
> +
> +/* When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
> + * This is typical bpf hash map use case when all elements have equal size.
> + *
> + * When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
> + * kmalloc/kfree. Max allocation size is 4096 in this case.
> + * This is bpf_dynptr and bpf_kptr use case.
> + */
> +int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
> +{
> +       static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
> +       struct bpf_mem_caches *cc, __percpu *pcc;
> +       struct bpf_mem_cache *c, __percpu *pc;
> +       struct kmem_cache *kmem_cache;
> +       struct obj_cgroup *objcg = NULL;
> +       char buf[32];
> +       int cpu, i;
> +
> +       if (size) {
> +               pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
> +               if (!pc)
> +                       return -ENOMEM;
> +               size += LLIST_NODE_SZ; /* room for llist_node */
> +               snprintf(buf, sizeof(buf), "bpf-%u", size);
> +               kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
> +               if (!kmem_cache) {
> +                       free_percpu(pc);
> +                       return -ENOMEM;
> +               }
> +#ifdef CONFIG_MEMCG_KMEM
> +               objcg = get_obj_cgroup_from_current();
> +#endif
> +               for_each_possible_cpu(cpu) {
> +                       c = per_cpu_ptr(pc, cpu);
> +                       c->kmem_cache = kmem_cache;
> +                       c->unit_size = size;
> +                       c->objcg = objcg;
> +                       prefill_mem_cache(c, cpu);
> +               }
> +               ma->cache = pc;
> +               return 0;
> +       }
> +
> +       pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL);
> +       if (!pcc)
> +               return -ENOMEM;
> +#ifdef CONFIG_MEMCG_KMEM
> +       objcg = get_obj_cgroup_from_current();
> +#endif
> +       for_each_possible_cpu(cpu) {
> +               cc = per_cpu_ptr(pcc, cpu);
> +               for (i = 0; i < NUM_CACHES; i++) {
> +                       c = &cc->cache[i];
> +                       c->unit_size = sizes[i];
> +                       c->objcg = objcg;
> +                       prefill_mem_cache(c, cpu);
> +               }
> +       }
> +       ma->caches = pcc;
> +       return 0;
> +}
> +
> +static void drain_mem_cache(struct bpf_mem_cache *c)
> +{
> +       struct llist_node *llnode;
> +
> +       while ((llnode = llist_del_first(&c->free_llist_nmi)))
> +               free_one(c, llnode);
> +       while ((llnode = __llist_del_first(&c->free_llist)))
> +               free_one(c, llnode);
> +}
> +
> +void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> +{
> +       struct bpf_mem_caches *cc;
> +       struct bpf_mem_cache *c;
> +       int cpu, i;
> +
> +       if (ma->cache) {
> +               for_each_possible_cpu(cpu) {
> +                       c = per_cpu_ptr(ma->cache, cpu);
> +                       drain_mem_cache(c);
> +               }
> +               /* kmem_cache and memcg are the same across cpus */
> +               kmem_cache_destroy(c->kmem_cache);
> +               if (c->objcg)
> +                       obj_cgroup_put(c->objcg);
> +               free_percpu(ma->cache);
> +               ma->cache = NULL;
> +       }
> +       if (ma->caches) {
> +               for_each_possible_cpu(cpu) {
> +                       cc = per_cpu_ptr(ma->caches, cpu);
> +                       for (i = 0; i < NUM_CACHES; i++) {
> +                               c = &cc->cache[i];
> +                               drain_mem_cache(c);
> +                       }
> +               }
> +               if (c->objcg)
> +                       obj_cgroup_put(c->objcg);
> +               free_percpu(ma->caches);
> +               ma->caches = NULL;
> +       }
> +}
> +
> +/* notrace is necessary here and in other functions to make sure
> + * bpf programs cannot attach to them and cause llist corruptions.
> + */
> +static void notrace *unit_alloc(struct bpf_mem_cache *c)
> +{
> +       bool in_nmi = bpf_in_nmi();
> +       struct llist_node *llnode;
> +       unsigned long flags;
> +       int cnt = 0;
> +
> +       if (unlikely(in_nmi)) {
> +               llnode = llist_del_first(&c->free_llist_nmi);
> +               if (llnode)
> +                       cnt = atomic_dec_return(&c->free_cnt_nmi);

I am trying to understand which case this
atomic_dec_return/atomic_inc_return protects against in the
unit_alloc/unit_free for in_nmi branch. Is it protecting nested NMI
BPF prog interrupting NMI prog?

In case of perf it seems we use bpf_prog_active, so nested NMI prog
won't be invoked while we are interrupted inside a BPF program in NMI
context. Which are the other cases that might cause reentrancy in this
branch such that we need atomics instead of c->free_cnt_nmi--? Or are
you anticipating you might allow this in the future even if it is
disallowed for now?

If programs are allowed to stack like this, and we try to reason about
the safety of llist_del_first operation, the code is:

struct llist_node *llist_del_first(struct llist_head *head)
{
     struct llist_node *entry, *old_entry, *next;

     entry = smp_load_acquire(&head->first);
     for (;;) {
         if (entry == NULL)
             return NULL;
         old_entry = entry;
         next = READ_ONCE(entry->next);
>>>>>>>> Suppose nested NMI comes at this point and BPF prog is invoked.
         entry = cmpxchg(&head->first, old_entry, next);
         if (entry == old_entry)
             break;
     }
     return entry;
}

Assume the current nmi free llist is HEAD -> A -> B -> C -> D -> ...
For our cmpxchg, parameters are going to be cmpxchg(&head->first, A, B);

Now, nested NMI prog does unit_alloc thrice. this does llist_del_first thrice
This makes nmi free llist HEAD -> D -> ...
A, B, C are allocated in prog.
Now it does unit_free of all three, but in order of B, C, A.
unit_free does llist_add, nmi free llist becomes HEAD -> A -> C -> B -> D -> ...

Nested NMI prog exits.
We continue with our cmpxchg(&head->first, A, B); It succeeds, A is
returned, but C will be leaked.

> +       } else {
> +               /* Disable irqs to prevent the following race:
> +                * bpf_prog_A
> +                *   bpf_mem_alloc
> +                *      preemption or irq -> bpf_prog_B
> +                *        bpf_mem_alloc
> +                */
> +               local_irq_save(flags);
> +               llnode = __llist_del_first(&c->free_llist);
> +               if (llnode)
> +                       cnt = --c->free_cnt;
> +               local_irq_restore(flags);
> +       }
> +       WARN_ON(cnt < 0);
> +
> +       if (cnt < LOW_WATERMARK)
> +               irq_work_raise(c, in_nmi);
> +       return llnode;
> +}
> +
> +/* Though 'ptr' object could have been allocated on a different cpu
> + * add it to the free_llist of the current cpu.
> + * Let kfree() logic deal with it when it's later called from irq_work.
> + */
> +static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
> +{
> +       struct llist_node *llnode = ptr - LLIST_NODE_SZ;
> +       bool in_nmi = bpf_in_nmi();
> +       unsigned long flags;
> +       int cnt;
> +
> +       BUILD_BUG_ON(LLIST_NODE_SZ > 8);
> +
> +       if (unlikely(in_nmi)) {
> +               llist_add(llnode, &c->free_llist_nmi);
> +               cnt = atomic_inc_return(&c->free_cnt_nmi);
> +       } else {
> +               local_irq_save(flags);
> +               __llist_add(llnode, &c->free_llist);
> +               cnt = ++c->free_cnt;
> +               local_irq_restore(flags);
> +       }
> +       WARN_ON(cnt <= 0);
> +
> +       if (cnt > HIGH_WATERMARK)
> +               /* free few objects from current cpu into global kmalloc pool */
> +               irq_work_raise(c, in_nmi);
> +}
> +
> +/* Called from BPF program or from sys_bpf syscall.
> + * In both cases migration is disabled.
> + */
> +void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
> +{
> +       int idx;
> +       void *ret;
> +
> +       if (!size)
> +               return ZERO_SIZE_PTR;
> +
> +       idx = bpf_mem_cache_idx(size + LLIST_NODE_SZ);
> +       if (idx < 0)
> +               return NULL;
> +
> +       ret = unit_alloc(this_cpu_ptr(ma->caches)->cache + idx);
> +       return !ret ? NULL : ret + LLIST_NODE_SZ;
> +}
> +
> +void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
> +{
> +       int idx;
> +
> +       if (!ptr)
> +               return;
> +
> +       idx = bpf_mem_cache_idx(__ksize(ptr - LLIST_NODE_SZ));
> +       if (idx < 0)
> +               return;
> +
> +       unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
> +}
> +
> +void notrace *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma)
> +{
> +       void *ret;
> +
> +       ret = unit_alloc(this_cpu_ptr(ma->cache));
> +       return !ret ? NULL : ret + LLIST_NODE_SZ;
> +}
> +
> +void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
> +{
> +       if (!ptr)
> +               return;
> +
> +       unit_free(this_cpu_ptr(ma->cache), ptr);
> +}
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 bpf-next 01/12] bpf: Introduce any context BPF specific memory allocator.
  2022-08-17 23:51   ` Kumar Kartikeya Dwivedi
@ 2022-08-18  0:39     ` Alexei Starovoitov
  2022-08-18 12:38       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-18  0:39 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Thu, Aug 18, 2022 at 01:51:39AM +0200, Kumar Kartikeya Dwivedi wrote:
> > +
> > +/* notrace is necessary here and in other functions to make sure
> > + * bpf programs cannot attach to them and cause llist corruptions.
> > + */
> > +static void notrace *unit_alloc(struct bpf_mem_cache *c)
> > +{
> > +       bool in_nmi = bpf_in_nmi();
> > +       struct llist_node *llnode;
> > +       unsigned long flags;
> > +       int cnt = 0;
> > +
> > +       if (unlikely(in_nmi)) {
> > +               llnode = llist_del_first(&c->free_llist_nmi);
> > +               if (llnode)
> > +                       cnt = atomic_dec_return(&c->free_cnt_nmi);
> 
> I am trying to understand which case this
> atomic_dec_return/atomic_inc_return protects against in the
> unit_alloc/unit_free for in_nmi branch. Is it protecting nested NMI
> BPF prog interrupting NMI prog?
> 
> In case of perf it seems we use bpf_prog_active, 

yes, but bpf_prog_active has plenty of downsides and hopefully
will be replaced eventually with cleaner mechanism.
Separate topic.

> so nested NMI prog
> won't be invoked while we are interrupted inside a BPF program in NMI
> context. Which are the other cases that might cause reentrancy in this
> branch such that we need atomics instead of c->free_cnt_nmi--? Or are
> you anticipating you might allow this in the future even if it is
> disallowed for now?
> 
> If programs are allowed to stack like this, and we try to reason about
> the safety of llist_del_first operation, the code is:
> 
> struct llist_node *llist_del_first(struct llist_head *head)
> {
>      struct llist_node *entry, *old_entry, *next;
> 
>      entry = smp_load_acquire(&head->first);
>      for (;;) {
>          if (entry == NULL)
>              return NULL;
>          old_entry = entry;
>          next = READ_ONCE(entry->next);
> >>>>>>>> Suppose nested NMI comes at this point and BPF prog is invoked.

llist_del_first is notrace.
unit_alloc() above is also notrace. See comment before it.
perf event overflow NMI can happen here, but for some other llist.
Hence we're talking about NMI issues only here. fentry progs do not apply here.

> Assume the current nmi free llist is HEAD -> A -> B -> C -> D -> ...
> For our cmpxchg, parameters are going to be cmpxchg(&head->first, A, B);
> 
> Now, nested NMI prog does unit_alloc thrice. this does llist_del_first thrice

Even double llist_del_first on the same llist is bad. That's a known fact.

> This makes nmi free llist HEAD -> D -> ...
> A, B, C are allocated in prog.
> Now it does unit_free of all three, but in order of B, C, A.
> unit_free does llist_add, nmi free llist becomes HEAD -> A -> C -> B -> D -> ...
> 
> Nested NMI prog exits.
> We continue with our cmpxchg(&head->first, A, B); It succeeds, A is
> returned, but C will be leaked.

This exact scenario cannot happen for bpf_mem_cache's freelist.
unit_alloc is doing llist_del_first on per-cpu freelist.
We can have two perf_event bpf progs. Both progs would
share the same hash map and use the same struct bpf_mem_alloc,
and both call unit_alloc() on the same cpu freelist,
but as you noticed bpf_prog_active covers that case.
bpf_prog_active is too coarse as we discussed in the other thread a
month or so ago. It prevents valid and safe execution of bpf progs, lost
events, etc. We will surely come up with a better mechanism.

Going back to your earlier question:

> Which are the other cases that might cause reentrancy in this
> branch such that we need atomics instead of c->free_cnt_nmi--?

It's the case where perf_event bpf prog happened inside bpf_mem_refill in irq_work.
bpf_mem_refill manipulates free_cnt_nmi and nmi bpf prog too through unit_alloc.
Which got me thinking that there is indeed a missing check here.
We need to protect free_bulk_nmi's llist_del_first from unit_alloc's llist_del_first.
bpf_prog_active could be used for that, but let's come up with a cleaner way.
Probably going to add atomic_t flag to bpf_mem_cache and cmpxchg it,
or lock and spin_trylock it. tbd.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 bpf-next 01/12] bpf: Introduce any context BPF specific memory allocator.
  2022-08-18  0:39     ` Alexei Starovoitov
@ 2022-08-18 12:38       ` Kumar Kartikeya Dwivedi
  2022-08-18 22:30         ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-18 12:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Thu, 18 Aug 2022 at 02:40, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Aug 18, 2022 at 01:51:39AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > +
> > > +/* notrace is necessary here and in other functions to make sure
> > > + * bpf programs cannot attach to them and cause llist corruptions.
> > > + */
> > > +static void notrace *unit_alloc(struct bpf_mem_cache *c)
> > > +{
> > > +       bool in_nmi = bpf_in_nmi();
> > > +       struct llist_node *llnode;
> > > +       unsigned long flags;
> > > +       int cnt = 0;
> > > +
> > > +       if (unlikely(in_nmi)) {
> > > +               llnode = llist_del_first(&c->free_llist_nmi);
> > > +               if (llnode)
> > > +                       cnt = atomic_dec_return(&c->free_cnt_nmi);
> >
> > I am trying to understand which case this
> > atomic_dec_return/atomic_inc_return protects against in the
> > unit_alloc/unit_free for in_nmi branch. Is it protecting nested NMI
> > BPF prog interrupting NMI prog?
> >
> > In case of perf it seems we use bpf_prog_active,
>
> yes, but bpf_prog_active has plenty of downsides and hopefully
> will be replaced eventually with cleaner mechanism.
> Separate topic.

I see.

>
> > so nested NMI prog
> > won't be invoked while we are interrupted inside a BPF program in NMI
> > context. Which are the other cases that might cause reentrancy in this
> > branch such that we need atomics instead of c->free_cnt_nmi--? Or are
> > you anticipating you might allow this in the future even if it is
> > disallowed for now?
> >
> > If programs are allowed to stack like this, and we try to reason about
> > the safety of llist_del_first operation, the code is:
> >
> > struct llist_node *llist_del_first(struct llist_head *head)
> > {
> >      struct llist_node *entry, *old_entry, *next;
> >
> >      entry = smp_load_acquire(&head->first);
> >      for (;;) {
> >          if (entry == NULL)
> >              return NULL;
> >          old_entry = entry;
> >          next = READ_ONCE(entry->next);
> > >>>>>>>> Suppose nested NMI comes at this point and BPF prog is invoked.
>
> llist_del_first is notrace.
> unit_alloc() above is also notrace. See comment before it.
> perf event overflow NMI can happen here, but for some other llist.
> Hence we're talking about NMI issues only here. fentry progs do not apply here.

I was not thinking about fentry progs either. I saw perf overflow
progs can't nest, so I was wondering whether there were any other
cases. But you mentioned bpf_mem_refill in IRQ and perf in NMI can't
touch the same lockless list, so the scenario is still possible.

>
> > Assume the current nmi free llist is HEAD -> A -> B -> C -> D -> ...
> > For our cmpxchg, parameters are going to be cmpxchg(&head->first, A, B);
> >
> > Now, nested NMI prog does unit_alloc thrice. this does llist_del_first thrice
>
> Even double llist_del_first on the same llist is bad. That's a known fact.

Well, if you think about it (correct me if I'm wrong), at least in
this kind of nesting scenario on the same CPU, just doing
llist_del_first in NMI prog which interrupts llist_del_first of
bpf_mem_refill isn't a problem. The cmpxchg will fail as head->first
changed. The problem occurs when you combine it with llist_add between
the READ_ONCE(entry->next) and cmpxchg of the interrupted
llist_del_first. The main invariant of llist_del_first is that
entry->next should not change between READ_ONCE and cmpxchg, but if we
construct an ABA scenario like I did in my previous reply, _then_ we
have a problem. Otherwise it will just retry loop on exit if we e.g.
llist_del_first and kptr_xchg the ptr (which won't do llist_add).

>
> > This makes nmi free llist HEAD -> D -> ...
> > A, B, C are allocated in prog.
> > Now it does unit_free of all three, but in order of B, C, A.
> > unit_free does llist_add, nmi free llist becomes HEAD -> A -> C -> B -> D -> ...
> >
> > Nested NMI prog exits.
> > We continue with our cmpxchg(&head->first, A, B); It succeeds, A is
> > returned, but C will be leaked.
>
> This exact scenario cannot happen for bpf_mem_cache's freelist.
> unit_alloc is doing llist_del_first on per-cpu freelist.
> We can have two perf_event bpf progs. Both progs would
> share the same hash map and use the same struct bpf_mem_alloc,
> and both call unit_alloc() on the same cpu freelist,
> but as you noticed bpf_prog_active covers that case.
> bpf_prog_active is too coarse as we discussed in the other thread a
> month or so ago. It prevents valid and safe execution of bpf progs, lost
> events, etc. We will surely come up with a better mechanism.
>
> Going back to your earlier question:
>
> > Which are the other cases that might cause reentrancy in this
> > branch such that we need atomics instead of c->free_cnt_nmi--?
>
> It's the case where perf_event bpf prog happened inside bpf_mem_refill in irq_work.
> bpf_mem_refill manipulates free_cnt_nmi and nmi bpf prog too through unit_alloc.
> Which got me thinking that there is indeed a missing check here.

Aaah, ok, so this is what you wanted to prevent. Makes sense, even
though NMI nesting won't happen in progs (atleast for now), this
irq_work refilling can be interrupted by some perf NMI prog, or raw_tp
tracing prog in NMI context.

> We need to protect free_bulk_nmi's llist_del_first from unit_alloc's llist_del_first.
> bpf_prog_active could be used for that, but let's come up with a cleaner way.
> Probably going to add atomic_t flag to bpf_mem_cache and cmpxchg it,
> or lock and spin_trylock it. tbd.

Hm, can you explain why an atomic flag or lock would be needed, and
not just having a small busy counter like bpf_prog_active for the NMI
free llist will work? bpf_mem_cache is already per-CPU so it can just
be int alongside the llist. You inc it before llist_del_first, and
then assuming inc is atomic across interrupt boundary (which I think
this_cpu_inc_return for bpf_prog_active is already assuming), NMI prog
will see llist as busy and will fail its llist_del_first.
llist_add should still be fine to allow.

Technically we can fail llist_add instead, since doing multiple
llist_del_first won't be an issue, but you can't fail bpf_mem_free,
though you can fail bpf_mem_alloc, so it makes sense to protect only
llist_del_first using the counter.

PS: Also, it would be great if you could add a few words about this
around this code, i.e. which contexts would nest and those are what
this atomic_inc/dec and llist_del_first protection prevents problems
in.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 bpf-next 01/12] bpf: Introduce any context BPF specific memory allocator.
  2022-08-18 12:38       ` Kumar Kartikeya Dwivedi
@ 2022-08-18 22:30         ` Alexei Starovoitov
  2022-08-19 14:31           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-18 22:30 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Thu, Aug 18, 2022 at 02:38:06PM +0200, Kumar Kartikeya Dwivedi wrote:
> >
> > > Assume the current nmi free llist is HEAD -> A -> B -> C -> D -> ...
> > > For our cmpxchg, parameters are going to be cmpxchg(&head->first, A, B);
> > >
> > > Now, nested NMI prog does unit_alloc thrice. this does llist_del_first thrice
> >
> > Even double llist_del_first on the same llist is bad. That's a known fact.
> 
> Well, if you think about it (correct me if I'm wrong), at least in
> this kind of nesting scenario on the same CPU, just doing
> llist_del_first in NMI prog which interrupts llist_del_first of
> bpf_mem_refill isn't a problem. The cmpxchg will fail as head->first
> changed. The problem occurs when you combine it with llist_add between
> the READ_ONCE(entry->next) and cmpxchg of the interrupted
> llist_del_first. The main invariant of llist_del_first is that
> entry->next should not change between READ_ONCE and cmpxchg, but if we
> construct an ABA scenario like I did in my previous reply, _then_ we
> have a problem. Otherwise it will just retry loop on exit if we e.g.
> llist_del_first and kptr_xchg the ptr (which won't do llist_add).

Of course. In some race scenarios the llist will stay sane.
In others there will be leaks. In others crashes.
Like we don't really need 3 llist_del followed by 3 out of order llist_add-s
to observe bad things. 2 llist_del-s and 1 llist_add are just as bad.
That's why the doc says do one llist_del_first at a time and doesn't
specify all possible bad things.

> >
> > > This makes nmi free llist HEAD -> D -> ...
> > > A, B, C are allocated in prog.
> > > Now it does unit_free of all three, but in order of B, C, A.
> > > unit_free does llist_add, nmi free llist becomes HEAD -> A -> C -> B -> D -> ...
> > >
> > > Nested NMI prog exits.
> > > We continue with our cmpxchg(&head->first, A, B); It succeeds, A is
> > > returned, but C will be leaked.
> >
> > This exact scenario cannot happen for bpf_mem_cache's freelist.
> > unit_alloc is doing llist_del_first on per-cpu freelist.
> > We can have two perf_event bpf progs. Both progs would
> > share the same hash map and use the same struct bpf_mem_alloc,
> > and both call unit_alloc() on the same cpu freelist,
> > but as you noticed bpf_prog_active covers that case.
> > bpf_prog_active is too coarse as we discussed in the other thread a
> > month or so ago. It prevents valid and safe execution of bpf progs, lost
> > events, etc. We will surely come up with a better mechanism.
> >
> > Going back to your earlier question:
> >
> > > Which are the other cases that might cause reentrancy in this
> > > branch such that we need atomics instead of c->free_cnt_nmi--?
> >
> > It's the case where perf_event bpf prog happened inside bpf_mem_refill in irq_work.
> > bpf_mem_refill manipulates free_cnt_nmi and nmi bpf prog too through unit_alloc.
> > Which got me thinking that there is indeed a missing check here.
> 
> Aaah, ok, so this is what you wanted to prevent. Makes sense, even
> though NMI nesting won't happen in progs (atleast for now), this
> irq_work refilling can be interrupted by some perf NMI prog, or raw_tp
> tracing prog in NMI context.

Right. Doesn't matter which prog type that would be.
in_nmi() is the context that needs special handling.
It could happen not only in bpf_prog_type_perf_event.

> > We need to protect free_bulk_nmi's llist_del_first from unit_alloc's llist_del_first.
> > bpf_prog_active could be used for that, but let's come up with a cleaner way.
> > Probably going to add atomic_t flag to bpf_mem_cache and cmpxchg it,
> > or lock and spin_trylock it. tbd.
> 
> Hm, can you explain why an atomic flag or lock would be needed, and
> not just having a small busy counter like bpf_prog_active for the NMI
> free llist will work? bpf_mem_cache is already per-CPU so it can just
> be int alongside the llist. You inc it before llist_del_first, and
> then assuming inc is atomic across interrupt boundary (which I think
> this_cpu_inc_return for bpf_prog_active is already assuming), NMI prog
> will see llist as busy and will fail its llist_del_first.
> llist_add should still be fine to allow.

Good idea. The per-cpu counter is faster and simpler.

> Technically we can fail llist_add instead, since doing multiple
> llist_del_first won't be an issue, but you can't fail bpf_mem_free,
> though you can fail bpf_mem_alloc, so it makes sense to protect only
> llist_del_first using the counter.

Right. We cannot fail in unit_free().
With per-cpu counter both unit_alloc() and free_bulk_nmi() would
potentially fail in such unlikely scenario.
Not a big deal for free_bulk_nmi(). It would pick the element later.
For unit_alloc() return NULL is normal.
Especially since it's so unlikely for nmi to hit right in the middle
of llist_del_first().

Since we'll add this per-cpu counter to solve interrupted llist_del_first()
it feels that the same counter can be used to protect unit_alloc/free/irq_work.
Then we don't need free_llist_nmi. Single free_llist would be fine,
but unit_free() should not fail. If free_list cannot be accessed
due to per-cpu counter being busy we have to push somewhere.
So it seems two lists are necessary. Maybe it's still better ?
Roughly I'm thinking of the following:
unit_alloc()
{
  llnode = NULL;
  local_irq_save();
  if (__this_cpu_inc_return(c->alloc_active) != 1))
     goto out;
  llnode = __llist_del_first(&c->free_llist);
  if (llnode)
      cnt = --c->free_cnt;
out:
  __this_cpu_dec(c->alloc_active);
  local_irq_restore();
  return ret;
}
unit_free()
{
  local_irq_save();
  if (__this_cpu_inc_return(c->alloc_active) != 1)) {
     llist_add(llnode, &c->free_llist_nmi);
     goto out;
  }
  __llist_add(llnode, &c->free_llist);
  cnt = ++c->free_cnt;
out:
  __this_cpu_dec(c->alloc_active);
  local_irq_restore();
  return ret;
}
alloc_bulk, free_bulk would be protected by alloc_active as well.
alloc_bulk_nmi is gone.
free_bulk_nmi is still there to drain unlucky unit_free,
but it's now alone to do llist_del_first() and it just frees anything
that is in the free_llist_nmi.
The main advantage is that free_llist_nmi doesn't need to prefilled.
It will be empty most of the time.
wdyt?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 bpf-next 01/12] bpf: Introduce any context BPF specific memory allocator.
  2022-08-18 22:30         ` Alexei Starovoitov
@ 2022-08-19 14:31           ` Kumar Kartikeya Dwivedi
  2022-08-19 17:51             ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-19 14:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Fri, 19 Aug 2022 at 00:30, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Right. We cannot fail in unit_free().
> With per-cpu counter both unit_alloc() and free_bulk_nmi() would
> potentially fail in such unlikely scenario.
> Not a big deal for free_bulk_nmi(). It would pick the element later.
> For unit_alloc() return NULL is normal.
> Especially since it's so unlikely for nmi to hit right in the middle
> of llist_del_first().
>
> Since we'll add this per-cpu counter to solve interrupted llist_del_first()
> it feels that the same counter can be used to protect unit_alloc/free/irq_work.
> Then we don't need free_llist_nmi. Single free_llist would be fine,
> but unit_free() should not fail. If free_list cannot be accessed
> due to per-cpu counter being busy we have to push somewhere.
> So it seems two lists are necessary. Maybe it's still better ?
> Roughly I'm thinking of the following:
> unit_alloc()
> {
>   llnode = NULL;
>   local_irq_save();
>   if (__this_cpu_inc_return(c->alloc_active) != 1))
>      goto out;
>   llnode = __llist_del_first(&c->free_llist);
>   if (llnode)
>       cnt = --c->free_cnt;
> out:
>   __this_cpu_dec(c->alloc_active);
>   local_irq_restore();
>   return ret;
> }
> unit_free()
> {
>   local_irq_save();
>   if (__this_cpu_inc_return(c->alloc_active) != 1)) {
>      llist_add(llnode, &c->free_llist_nmi);
>      goto out;
>   }
>   __llist_add(llnode, &c->free_llist);
>   cnt = ++c->free_cnt;
> out:
>   __this_cpu_dec(c->alloc_active);
>   local_irq_restore();
>   return ret;
> }
> alloc_bulk, free_bulk would be protected by alloc_active as well.
> alloc_bulk_nmi is gone.
> free_bulk_nmi is still there to drain unlucky unit_free,
> but it's now alone to do llist_del_first() and it just frees anything
> that is in the free_llist_nmi.
> The main advantage is that free_llist_nmi doesn't need to prefilled.
> It will be empty most of the time.
> wdyt?

Looks great! The other option would be to not have the overflow
free_llist_nmi list and just allowing llist_add to free_llist from the
NMI case even if we interrupt llist_del_first, but then the non-NMI
user needs to use the atomic llist_add version as well (since we may
interrupt it), which won't be great for performance. So having the
extra list is much better.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 bpf-next 01/12] bpf: Introduce any context BPF specific memory allocator.
  2022-08-19 14:31           ` Kumar Kartikeya Dwivedi
@ 2022-08-19 17:51             ` Alexei Starovoitov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 17:51 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Fri, Aug 19, 2022 at 04:31:11PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Fri, 19 Aug 2022 at 00:30, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Right. We cannot fail in unit_free().
> > With per-cpu counter both unit_alloc() and free_bulk_nmi() would
> > potentially fail in such unlikely scenario.
> > Not a big deal for free_bulk_nmi(). It would pick the element later.
> > For unit_alloc() return NULL is normal.
> > Especially since it's so unlikely for nmi to hit right in the middle
> > of llist_del_first().
> >
> > Since we'll add this per-cpu counter to solve interrupted llist_del_first()
> > it feels that the same counter can be used to protect unit_alloc/free/irq_work.
> > Then we don't need free_llist_nmi. Single free_llist would be fine,
> > but unit_free() should not fail. If free_list cannot be accessed
> > due to per-cpu counter being busy we have to push somewhere.
> > So it seems two lists are necessary. Maybe it's still better ?
> > Roughly I'm thinking of the following:
> > unit_alloc()
> > {
> >   llnode = NULL;
> >   local_irq_save();
> >   if (__this_cpu_inc_return(c->alloc_active) != 1))
> >      goto out;
> >   llnode = __llist_del_first(&c->free_llist);
> >   if (llnode)
> >       cnt = --c->free_cnt;
> > out:
> >   __this_cpu_dec(c->alloc_active);
> >   local_irq_restore();
> >   return ret;
> > }
> > unit_free()
> > {
> >   local_irq_save();
> >   if (__this_cpu_inc_return(c->alloc_active) != 1)) {
> >      llist_add(llnode, &c->free_llist_nmi);
> >      goto out;
> >   }
> >   __llist_add(llnode, &c->free_llist);
> >   cnt = ++c->free_cnt;
> > out:
> >   __this_cpu_dec(c->alloc_active);
> >   local_irq_restore();
> >   return ret;
> > }
> > alloc_bulk, free_bulk would be protected by alloc_active as well.
> > alloc_bulk_nmi is gone.
> > free_bulk_nmi is still there to drain unlucky unit_free,
> > but it's now alone to do llist_del_first() and it just frees anything
> > that is in the free_llist_nmi.
> > The main advantage is that free_llist_nmi doesn't need to prefilled.
> > It will be empty most of the time.
> > wdyt?
> 
> Looks great! The other option would be to not have the overflow
> free_llist_nmi list and just allowing llist_add to free_llist from the
> NMI case even if we interrupt llist_del_first, but then the non-NMI
> user needs to use the atomic llist_add version as well (since we may
> interrupt it), 

not only llist_add, but unit_alloc would have to use atomic llist_del_first too.
So any operation on the list would have to be with cmpxchg.

> which won't be great for performance.

exactly.

> So having the
> extra list is much better.

yep. same thinking.
I'll refactor the patches and send v3 with this approach.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-08-19 18:05 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-17 21:04 [PATCH v2 bpf-next 00/12] bpf: BPF specific memory allocator Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 01/12] bpf: Introduce any context " Alexei Starovoitov
2022-08-17 23:51   ` Kumar Kartikeya Dwivedi
2022-08-18  0:39     ` Alexei Starovoitov
2022-08-18 12:38       ` Kumar Kartikeya Dwivedi
2022-08-18 22:30         ` Alexei Starovoitov
2022-08-19 14:31           ` Kumar Kartikeya Dwivedi
2022-08-19 17:51             ` Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 02/12] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 03/12] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 04/12] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 05/12] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 06/12] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 07/12] bpf: Optimize call_rcu " Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 08/12] bpf: Adjust low/high watermarks in bpf_mem_cache Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 09/12] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 10/12] bpf: Add percpu allocation support to bpf_mem_alloc Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 11/12] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc Alexei Starovoitov
2022-08-17 21:04 ` [PATCH v2 bpf-next 12/12] bpf: Remove tracing program restriction on map types Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.