bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator.
@ 2022-08-19 21:42 Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
                   ` (16 more replies)
  0 siblings, 17 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce any context BPF specific memory allocator.

Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.
Front-end kmalloc() with per-cpu cache of free elements.
Refill this cache asynchronously from irq_work.

Major achievements enabled by bpf_mem_alloc:
- Dynamically allocated hash maps used to be 10 times slower than fully preallocated.
  With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc.
- Tracing bpf programs can use dynamically allocated hash maps.
  Potentially saving lots of memory. Typical hash map is sparsely populated.
- Sleepable bpf programs can used dynamically allocated hash maps.

v2->v3:
- Rewrote the free_list algorithm based on discussions with Kumar. Patch 1.
- Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14.
- Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is
  requested to reduce memory consumption. Patch 15.
- Fix: zero-fill percpu allocation
- Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction

v2 thread:
https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/

v1->v2:
- Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
- Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
- Tuned watermarks per allocation size. Patch 8
- Adopted this approach to per-cpu allocation. Patch 10.
- Fully converted hash map to bpf_mem_alloc. Patch 11.
- Removed tracing prog restriction on map types. Combination of all patches and final patch 12.

v1 thread:
https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/

LWN article:
https://lwn.net/Articles/899274/

Future work:
- expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
- convert lru map to bpf_mem_alloc

Alexei Starovoitov (15):
  bpf: Introduce any context BPF specific memory allocator.
  bpf: Convert hash map to bpf_mem_alloc.
  selftests/bpf: Improve test coverage of test_maps
  samples/bpf: Reduce syscall overhead in map_perf_test.
  bpf: Relax the requirement to use preallocated hash maps in tracing
    progs.
  bpf: Optimize element count in non-preallocated hash map.
  bpf: Optimize call_rcu in non-preallocated hash map.
  bpf: Adjust low/high watermarks in bpf_mem_cache
  bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  bpf: Add percpu allocation support to bpf_mem_alloc.
  bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
  bpf: Remove tracing program restriction on map types
  bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  bpf: Remove prealloc-only restriction for sleepable bpf programs.
  bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.

 include/linux/bpf_mem_alloc.h             |  26 +
 include/linux/filter.h                    |   2 +
 kernel/bpf/Makefile                       |   2 +-
 kernel/bpf/core.c                         |   2 +
 kernel/bpf/hashtab.c                      | 132 +++--
 kernel/bpf/memalloc.c                     | 601 ++++++++++++++++++++++
 kernel/bpf/syscall.c                      |  14 +-
 kernel/bpf/verifier.c                     |  52 --
 samples/bpf/map_perf_test_kern.c          |  44 +-
 samples/bpf/map_perf_test_user.c          |   2 +-
 tools/testing/selftests/bpf/progs/timer.c |  11 -
 tools/testing/selftests/bpf/test_maps.c   |  38 +-
 12 files changed, 795 insertions(+), 131 deletions(-)
 create mode 100644 include/linux/bpf_mem_alloc.h
 create mode 100644 kernel/bpf/memalloc.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 02/15] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.

Front-end kmalloc() with minimal per-cpu cache of free elements.
Refill this cache asynchronously from irq_work.

BPF programs always run with migration disabled.
It's safe to allocate from cache of the current cpu with irqs disabled.
Free-ing is always done into bucket of the current cpu as well.
irq_work trims extra free elements from buckets with kfree
and refills them with kmalloc, so global kmalloc logic takes care
of freeing objects allocated by one cpu and freed on another.

struct bpf_mem_alloc supports two modes:
- When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
  This is typical bpf hash map use case when all elements have equal size.
- When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
  kmalloc/kfree. Max allocation size is 4096 in this case.
  This is bpf_dynptr and bpf_kptr use case.

bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.

The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_mem_alloc.h |  26 ++
 kernel/bpf/Makefile           |   2 +-
 kernel/bpf/memalloc.c         | 475 ++++++++++++++++++++++++++++++++++
 3 files changed, 502 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf_mem_alloc.h
 create mode 100644 kernel/bpf/memalloc.c

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
new file mode 100644
index 000000000000..804733070f8d
--- /dev/null
+++ b/include/linux/bpf_mem_alloc.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+#ifndef _BPF_MEM_ALLOC_H
+#define _BPF_MEM_ALLOC_H
+#include <linux/compiler_types.h>
+
+struct bpf_mem_cache;
+struct bpf_mem_caches;
+
+struct bpf_mem_alloc {
+	struct bpf_mem_caches __percpu *caches;
+	struct bpf_mem_cache __percpu *cache;
+};
+
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size);
+void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
+
+/* kmalloc/kfree equivalent: */
+void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size);
+void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr);
+
+/* kmem_cache_alloc/free equivalent: */
+void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma);
+void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
+
+#endif /* _BPF_MEM_ALLOC_H */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..11fb9220909b 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
-obj-$(CONFIG_BPF_SYSCALL) += btf.o
+obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
new file mode 100644
index 000000000000..293380eaea41
--- /dev/null
+++ b/kernel/bpf/memalloc.c
@@ -0,0 +1,475 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+#include <linux/mm.h>
+#include <linux/llist.h>
+#include <linux/bpf.h>
+#include <linux/irq_work.h>
+#include <linux/bpf_mem_alloc.h>
+#include <linux/memcontrol.h>
+
+/* Any context (including NMI) BPF specific memory allocator.
+ *
+ * Tracing BPF programs can attach to kprobe and fentry. Hence they
+ * run in unknown context where calling plain kmalloc() might not be safe.
+ *
+ * Front-end kmalloc() with per-cpu per-bucket cache of free elements.
+ * Refill this cache asynchronously from irq_work.
+ *
+ * CPU_0 buckets
+ * 16 32 64 96 128 196 256 512 1024 2048 4096
+ * ...
+ * CPU_N buckets
+ * 16 32 64 96 128 196 256 512 1024 2048 4096
+ *
+ * The buckets are prefilled at the start.
+ * BPF programs always run with migration disabled.
+ * It's safe to allocate from cache of the current cpu with irqs disabled.
+ * Free-ing is always done into bucket of the current cpu as well.
+ * irq_work trims extra free elements from buckets with kfree
+ * and refills them with kmalloc, so global kmalloc logic takes care
+ * of freeing objects allocated by one cpu and freed on another.
+ *
+ * Every allocated objected is padded with extra 8 bytes that contains
+ * struct llist_node.
+ */
+#define LLIST_NODE_SZ sizeof(struct llist_node)
+
+/* similar to kmalloc, but sizeof == 8 bucket is gone */
+static u8 size_index[24] __ro_after_init = {
+	3,	/* 8 */
+	3,	/* 16 */
+	4,	/* 24 */
+	4,	/* 32 */
+	5,	/* 40 */
+	5,	/* 48 */
+	5,	/* 56 */
+	5,	/* 64 */
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+	6,	/* 104 */
+	6,	/* 112 */
+	6,	/* 120 */
+	6,	/* 128 */
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+};
+
+static int bpf_mem_cache_idx(size_t size)
+{
+	if (!size || size > 4096)
+		return -1;
+
+	if (size <= 192)
+		return size_index[(size - 1) / 8] - 1;
+
+	return fls(size - 1) - 1;
+}
+
+#define NUM_CACHES 11
+
+struct bpf_mem_cache {
+	/* per-cpu list of free objects of size 'unit_size'.
+	 * All accesses are done with interrupts disabled and 'active' counter
+	 * protection with __llist_add() and __llist_del_first().
+	 */
+	struct llist_head free_llist;
+	local_t active;
+
+	/* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill
+	 * are sequenced by per-cpu 'active' counter. But unit_free() cannot
+	 * fail. When 'active' is busy the unit_free() will add an object to
+	 * free_llist_extra.
+	 */
+	struct llist_head free_llist_extra;
+
+	/* kmem_cache != NULL when bpf_mem_alloc was created for specific
+	 * element size.
+	 */
+	struct kmem_cache *kmem_cache;
+	struct irq_work refill_work;
+	struct obj_cgroup *objcg;
+	int unit_size;
+	/* count of objects in free_llist */
+	int free_cnt;
+};
+
+struct bpf_mem_caches {
+	struct bpf_mem_cache cache[NUM_CACHES];
+};
+
+static struct llist_node notrace *__llist_del_first(struct llist_head *head)
+{
+	struct llist_node *entry, *next;
+
+	entry = head->first;
+	if (!entry)
+		return NULL;
+	next = entry->next;
+	head->first = next;
+	return entry;
+}
+
+#define BATCH 48
+#define LOW_WATERMARK 32
+#define HIGH_WATERMARK 96
+/* Assuming the average number of elements per bucket is 64, when all buckets
+ * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
+ * 64*4096*32 ~ 20Mbyte
+ */
+
+static void *__alloc(struct bpf_mem_cache *c, int node)
+{
+	/* Allocate, but don't deplete atomic reserves that typical
+	 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
+	 * will allocate from the current numa node which is what we
+	 * want here.
+	 */
+	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT;
+
+	if (c->kmem_cache)
+		return kmem_cache_alloc_node(c->kmem_cache, flags, node);
+
+	return kmalloc_node(c->unit_size, flags, node);
+}
+
+static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	if (c->objcg)
+		return get_mem_cgroup_from_objcg(c->objcg);
+#endif
+
+#ifdef CONFIG_MEMCG
+	return root_mem_cgroup;
+#else
+	return NULL;
+#endif
+}
+
+/* Mostly runs from irq_work except __init phase. */
+static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
+{
+	struct mem_cgroup *memcg = NULL, *old_memcg;
+	unsigned long flags;
+	void *obj;
+	int i;
+
+	memcg = get_memcg(c);
+	old_memcg = set_active_memcg(memcg);
+	for (i = 0; i < cnt; i++) {
+		obj = __alloc(c, node);
+		if (!obj)
+			break;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			/* In RT irq_work runs in per-cpu kthread, so disable
+			 * interrupts to avoid preemption and interrupts and
+			 * reduce the chance of bpf prog executing on this cpu
+			 * when active counter is busy.
+			 */
+			local_irq_save(flags);
+		if (local_inc_return(&c->active) == 1) {
+			__llist_add(obj, &c->free_llist);
+			c->free_cnt++;
+		}
+		local_dec(&c->active);
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_restore(flags);
+	}
+	set_active_memcg(old_memcg);
+	mem_cgroup_put(memcg);
+}
+
+static void free_one(struct bpf_mem_cache *c, void *obj)
+{
+	if (c->kmem_cache)
+		kmem_cache_free(c->kmem_cache, obj);
+	else
+		kfree(obj);
+}
+
+static void free_bulk(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode, *t;
+	unsigned long flags;
+	int cnt;
+
+	do {
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_save(flags);
+		if (local_inc_return(&c->active) == 1) {
+			llnode = __llist_del_first(&c->free_llist);
+			if (llnode)
+				cnt = --c->free_cnt;
+			else
+				cnt = 0;
+		}
+		local_dec(&c->active);
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_restore(flags);
+		free_one(c, llnode);
+	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+
+	/* and drain free_llist_extra */
+	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
+		free_one(c, llnode);
+}
+
+static void bpf_mem_refill(struct irq_work *work)
+{
+	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
+	int cnt;
+
+	/* Racy access to free_cnt. It doesn't need to be 100% accurate */
+	cnt = c->free_cnt;
+	if (cnt < LOW_WATERMARK)
+		/* irq_work runs on this cpu and kmalloc will allocate
+		 * from the current numa node which is what we want here.
+		 */
+		alloc_bulk(c, BATCH, NUMA_NO_NODE);
+	else if (cnt > HIGH_WATERMARK)
+		free_bulk(c);
+}
+
+static void notrace irq_work_raise(struct bpf_mem_cache *c)
+{
+	irq_work_queue(&c->refill_work);
+}
+
+static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
+{
+	init_irq_work(&c->refill_work, bpf_mem_refill);
+	/* To avoid consuming memory assume that 1st run of bpf
+	 * prog won't be doing more than 4 map_update_elem from
+	 * irq disabled region
+	 */
+	alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu));
+}
+
+/* When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
+ * This is typical bpf hash map use case when all elements have equal size.
+ *
+ * When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
+ * kmalloc/kfree. Max allocation size is 4096 in this case.
+ * This is bpf_dynptr and bpf_kptr use case.
+ */
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
+{
+	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
+	struct bpf_mem_caches *cc, __percpu *pcc;
+	struct bpf_mem_cache *c, __percpu *pc;
+	struct kmem_cache *kmem_cache;
+	struct obj_cgroup *objcg = NULL;
+	char buf[32];
+	int cpu, i;
+
+	if (size) {
+		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
+		if (!pc)
+			return -ENOMEM;
+		size += LLIST_NODE_SZ; /* room for llist_node */
+		snprintf(buf, sizeof(buf), "bpf-%u", size);
+		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
+		if (!kmem_cache) {
+			free_percpu(pc);
+			return -ENOMEM;
+		}
+#ifdef CONFIG_MEMCG_KMEM
+		objcg = get_obj_cgroup_from_current();
+#endif
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(pc, cpu);
+			c->kmem_cache = kmem_cache;
+			c->unit_size = size;
+			c->objcg = objcg;
+			prefill_mem_cache(c, cpu);
+		}
+		ma->cache = pc;
+		return 0;
+	}
+
+	pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL);
+	if (!pcc)
+		return -ENOMEM;
+#ifdef CONFIG_MEMCG_KMEM
+	objcg = get_obj_cgroup_from_current();
+#endif
+	for_each_possible_cpu(cpu) {
+		cc = per_cpu_ptr(pcc, cpu);
+		for (i = 0; i < NUM_CACHES; i++) {
+			c = &cc->cache[i];
+			c->unit_size = sizes[i];
+			c->objcg = objcg;
+			prefill_mem_cache(c, cpu);
+		}
+	}
+	ma->caches = pcc;
+	return 0;
+}
+
+static void drain_mem_cache(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode, *t;
+
+	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
+		free_one(c, llnode);
+	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
+		free_one(c, llnode);
+}
+
+void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
+{
+	struct bpf_mem_caches *cc;
+	struct bpf_mem_cache *c;
+	int cpu, i;
+
+	if (ma->cache) {
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(ma->cache, cpu);
+			drain_mem_cache(c);
+		}
+		/* kmem_cache and memcg are the same across cpus */
+		kmem_cache_destroy(c->kmem_cache);
+		if (c->objcg)
+			obj_cgroup_put(c->objcg);
+		free_percpu(ma->cache);
+		ma->cache = NULL;
+	}
+	if (ma->caches) {
+		for_each_possible_cpu(cpu) {
+			cc = per_cpu_ptr(ma->caches, cpu);
+			for (i = 0; i < NUM_CACHES; i++) {
+				c = &cc->cache[i];
+				drain_mem_cache(c);
+			}
+		}
+		if (c->objcg)
+			obj_cgroup_put(c->objcg);
+		free_percpu(ma->caches);
+		ma->caches = NULL;
+	}
+}
+
+/* notrace is necessary here and in other functions to make sure
+ * bpf programs cannot attach to them and cause llist corruptions.
+ */
+static void notrace *unit_alloc(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode = NULL;
+	unsigned long flags;
+	int cnt = 0;
+
+	/* Disable irqs to prevent the following race for majority of prog types:
+	 * prog_A
+	 *   bpf_mem_alloc
+	 *      preemption or irq -> prog_B
+	 *        bpf_mem_alloc
+	 *
+	 * but prog_B could be a perf_event NMI prog.
+	 * Use per-cpu 'active' counter to order free_list access between
+	 * unit_alloc/unit_free/bpf_mem_refill.
+	 */
+	local_irq_save(flags);
+	if (local_inc_return(&c->active) == 1) {
+		llnode = __llist_del_first(&c->free_llist);
+		if (llnode)
+			cnt = --c->free_cnt;
+	}
+	local_dec(&c->active);
+	local_irq_restore(flags);
+
+	WARN_ON(cnt < 0);
+
+	if (cnt < LOW_WATERMARK)
+		irq_work_raise(c);
+	return llnode;
+}
+
+/* Though 'ptr' object could have been allocated on a different cpu
+ * add it to the free_llist of the current cpu.
+ * Let kfree() logic deal with it when it's later called from irq_work.
+ */
+static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
+{
+	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
+	unsigned long flags;
+	int cnt = 0;
+
+	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
+
+	local_irq_save(flags);
+	if (local_inc_return(&c->active) == 1) {
+		__llist_add(llnode, &c->free_llist);
+		cnt = ++c->free_cnt;
+	} else {
+		/* unit_free() cannot fail. Therefore add an object to atomic
+		 * llist. free_bulk() will drain it. Though free_llist_extra is
+		 * a per-cpu list we have to use atomic llist_add here, since
+		 * it also can be interrupted by bpf nmi prog that does another
+		 * unit_free() into the same free_llist_extra.
+		 */
+		llist_add(llnode, &c->free_llist_extra);
+	}
+	local_dec(&c->active);
+	local_irq_restore(flags);
+
+	if (cnt > HIGH_WATERMARK)
+		/* free few objects from current cpu into global kmalloc pool */
+		irq_work_raise(c);
+}
+
+/* Called from BPF program or from sys_bpf syscall.
+ * In both cases migration is disabled.
+ */
+void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
+{
+	int idx;
+	void *ret;
+
+	if (!size)
+		return ZERO_SIZE_PTR;
+
+	idx = bpf_mem_cache_idx(size + LLIST_NODE_SZ);
+	if (idx < 0)
+		return NULL;
+
+	ret = unit_alloc(this_cpu_ptr(ma->caches)->cache + idx);
+	return !ret ? NULL : ret + LLIST_NODE_SZ;
+}
+
+void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
+{
+	int idx;
+
+	if (!ptr)
+		return;
+
+	idx = bpf_mem_cache_idx(__ksize(ptr - LLIST_NODE_SZ));
+	if (idx < 0)
+		return;
+
+	unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
+}
+
+void notrace *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma)
+{
+	void *ret;
+
+	ret = unit_alloc(this_cpu_ptr(ma->cache));
+	return !ret ? NULL : ret + LLIST_NODE_SZ;
+}
+
+void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
+{
+	if (!ptr)
+		return;
+
+	unit_free(this_cpu_ptr(ma->cache), ptr);
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 02/15] bpf: Convert hash map to bpf_mem_alloc.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 03/15] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert bpf hash map to use bpf memory allocator.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index b301a63afa2f..bd23c8830d49 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -14,6 +14,7 @@
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
+#include <linux/bpf_mem_alloc.h>
 
 #define HTAB_CREATE_FLAG_MASK						\
 	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |	\
@@ -92,6 +93,7 @@ struct bucket {
 
 struct bpf_htab {
 	struct bpf_map map;
+	struct bpf_mem_alloc ma;
 	struct bucket *buckets;
 	void *elems;
 	union {
@@ -563,6 +565,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 			if (err)
 				goto free_prealloc;
 		}
+	} else {
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size);
+		if (err)
+			goto free_map_locked;
 	}
 
 	return &htab->map;
@@ -573,6 +579,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->ma);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
 	bpf_map_area_free(htab);
@@ -849,7 +856,7 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
 	check_and_free_fields(htab, l);
-	kfree(l);
+	bpf_mem_cache_free(&htab->ma, l);
 }
 
 static void htab_elem_free_rcu(struct rcu_head *head)
@@ -973,9 +980,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 				l_new = ERR_PTR(-E2BIG);
 				goto dec_count;
 			}
-		l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
-					     GFP_NOWAIT | __GFP_NOWARN,
-					     htab->map.numa_node);
+		l_new = bpf_mem_cache_alloc(&htab->ma);
 		if (!l_new) {
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
@@ -994,7 +999,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			pptr = bpf_map_alloc_percpu(&htab->map, size, 8,
 						    GFP_NOWAIT | __GFP_NOWARN);
 			if (!pptr) {
-				kfree(l_new);
+				bpf_mem_cache_free(&htab->ma, l_new);
 				l_new = ERR_PTR(-ENOMEM);
 				goto dec_count;
 			}
@@ -1489,6 +1494,7 @@ static void htab_map_free(struct bpf_map *map)
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->ma);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 03/15] selftests/bpf: Improve test coverage of test_maps
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 02/15] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 04/15] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Make test_maps more stressful with more parallelism in
update/delete/lookup/walk including different value sizes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_maps.c | 38 ++++++++++++++++---------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index cbebfaa7c1e8..d1ffc76814d9 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -264,10 +264,11 @@ static void test_hashmap_percpu(unsigned int task, void *data)
 	close(fd);
 }
 
+#define VALUE_SIZE 3
 static int helper_fill_hashmap(int max_entries)
 {
 	int i, fd, ret;
-	long long key, value;
+	long long key, value[VALUE_SIZE] = {};
 
 	fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(key), sizeof(value),
 			    max_entries, &map_opts);
@@ -276,8 +277,8 @@ static int helper_fill_hashmap(int max_entries)
 	      "err: %s, flags: 0x%x\n", strerror(errno), map_opts.map_flags);
 
 	for (i = 0; i < max_entries; i++) {
-		key = i; value = key;
-		ret = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		key = i; value[0] = key;
+		ret = bpf_map_update_elem(fd, &key, value, BPF_NOEXIST);
 		CHECK(ret != 0,
 		      "can't update hashmap",
 		      "err: %s\n", strerror(ret));
@@ -288,8 +289,8 @@ static int helper_fill_hashmap(int max_entries)
 
 static void test_hashmap_walk(unsigned int task, void *data)
 {
-	int fd, i, max_entries = 1000;
-	long long key, value, next_key;
+	int fd, i, max_entries = 10000;
+	long long key, value[VALUE_SIZE], next_key;
 	bool next_key_valid = true;
 
 	fd = helper_fill_hashmap(max_entries);
@@ -297,7 +298,7 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : &key,
 					 &next_key) == 0; i++) {
 		key = next_key;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
 	}
 
 	assert(i == max_entries);
@@ -305,9 +306,9 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
 	for (i = 0; next_key_valid; i++) {
 		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
-		value++;
-		assert(bpf_map_update_elem(fd, &key, &value, BPF_EXIST) == 0);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
+		value[0]++;
+		assert(bpf_map_update_elem(fd, &key, value, BPF_EXIST) == 0);
 		key = next_key;
 	}
 
@@ -316,8 +317,8 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : &key,
 					 &next_key) == 0; i++) {
 		key = next_key;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
-		assert(value - 1 == key);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
+		assert(value[0] - 1 == key);
 	}
 
 	assert(i == max_entries);
@@ -1371,16 +1372,16 @@ static void __run_parallel(unsigned int tasks,
 
 static void test_map_stress(void)
 {
+	run_parallel(100, test_hashmap_walk, NULL);
 	run_parallel(100, test_hashmap, NULL);
 	run_parallel(100, test_hashmap_percpu, NULL);
 	run_parallel(100, test_hashmap_sizes, NULL);
-	run_parallel(100, test_hashmap_walk, NULL);
 
 	run_parallel(100, test_arraymap, NULL);
 	run_parallel(100, test_arraymap_percpu, NULL);
 }
 
-#define TASKS 1024
+#define TASKS 100
 
 #define DO_UPDATE 1
 #define DO_DELETE 0
@@ -1432,6 +1433,8 @@ static void test_update_delete(unsigned int fn, void *data)
 	int fd = ((int *)data)[0];
 	int i, key, value, err;
 
+	if (fn & 1)
+		test_hashmap_walk(fn, NULL);
 	for (i = fn; i < MAP_SIZE; i += TASKS) {
 		key = value = i;
 
@@ -1455,7 +1458,7 @@ static void test_update_delete(unsigned int fn, void *data)
 
 static void test_map_parallel(void)
 {
-	int i, fd, key = 0, value = 0;
+	int i, fd, key = 0, value = 0, j = 0;
 	int data[2];
 
 	fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(key), sizeof(value),
@@ -1466,6 +1469,7 @@ static void test_map_parallel(void)
 		exit(1);
 	}
 
+again:
 	/* Use the same fd in children to add elements to this map:
 	 * child_0 adds key=0, key=1024, key=2048, ...
 	 * child_1 adds key=1, key=1025, key=2049, ...
@@ -1502,6 +1506,12 @@ static void test_map_parallel(void)
 	key = -1;
 	assert(bpf_map_get_next_key(fd, NULL, &key) < 0 && errno == ENOENT);
 	assert(bpf_map_get_next_key(fd, &key, &key) < 0 && errno == ENOENT);
+
+	key = 0;
+	bpf_map_delete_elem(fd, &key);
+	if (j++ < 5)
+		goto again;
+	close(fd);
 }
 
 static void test_map_rdonly(void)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 04/15] samples/bpf: Reduce syscall overhead in map_perf_test.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 03/15] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 05/15] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Make map_perf_test for preallocated and non-preallocated hash map
spend more time inside bpf program to focus performance analysis
on the speed of update/lookup/delete operations performed by bpf program.

It makes 'perf report' of bpf_mem_alloc look like:
 11.76%  map_perf_test    [k] _raw_spin_lock_irqsave
 11.26%  map_perf_test    [k] htab_map_update_elem
  9.70%  map_perf_test    [k] _raw_spin_lock
  9.47%  map_perf_test    [k] htab_map_delete_elem
  8.57%  map_perf_test    [k] memcpy_erms
  5.58%  map_perf_test    [k] alloc_htab_elem
  4.09%  map_perf_test    [k] __htab_map_lookup_elem
  3.44%  map_perf_test    [k] syscall_exit_to_user_mode
  3.13%  map_perf_test    [k] lookup_nulls_elem_raw
  3.05%  map_perf_test    [k] migrate_enable
  3.04%  map_perf_test    [k] memcmp
  2.67%  map_perf_test    [k] unit_free
  2.39%  map_perf_test    [k] lookup_elem_raw

Reduce default iteration count as well to make 'map_perf_test' quick enough
even on debug kernels.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/map_perf_test_kern.c | 44 ++++++++++++++++++++------------
 samples/bpf/map_perf_test_user.c |  2 +-
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test_kern.c
index 8773f22b6a98..7342c5b2f278 100644
--- a/samples/bpf/map_perf_test_kern.c
+++ b/samples/bpf/map_perf_test_kern.c
@@ -108,11 +108,14 @@ int stress_hmap(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&hash_map, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&hash_map, &key);
-	if (value)
-		bpf_map_delete_elem(&hash_map, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&hash_map, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&hash_map, &key);
+		if (value)
+			bpf_map_delete_elem(&hash_map, &key);
+	}
 
 	return 0;
 }
@@ -123,11 +126,14 @@ int stress_percpu_hmap(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&percpu_hash_map, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&percpu_hash_map, &key);
-	if (value)
-		bpf_map_delete_elem(&percpu_hash_map, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&percpu_hash_map, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&percpu_hash_map, &key);
+		if (value)
+			bpf_map_delete_elem(&percpu_hash_map, &key);
+	}
 	return 0;
 }
 
@@ -137,11 +143,14 @@ int stress_hmap_alloc(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&hash_map_alloc, &key);
-	if (value)
-		bpf_map_delete_elem(&hash_map_alloc, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&hash_map_alloc, &key);
+		if (value)
+			bpf_map_delete_elem(&hash_map_alloc, &key);
+	}
 	return 0;
 }
 
@@ -151,11 +160,14 @@ int stress_percpu_hmap_alloc(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&percpu_hash_map_alloc, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&percpu_hash_map_alloc, &key);
-	if (value)
-		bpf_map_delete_elem(&percpu_hash_map_alloc, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&percpu_hash_map_alloc, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&percpu_hash_map_alloc, &key);
+		if (value)
+			bpf_map_delete_elem(&percpu_hash_map_alloc, &key);
+	}
 	return 0;
 }
 
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index b6fc174ab1f2..1bb53f4b29e1 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -72,7 +72,7 @@ static int test_flags = ~0;
 static uint32_t num_map_entries;
 static uint32_t inner_lru_hash_size;
 static int lru_hash_lookup_test_entries = 32;
-static uint32_t max_cnt = 1000000;
+static uint32_t max_cnt = 10000;
 
 static int check_test_flags(enum test_type t)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 05/15] bpf: Relax the requirement to use preallocated hash maps in tracing progs.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 04/15] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Since bpf hash map was converted to use bpf_mem_alloc it is safe to use
from tracing programs and in RT kernels.
But per-cpu hash map is still using dynamic allocation for per-cpu map
values, hence keep the warning for this map type.
In the future alloc_percpu_gfp can be front-end-ed with bpf_mem_cache
and this restriction will be completely lifted.
perf_event (NMI) bpf programs have to use preallocated hash maps,
because free_htab_elem() is using call_rcu which might crash if re-entered.

Sleepable bpf programs have to use preallocated hash maps, because
life time of the map elements is not protected by rcu_read_lock/unlock.
This restriction can be lifted in the future as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2c1f8069f7b7..d785f29047d7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12605,10 +12605,12 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 	 * For programs attached to PERF events this is mandatory as the
 	 * perf NMI can hit any arbitrary code sequence.
 	 *
-	 * All other trace types using preallocated hash maps are unsafe as
-	 * well because tracepoint or kprobes can be inside locked regions
-	 * of the memory allocator or at a place where a recursion into the
-	 * memory allocator would see inconsistent state.
+	 * All other trace types using non-preallocated per-cpu hash maps are
+	 * unsafe as well because tracepoint or kprobes can be inside locked
+	 * regions of the per-cpu memory allocator or at a place where a
+	 * recursion into the per-cpu memory allocator would see inconsistent
+	 * state. Non per-cpu hash maps are using bpf_mem_alloc-tor which is
+	 * safe to use from kprobe/fentry and in RT.
 	 *
 	 * On RT enabled kernels run-time allocation of all trace type
 	 * programs is strictly prohibited due to lock type constraints. On
@@ -12618,15 +12620,26 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 	 */
 	if (is_tracing_prog_type(prog_type) && !is_preallocated_map(map)) {
 		if (prog_type == BPF_PROG_TYPE_PERF_EVENT) {
+			/* perf_event bpf progs have to use preallocated hash maps
+			 * because non-prealloc is still relying on call_rcu to free
+			 * elements.
+			 */
 			verbose(env, "perf_event programs can only use preallocated hash map\n");
 			return -EINVAL;
 		}
-		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-			verbose(env, "trace type programs can only use preallocated hash map\n");
-			return -EINVAL;
+		if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+		    (map->inner_map_meta &&
+		     map->inner_map_meta->map_type == BPF_MAP_TYPE_PERCPU_HASH)) {
+			if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+				verbose(env,
+					"trace type programs can only use preallocated per-cpu hash map\n");
+				return -EINVAL;
+			}
+			WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
+			verbose(env,
+				"trace type programs with run-time allocated per-cpu hash maps are unsafe."
+				" Switch to preallocated hash maps.\n");
 		}
-		WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
-		verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n");
 	}
 
 	if (map_value_has_spin_lock(map)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 05/15] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 07/15] bpf: Optimize call_rcu " Alexei Starovoitov
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The atomic_inc/dec might cause extreme cache line bouncing when multiple cpus
access the same bpf map. Based on specified max_entries for the hash map
calculate when percpu_counter becomes faster than atomic_t and use it for such
maps. For example samples/bpf/map_perf_test is using hash map with max_entries
1000. On a system with 16 cpus the 'map_perf_test 4' shows 14k events per
second using atomic_t. On a system with 15 cpus it shows 100k events per second
using percpu. map_perf_test is an extreme case where all cpus colliding on
atomic_t which causes extreme cache bouncing. Note that the slow path of
percpu_counter is 5k events per secound vs 14k for atomic, so the heuristic is
necessary. See comment in the code why the heuristic is based on
num_online_cpus().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 70 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 62 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index bd23c8830d49..8f68c6e13339 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -101,7 +101,12 @@ struct bpf_htab {
 		struct bpf_lru lru;
 	};
 	struct htab_elem *__percpu *extra_elems;
-	atomic_t count;	/* number of elements in this hashtable */
+	/* number of elements in non-preallocated hashtable are kept
+	 * in either pcount or count
+	 */
+	struct percpu_counter pcount;
+	atomic_t count;
+	bool use_percpu_counter;
 	u32 n_buckets;	/* number of hash buckets */
 	u32 elem_size;	/* size of each element in bytes */
 	u32 hashrnd;
@@ -552,6 +557,29 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	htab_init_buckets(htab);
 
+/* compute_batch_value() computes batch value as num_online_cpus() * 2
+ * and __percpu_counter_compare() needs
+ * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus()
+ * for percpu_counter to be faster than atomic_t. In practice the average bpf
+ * hash map size is 10k, which means that a system with 64 cpus will fill
+ * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore
+ * define our own batch count as 32 then 10k hash map can be filled up to 80%:
+ * 10k - 8k > 32 _batch_ * 64 _cpus_
+ * and __percpu_counter_compare() will still be fast. At that point hash map
+ * collisions will dominate its performance anyway. Assume that hash map filled
+ * to 50+% isn't going to be O(1) and use the following formula to choose
+ * between percpu_counter and atomic_t.
+ */
+#define PERCPU_COUNTER_BATCH 32
+	if (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH)
+		htab->use_percpu_counter = true;
+
+	if (htab->use_percpu_counter) {
+		err = percpu_counter_init(&htab->pcount, 0, GFP_KERNEL);
+		if (err)
+			goto free_map_locked;
+	}
+
 	if (prealloc) {
 		err = prealloc_init(htab);
 		if (err)
@@ -878,6 +906,31 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 	}
 }
 
+static bool is_map_full(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		return __percpu_counter_compare(&htab->pcount, htab->map.max_entries,
+						PERCPU_COUNTER_BATCH) >= 0;
+	return atomic_read(&htab->count) >= htab->map.max_entries;
+}
+
+static void inc_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_inc(&htab->count);
+}
+
+static void dec_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_dec(&htab->count);
+}
+
+
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	htab_put_fd_value(htab, l);
@@ -886,7 +939,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 		check_and_free_fields(htab, l);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
-		atomic_dec(&htab->count);
+		dec_elem_count(htab);
 		l->htab = htab;
 		call_rcu(&l->rcu, htab_elem_free_rcu);
 	}
@@ -970,16 +1023,15 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = container_of(l, struct htab_elem, fnode);
 		}
 	} else {
-		if (atomic_inc_return(&htab->count) > htab->map.max_entries)
-			if (!old_elem) {
+		if (is_map_full(htab))
+			if (!old_elem)
 				/* when map is full and update() is replacing
 				 * old element, it's ok to allocate, since
 				 * old element will be freed immediately.
 				 * Otherwise return an error
 				 */
-				l_new = ERR_PTR(-E2BIG);
-				goto dec_count;
-			}
+				return ERR_PTR(-E2BIG);
+		inc_elem_count(htab);
 		l_new = bpf_mem_cache_alloc(&htab->ma);
 		if (!l_new) {
 			l_new = ERR_PTR(-ENOMEM);
@@ -1021,7 +1073,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 	l_new->hash = hash;
 	return l_new;
 dec_count:
-	atomic_dec(&htab->count);
+	dec_elem_count(htab);
 	return l_new;
 }
 
@@ -1495,6 +1547,8 @@ static void htab_map_free(struct bpf_map *map)
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
 	bpf_mem_alloc_destroy(&htab->ma);
+	if (htab->use_percpu_counter)
+		percpu_counter_destroy(&htab->pcount);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 07/15] bpf: Optimize call_rcu in non-preallocated hash map.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 08/15] bpf: Adjust low/high watermarks in bpf_mem_cache Alexei Starovoitov
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Doing call_rcu() million times a second becomes a bottle neck.
Convert non-preallocated hash map from call_rcu to SLAB_TYPESAFE_BY_RCU.
The rcu critical section is no longer observed for one htab element
which makes non-preallocated hash map behave just like preallocated hash map.
The map elements are released back to kernel memory after observing
rcu critical section.
This improves 'map_perf_test 4' performance from 100k events per second
to 250k events per second.

bpf_mem_alloc + percpu_counter + typesafe_by_rcu provide 10x performance
boost to non-preallocated hash map and make it within few % of preallocated map
while consuming fraction of memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c                      |  8 ++++++--
 kernel/bpf/memalloc.c                     |  2 +-
 tools/testing/selftests/bpf/progs/timer.c | 11 -----------
 3 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8f68c6e13339..299ab98f9811 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -940,8 +940,12 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		dec_elem_count(htab);
-		l->htab = htab;
-		call_rcu(&l->rcu, htab_elem_free_rcu);
+		if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH) {
+			l->htab = htab;
+			call_rcu(&l->rcu, htab_elem_free_rcu);
+		} else {
+			htab_elem_free(htab, l);
+		}
 	}
 }
 
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 293380eaea41..cfa07f539eda 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -276,7 +276,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
 			return -ENOMEM;
 		size += LLIST_NODE_SZ; /* room for llist_node */
 		snprintf(buf, sizeof(buf), "bpf-%u", size);
-		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
+		kmem_cache = kmem_cache_create(buf, size, 8, SLAB_TYPESAFE_BY_RCU, NULL);
 		if (!kmem_cache) {
 			free_percpu(pc);
 			return -ENOMEM;
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
index 5f5309791649..0053c5402173 100644
--- a/tools/testing/selftests/bpf/progs/timer.c
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -208,17 +208,6 @@ static int timer_cb2(void *map, int *key, struct hmap_elem *val)
 		 */
 		bpf_map_delete_elem(map, key);
 
-		/* in non-preallocated hashmap both 'key' and 'val' are RCU
-		 * protected and still valid though this element was deleted
-		 * from the map. Arm this timer for ~35 seconds. When callback
-		 * finishes the call_rcu will invoke:
-		 * htab_elem_free_rcu
-		 *   check_and_free_timer
-		 *     bpf_timer_cancel_and_free
-		 * to cancel this 35 second sleep and delete the timer for real.
-		 */
-		if (bpf_timer_start(&val->timer, 1ull << 35, 0) != 0)
-			err |= 256;
 		ok |= 4;
 	}
 	return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 08/15] bpf: Adjust low/high watermarks in bpf_mem_cache
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 07/15] bpf: Optimize call_rcu " Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The same low/high watermarks for every bucket in bpf_mem_cache consume
significant amount of memory. Preallocating 64 elements of 4096 bytes each in
the free list is not efficient. Make low/high watermarks and batching value
dependent on element size. This change brings significant memory savings.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 50 +++++++++++++++++++++++++++++++------------
 1 file changed, 36 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index cfa07f539eda..22b729914afe 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -99,6 +99,7 @@ struct bpf_mem_cache {
 	int unit_size;
 	/* count of objects in free_llist */
 	int free_cnt;
+	int low_watermark, high_watermark, batch;
 };
 
 struct bpf_mem_caches {
@@ -117,14 +118,6 @@ static struct llist_node notrace *__llist_del_first(struct llist_head *head)
 	return entry;
 }
 
-#define BATCH 48
-#define LOW_WATERMARK 32
-#define HIGH_WATERMARK 96
-/* Assuming the average number of elements per bucket is 64, when all buckets
- * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
- * 64*4096*32 ~ 20Mbyte
- */
-
 static void *__alloc(struct bpf_mem_cache *c, int node)
 {
 	/* Allocate, but don't deplete atomic reserves that typical
@@ -215,7 +208,7 @@ static void free_bulk(struct bpf_mem_cache *c)
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			local_irq_restore(flags);
 		free_one(c, llnode);
-	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
 
 	/* and drain free_llist_extra */
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
@@ -229,12 +222,12 @@ static void bpf_mem_refill(struct irq_work *work)
 
 	/* Racy access to free_cnt. It doesn't need to be 100% accurate */
 	cnt = c->free_cnt;
-	if (cnt < LOW_WATERMARK)
+	if (cnt < c->low_watermark)
 		/* irq_work runs on this cpu and kmalloc will allocate
 		 * from the current numa node which is what we want here.
 		 */
-		alloc_bulk(c, BATCH, NUMA_NO_NODE);
-	else if (cnt > HIGH_WATERMARK)
+		alloc_bulk(c, c->batch, NUMA_NO_NODE);
+	else if (cnt > c->high_watermark)
 		free_bulk(c);
 }
 
@@ -243,9 +236,38 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c)
 	irq_work_queue(&c->refill_work);
 }
 
+/* For typical bpf map case that uses bpf_mem_cache_alloc and single bucket
+ * the freelist cache will be elem_size * 64 (or less) on each cpu.
+ *
+ * For bpf programs that don't have statically known allocation sizes and
+ * assuming (low_mark + high_mark) / 2 as an average number of elements per
+ * bucket and all buckets are used the total amount of memory in freelists
+ * on each cpu will be:
+ * 64*16 + 64*32 + 64*64 + 64*96 + 64*128 + 64*196 + 64*256 + 32*512 + 16*1024 + 8*2048 + 4*4096
+ * == ~ 116 Kbyte using below heuristic.
+ * Initialized, but unused bpf allocator (not bpf map specific one) will
+ * consume ~ 11 Kbyte per cpu.
+ * Typical case will be between 11K and 116K closer to 11K.
+ * bpf progs can and should share bpf_mem_cache when possible.
+ */
+
 static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
 {
 	init_irq_work(&c->refill_work, bpf_mem_refill);
+	if (c->unit_size <= 256) {
+		c->low_watermark = 32;
+		c->high_watermark = 96;
+	} else {
+		/* When page_size == 4k, order-0 cache will have low_mark == 2
+		 * and high_mark == 6 with batch alloc of 3 individual pages at
+		 * a time.
+		 * 8k allocs and above low == 1, high == 3, batch == 1.
+		 */
+		c->low_watermark = max(32 * 256 / c->unit_size, 1);
+		c->high_watermark = max(96 * 256 / c->unit_size, 3);
+	}
+	c->batch = max((c->high_watermark - c->low_watermark) / 4 * 3, 1);
+
 	/* To avoid consuming memory assume that 1st run of bpf
 	 * prog won't be doing more than 4 map_update_elem from
 	 * irq disabled region
@@ -387,7 +409,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
 
 	WARN_ON(cnt < 0);
 
-	if (cnt < LOW_WATERMARK)
+	if (cnt < c->low_watermark)
 		irq_work_raise(c);
 	return llnode;
 }
@@ -420,7 +442,7 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 	local_dec(&c->active);
 	local_irq_restore(flags);
 
-	if (cnt > HIGH_WATERMARK)
+	if (cnt > c->high_watermark)
 		/* free few objects from current cpu into global kmalloc pool */
 		irq_work_raise(c);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (7 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 08/15] bpf: Adjust low/high watermarks in bpf_mem_cache Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-24 19:58   ` Kumar Kartikeya Dwivedi
  2022-08-19 21:42 ` [PATCH v3 bpf-next 10/15] bpf: Add percpu allocation support to bpf_mem_alloc Alexei Starovoitov
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
solves the memory consumption issue, avoids kmem_cache_destroy latency and
keeps bpf hash map performance the same.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 64 +++++++++++++++++++++++++++++++++++++++++--
 kernel/bpf/syscall.c  |  5 +++-
 2 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 22b729914afe..d765a5cb24b4 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -100,6 +100,11 @@ struct bpf_mem_cache {
 	/* count of objects in free_llist */
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
+
+	struct rcu_head rcu;
+	struct llist_head free_by_rcu;
+	struct llist_head waiting_for_gp;
+	atomic_t call_rcu_in_progress;
 };
 
 struct bpf_mem_caches {
@@ -188,6 +193,45 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
 		kfree(obj);
 }
 
+static void __free_rcu(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+	struct llist_node *llnode = llist_del_all(&c->waiting_for_gp);
+	struct llist_node *pos, *t;
+
+	llist_for_each_safe(pos, t, llnode)
+		free_one(c, pos);
+	atomic_set(&c->call_rcu_in_progress, 0);
+}
+
+static void enque_to_free(struct bpf_mem_cache *c, void *obj)
+{
+	struct llist_node *llnode = obj;
+
+	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
+	 * Nothing races to add to free_by_rcu list.
+	 */
+	__llist_add(llnode, &c->free_by_rcu);
+}
+
+static void do_call_rcu(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode, *t;
+
+	if (atomic_xchg(&c->call_rcu_in_progress, 1))
+		return;
+
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
+	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
+		/* There is no concurrent __llist_add(waiting_for_gp) access.
+		 * It doesn't race with llist_del_all either.
+		 * But there could be two concurrent llist_del_all(waiting_for_gp):
+		 * from __free_rcu() and from drain_mem_cache().
+		 */
+		__llist_add(llnode, &c->waiting_for_gp);
+	call_rcu(&c->rcu, __free_rcu);
+}
+
 static void free_bulk(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode, *t;
@@ -207,12 +251,13 @@ static void free_bulk(struct bpf_mem_cache *c)
 		local_dec(&c->active);
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			local_irq_restore(flags);
-		free_one(c, llnode);
+		enque_to_free(c, llnode);
 	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
 
 	/* and drain free_llist_extra */
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
-		free_one(c, llnode);
+		enque_to_free(c, llnode);
+	do_call_rcu(c);
 }
 
 static void bpf_mem_refill(struct irq_work *work)
@@ -298,7 +343,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
 			return -ENOMEM;
 		size += LLIST_NODE_SZ; /* room for llist_node */
 		snprintf(buf, sizeof(buf), "bpf-%u", size);
-		kmem_cache = kmem_cache_create(buf, size, 8, SLAB_TYPESAFE_BY_RCU, NULL);
+		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
 		if (!kmem_cache) {
 			free_percpu(pc);
 			return -ENOMEM;
@@ -340,6 +385,15 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode, *t;
 
+	/* The caller has done rcu_barrier() and no progs are using this
+	 * bpf_mem_cache, but htab_map_free() called bpf_mem_cache_free() for
+	 * all remaining elements and they can be in free_by_rcu or in
+	 * waiting_for_gp lists, so drain those lists now.
+	 */
+	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
+		free_one(c, llnode);
+	llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
+		free_one(c, llnode);
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
 		free_one(c, llnode);
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
@@ -361,6 +415,10 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 		kmem_cache_destroy(c->kmem_cache);
 		if (c->objcg)
 			obj_cgroup_put(c->objcg);
+		/* c->waiting_for_gp list was drained, but __free_rcu might
+		 * still execute. Wait for it now before we free 'c'.
+		 */
+		rcu_barrier();
 		free_percpu(ma->cache);
 		ma->cache = NULL;
 	}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a4d40d98428a..850270a72350 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -638,7 +638,10 @@ static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
 		bpf_map_free_id(map, do_idr_lock);
 		btf_put(map->btf);
 		INIT_WORK(&map->work, bpf_map_free_deferred);
-		schedule_work(&map->work);
+		/* Avoid spawning kworkers, since they all might contend
+		 * for the same mutex like slab_mutex.
+		 */
+		queue_work(system_unbound_wq, &map->work);
 	}
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 10/15] bpf: Add percpu allocation support to bpf_mem_alloc.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (8 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 11/15] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc Alexei Starovoitov
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations.
Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects.
bpf_mem_cache_free() will free them back into global per-cpu pool after
observing RCU grace period.
per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps.

The free list cache consists of tuples { llist_node, per-cpu pointer }
Unlike alloc_percpu() that returns per-cpu pointer
the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and
bpf_mem_cache_free() expects to receive it back.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_mem_alloc.h |  2 +-
 kernel/bpf/hashtab.c          |  2 +-
 kernel/bpf/memalloc.c         | 44 +++++++++++++++++++++++++++++++----
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 804733070f8d..653ed1584a03 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -12,7 +12,7 @@ struct bpf_mem_alloc {
 	struct bpf_mem_cache __percpu *cache;
 };
 
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size);
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 
 /* kmalloc/kfree equivalent: */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 299ab98f9811..8daa1132d43c 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -594,7 +594,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 				goto free_prealloc;
 		}
 	} else {
-		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size);
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
 		if (err)
 			goto free_map_locked;
 	}
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index d765a5cb24b4..9e5ad7dc4dc7 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -100,6 +100,7 @@ struct bpf_mem_cache {
 	/* count of objects in free_llist */
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
+	bool percpu;
 
 	struct rcu_head rcu;
 	struct llist_head free_by_rcu;
@@ -132,6 +133,19 @@ static void *__alloc(struct bpf_mem_cache *c, int node)
 	 */
 	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT;
 
+	if (c->percpu) {
+		void **obj = kmem_cache_alloc_node(c->kmem_cache, flags, node);
+		void *pptr = __alloc_percpu_gfp(c->unit_size, 8, flags);
+
+		if (!obj || !pptr) {
+			free_percpu(pptr);
+			kfree(obj);
+			return NULL;
+		}
+		obj[1] = pptr;
+		return obj;
+	}
+
 	if (c->kmem_cache)
 		return kmem_cache_alloc_node(c->kmem_cache, flags, node);
 
@@ -187,6 +201,12 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 
 static void free_one(struct bpf_mem_cache *c, void *obj)
 {
+	if (c->percpu) {
+		free_percpu(((void **)obj)[1]);
+		kmem_cache_free(c->kmem_cache, obj);
+		return;
+	}
+
 	if (c->kmem_cache)
 		kmem_cache_free(c->kmem_cache, obj);
 	else
@@ -327,21 +347,30 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
  * kmalloc/kfree. Max allocation size is 4096 in this case.
  * This is bpf_dynptr and bpf_kptr use case.
  */
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 {
 	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
 	struct bpf_mem_caches *cc, __percpu *pcc;
 	struct bpf_mem_cache *c, __percpu *pc;
-	struct kmem_cache *kmem_cache;
+	struct kmem_cache *kmem_cache = NULL;
 	struct obj_cgroup *objcg = NULL;
 	char buf[32];
-	int cpu, i;
+	int cpu, i, unit_size;
 
 	if (size) {
 		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
 		if (!pc)
 			return -ENOMEM;
-		size += LLIST_NODE_SZ; /* room for llist_node */
+
+		if (percpu) {
+			unit_size = size;
+			/* room for llist_node and per-cpu pointer */
+			size = LLIST_NODE_SZ + sizeof(void *);
+		} else {
+			size += LLIST_NODE_SZ; /* room for llist_node */
+			unit_size = size;
+		}
+
 		snprintf(buf, sizeof(buf), "bpf-%u", size);
 		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
 		if (!kmem_cache) {
@@ -354,14 +383,19 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
 		for_each_possible_cpu(cpu) {
 			c = per_cpu_ptr(pc, cpu);
 			c->kmem_cache = kmem_cache;
-			c->unit_size = size;
+			c->unit_size = unit_size;
 			c->objcg = objcg;
+			c->percpu = percpu;
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
 		return 0;
 	}
 
+	/* size == 0 && percpu is an invalid combination */
+	if (WARN_ON_ONCE(percpu))
+		return -EINVAL;
+
 	pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL);
 	if (!pcc)
 		return -ENOMEM;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 11/15] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (9 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 10/15] bpf: Add percpu allocation support to bpf_mem_alloc Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 12/15] bpf: Remove tracing program restriction on map types Alexei Starovoitov
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert dynamic allocations in percpu hash map from alloc_percpu() to
bpf_mem_cache_alloc() from per-cpu bpf_mem_alloc. Since bpf_mem_alloc frees
objects after RCU gp the call_rcu() is removed. pcpu_init_value() now needs to
zero-fill per-cpu allocations, since dynamically allocated map elements are now
similar to full prealloc, since alloc_percpu() is not called inline and the
elements are reused in the freelist.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 45 +++++++++++++++++++-------------------------
 1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8daa1132d43c..89f26cbddef5 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -94,6 +94,7 @@ struct bucket {
 struct bpf_htab {
 	struct bpf_map map;
 	struct bpf_mem_alloc ma;
+	struct bpf_mem_alloc pcpu_ma;
 	struct bucket *buckets;
 	void *elems;
 	union {
@@ -121,14 +122,14 @@ struct htab_elem {
 		struct {
 			void *padding;
 			union {
-				struct bpf_htab *htab;
 				struct pcpu_freelist_node fnode;
 				struct htab_elem *batch_flink;
 			};
 		};
 	};
 	union {
-		struct rcu_head rcu;
+		/* pointer to per-cpu pointer */
+		void *ptr_to_pptr;
 		struct bpf_lru_node lru_node;
 	};
 	u32 hash;
@@ -435,8 +436,6 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	bool zero_seed = (attr->map_flags & BPF_F_ZERO_SEED);
 	int numa_node = bpf_map_attr_numa_node(attr);
 
-	BUILD_BUG_ON(offsetof(struct htab_elem, htab) !=
-		     offsetof(struct htab_elem, hash_node.pprev));
 	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
 		     offsetof(struct htab_elem, hash_node.pprev));
 
@@ -597,6 +596,12 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
 		if (err)
 			goto free_map_locked;
+		if (percpu) {
+			err = bpf_mem_alloc_init(&htab->pcpu_ma,
+						 round_up(htab->map.value_size, 8), true);
+			if (err)
+				goto free_map_locked;
+		}
 	}
 
 	return &htab->map;
@@ -607,6 +612,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
@@ -882,19 +888,11 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
-		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
+		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
 	check_and_free_fields(htab, l);
 	bpf_mem_cache_free(&htab->ma, l);
 }
 
-static void htab_elem_free_rcu(struct rcu_head *head)
-{
-	struct htab_elem *l = container_of(head, struct htab_elem, rcu);
-	struct bpf_htab *htab = l->htab;
-
-	htab_elem_free(htab, l);
-}
-
 static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 {
 	struct bpf_map *map = &htab->map;
@@ -940,12 +938,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		dec_elem_count(htab);
-		if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH) {
-			l->htab = htab;
-			call_rcu(&l->rcu, htab_elem_free_rcu);
-		} else {
-			htab_elem_free(htab, l);
-		}
+		htab_elem_free(htab, l);
 	}
 }
 
@@ -970,13 +963,12 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
 static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr,
 			    void *value, bool onallcpus)
 {
-	/* When using prealloc and not setting the initial value on all cpus,
-	 * zero-fill element values for other cpus (just as what happens when
-	 * not using prealloc). Otherwise, bpf program has no way to ensure
+	/* When not setting the initial value on all cpus, zero-fill element
+	 * values for other cpus. Otherwise, bpf program has no way to ensure
 	 * known initial values for cpus other than current one
 	 * (onallcpus=false always when coming from bpf prog).
 	 */
-	if (htab_is_prealloc(htab) && !onallcpus) {
+	if (!onallcpus) {
 		u32 size = round_up(htab->map.value_size, 8);
 		int current_cpu = raw_smp_processor_id();
 		int cpu;
@@ -1047,18 +1039,18 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 
 	memcpy(l_new->key, key, key_size);
 	if (percpu) {
-		size = round_up(size, 8);
 		if (prealloc) {
 			pptr = htab_elem_get_ptr(l_new, key_size);
 		} else {
 			/* alloc_percpu zero-fills */
-			pptr = bpf_map_alloc_percpu(&htab->map, size, 8,
-						    GFP_NOWAIT | __GFP_NOWARN);
+			pptr = bpf_mem_cache_alloc(&htab->pcpu_ma);
 			if (!pptr) {
 				bpf_mem_cache_free(&htab->ma, l_new);
 				l_new = ERR_PTR(-ENOMEM);
 				goto dec_count;
 			}
+			l_new->ptr_to_pptr = pptr;
+			pptr = *(void **)pptr;
 		}
 
 		pcpu_init_value(htab, pptr, value, onallcpus);
@@ -1550,6 +1542,7 @@ static void htab_map_free(struct bpf_map *map)
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
 	if (htab->use_percpu_counter)
 		percpu_counter_destroy(&htab->pcount);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 12/15] bpf: Remove tracing program restriction on map types
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (10 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 11/15] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs Alexei Starovoitov
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The hash map is now fully converted to bpf_mem_alloc. Its implementation is not
allocating synchronously and not calling call_rcu() directly. It's now safe to
use non-preallocated hash maps in all types of tracing programs including
BPF_PROG_TYPE_PERF_EVENT that runs out of NMI context.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 42 ------------------------------------------
 1 file changed, 42 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d785f29047d7..a1ada707c57c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12599,48 +12599,6 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 
 {
 	enum bpf_prog_type prog_type = resolve_prog_type(prog);
-	/*
-	 * Validate that trace type programs use preallocated hash maps.
-	 *
-	 * For programs attached to PERF events this is mandatory as the
-	 * perf NMI can hit any arbitrary code sequence.
-	 *
-	 * All other trace types using non-preallocated per-cpu hash maps are
-	 * unsafe as well because tracepoint or kprobes can be inside locked
-	 * regions of the per-cpu memory allocator or at a place where a
-	 * recursion into the per-cpu memory allocator would see inconsistent
-	 * state. Non per-cpu hash maps are using bpf_mem_alloc-tor which is
-	 * safe to use from kprobe/fentry and in RT.
-	 *
-	 * On RT enabled kernels run-time allocation of all trace type
-	 * programs is strictly prohibited due to lock type constraints. On
-	 * !RT kernels it is allowed for backwards compatibility reasons for
-	 * now, but warnings are emitted so developers are made aware of
-	 * the unsafety and can fix their programs before this is enforced.
-	 */
-	if (is_tracing_prog_type(prog_type) && !is_preallocated_map(map)) {
-		if (prog_type == BPF_PROG_TYPE_PERF_EVENT) {
-			/* perf_event bpf progs have to use preallocated hash maps
-			 * because non-prealloc is still relying on call_rcu to free
-			 * elements.
-			 */
-			verbose(env, "perf_event programs can only use preallocated hash map\n");
-			return -EINVAL;
-		}
-		if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
-		    (map->inner_map_meta &&
-		     map->inner_map_meta->map_type == BPF_MAP_TYPE_PERCPU_HASH)) {
-			if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-				verbose(env,
-					"trace type programs can only use preallocated per-cpu hash map\n");
-				return -EINVAL;
-			}
-			WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
-			verbose(env,
-				"trace type programs with run-time allocated per-cpu hash maps are unsafe."
-				" Switch to preallocated hash maps.\n");
-		}
-	}
 
 	if (map_value_has_spin_lock(map)) {
 		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (11 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 12/15] bpf: Remove tracing program restriction on map types Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 22:21   ` Kumar Kartikeya Dwivedi
  2022-08-19 21:42 ` [PATCH v3 bpf-next 14/15] bpf: Remove prealloc-only restriction for " Alexei Starovoitov
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
Then use call_rcu() to wait for normal progs to finish
and finally do free_one() on each element when freeing objects
into global memory pool.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 9e5ad7dc4dc7..d34383dc12d9 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -224,6 +224,13 @@ static void __free_rcu(struct rcu_head *head)
 	atomic_set(&c->call_rcu_in_progress, 0);
 }
 
+static void __free_rcu_tasks_trace(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+
+	call_rcu(&c->rcu, __free_rcu);
+}
+
 static void enque_to_free(struct bpf_mem_cache *c, void *obj)
 {
 	struct llist_node *llnode = obj;
@@ -249,7 +256,11 @@ static void do_call_rcu(struct bpf_mem_cache *c)
 		 * from __free_rcu() and from drain_mem_cache().
 		 */
 		__llist_add(llnode, &c->waiting_for_gp);
-	call_rcu(&c->rcu, __free_rcu);
+	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
+	 * Then use call_rcu() to wait for normal progs to finish
+	 * and finally do free_one() on each element.
+	 */
+	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
 }
 
 static void free_bulk(struct bpf_mem_cache *c)
@@ -452,6 +463,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 		/* c->waiting_for_gp list was drained, but __free_rcu might
 		 * still execute. Wait for it now before we free 'c'.
 		 */
+		rcu_barrier_tasks_trace();
 		rcu_barrier();
 		free_percpu(ma->cache);
 		ma->cache = NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 14/15] bpf: Remove prealloc-only restriction for sleepable bpf programs.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (12 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-19 21:42 ` [PATCH v3 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc Alexei Starovoitov
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Since hash map is now converted to bpf_mem_alloc and it's waiting for rcu and
rcu_tasks_trace GPs before freeing elements into global memory slabs it's safe
to use dynamically allocated hash maps in sleepable bpf programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 23 -----------------------
 1 file changed, 23 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a1ada707c57c..dcbcf876b886 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12562,14 +12562,6 @@ static int check_pseudo_btf_id(struct bpf_verifier_env *env,
 	return err;
 }
 
-static int check_map_prealloc(struct bpf_map *map)
-{
-	return (map->map_type != BPF_MAP_TYPE_HASH &&
-		map->map_type != BPF_MAP_TYPE_PERCPU_HASH &&
-		map->map_type != BPF_MAP_TYPE_HASH_OF_MAPS) ||
-		!(map->map_flags & BPF_F_NO_PREALLOC);
-}
-
 static bool is_tracing_prog_type(enum bpf_prog_type type)
 {
 	switch (type) {
@@ -12584,15 +12576,6 @@ static bool is_tracing_prog_type(enum bpf_prog_type type)
 	}
 }
 
-static bool is_preallocated_map(struct bpf_map *map)
-{
-	if (!check_map_prealloc(map))
-		return false;
-	if (map->inner_map_meta && !check_map_prealloc(map->inner_map_meta))
-		return false;
-	return true;
-}
-
 static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 					struct bpf_map *map,
 					struct bpf_prog *prog)
@@ -12645,12 +12628,6 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		case BPF_MAP_TYPE_LRU_PERCPU_HASH:
 		case BPF_MAP_TYPE_ARRAY_OF_MAPS:
 		case BPF_MAP_TYPE_HASH_OF_MAPS:
-			if (!is_preallocated_map(map)) {
-				verbose(env,
-					"Sleepable programs can only use preallocated maps\n");
-				return -EINVAL;
-			}
-			break;
 		case BPF_MAP_TYPE_RINGBUF:
 		case BPF_MAP_TYPE_INODE_STORAGE:
 		case BPF_MAP_TYPE_SK_STORAGE:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (13 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 14/15] bpf: Remove prealloc-only restriction for " Alexei Starovoitov
@ 2022-08-19 21:42 ` Alexei Starovoitov
  2022-08-24 20:03 ` [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Kumar Kartikeya Dwivedi
  2022-08-25  0:56 ` [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular Delyan Kratunov
  16 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 21:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce sysctl kernel.bpf_force_dyn_alloc to force dynamic allocation in bpf
hash map. All selftests/bpf should pass with bpf_force_dyn_alloc 0 or 1 and all
bpf programs (both sleepable and not) should not see any functional difference.
The sysctl's observable behavior should only be improved memory usage.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/filter.h | 2 ++
 kernel/bpf/core.c      | 2 ++
 kernel/bpf/hashtab.c   | 5 +++++
 kernel/bpf/syscall.c   | 9 +++++++++
 4 files changed, 18 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index a5f21dc3c432..eb4d4a0c0bde 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1009,6 +1009,8 @@ bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 }
 #endif
 
+extern int bpf_force_dyn_alloc;
+
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
 extern int bpf_jit_harden;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 639437f36928..a13e78ea4b90 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -533,6 +533,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
 	bpf_prog_kallsyms_del(fp);
 }
 
+int bpf_force_dyn_alloc __read_mostly;
+
 #ifdef CONFIG_BPF_JIT
 /* All BPF JIT sysctl knobs here. */
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 89f26cbddef5..f68a3400939e 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -505,6 +505,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	bpf_map_init_from_attr(&htab->map, attr);
 
+	if (!lru && bpf_force_dyn_alloc) {
+		prealloc = false;
+		htab->map.map_flags |= BPF_F_NO_PREALLOC;
+	}
+
 	if (percpu_lru) {
 		/* ensure each CPU's lru list has >=1 elements.
 		 * since we are at it, make each lru list has the same
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 850270a72350..c201796f4997 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5297,6 +5297,15 @@ static struct ctl_table bpf_syscall_table[] = {
 		.mode		= 0644,
 		.proc_handler	= bpf_stats_handler,
 	},
+	{
+		.procname	= "bpf_force_dyn_alloc",
+		.data		= &bpf_force_dyn_alloc,
+		.maxlen		= sizeof(int),
+		.mode		= 0600,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{ }
 };
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  2022-08-19 21:42 ` [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs Alexei Starovoitov
@ 2022-08-19 22:21   ` Kumar Kartikeya Dwivedi
  2022-08-19 22:43     ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-19 22:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> Then use call_rcu() to wait for normal progs to finish
> and finally do free_one() on each element when freeing objects
> into global memory pool.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

I fear this can make OOM issues very easy to run into, because one
sleepable prog that sleeps for a long period of time can hold the
freeing of elements from another sleepable prog which either does not
sleep often or sleeps for a very short period of time, and has a high
update frequency. I'm mostly worried that unrelated sleepable programs
not even using the same map will begin to affect each other.

Have you considered other options? E.g. we could directly expose
bpf_rcu_read_lock/bpf_rcu_read_unlock to the program and enforce that
access to RCU protected map lookups only happens in such read
sections, and unlock invalidates all RCU protected pointers? Sleepable
helpers can then not be invoked inside the BPF RCU read section. The
program uses RCU read section while accessing such maps, and sleeps
after doing bpf_rcu_read_unlock. They can be kfuncs.

It might also be useful in general, to access RCU protected data from
sleepable programs (i.e. make some sections of the program RCU
protected and non-sleepable at runtime). It will allow use of elements
from dynamically allocated maps with bpf_mem_alloc while not having to
wait for RCU tasks trace grace period, which can extend into minutes
(or even longer if unlucky).

One difference would be that you can pin a lookup across a sleep cycle
with this approach, but not with preallocated maps or the explicit RCU
section above, but I'm not sure it's worth it. It isn't possible now.

>  kernel/bpf/memalloc.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 9e5ad7dc4dc7..d34383dc12d9 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -224,6 +224,13 @@ static void __free_rcu(struct rcu_head *head)
>         atomic_set(&c->call_rcu_in_progress, 0);
>  }
>
> +static void __free_rcu_tasks_trace(struct rcu_head *head)
> +{
> +       struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
> +
> +       call_rcu(&c->rcu, __free_rcu);
> +}
> +
>  static void enque_to_free(struct bpf_mem_cache *c, void *obj)
>  {
>         struct llist_node *llnode = obj;
> @@ -249,7 +256,11 @@ static void do_call_rcu(struct bpf_mem_cache *c)
>                  * from __free_rcu() and from drain_mem_cache().
>                  */
>                 __llist_add(llnode, &c->waiting_for_gp);
> -       call_rcu(&c->rcu, __free_rcu);
> +       /* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> +        * Then use call_rcu() to wait for normal progs to finish
> +        * and finally do free_one() on each element.
> +        */
> +       call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
>  }
>
>  static void free_bulk(struct bpf_mem_cache *c)
> @@ -452,6 +463,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>                 /* c->waiting_for_gp list was drained, but __free_rcu might
>                  * still execute. Wait for it now before we free 'c'.
>                  */
> +               rcu_barrier_tasks_trace();
>                 rcu_barrier();
>                 free_percpu(ma->cache);
>                 ma->cache = NULL;
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  2022-08-19 22:21   ` Kumar Kartikeya Dwivedi
@ 2022-08-19 22:43     ` Alexei Starovoitov
  2022-08-19 22:56       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 22:43 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > Then use call_rcu() to wait for normal progs to finish
> > and finally do free_one() on each element when freeing objects
> > into global memory pool.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> 
> I fear this can make OOM issues very easy to run into, because one
> sleepable prog that sleeps for a long period of time can hold the
> freeing of elements from another sleepable prog which either does not
> sleep often or sleeps for a very short period of time, and has a high
> update frequency. I'm mostly worried that unrelated sleepable programs
> not even using the same map will begin to affect each other.

'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
sleepable progs can copy_from_user, but they're not allowed to waste time.
I don't share OOM concerns at all.
max_entries and memcg limits are still there and enforced.
dynamic map is strictly better and memory efficient than full prealloc.

> Have you considered other options? E.g. we could directly expose
> bpf_rcu_read_lock/bpf_rcu_read_unlock to the program and enforce that
> access to RCU protected map lookups only happens in such read
> sections, and unlock invalidates all RCU protected pointers? Sleepable
> helpers can then not be invoked inside the BPF RCU read section. The
> program uses RCU read section while accessing such maps, and sleeps
> after doing bpf_rcu_read_unlock. They can be kfuncs.

Yes. We can add explicit bpf_rcu_read_lock and teach verifier about RCU CS,
but I don't see the value specifically for sleepable progs.
Current sleepable progs can do map lookup without extra kfuncs.
Explicit CS would force progs to be rewritten which is not great.

> It might also be useful in general, to access RCU protected data from
> sleepable programs (i.e. make some sections of the program RCU
> protected and non-sleepable at runtime). It will allow use of elements

For other cases, sure. We can introduce RCU protected objects and
explicit bpf_rcu_read_lock.

> from dynamically allocated maps with bpf_mem_alloc while not having to
> wait for RCU tasks trace grace period, which can extend into minutes
> (or even longer if unlucky).

sleepable bpf prog that lasts minutes? In what kind of situation?
We don't have bpf_sleep() helper and not going to add one any time soon.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  2022-08-19 22:43     ` Alexei Starovoitov
@ 2022-08-19 22:56       ` Kumar Kartikeya Dwivedi
  2022-08-19 23:01         ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-19 22:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Sat, 20 Aug 2022 at 00:43, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > Then use call_rcu() to wait for normal progs to finish
> > > and finally do free_one() on each element when freeing objects
> > > into global memory pool.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> >
> > I fear this can make OOM issues very easy to run into, because one
> > sleepable prog that sleeps for a long period of time can hold the
> > freeing of elements from another sleepable prog which either does not
> > sleep often or sleeps for a very short period of time, and has a high
> > update frequency. I'm mostly worried that unrelated sleepable programs
> > not even using the same map will begin to affect each other.
>
> 'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
> sleepable progs can copy_from_user, but they're not allowed to waste time.

It is certainly possible to waste time, but indirectly, not through
the BPF program itself.

If you have userfaultfd enabled (for unpriv users), an unprivileged
user can trap a sleepable BPF prog (say LSM) using bpf_copy_from_user
for as long as it wants. A similar case can be done using FUSE, IIRC.

You can then say it's a problem about unprivileged users being able to
use userfaultfd or FUSE, or we could think about fixing
bpf_copy_from_user to return -EFAULT for this case, but it is totally
possible right now for malicious userspace to extend the tasks trace
gp like this for minutes (or even longer) on a system where sleepable
BPF programs are using e.g. bpf_copy_from_user.

> I don't share OOM concerns at all.
> max_entries and memcg limits are still there and enforced.
> dynamic map is strictly better and memory efficient than full prealloc.
>
> > Have you considered other options? E.g. we could directly expose
> > bpf_rcu_read_lock/bpf_rcu_read_unlock to the program and enforce that
> > access to RCU protected map lookups only happens in such read
> > sections, and unlock invalidates all RCU protected pointers? Sleepable
> > helpers can then not be invoked inside the BPF RCU read section. The
> > program uses RCU read section while accessing such maps, and sleeps
> > after doing bpf_rcu_read_unlock. They can be kfuncs.
>
> Yes. We can add explicit bpf_rcu_read_lock and teach verifier about RCU CS,
> but I don't see the value specifically for sleepable progs.
> Current sleepable progs can do map lookup without extra kfuncs.
> Explicit CS would force progs to be rewritten which is not great.
>
> > It might also be useful in general, to access RCU protected data from
> > sleepable programs (i.e. make some sections of the program RCU
> > protected and non-sleepable at runtime). It will allow use of elements
>
> For other cases, sure. We can introduce RCU protected objects and
> explicit bpf_rcu_read_lock.
>
> > from dynamically allocated maps with bpf_mem_alloc while not having to
> > wait for RCU tasks trace grace period, which can extend into minutes
> > (or even longer if unlucky).
>
> sleepable bpf prog that lasts minutes? In what kind of situation?
> We don't have bpf_sleep() helper and not going to add one any time soon.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  2022-08-19 22:56       ` Kumar Kartikeya Dwivedi
@ 2022-08-19 23:01         ` Alexei Starovoitov
  2022-08-24 19:49           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 23:01 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Tejun Heo,
	Delyan Kratunov, linux-mm, bpf, Kernel Team

On Fri, Aug 19, 2022 at 3:56 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 20 Aug 2022 at 00:43, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > > Then use call_rcu() to wait for normal progs to finish
> > > > and finally do free_one() on each element when freeing objects
> > > > into global memory pool.
> > > >
> > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > ---
> > >
> > > I fear this can make OOM issues very easy to run into, because one
> > > sleepable prog that sleeps for a long period of time can hold the
> > > freeing of elements from another sleepable prog which either does not
> > > sleep often or sleeps for a very short period of time, and has a high
> > > update frequency. I'm mostly worried that unrelated sleepable programs
> > > not even using the same map will begin to affect each other.
> >
> > 'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
> > sleepable progs can copy_from_user, but they're not allowed to waste time.
>
> It is certainly possible to waste time, but indirectly, not through
> the BPF program itself.
>
> If you have userfaultfd enabled (for unpriv users), an unprivileged
> user can trap a sleepable BPF prog (say LSM) using bpf_copy_from_user
> for as long as it wants. A similar case can be done using FUSE, IIRC.
>
> You can then say it's a problem about unprivileged users being able to
> use userfaultfd or FUSE, or we could think about fixing
> bpf_copy_from_user to return -EFAULT for this case, but it is totally
> possible right now for malicious userspace to extend the tasks trace
> gp like this for minutes (or even longer) on a system where sleepable
> BPF programs are using e.g. bpf_copy_from_user.

Well in that sense userfaultfd can keep all sorts of things
in the kernel from making progress.
But nothing to do with OOM.
There is still the max_entries limit.
The amount of objects in waiting_for_gp is guaranteed to be less
than full prealloc.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  2022-08-19 23:01         ` Alexei Starovoitov
@ 2022-08-24 19:49           ` Kumar Kartikeya Dwivedi
  2022-08-25  0:08             ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-24 19:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Tejun Heo,
	Delyan Kratunov, linux-mm, bpf, Kernel Team

On Sat, 20 Aug 2022 at 01:01, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 19, 2022 at 3:56 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Sat, 20 Aug 2022 at 00:43, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > >
> > > > > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > > > Then use call_rcu() to wait for normal progs to finish
> > > > > and finally do free_one() on each element when freeing objects
> > > > > into global memory pool.
> > > > >
> > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > > ---
> > > >
> > > > I fear this can make OOM issues very easy to run into, because one
> > > > sleepable prog that sleeps for a long period of time can hold the
> > > > freeing of elements from another sleepable prog which either does not
> > > > sleep often or sleeps for a very short period of time, and has a high
> > > > update frequency. I'm mostly worried that unrelated sleepable programs
> > > > not even using the same map will begin to affect each other.
> > >
> > > 'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
> > > sleepable progs can copy_from_user, but they're not allowed to waste time.
> >
> > It is certainly possible to waste time, but indirectly, not through
> > the BPF program itself.
> >
> > If you have userfaultfd enabled (for unpriv users), an unprivileged
> > user can trap a sleepable BPF prog (say LSM) using bpf_copy_from_user
> > for as long as it wants. A similar case can be done using FUSE, IIRC.
> >
> > You can then say it's a problem about unprivileged users being able to
> > use userfaultfd or FUSE, or we could think about fixing
> > bpf_copy_from_user to return -EFAULT for this case, but it is totally
> > possible right now for malicious userspace to extend the tasks trace
> > gp like this for minutes (or even longer) on a system where sleepable
> > BPF programs are using e.g. bpf_copy_from_user.
>
> Well in that sense userfaultfd can keep all sorts of things
> in the kernel from making progress.
> But nothing to do with OOM.
> There is still the max_entries limit.
> The amount of objects in waiting_for_gp is guaranteed to be less
> than full prealloc.

My thinking was that once you hold the GP using uffd, we can assume
you will eventually hit a case where all such maps on the system have
their max_entries exhausted. So yes, it probably won't OOM, but it
would be bad regardless.

I think this just begs instead that uffd (and even FUSE) should not be
available to untrusted processes on the system by default. Both are
used regularly to widen hard to hit race conditions in the kernel.

But anyway, there's no easy way currently to guarantee the lifetime of
elements for the sleepable case while being as low overhead as trace
RCU, so it makes sense to go ahead with this.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  2022-08-19 21:42 ` [PATCH v3 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
@ 2022-08-24 19:58   ` Kumar Kartikeya Dwivedi
  2022-08-25  0:13     ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-24 19:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team, joel

On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
> kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
> and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
> solves the memory consumption issue, avoids kmem_cache_destroy latency and
> keeps bpf hash map performance the same.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Makes sense, there was a call_rcu_lazy work from Joel (CCed) on doing
this batching using a timer + max batch count instead, I wonder if
that fits our use case and could be useful in the future when it is
merged?

https://lore.kernel.org/rcu/20220713213237.1596225-2-joel@joelfernandes.org

wdyt?

> ---
>  kernel/bpf/memalloc.c | 64 +++++++++++++++++++++++++++++++++++++++++--
>  kernel/bpf/syscall.c  |  5 +++-
>  2 files changed, 65 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 22b729914afe..d765a5cb24b4 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -100,6 +100,11 @@ struct bpf_mem_cache {
>         /* count of objects in free_llist */
>         int free_cnt;
>         int low_watermark, high_watermark, batch;
> +
> +       struct rcu_head rcu;
> +       struct llist_head free_by_rcu;
> +       struct llist_head waiting_for_gp;
> +       atomic_t call_rcu_in_progress;
>  };
>
>  struct bpf_mem_caches {
> @@ -188,6 +193,45 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
>                 kfree(obj);
>  }
>
> +static void __free_rcu(struct rcu_head *head)
> +{
> +       struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
> +       struct llist_node *llnode = llist_del_all(&c->waiting_for_gp);
> +       struct llist_node *pos, *t;
> +
> +       llist_for_each_safe(pos, t, llnode)
> +               free_one(c, pos);
> +       atomic_set(&c->call_rcu_in_progress, 0);
> +}
> +
> +static void enque_to_free(struct bpf_mem_cache *c, void *obj)
> +{
> +       struct llist_node *llnode = obj;
> +
> +       /* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
> +        * Nothing races to add to free_by_rcu list.
> +        */
> +       __llist_add(llnode, &c->free_by_rcu);
> +}
> +
> +static void do_call_rcu(struct bpf_mem_cache *c)
> +{
> +       struct llist_node *llnode, *t;
> +
> +       if (atomic_xchg(&c->call_rcu_in_progress, 1))
> +               return;
> +
> +       WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
> +       llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
> +               /* There is no concurrent __llist_add(waiting_for_gp) access.
> +                * It doesn't race with llist_del_all either.
> +                * But there could be two concurrent llist_del_all(waiting_for_gp):
> +                * from __free_rcu() and from drain_mem_cache().
> +                */
> +               __llist_add(llnode, &c->waiting_for_gp);
> +       call_rcu(&c->rcu, __free_rcu);
> +}
> +
>  static void free_bulk(struct bpf_mem_cache *c)
>  {
>         struct llist_node *llnode, *t;
> @@ -207,12 +251,13 @@ static void free_bulk(struct bpf_mem_cache *c)
>                 local_dec(&c->active);
>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
>                         local_irq_restore(flags);
> -               free_one(c, llnode);
> +               enque_to_free(c, llnode);
>         } while (cnt > (c->high_watermark + c->low_watermark) / 2);
>
>         /* and drain free_llist_extra */
>         llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
> -               free_one(c, llnode);
> +               enque_to_free(c, llnode);
> +       do_call_rcu(c);
>  }
>
>  static void bpf_mem_refill(struct irq_work *work)
> @@ -298,7 +343,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
>                         return -ENOMEM;
>                 size += LLIST_NODE_SZ; /* room for llist_node */
>                 snprintf(buf, sizeof(buf), "bpf-%u", size);
> -               kmem_cache = kmem_cache_create(buf, size, 8, SLAB_TYPESAFE_BY_RCU, NULL);
> +               kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
>                 if (!kmem_cache) {
>                         free_percpu(pc);
>                         return -ENOMEM;
> @@ -340,6 +385,15 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
>  {
>         struct llist_node *llnode, *t;
>
> +       /* The caller has done rcu_barrier() and no progs are using this
> +        * bpf_mem_cache, but htab_map_free() called bpf_mem_cache_free() for
> +        * all remaining elements and they can be in free_by_rcu or in
> +        * waiting_for_gp lists, so drain those lists now.
> +        */
> +       llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
> +               free_one(c, llnode);
> +       llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
> +               free_one(c, llnode);
>         llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
>                 free_one(c, llnode);
>         llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
> @@ -361,6 +415,10 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>                 kmem_cache_destroy(c->kmem_cache);
>                 if (c->objcg)
>                         obj_cgroup_put(c->objcg);
> +               /* c->waiting_for_gp list was drained, but __free_rcu might
> +                * still execute. Wait for it now before we free 'c'.
> +                */
> +               rcu_barrier();
>                 free_percpu(ma->cache);
>                 ma->cache = NULL;
>         }
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index a4d40d98428a..850270a72350 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -638,7 +638,10 @@ static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
>                 bpf_map_free_id(map, do_idr_lock);
>                 btf_put(map->btf);
>                 INIT_WORK(&map->work, bpf_map_free_deferred);
> -               schedule_work(&map->work);
> +               /* Avoid spawning kworkers, since they all might contend
> +                * for the same mutex like slab_mutex.
> +                */
> +               queue_work(system_unbound_wq, &map->work);
>         }
>  }
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator.
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (14 preceding siblings ...)
  2022-08-19 21:42 ` [PATCH v3 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc Alexei Starovoitov
@ 2022-08-24 20:03 ` Kumar Kartikeya Dwivedi
  2022-08-25  0:16   ` Alexei Starovoitov
  2022-08-25  0:56 ` [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular Delyan Kratunov
  16 siblings, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-24 20:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, delyank, linux-mm, bpf, kernel-team

On Fri, 19 Aug 2022 at 23:42, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce any context BPF specific memory allocator.
>
> Tracing BPF programs can attach to kprobe and fentry. Hence they
> run in unknown context where calling plain kmalloc() might not be safe.
> Front-end kmalloc() with per-cpu cache of free elements.
> Refill this cache asynchronously from irq_work.
>
> Major achievements enabled by bpf_mem_alloc:
> - Dynamically allocated hash maps used to be 10 times slower than fully preallocated.
>   With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc.
> - Tracing bpf programs can use dynamically allocated hash maps.
>   Potentially saving lots of memory. Typical hash map is sparsely populated.
> - Sleepable bpf programs can used dynamically allocated hash maps.
>

From my side, for the whole series:
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>


> v2->v3:
> - Rewrote the free_list algorithm based on discussions with Kumar. Patch 1.
> - Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14.
> - Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is
>   requested to reduce memory consumption. Patch 15.
> - Fix: zero-fill percpu allocation
> - Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction
>
> v2 thread:
> https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/
>
> v1->v2:
> - Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
> - Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
> - Tuned watermarks per allocation size. Patch 8
> - Adopted this approach to per-cpu allocation. Patch 10.
> - Fully converted hash map to bpf_mem_alloc. Patch 11.
> - Removed tracing prog restriction on map types. Combination of all patches and final patch 12.
>
> v1 thread:
> https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/
>
> LWN article:
> https://lwn.net/Articles/899274/
>
> Future work:
> - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
> - convert lru map to bpf_mem_alloc
>
> Alexei Starovoitov (15):
>   bpf: Introduce any context BPF specific memory allocator.
>   bpf: Convert hash map to bpf_mem_alloc.
>   selftests/bpf: Improve test coverage of test_maps
>   samples/bpf: Reduce syscall overhead in map_perf_test.
>   bpf: Relax the requirement to use preallocated hash maps in tracing
>     progs.
>   bpf: Optimize element count in non-preallocated hash map.
>   bpf: Optimize call_rcu in non-preallocated hash map.
>   bpf: Adjust low/high watermarks in bpf_mem_cache
>   bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
>   bpf: Add percpu allocation support to bpf_mem_alloc.
>   bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
>   bpf: Remove tracing program restriction on map types
>   bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
>   bpf: Remove prealloc-only restriction for sleepable bpf programs.
>   bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.
>
>  include/linux/bpf_mem_alloc.h             |  26 +
>  include/linux/filter.h                    |   2 +
>  kernel/bpf/Makefile                       |   2 +-
>  kernel/bpf/core.c                         |   2 +
>  kernel/bpf/hashtab.c                      | 132 +++--
>  kernel/bpf/memalloc.c                     | 601 ++++++++++++++++++++++
>  kernel/bpf/syscall.c                      |  14 +-
>  kernel/bpf/verifier.c                     |  52 --
>  samples/bpf/map_perf_test_kern.c          |  44 +-
>  samples/bpf/map_perf_test_user.c          |   2 +-
>  tools/testing/selftests/bpf/progs/timer.c |  11 -
>  tools/testing/selftests/bpf/test_maps.c   |  38 +-
>  12 files changed, 795 insertions(+), 131 deletions(-)
>  create mode 100644 include/linux/bpf_mem_alloc.h
>  create mode 100644 kernel/bpf/memalloc.c
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  2022-08-24 19:49           ` Kumar Kartikeya Dwivedi
@ 2022-08-25  0:08             ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-25  0:08 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Tejun Heo,
	Delyan Kratunov, linux-mm, bpf, Kernel Team

On Wed, Aug 24, 2022 at 12:50 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 20 Aug 2022 at 01:01, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Aug 19, 2022 at 3:56 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Sat, 20 Aug 2022 at 00:43, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > > >
> > > > > > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > > > > Then use call_rcu() to wait for normal progs to finish
> > > > > > and finally do free_one() on each element when freeing objects
> > > > > > into global memory pool.
> > > > > >
> > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > > > ---
> > > > >
> > > > > I fear this can make OOM issues very easy to run into, because one
> > > > > sleepable prog that sleeps for a long period of time can hold the
> > > > > freeing of elements from another sleepable prog which either does not
> > > > > sleep often or sleeps for a very short period of time, and has a high
> > > > > update frequency. I'm mostly worried that unrelated sleepable programs
> > > > > not even using the same map will begin to affect each other.
> > > >
> > > > 'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
> > > > sleepable progs can copy_from_user, but they're not allowed to waste time.
> > >
> > > It is certainly possible to waste time, but indirectly, not through
> > > the BPF program itself.
> > >
> > > If you have userfaultfd enabled (for unpriv users), an unprivileged
> > > user can trap a sleepable BPF prog (say LSM) using bpf_copy_from_user
> > > for as long as it wants. A similar case can be done using FUSE, IIRC.
> > >
> > > You can then say it's a problem about unprivileged users being able to
> > > use userfaultfd or FUSE, or we could think about fixing
> > > bpf_copy_from_user to return -EFAULT for this case, but it is totally
> > > possible right now for malicious userspace to extend the tasks trace
> > > gp like this for minutes (or even longer) on a system where sleepable
> > > BPF programs are using e.g. bpf_copy_from_user.
> >
> > Well in that sense userfaultfd can keep all sorts of things
> > in the kernel from making progress.
> > But nothing to do with OOM.
> > There is still the max_entries limit.
> > The amount of objects in waiting_for_gp is guaranteed to be less
> > than full prealloc.
>
> My thinking was that once you hold the GP using uffd, we can assume
> you will eventually hit a case where all such maps on the system have
> their max_entries exhausted. So yes, it probably won't OOM, but it
> would be bad regardless.
>
> I think this just begs instead that uffd (and even FUSE) should not be
> available to untrusted processes on the system by default. Both are
> used regularly to widen hard to hit race conditions in the kernel.
>
> But anyway, there's no easy way currently to guarantee the lifetime of
> elements for the sleepable case while being as low overhead as trace
> RCU, so it makes sense to go ahead with this.

Right. We evaluated SRCU for sleepable and it had too much overhead.
That's the reason rcu_tasks_trace was added and sleepable bpf progs
is the only user so far.
The point I'm arguing is that call_rcu_tasks_trace in this patch
doesn't add mm concerns more than the existing call_rcu.
There is CONFIG_PREEMPT_RCU and RT. uffd will cause similar
issues in such configs too.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  2022-08-24 19:58   ` Kumar Kartikeya Dwivedi
@ 2022-08-25  0:13     ` Alexei Starovoitov
  2022-08-25  0:35       ` Joel Fernandes
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-25  0:13 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Tejun Heo,
	Delyan Kratunov, linux-mm, bpf, Kernel Team, Joel Fernandes

On Wed, Aug 24, 2022 at 12:59 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
> > kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
> > and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
> > solves the memory consumption issue, avoids kmem_cache_destroy latency and
> > keeps bpf hash map performance the same.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Makes sense, there was a call_rcu_lazy work from Joel (CCed) on doing
> this batching using a timer + max batch count instead, I wonder if
> that fits our use case and could be useful in the future when it is
> merged?
>
> https://lore.kernel.org/rcu/20220713213237.1596225-2-joel@joelfernandes.org

Thanks for the pointer. It looks orthogonal.
timer based call_rcu is for power savings.
I'm not sure how it would help here. Probably wouldn't hurt.
But explicit waiting_for_gp list is necessary here,
because two later patches (sleepable support and per-cpu rcu-safe
freeing) are relying on this patch.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator.
  2022-08-24 20:03 ` [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Kumar Kartikeya Dwivedi
@ 2022-08-25  0:16   ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-25  0:16 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Tejun Heo,
	Delyan Kratunov, linux-mm, bpf, Kernel Team

On Wed, Aug 24, 2022 at 1:03 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 19 Aug 2022 at 23:42, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce any context BPF specific memory allocator.
> >
> > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > run in unknown context where calling plain kmalloc() might not be safe.
> > Front-end kmalloc() with per-cpu cache of free elements.
> > Refill this cache asynchronously from irq_work.
> >
> > Major achievements enabled by bpf_mem_alloc:
> > - Dynamically allocated hash maps used to be 10 times slower than fully preallocated.
> >   With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc.
> > - Tracing bpf programs can use dynamically allocated hash maps.
> >   Potentially saving lots of memory. Typical hash map is sparsely populated.
> > - Sleepable bpf programs can used dynamically allocated hash maps.
> >
>
> From my side, for the whole series:
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Thanks a bunch for all the suggestions.
Especially for ideas that led to rewrite of patch 1.
It looks much simpler now.

I've missed #include <asm/local.h> in the patch 1.
In the respin I'm going to keep your Ack if you don't mind.

Thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  2022-08-25  0:13     ` Alexei Starovoitov
@ 2022-08-25  0:35       ` Joel Fernandes
  2022-08-25  0:49         ` Joel Fernandes
  0 siblings, 1 reply; 59+ messages in thread
From: Joel Fernandes @ 2022-08-25  0:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Delyan Kratunov, linux-mm, bpf,
	Kernel Team, Daniel Bristot de Oliveira

On Wed, Aug 24, 2022 at 8:14 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 24, 2022 at 12:59 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
> > > kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
> > > and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
> > > solves the memory consumption issue, avoids kmem_cache_destroy latency and
> > > keeps bpf hash map performance the same.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >
> > Makes sense, there was a call_rcu_lazy work from Joel (CCed) on doing
> > this batching using a timer + max batch count instead, I wonder if
> > that fits our use case and could be useful in the future when it is
> > merged?
> >
> > https://lore.kernel.org/rcu/20220713213237.1596225-2-joel@joelfernandes.org
>
> Thanks for the pointer. It looks orthogonal.
> timer based call_rcu is for power savings.
> I'm not sure how it would help here. Probably wouldn't hurt.
> But explicit waiting_for_gp list is necessary here,
> because two later patches (sleepable support and per-cpu rcu-safe
> freeing) are relying on this patch.

Hello Kumar and Alexei,

Kumar thanks for the CC. I am seeing this BPF work for the first time
so have not gone over it too much - but in case the waiting is
synchronous by any chance, call_rcu_lazy() could hurt. The idea is to
only queue callbacks that are not all that important to the system
while keeping it quiet (power being the primary reason but Daniel
Bristot would concur it brings down OS noise and helps RT as well).

Have a good evening, Thank you,

 - Joel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  2022-08-25  0:35       ` Joel Fernandes
@ 2022-08-25  0:49         ` Joel Fernandes
  0 siblings, 0 replies; 59+ messages in thread
From: Joel Fernandes @ 2022-08-25  0:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Delyan Kratunov, linux-mm, bpf,
	Kernel Team, Daniel Bristot de Oliveira

On Wed, Aug 24, 2022 at 8:35 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, Aug 24, 2022 at 8:14 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Aug 24, 2022 at 12:59 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
> > > > kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
> > > > and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
> > > > solves the memory consumption issue, avoids kmem_cache_destroy latency and
> > > > keeps bpf hash map performance the same.
> > > >
> > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Makes sense, there was a call_rcu_lazy work from Joel (CCed) on doing
> > > this batching using a timer + max batch count instead, I wonder if
> > > that fits our use case and could be useful in the future when it is
> > > merged?
> > >
> > > https://lore.kernel.org/rcu/20220713213237.1596225-2-joel@joelfernandes.org
> >
> > Thanks for the pointer. It looks orthogonal.
> > timer based call_rcu is for power savings.
> > I'm not sure how it would help here. Probably wouldn't hurt.
> > But explicit waiting_for_gp list is necessary here,
> > because two later patches (sleepable support and per-cpu rcu-safe
> > freeing) are relying on this patch.
>
> Hello Kumar and Alexei,
>
> Kumar thanks for the CC. I am seeing this BPF work for the first time
> so have not gone over it too much - but in case the waiting is
> synchronous by any chance, call_rcu_lazy() could hurt. The idea is to
> only queue callbacks that are not all that important to the system
> while keeping it quiet (power being the primary reason but Daniel
> Bristot would concur it brings down OS noise and helps RT as well).

Just as FYI, I see rcu_barrier() used in Alexei's patch - that will
flush the lazy CBs to keep rcu_barrier() both correct and performant.
At that point call_rcu_lazy() is equivalent to call_rcu() as we  no
longer kept the callbacks a secret from the rest of the system.

Thanks,

 -  Joel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (15 preceding siblings ...)
  2022-08-24 20:03 ` [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Kumar Kartikeya Dwivedi
@ 2022-08-25  0:56 ` Delyan Kratunov
  2022-08-26  4:03   ` Kumar Kartikeya Dwivedi
  16 siblings, 1 reply; 59+ messages in thread
From: Delyan Kratunov @ 2022-08-25  0:56 UTC (permalink / raw)
  To: davem, alexei.starovoitov
  Cc: tj, joannelkoong, andrii, daniel, memxor, Dave Marchevsky,
	Kernel Team, bpf

Alexei and I spent some time today going back and forth on what the uapi to this
allocator should look like in a BPF program. To both of our surprise, the problem
space became far more complicated than we anticipated.

There are three primary problems we have to solve:
1) Knowing which allocator an object came from, so we can safely reclaim it when
necessary (e.g., freeing a map).
2) Type confusion between local and kernel types. (I.e., a program allocating kernel
types and passing them to helpers/kfuncs that don't expect them). This is especially
important because the existing kptr mechanism assumes kernel types everywhere.
3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy
when you allow allocated objects in pinned maps.

This is the proposed design that we landed on:

1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity
at creation time. Value_size > 0 takes the kmem_cache path. Probably with
btf_value_type_id enforcement for the kmem_cache path.

2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct
foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer.

3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local)
annotates fields that can hold values with verifier type `PTR_TO_BTF_ID |
BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves
against program-local btf (in the verifier, at runtime it only gets an allocation
size). 
3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce
a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the
corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This
would be the only place these objects can leak out of bpf land. If there's no runtime
aspect (and there likely wouldn't be), we might consider doing this transparently,
still against an allowlist of types.

4. To ensure the allocator stays alive while objects from it are alive, we must be
able to identify which allocator each __kptr_local pointer came from, and we must
keep the refcount up while any such values are alive. One concern here is that doing
the refcount manipulation in kptr_xchg would be too expensive. The proposed solution
is to: 
4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer
from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the
object to and can simplify the bpf_obj_free() call.
4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would
walk all maps that the program has access to (that have kptr_local fields), walk each
value and ensure that any allocators not already in the map's used_allocators are
refcount_inc'd and added to the list. Do note that allocators are also kept alive by
their bpf_map wrapper but after that's gone, used_allocators is the main mechanism.
Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we
can only return objects to it.
4.3. On map free, we walk and obj_free() all the __kptr_local fields, then
refcount_dec all the used_allocators.

Overall, we think this handles all the nasty corners - objects escaping into
kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations,
programs accessing multiple allocators having deterministic freelist behaviors -
while keeping the API and complexity sane. The used_allocators approach can certainly
be less conservative (or can be even precise) but for a v1 that's probably overkill.

Please, feel free to shoot holes in this design! We tried to capture everything but
I'd love confirmation that we didn't miss anything.

--Delyan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-25  0:56 ` [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular Delyan Kratunov
@ 2022-08-26  4:03   ` Kumar Kartikeya Dwivedi
  2022-08-29 21:23     ` Delyan Kratunov
  2022-08-29 21:29     ` Delyan Kratunov
  0 siblings, 2 replies; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-26  4:03 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: davem, alexei.starovoitov, tj, joannelkoong, andrii, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote:
>
> Alexei and I spent some time today going back and forth on what the uapi to this
> allocator should look like in a BPF program. To both of our surprise, the problem
> space became far more complicated than we anticipated.
>
> There are three primary problems we have to solve:
> 1) Knowing which allocator an object came from, so we can safely reclaim it when
> necessary (e.g., freeing a map).
> 2) Type confusion between local and kernel types. (I.e., a program allocating kernel
> types and passing them to helpers/kfuncs that don't expect them). This is especially
> important because the existing kptr mechanism assumes kernel types everywhere.

Why is the btf_is_kernel(reg->btf) check not enough to distinguish
local vs kernel kptr?
We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now.

Fun fact: I added a similar check on purpose in map_kptr_match_type,
since Alexei mentioned back then he was working on a local type
allocator, so forgetting to add it later would have been a problem.

> 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy
> when you allow allocated objects in pinned maps.
>
> This is the proposed design that we landed on:
>
> 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity
> at creation time. Value_size > 0 takes the kmem_cache path. Probably with
> btf_value_type_id enforcement for the kmem_cache path.
>
> 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct
> foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer.
>
> 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local)
> annotates fields that can hold values with verifier type `PTR_TO_BTF_ID |
> BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves
> against program-local btf (in the verifier, at runtime it only gets an allocation
> size).

This is ok too, but I think just gating everywhere with btf_is_kernel
would be fine as well.

> 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce
> a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the
> corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This

It would be fine to allow passing if it is just plain data (e.g. what
scalar_struct check does for kfuncs).
There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID,
etc. so it was necessary to restrict the kind of type to LCD.

But we don't have to do it from day 1, just listing what should be ok.

> would be the only place these objects can leak out of bpf land. If there's no runtime
> aspect (and there likely wouldn't be), we might consider doing this transparently,
> still against an allowlist of types.
>
> 4. To ensure the allocator stays alive while objects from it are alive, we must be
> able to identify which allocator each __kptr_local pointer came from, and we must
> keep the refcount up while any such values are alive. One concern here is that doing
> the refcount manipulation in kptr_xchg would be too expensive. The proposed solution
> is to:
> 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer
> from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the
> object to and can simplify the bpf_obj_free() call.
> 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would
> walk all maps that the program has access to (that have kptr_local fields), walk each
> value and ensure that any allocators not already in the map's used_allocators are
> refcount_inc'd and added to the list. Do note that allocators are also kept alive by
> their bpf_map wrapper but after that's gone, used_allocators is the main mechanism.
> Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we
> can only return objects to it.
> 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then
> refcount_dec all the used_allocators.
>

So to summarize your approach:
Each allocation has a bpf_mem_alloc pointer before it to track its
owner allocator.
We know used_maps of each prog, so during unload of program, walk all
local kptrs in each used_maps map values, and that map takes a
reference to the allocator stashing it in used_allocators list,
because prog is going to relinquish its ref to allocator_map (which if
it were the last one would release allocator reference as well for
local kptrs held by those maps).
Once prog is gone, the allocator is kept alive by other maps holding
objects allocated from it. References to the allocator are taken
lazily when required.
Did I get it right?

I see two problems: the first is concurrency. When walking each value,
it is going to be hard to ensure the kptr field remains stable while
you load and take ref to its allocator. Some other programs may also
have access to the map value and may concurrently change the kptr
field (xchg and even release it). How do we safely do a refcount_inc
of its allocator?

For the second problem, consider this:
obj = bpf_obj_alloc(&alloc_map, ...);
inner_map = bpf_map_lookup_elem(&map_in_map, ...);
map_val = bpf_map_lookup_elem(inner_map, ...);
kptr_xchg(&map_val->kptr, obj);

Now delete the entry having that inner_map, but keep its fd open.
Unload the program, since it is map-in-map, no way to fill used_allocators.
alloc_map is freed, releases reference on allocator, allocator is freed.
Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak.
Is there a way to prevent this in your scheme?

--

I had another idea, but it's not _completely_ 0 overhead. Heavy
prototyping so I might be missing corner cases.
It is to take reference on each allocation and deallocation. Yes,
naive and slow if using atomics, but instead we can use percpu_ref
instead of atomic refcount for the allocator. percpu_ref_get/put on
each unit_alloc/unit_free.
The problem though is that once initial reference is killed, it
downgrades to atomic, which will kill performance. So we need to be
smart about how that initial reference is managed.
My idea is that the initial ref is taken and killed by the allocator
bpf_map pinning the allocator. Once that bpf_map is gone, you cannot
do any more allocations anyway (since you need to pass the map pointer
to bpf_obj_alloc), so once it downgrades to atomics at that point we
will only be releasing the references after freeing its allocated
objects. Yes, then the free path becomes a bit costly after the
allocator map is gone.

We might be able to remove the cost on free path as well using the
used_allocators scheme from above (to delay percpu_ref_kill), but it
is not clear how to safely increment the ref of the allocator from map
value...

wdyt?

> Overall, we think this handles all the nasty corners - objects escaping into
> kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations,
> programs accessing multiple allocators having deterministic freelist behaviors -
> while keeping the API and complexity sane. The used_allocators approach can certainly
> be less conservative (or can be even precise) but for a v1 that's probably overkill.
>
> Please, feel free to shoot holes in this design! We tried to capture everything but
> I'd love confirmation that we didn't miss anything.
>
> --Delyan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-26  4:03   ` Kumar Kartikeya Dwivedi
@ 2022-08-29 21:23     ` Delyan Kratunov
  2022-08-29 21:29     ` Delyan Kratunov
  1 sibling, 0 replies; 59+ messages in thread
From: Delyan Kratunov @ 2022-08-29 21:23 UTC (permalink / raw)
  To: memxor
  Cc: tj, andrii, joannelkoong, davem, daniel, Dave Marchevsky,
	alexei.starovoitov, Kernel Team, bpf

Thanks for taking a look, Kumar!

On Fri, 2022-08-26 at 06:03 +0200, Kumar Kartikeya Dwivedi wrote:
> 
> On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote:
> > 
> > Alexei and I spent some time today going back and forth on what the uapi to this
> > allocator should look like in a BPF program. To both of our surprise, the problem
> > space became far more complicated than we anticipated.
> > 
> > There are three primary problems we have to solve:
> > 1) Knowing which allocator an object came from, so we can safely reclaim it when
> > necessary (e.g., freeing a map).
> > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel
> > types and passing them to helpers/kfuncs that don't expect them). This is especially
> > important because the existing kptr mechanism assumes kernel types everywhere.
> 
> Why is the btf_is_kernel(reg->btf) check not enough to distinguish
> local vs kernel kptr?

Answered below.

> We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now.
> 
> Fun fact: I added a similar check on purpose in map_kptr_match_type,
> since Alexei mentioned back then he was working on a local type
> allocator, so forgetting to add it later would have been a problem.
> 
> > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy
> > when you allow allocated objects in pinned maps.
> > 
> > This is the proposed design that we landed on:
> > 
> > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity
> > at creation time. Value_size > 0 takes the kmem_cache path. Probably with
> > btf_value_type_id enforcement for the kmem_cache path.
> > 
> > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct
> > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer.
> > 
> > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local)
> > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID |
> > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves
> > against program-local btf (in the verifier, at runtime it only gets an allocation
> > size).
> 
> This is ok too, but I think just gating everywhere with btf_is_kernel
> would be fine as well.


Yeah, I can get behind not using BTF_LOCAL_ID as a type flag and just encoding that
in the btf field of the register/stack slot/kptr/helper proto. That said, we still
need the new type tag to tell the map btf parsing code to use the local btf in the
kptr descriptor.

> 
> > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce
> > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the
> > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This
> 
> It would be fine to allow passing if it is just plain data (e.g. what
> scalar_struct check does for kfuncs).
> There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID,
> etc. so it was necessary to restrict the kind of type to LCD.
> 
> But we don't have to do it from day 1, just listing what should be ok.

That's a good call, I'll add it to the initial can-transition-to-kernel-kptr logic.

> 
> > would be the only place these objects can leak out of bpf land. If there's no runtime
> > aspect (and there likely wouldn't be), we might consider doing this transparently,
> > still against an allowlist of types.
> > 
> > 4. To ensure the allocator stays alive while objects from it are alive, we must be
> > able to identify which allocator each __kptr_local pointer came from, and we must
> > keep the refcount up while any such values are alive. One concern here is that doing
> > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution
> > is to:
> > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer
> > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the
> > object to and can simplify the bpf_obj_free() call.
> > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would
> > walk all maps that the program has access to (that have kptr_local fields), walk each
> > value and ensure that any allocators not already in the map's used_allocators are
> > refcount_inc'd and added to the list. Do note that allocators are also kept alive by
> > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism.
> > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we
> > can only return objects to it.
> > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then
> > refcount_dec all the used_allocators.
> > 
> 
> So to summarize your approach:
> Each allocation has a bpf_mem_alloc pointer before it to track its
> owner allocator.
> We know used_maps of each prog, so during unload of program, walk all
> local kptrs in each used_maps map values, and that map takes a
> reference to the allocator stashing it in used_allocators list,
> because prog is going to relinquish its ref to allocator_map (which if
> it were the last one would release allocator reference as well for
> local kptrs held by those maps).
> Once prog is gone, the allocator is kept alive by other maps holding
> objects allocated from it. References to the allocator are taken
> lazily when required.
> Did I get it right?

That's correct!

> 
> I see two problems: the first is concurrency. When walking each value,
> it is going to be hard to ensure the kptr field remains stable while
> you load and take ref to its allocator. Some other programs may also
> have access to the map value and may concurrently change the kptr
> field (xchg and even release it). How do we safely do a refcount_inc
> of its allocator?

Fair question. You can think of that pointer as immutable for the entire time that
the allocator is able to interact with the object. Once the object makes it on a
freelist, it won't be released until an rcu qs. Therefore, the first time that value
can change - when we return the object to the global kmalloc pool - it has provably
no bpf-side concurrent observers.

Alexei, please correct me if I misunderstood how the design is supposed to work.

> 
> For the second problem, consider this:
> obj = bpf_obj_alloc(&alloc_map, ...);
> inner_map = bpf_map_lookup_elem(&map_in_map, ...);
> map_val = bpf_map_lookup_elem(inner_map, ...);
> kptr_xchg(&map_val->kptr, obj);
> 
> Now delete the entry having that inner_map, but keep its fd open.
> Unload the program, since it is map-in-map, no way to fill used_allocators.
> alloc_map is freed, releases reference on allocator, allocator is freed.
> Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak.
> Is there a way to prevent this in your scheme?

This is fair, inner maps not being tracked in used_maps is a wrench in that plan. 

> -
> 
> I had another idea, but it's not _completely_ 0 overhead. Heavy
> prototyping so I might be missing corner cases.
> It is to take reference on each allocation and deallocation. Yes,
> naive and slow if using atomics, but instead we can use percpu_ref
> instead of atomic refcount for the allocator. percpu_ref_get/put on
> each unit_alloc/unit_free.
> The problem though is that once initial reference is killed, it
> downgrades to atomic, which will kill performance. So we need to be
> smart about how that initial reference is managed.
> My idea is that the initial ref is taken and killed by the allocator
> bpf_map pinning the allocator. Once that bpf_map is gone, you cannot
> do any more allocations anyway (since you need to pass the map pointer
> to bpf_obj_alloc), so once it downgrades to atomics at that point we
> will only be releasing the references after freeing its allocated
> objects. Yes, then the free path becomes a bit costly after the
> allocator map is gone.
> 
> We might be able to remove the cost on free path as well using the
> used_allocators scheme from above (to delay percpu_ref_kill), but it
> is not clear how to safely increment the ref of the allocator from map
> value...

As explained above, the values are already rcu-protected, so we can use that to
coordinate refcounting of the allocator. That said, percpu_ref could work (I was
considering something similar within the allocator itself) but I'm not convinced
about the cost. My concern is that once it becomes atomic_t, it erases the benefits
of all the work in the allocator that maintains percpu data structures.

I wonder if the allocator should maintain percpu live counts (with underflow for
unbalanced alloc-free pairs on a cpu) in its percpu structures. Then, we can have
explicit "sum up all the counts to discover if you should be destroyed" calls. If we
keep the used_allocators scheme, these calls can be inserted at program unload for
maps in used_maps and at map free time for maps that escape that mechanism - map goes
over all its used_allocators and have them confirm the liveness count is > 0. 

I think doing it this way we cover the hole with map-in-map without regressing any
path. 

Thoughts?

> 
> wdyt?
> 
> > Overall, we think this handles all the nasty corners - objects escaping into
> > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations,
> > programs accessing multiple allocators having deterministic freelist behaviors -
> > while keeping the API and complexity sane. The used_allocators approach can certainly
> > be less conservative (or can be even precise) but for a v1 that's probably overkill.
> > 
> > Please, feel free to shoot holes in this design! We tried to capture everything but
> > I'd love confirmation that we didn't miss anything.
> > 
> > --Delyan


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-26  4:03   ` Kumar Kartikeya Dwivedi
  2022-08-29 21:23     ` Delyan Kratunov
@ 2022-08-29 21:29     ` Delyan Kratunov
  2022-08-29 22:07       ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 59+ messages in thread
From: Delyan Kratunov @ 2022-08-29 21:29 UTC (permalink / raw)
  To: memxor
  Cc: tj, andrii, joannelkoong, davem, daniel, Dave Marchevsky,
	alexei.starovoitov, Kernel Team, bpf

Thanks for taking a look, Kumar!

On Fri, 2022-08-26 at 06:03 +0200, Kumar Kartikeya Dwivedi wrote:
> > 
> > On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote:
> > > > 
> > > > Alexei and I spent some time today going back and forth on what the uapi to this
> > > > allocator should look like in a BPF program. To both of our surprise, the problem
> > > > space became far more complicated than we anticipated.
> > > > 
> > > > There are three primary problems we have to solve:
> > > > 1) Knowing which allocator an object came from, so we can safely reclaim it when
> > > > necessary (e.g., freeing a map).
> > > > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel
> > > > types and passing them to helpers/kfuncs that don't expect them). This is especially
> > > > important because the existing kptr mechanism assumes kernel types everywhere.
> > 
> > Why is the btf_is_kernel(reg->btf) check not enough to distinguish
> > local vs kernel kptr?

Answered below.

> > We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now.
> > 
> > Fun fact: I added a similar check on purpose in map_kptr_match_type,
> > since Alexei mentioned back then he was working on a local type
> > allocator, so forgetting to add it later would have been a problem.
> > 
> > > > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy
> > > > when you allow allocated objects in pinned maps.
> > > > 
> > > > This is the proposed design that we landed on:
> > > > 
> > > > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity
> > > > at creation time. Value_size > 0 takes the kmem_cache path. Probably with
> > > > btf_value_type_id enforcement for the kmem_cache path.
> > > > 
> > > > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct
> > > > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer.
> > > > 
> > > > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local)
> > > > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID |
> > > > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves
> > > > against program-local btf (in the verifier, at runtime it only gets an allocation
> > > > size).
> > 
> > This is ok too, but I think just gating everywhere with btf_is_kernel
> > would be fine as well.


Yeah, I can get behind not using BTF_LOCAL_ID as a type flag and just encoding that
in the btf field of the register/stack slot/kptr/helper proto. That said, we still
need the new type tag to tell the map btf parsing code to use the local btf in the
kptr descriptor.

> > 
> > > > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce
> > > > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the
> > > > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This
> > 
> > It would be fine to allow passing if it is just plain data (e.g. what
> > scalar_struct check does for kfuncs).
> > There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID,
> > etc. so it was necessary to restrict the kind of type to LCD.
> > 
> > But we don't have to do it from day 1, just listing what should be ok.

That's a good call, I'll add it to the initial can-transition-to-kernel-kptr logic.

> > 
> > > > would be the only place these objects can leak out of bpf land. If there's no runtime
> > > > aspect (and there likely wouldn't be), we might consider doing this transparently,
> > > > still against an allowlist of types.
> > > > 
> > > > 4. To ensure the allocator stays alive while objects from it are alive, we must be
> > > > able to identify which allocator each __kptr_local pointer came from, and we must
> > > > keep the refcount up while any such values are alive. One concern here is that doing
> > > > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution
> > > > is to:
> > > > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer
> > > > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the
> > > > object to and can simplify the bpf_obj_free() call.
> > > > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would
> > > > walk all maps that the program has access to (that have kptr_local fields), walk each
> > > > value and ensure that any allocators not already in the map's used_allocators are
> > > > refcount_inc'd and added to the list. Do note that allocators are also kept alive by
> > > > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism.
> > > > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we
> > > > can only return objects to it.
> > > > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then
> > > > refcount_dec all the used_allocators.
> > > > 
> > 
> > So to summarize your approach:
> > Each allocation has a bpf_mem_alloc pointer before it to track its
> > owner allocator.
> > We know used_maps of each prog, so during unload of program, walk all
> > local kptrs in each used_maps map values, and that map takes a
> > reference to the allocator stashing it in used_allocators list,
> > because prog is going to relinquish its ref to allocator_map (which if
> > it were the last one would release allocator reference as well for
> > local kptrs held by those maps).
> > Once prog is gone, the allocator is kept alive by other maps holding
> > objects allocated from it. References to the allocator are taken
> > lazily when required.
> > Did I get it right?

That's correct!

> > 
> > I see two problems: the first is concurrency. When walking each value,
> > it is going to be hard to ensure the kptr field remains stable while
> > you load and take ref to its allocator. Some other programs may also
> > have access to the map value and may concurrently change the kptr
> > field (xchg and even release it). How do we safely do a refcount_inc
> > of its allocator?

Fair question. You can think of that pointer as immutable for the entire time that
the allocator is able to interact with the object. Once the object makes it on a
freelist, it won't be released until an rcu gp has elapsed. Therefore, the first time
that value can change - when we return the object to the global kmalloc pool - it has
provably no bpf-side concurrent observers.

Alexei, please correct me if I misunderstood how the design is supposed to work.

> > 
> > For the second problem, consider this:
> > obj = bpf_obj_alloc(&alloc_map, ...);
> > inner_map = bpf_map_lookup_elem(&map_in_map, ...);
> > map_val = bpf_map_lookup_elem(inner_map, ...);
> > kptr_xchg(&map_val->kptr, obj);
> > 
> > Now delete the entry having that inner_map, but keep its fd open.
> > Unload the program, since it is map-in-map, no way to fill used_allocators.
> > alloc_map is freed, releases reference on allocator, allocator is freed.
> > Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak.
> > Is there a way to prevent this in your scheme?

This is fair, inner maps not being tracked in used_maps is a wrench in that plan.
However, we can have the parent map propagate its used_allocators on inner map
removal.

> > -
> > 
> > I had another idea, but it's not _completely_ 0 overhead. Heavy
> > prototyping so I might be missing corner cases.
> > It is to take reference on each allocation and deallocation. Yes,
> > naive and slow if using atomics, but instead we can use percpu_ref
> > instead of atomic refcount for the allocator. percpu_ref_get/put on
> > each unit_alloc/unit_free.
> > The problem though is that once initial reference is killed, it
> > downgrades to atomic, which will kill performance. So we need to be
> > smart about how that initial reference is managed.
> > My idea is that the initial ref is taken and killed by the allocator
> > bpf_map pinning the allocator. Once that bpf_map is gone, you cannot
> > do any more allocations anyway (since you need to pass the map pointer
> > to bpf_obj_alloc), so once it downgrades to atomics at that point we
> > will only be releasing the references after freeing its allocated
> > objects. Yes, then the free path becomes a bit costly after the
> > allocator map is gone.
> > 
> > We might be able to remove the cost on free path as well using the
> > used_allocators scheme from above (to delay percpu_ref_kill), but it
> > is not clear how to safely increment the ref of the allocator from map
> > value...

As explained above, the values are already rcu-protected, so we can use that to
coordinate refcounting of the allocator. That said, percpu_ref could work (I was
considering something similar within the allocator itself) but I'm not convinced the
cost is right. My main concern would be that once it becomes atomic_t, it erases the
benefits of all the work in the allocator that maintains percpu data structures.

If we want to go down this path, the allocator can maintain percpu live counts (with
underflow for unbalanced alloc-free pairs on a cpu) in its percpu structures. Then,
we can have explicit "sum up all the counts to discover if you should be destroyed"
calls. If we keep the used_allocators scheme, these calls can be inserted at program
unload for maps in used_maps and at map free time for maps that escape that
mechanism.

Or, we just extend the map-in-map mechanism to propagate used_allocators as needed.
There are nice debug properties of the allocator knowing the liveness counts but we
don't have to put them on the path to correctness.

> > 
> > wdyt?
> > 
> > > > Overall, we think this handles all the nasty corners - objects escaping into
> > > > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations,
> > > > programs accessing multiple allocators having deterministic freelist behaviors -
> > > > while keeping the API and complexity sane. The used_allocators approach can certainly
> > > > be less conservative (or can be even precise) but for a v1 that's probably overkill.
> > > > 
> > > > Please, feel free to shoot holes in this design! We tried to capture everything but
> > > > I'd love confirmation that we didn't miss anything.
> > > > 
> > > > --Delyan



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-29 21:29     ` Delyan Kratunov
@ 2022-08-29 22:07       ` Kumar Kartikeya Dwivedi
  2022-08-29 23:18         ` Delyan Kratunov
  0 siblings, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-29 22:07 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: tj, andrii, joannelkoong, davem, daniel, Dave Marchevsky,
	alexei.starovoitov, Kernel Team, bpf

On Mon, 29 Aug 2022 at 23:29, Delyan Kratunov <delyank@fb.com> wrote:
>
> Thanks for taking a look, Kumar!
>
> On Fri, 2022-08-26 at 06:03 +0200, Kumar Kartikeya Dwivedi wrote:
> > >
> > > On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote:
> > > > >
> > > > > Alexei and I spent some time today going back and forth on what the uapi to this
> > > > > allocator should look like in a BPF program. To both of our surprise, the problem
> > > > > space became far more complicated than we anticipated.
> > > > >
> > > > > There are three primary problems we have to solve:
> > > > > 1) Knowing which allocator an object came from, so we can safely reclaim it when
> > > > > necessary (e.g., freeing a map).
> > > > > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel
> > > > > types and passing them to helpers/kfuncs that don't expect them). This is especially
> > > > > important because the existing kptr mechanism assumes kernel types everywhere.
> > >
> > > Why is the btf_is_kernel(reg->btf) check not enough to distinguish
> > > local vs kernel kptr?
>
> Answered below.
>
> > > We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now.
> > >
> > > Fun fact: I added a similar check on purpose in map_kptr_match_type,
> > > since Alexei mentioned back then he was working on a local type
> > > allocator, so forgetting to add it later would have been a problem.
> > >
> > > > > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy
> > > > > when you allow allocated objects in pinned maps.
> > > > >
> > > > > This is the proposed design that we landed on:
> > > > >
> > > > > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity
> > > > > at creation time. Value_size > 0 takes the kmem_cache path. Probably with
> > > > > btf_value_type_id enforcement for the kmem_cache path.
> > > > >
> > > > > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct
> > > > > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer.
> > > > >
> > > > > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local)
> > > > > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID |
> > > > > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves
> > > > > against program-local btf (in the verifier, at runtime it only gets an allocation
> > > > > size).
> > >
> > > This is ok too, but I think just gating everywhere with btf_is_kernel
> > > would be fine as well.
>
>
> Yeah, I can get behind not using BTF_LOCAL_ID as a type flag and just encoding that
> in the btf field of the register/stack slot/kptr/helper proto. That said, we still
> need the new type tag to tell the map btf parsing code to use the local btf in the
> kptr descriptor.
>

Agreed, the new __local type tag looks necessary to make it search in
map BTF instead.

> > >
> > > > > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce
> > > > > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the
> > > > > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This
> > >
> > > It would be fine to allow passing if it is just plain data (e.g. what
> > > scalar_struct check does for kfuncs).
> > > There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID,
> > > etc. so it was necessary to restrict the kind of type to LCD.
> > >
> > > But we don't have to do it from day 1, just listing what should be ok.
>
> That's a good call, I'll add it to the initial can-transition-to-kernel-kptr logic.
>
> > >
> > > > > would be the only place these objects can leak out of bpf land. If there's no runtime
> > > > > aspect (and there likely wouldn't be), we might consider doing this transparently,
> > > > > still against an allowlist of types.
> > > > >
> > > > > 4. To ensure the allocator stays alive while objects from it are alive, we must be
> > > > > able to identify which allocator each __kptr_local pointer came from, and we must
> > > > > keep the refcount up while any such values are alive. One concern here is that doing
> > > > > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution
> > > > > is to:
> > > > > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer
> > > > > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the
> > > > > object to and can simplify the bpf_obj_free() call.
> > > > > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would
> > > > > walk all maps that the program has access to (that have kptr_local fields), walk each
> > > > > value and ensure that any allocators not already in the map's used_allocators are
> > > > > refcount_inc'd and added to the list. Do note that allocators are also kept alive by
> > > > > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism.
> > > > > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we
> > > > > can only return objects to it.
> > > > > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then
> > > > > refcount_dec all the used_allocators.
> > > > >
> > >
> > > So to summarize your approach:
> > > Each allocation has a bpf_mem_alloc pointer before it to track its
> > > owner allocator.
> > > We know used_maps of each prog, so during unload of program, walk all
> > > local kptrs in each used_maps map values, and that map takes a
> > > reference to the allocator stashing it in used_allocators list,
> > > because prog is going to relinquish its ref to allocator_map (which if
> > > it were the last one would release allocator reference as well for
> > > local kptrs held by those maps).
> > > Once prog is gone, the allocator is kept alive by other maps holding
> > > objects allocated from it. References to the allocator are taken
> > > lazily when required.
> > > Did I get it right?
>
> That's correct!
>
> > >
> > > I see two problems: the first is concurrency. When walking each value,
> > > it is going to be hard to ensure the kptr field remains stable while
> > > you load and take ref to its allocator. Some other programs may also
> > > have access to the map value and may concurrently change the kptr
> > > field (xchg and even release it). How do we safely do a refcount_inc
> > > of its allocator?
>
> Fair question. You can think of that pointer as immutable for the entire time that
> the allocator is able to interact with the object. Once the object makes it on a
> freelist, it won't be released until an rcu gp has elapsed. Therefore, the first time
> that value can change - when we return the object to the global kmalloc pool - it has
> provably no bpf-side concurrent observers.
>

I don't think that assumption will hold. Requiring RCU protection for
all local kptrs means allocator cache reuse becomes harder, as
elements are stuck in freelist until the next grace period. It
necessitates use of larger caches.
For some use cases where they can tolerate reuse, it might not be
optimal. IMO the allocator should be independent of how the lifetime
of elements is managed.

That said, even if you assume RCU protection, that still doesn't
address the real problem. Yes, you can access the value without
worrying about it moving to another map, but consider this case:
Prog unloading,
populate_used_allocators -> map walks map_values, tries to take
reference to local kptr whose backing allocator is A.
Loads kptr, meanwhile some other prog sharing access to the map value
exchanges (kptr_xchg) another pointer into that field.
Now you take reference to allocator A in used_allocators, while actual
value in map is for allocator B.

So you either have to cmpxchg and then retry if it fails (which is not
a wait-free operation, and honestly not great imo), or if you don't
handle this:
Someone moved an allocated local kptr backed by B into your map, but
you don't hold reference to it. That other program may release
allocator map -> allocator, and then when this map is destroyed, on
unit_free it will be use-after-free of bpf_mem_alloc *.

I don't see an easy way around these kinds of problems. And this is
just one specific scenario.

> Alexei, please correct me if I misunderstood how the design is supposed to work.
>
> > >
> > > For the second problem, consider this:
> > > obj = bpf_obj_alloc(&alloc_map, ...);
> > > inner_map = bpf_map_lookup_elem(&map_in_map, ...);
> > > map_val = bpf_map_lookup_elem(inner_map, ...);
> > > kptr_xchg(&map_val->kptr, obj);
> > >
> > > Now delete the entry having that inner_map, but keep its fd open.
> > > Unload the program, since it is map-in-map, no way to fill used_allocators.
> > > alloc_map is freed, releases reference on allocator, allocator is freed.
> > > Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak.
> > > Is there a way to prevent this in your scheme?
>
> This is fair, inner maps not being tracked in used_maps is a wrench in that plan.
> However, we can have the parent map propagate its used_allocators on inner map
> removal.
>

But used_allocators are not populated until unload? inner_map removal
can happen while the program is loaded/attached.
Or will you populate it before unloading, everytime during inner_map
removal? Then that would be very costly for a bpf_map_delete_elem...

> > > -
> > >
> > > I had another idea, but it's not _completely_ 0 overhead. Heavy
> > > prototyping so I might be missing corner cases.
> > > It is to take reference on each allocation and deallocation. Yes,
> > > naive and slow if using atomics, but instead we can use percpu_ref
> > > instead of atomic refcount for the allocator. percpu_ref_get/put on
> > > each unit_alloc/unit_free.
> > > The problem though is that once initial reference is killed, it
> > > downgrades to atomic, which will kill performance. So we need to be
> > > smart about how that initial reference is managed.
> > > My idea is that the initial ref is taken and killed by the allocator
> > > bpf_map pinning the allocator. Once that bpf_map is gone, you cannot
> > > do any more allocations anyway (since you need to pass the map pointer
> > > to bpf_obj_alloc), so once it downgrades to atomics at that point we
> > > will only be releasing the references after freeing its allocated
> > > objects. Yes, then the free path becomes a bit costly after the
> > > allocator map is gone.
> > >
> > > We might be able to remove the cost on free path as well using the
> > > used_allocators scheme from above (to delay percpu_ref_kill), but it
> > > is not clear how to safely increment the ref of the allocator from map
> > > value...
>
> As explained above, the values are already rcu-protected, so we can use that to
> coordinate refcounting of the allocator. That said, percpu_ref could work (I was
> considering something similar within the allocator itself) but I'm not convinced the
> cost is right. My main concern would be that once it becomes atomic_t, it erases the
> benefits of all the work in the allocator that maintains percpu data structures.
>
> If we want to go down this path, the allocator can maintain percpu live counts (with
> underflow for unbalanced alloc-free pairs on a cpu) in its percpu structures. Then,
> we can have explicit "sum up all the counts to discover if you should be destroyed"
> calls. If we keep the used_allocators scheme, these calls can be inserted at program
> unload for maps in used_maps and at map free time for maps that escape that
> mechanism.

Yes, it would be a good idea to put the percpu refcount for this case
inside the already percpu bpf_mem_cache struct, since that will be
much better than allocating it separately. The increment will then be
a 100% cache hit.

The main question is how this "sum up all the count" operation needs
to be done. Once that initial reference of bpf_map is gone, you need
to track the final owner who will be responsible to release the
allocator. You will need to do something similar to percpu_ref's
atomic count upgrade unless I'm missing something.

Once you establish that used_allocators cannot be safely populated on
unload (which you can correct me on), the only utility I see for it is
delaying the atomic upgrade for this idea.

So another approach (though I don't like this too much):
One solution to delay the upgrade can be that when you have allocator
maps and other normal maps in used_maps, it always incs ref on the
allocator pinned by allocator map on unload for maps that have local
kptr, so each map having local kptrs speculatively takes ref on
allocator maps of this prog, assuming allocations from that allocator
map are more likely to be in these maps.

Same with other progs having different allocator map but sharing these maps.

It is not very precise, but until those maps are gone it delays
release of the allocator (we can empty all percpu caches to save
memory once bpf_map pinning the allocator is gone, because allocations
are not going to be served). But it allows unit_free to be relatively
less costly as long as those 'candidate' maps are around.



>
> Or, we just extend the map-in-map mechanism to propagate used_allocators as needed.
> There are nice debug properties of the allocator knowing the liveness counts but we
> don't have to put them on the path to correctness.
>
> > >
> > > wdyt?
> > >
> > > > > Overall, we think this handles all the nasty corners - objects escaping into
> > > > > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations,
> > > > > programs accessing multiple allocators having deterministic freelist behaviors -
> > > > > while keeping the API and complexity sane. The used_allocators approach can certainly
> > > > > be less conservative (or can be even precise) but for a v1 that's probably overkill.
> > > > >
> > > > > Please, feel free to shoot holes in this design! We tried to capture everything but
> > > > > I'd love confirmation that we didn't miss anything.
> > > > >
> > > > > --Delyan
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-29 22:07       ` Kumar Kartikeya Dwivedi
@ 2022-08-29 23:18         ` Delyan Kratunov
  2022-08-29 23:45           ` Alexei Starovoitov
  2022-08-30  0:17           ` Kumar Kartikeya Dwivedi
  0 siblings, 2 replies; 59+ messages in thread
From: Delyan Kratunov @ 2022-08-29 23:18 UTC (permalink / raw)
  To: memxor
  Cc: davem, joannelkoong, andrii, tj, daniel, Dave Marchevsky,
	alexei.starovoitov, Kernel Team, bpf

On Tue, 2022-08-30 at 00:07 +0200, Kumar Kartikeya Dwivedi wrote:
> 
> On Mon, 29 Aug 2022 at 23:29, Delyan Kratunov <delyank@fb.com> wrote:
> > 
> > Thanks for taking a look, Kumar!
> > 
> > On Fri, 2022-08-26 at 06:03 +0200, Kumar Kartikeya Dwivedi wrote:
> > > > 
> > > > On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote:
> > > > > > 
> > > > > > Alexei and I spent some time today going back and forth on what the uapi to this
> > > > > > allocator should look like in a BPF program. To both of our surprise, the problem
> > > > > > space became far more complicated than we anticipated.
> > > > > > 
> > > > > > There are three primary problems we have to solve:
> > > > > > 1) Knowing which allocator an object came from, so we can safely reclaim it when
> > > > > > necessary (e.g., freeing a map).
> > > > > > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel
> > > > > > types and passing them to helpers/kfuncs that don't expect them). This is especially
> > > > > > important because the existing kptr mechanism assumes kernel types everywhere.
> > > > 
> > > > Why is the btf_is_kernel(reg->btf) check not enough to distinguish
> > > > local vs kernel kptr?
> > 
> > Answered below.
> > 
> > > > We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now.
> > > > 
> > > > Fun fact: I added a similar check on purpose in map_kptr_match_type,
> > > > since Alexei mentioned back then he was working on a local type
> > > > allocator, so forgetting to add it later would have been a problem.
> > > > 
> > > > > > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy
> > > > > > when you allow allocated objects in pinned maps.
> > > > > > 
> > > > > > This is the proposed design that we landed on:
> > > > > > 
> > > > > > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity
> > > > > > at creation time. Value_size > 0 takes the kmem_cache path. Probably with
> > > > > > btf_value_type_id enforcement for the kmem_cache path.
> > > > > > 
> > > > > > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct
> > > > > > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer.
> > > > > > 
> > > > > > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local)
> > > > > > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID |
> > > > > > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves
> > > > > > against program-local btf (in the verifier, at runtime it only gets an allocation
> > > > > > size).
> > > > 
> > > > This is ok too, but I think just gating everywhere with btf_is_kernel
> > > > would be fine as well.
> > 
> > 
> > Yeah, I can get behind not using BTF_LOCAL_ID as a type flag and just encoding that
> > in the btf field of the register/stack slot/kptr/helper proto. That said, we still
> > need the new type tag to tell the map btf parsing code to use the local btf in the
> > kptr descriptor.
> > 
> 
> Agreed, the new __local type tag looks necessary to make it search in
> map BTF instead.
> 
> > > > 
> > > > > > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce
> > > > > > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the
> > > > > > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This
> > > > 
> > > > It would be fine to allow passing if it is just plain data (e.g. what
> > > > scalar_struct check does for kfuncs).
> > > > There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID,
> > > > etc. so it was necessary to restrict the kind of type to LCD.
> > > > 
> > > > But we don't have to do it from day 1, just listing what should be ok.
> > 
> > That's a good call, I'll add it to the initial can-transition-to-kernel-kptr logic.
> > 
> > > > 
> > > > > > would be the only place these objects can leak out of bpf land. If there's no runtime
> > > > > > aspect (and there likely wouldn't be), we might consider doing this transparently,
> > > > > > still against an allowlist of types.
> > > > > > 
> > > > > > 4. To ensure the allocator stays alive while objects from it are alive, we must be
> > > > > > able to identify which allocator each __kptr_local pointer came from, and we must
> > > > > > keep the refcount up while any such values are alive. One concern here is that doing
> > > > > > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution
> > > > > > is to:
> > > > > > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer
> > > > > > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the
> > > > > > object to and can simplify the bpf_obj_free() call.
> > > > > > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would
> > > > > > walk all maps that the program has access to (that have kptr_local fields), walk each
> > > > > > value and ensure that any allocators not already in the map's used_allocators are
> > > > > > refcount_inc'd and added to the list. Do note that allocators are also kept alive by
> > > > > > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism.
> > > > > > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we
> > > > > > can only return objects to it.
> > > > > > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then
> > > > > > refcount_dec all the used_allocators.
> > > > > > 
> > > > 
> > > > So to summarize your approach:
> > > > Each allocation has a bpf_mem_alloc pointer before it to track its
> > > > owner allocator.
> > > > We know used_maps of each prog, so during unload of program, walk all
> > > > local kptrs in each used_maps map values, and that map takes a
> > > > reference to the allocator stashing it in used_allocators list,
> > > > because prog is going to relinquish its ref to allocator_map (which if
> > > > it were the last one would release allocator reference as well for
> > > > local kptrs held by those maps).
> > > > Once prog is gone, the allocator is kept alive by other maps holding
> > > > objects allocated from it. References to the allocator are taken
> > > > lazily when required.
> > > > Did I get it right?
> > 
> > That's correct!
> > 
> > > > 
> > > > I see two problems: the first is concurrency. When walking each value,
> > > > it is going to be hard to ensure the kptr field remains stable while
> > > > you load and take ref to its allocator. Some other programs may also
> > > > have access to the map value and may concurrently change the kptr
> > > > field (xchg and even release it). How do we safely do a refcount_inc
> > > > of its allocator?
> > 
> > Fair question. You can think of that pointer as immutable for the entire time that
> > the allocator is able to interact with the object. Once the object makes it on a
> > freelist, it won't be released until an rcu gp has elapsed. Therefore, the first time
> > that value can change - when we return the object to the global kmalloc pool - it has
> > provably no bpf-side concurrent observers.
> > 
> 
> I don't think that assumption will hold. Requiring RCU protection for
> all local kptrs means allocator cache reuse becomes harder, as
> elements are stuck in freelist until the next grace period. It
> necessitates use of larger caches.
> For some use cases where they can tolerate reuse, it might not be
> optimal. IMO the allocator should be independent of how the lifetime
> of elements is managed.

All maps and allocations are already rcu-protected, we're not adding new here. We're
only relying on this rcu protection (c.f. the chain of call_rcu_task_trace and
call_rcu in the patchset) to prove that no program can observe an allocator pointer
that is corrupted or stale. 

> 
> That said, even if you assume RCU protection, that still doesn't
> address the real problem. Yes, you can access the value without
> worrying about it moving to another map, but consider this case:
> Prog unloading,
> populate_used_allocators -> map walks map_values, tries to take
> reference to local kptr whose backing allocator is A.
> Loads kptr, meanwhile some other prog sharing access to the map value
> exchanges (kptr_xchg) another pointer into that field.
> Now you take reference to allocator A in used_allocators, while actual
> value in map is for allocator B.

This is fine. The algorithm is conservative, it may keep allocators around longer
than they're needed. Eventually there will exist a time that this map won't be
accessible to any program, at which point both allocator A and B will be released.

It is possible to make a more precise algorithm but given that this behavior is only
really a problem with (likely pinned) long-lived maps, it's imo not worth it for v1.

> 
> So you either have to cmpxchg and then retry if it fails (which is not
> a wait-free operation, and honestly not great imo), or if you don't
> handle this:
> Someone moved an allocated local kptr backed by B into your map, but
> you don't hold reference to it. 

You don't need a reference while something else is holding the allocator alive. The
references exist to extend the lifetime past the lifetimes of programs that can
directly reference the allocator.

> That other program may release
> allocator map -> allocator, 

The allocator map cannot be removed while it's in used_maps of the program. On
program unload, we'll add B to the used_allocators list, as a value from B is in the
map. Only then will the allocator map be releasable.

> and then when this map is destroyed, on
> unit_free it will be use-after-free of bpf_mem_alloc *.
> 
> I don't see an easy way around these kinds of problems. And this is
> just one specific scenario.
> 
> > Alexei, please correct me if I misunderstood how the design is supposed to work.
> > 
> > > > 
> > > > For the second problem, consider this:
> > > > obj = bpf_obj_alloc(&alloc_map, ...);
> > > > inner_map = bpf_map_lookup_elem(&map_in_map, ...);
> > > > map_val = bpf_map_lookup_elem(inner_map, ...);
> > > > kptr_xchg(&map_val->kptr, obj);
> > > > 
> > > > Now delete the entry having that inner_map, but keep its fd open.
> > > > Unload the program, since it is map-in-map, no way to fill used_allocators.
> > > > alloc_map is freed, releases reference on allocator, allocator is freed.
> > > > Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak.
> > > > Is there a way to prevent this in your scheme?
> > 
> > This is fair, inner maps not being tracked in used_maps is a wrench in that plan.
> > However, we can have the parent map propagate its used_allocators on inner map
> > removal.
> > 
> 
> But used_allocators are not populated until unload? inner_map removal
> can happen while the program is loaded/attached.
> Or will you populate it before unloading, everytime during inner_map
> removal? Then that would be very costly for a bpf_map_delete_elem...

It's not free, granted, but it only kicks if the map-of-maps has already acquired a
used_allocators list.

I'd be okay with handling map-of-map via liveness counts too.

> 
> > > > -
> > > > 
> > > > I had another idea, but it's not _completely_ 0 overhead. Heavy
> > > > prototyping so I might be missing corner cases.
> > > > It is to take reference on each allocation and deallocation. Yes,
> > > > naive and slow if using atomics, but instead we can use percpu_ref
> > > > instead of atomic refcount for the allocator. percpu_ref_get/put on
> > > > each unit_alloc/unit_free.
> > > > The problem though is that once initial reference is killed, it
> > > > downgrades to atomic, which will kill performance. So we need to be
> > > > smart about how that initial reference is managed.
> > > > My idea is that the initial ref is taken and killed by the allocator
> > > > bpf_map pinning the allocator. Once that bpf_map is gone, you cannot
> > > > do any more allocations anyway (since you need to pass the map pointer
> > > > to bpf_obj_alloc), so once it downgrades to atomics at that point we
> > > > will only be releasing the references after freeing its allocated
> > > > objects. Yes, then the free path becomes a bit costly after the
> > > > allocator map is gone.
> > > > 
> > > > We might be able to remove the cost on free path as well using the
> > > > used_allocators scheme from above (to delay percpu_ref_kill), but it
> > > > is not clear how to safely increment the ref of the allocator from map
> > > > value...
> > 
> > As explained above, the values are already rcu-protected, so we can use that to
> > coordinate refcounting of the allocator. That said, percpu_ref could work (I was
> > considering something similar within the allocator itself) but I'm not convinced the
> > cost is right. My main concern would be that once it becomes atomic_t, it erases the
> > benefits of all the work in the allocator that maintains percpu data structures.
> > 
> > If we want to go down this path, the allocator can maintain percpu live counts (with
> > underflow for unbalanced alloc-free pairs on a cpu) in its percpu structures. Then,
> > we can have explicit "sum up all the counts to discover if you should be destroyed"
> > calls. If we keep the used_allocators scheme, these calls can be inserted at program
> > unload for maps in used_maps and at map free time for maps that escape that
> > mechanism.
> 
> Yes, it would be a good idea to put the percpu refcount for this case
> inside the already percpu bpf_mem_cache struct, since that will be
> much better than allocating it separately. The increment will then be
> a 100% cache hit.
> 
> The main question is how this "sum up all the count" operation needs
> to be done. Once that initial reference of bpf_map is gone, you need
> to track the final owner who will be responsible to release the
> allocator. You will need to do something similar to percpu_ref's
> atomic count upgrade unless I'm missing something.

It's actually easier since we know we can limit the checks to only the there's-no-
reference-from-an-allocator-map case. We can also postpone the work to the rcu gp to
make it even easier to sequence with the individual elements' free()s.

> 
> Once you establish that used_allocators cannot be safely populated on
> unload (which you can correct me on), the only utility I see for it is
> delaying the atomic upgrade for this idea.
> 
> So another approach (though I don't like this too much):
> One solution to delay the upgrade can be that when you have allocator
> maps and other normal maps in used_maps, it always incs ref on the
> allocator pinned by allocator map on unload for maps that have local
> kptr, so each map having local kptrs speculatively takes ref on
> allocator maps of this prog, assuming allocations from that allocator
> map are more likely to be in these maps.
> 
> Same with other progs having different allocator map but sharing these maps.
> 
> It is not very precise, but until those maps are gone it delays
> release of the allocator (we can empty all percpu caches to save
> memory once bpf_map pinning the allocator is gone, because allocations
> are not going to be served). But it allows unit_free to be relatively
> less costly as long as those 'candidate' maps are around.

Yes, we considered this but it's much easier to get to pathological behaviors, by
just loading and unloading programs that can access an allocator in a loop. The
freelists being empty help but it's still quite easy to hold a lot of memory for
nothing. 

The pointer walk was proposed to prune most such pathological cases while still being
conservative enough to be easy to implement. Only races with the pointer walk can
extend the lifetime unnecessarily.

> 
> 
> 
> > 
> > Or, we just extend the map-in-map mechanism to propagate used_allocators as needed.
> > There are nice debug properties of the allocator knowing the liveness counts but we
> > don't have to put them on the path to correctness.
> > 
> > > > 
> > > > wdyt?
> > > > 
> > > > > > Overall, we think this handles all the nasty corners - objects escaping into
> > > > > > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations,
> > > > > > programs accessing multiple allocators having deterministic freelist behaviors -
> > > > > > while keeping the API and complexity sane. The used_allocators approach can certainly
> > > > > > be less conservative (or can be even precise) but for a v1 that's probably overkill.
> > > > > > 
> > > > > > Please, feel free to shoot holes in this design! We tried to capture everything but
> > > > > > I'd love confirmation that we didn't miss anything.
> > > > > > 
> > > > > > --Delyan
> > 
> > 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-29 23:18         ` Delyan Kratunov
@ 2022-08-29 23:45           ` Alexei Starovoitov
  2022-08-30  0:20             ` Kumar Kartikeya Dwivedi
  2022-08-30  0:17           ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-29 23:45 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: memxor, davem, joannelkoong, andrii, tj, daniel, Dave Marchevsky,
	Kernel Team, bpf

On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote:
>
> >
> > It is not very precise, but until those maps are gone it delays
> > release of the allocator (we can empty all percpu caches to save
> > memory once bpf_map pinning the allocator is gone, because allocations
> > are not going to be served). But it allows unit_free to be relatively
> > less costly as long as those 'candidate' maps are around.
>
> Yes, we considered this but it's much easier to get to pathological behaviors, by
> just loading and unloading programs that can access an allocator in a loop. The
> freelists being empty help but it's still quite easy to hold a lot of memory for
> nothing.
>
> The pointer walk was proposed to prune most such pathological cases while still being
> conservative enough to be easy to implement. Only races with the pointer walk can
> extend the lifetime unnecessarily.

I'm getting lost in this thread.

Here is my understanding so far:
We don't free kernel kptrs from map in release_uref,
but we should for local kptrs, since such objs are
not much different from timers.
So release_uref will xchg all such kptrs and free them
into the allocator without touching allocator's refcnt.
So there is no concurrency issue that Kumar was concerned about.
We might need two arrays though.
prog->used_allocators[] and map->used_allocators[]
The verifier would populate both at load time.
At prog unload dec refcnt in one array.
At map free dec refcnt in the other array.
Map-in-map insert/delete of new map would copy allocators[] from
outer map.
As the general suggestion to solve this problem I think
we really need to avoid run-time refcnt changes at alloc/free
even when they're per-cpu 'fast'.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-29 23:18         ` Delyan Kratunov
  2022-08-29 23:45           ` Alexei Starovoitov
@ 2022-08-30  0:17           ` Kumar Kartikeya Dwivedi
  1 sibling, 0 replies; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-30  0:17 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: davem, joannelkoong, andrii, tj, daniel, Dave Marchevsky,
	alexei.starovoitov, Kernel Team, bpf

On Tue, 30 Aug 2022 at 01:18, Delyan Kratunov <delyank@fb.com> wrote:
>
> > [...]
> >
> > I don't think that assumption will hold. Requiring RCU protection for
> > all local kptrs means allocator cache reuse becomes harder, as
> > elements are stuck in freelist until the next grace period. It
> > necessitates use of larger caches.
> > For some use cases where they can tolerate reuse, it might not be
> > optimal. IMO the allocator should be independent of how the lifetime
> > of elements is managed.
>
> All maps and allocations are already rcu-protected, we're not adding new here. We're
> only relying on this rcu protection (c.f. the chain of call_rcu_task_trace and
> call_rcu in the patchset) to prove that no program can observe an allocator pointer
> that is corrupted or stale.
>
> >
> > That said, even if you assume RCU protection, that still doesn't
> > address the real problem. Yes, you can access the value without
> > worrying about it moving to another map, but consider this case:
> > Prog unloading,
> > populate_used_allocators -> map walks map_values, tries to take
> > reference to local kptr whose backing allocator is A.
> > Loads kptr, meanwhile some other prog sharing access to the map value
> > exchanges (kptr_xchg) another pointer into that field.
> > Now you take reference to allocator A in used_allocators, while actual
> > value in map is for allocator B.
>
> This is fine. The algorithm is conservative, it may keep allocators around longer
> than they're needed. Eventually there will exist a time that this map won't be
> accessible to any program, at which point both allocator A and B will be released.
>
> It is possible to make a more precise algorithm but given that this behavior is only
> really a problem with (likely pinned) long-lived maps, it's imo not worth it for v1.
>
> >
> > So you either have to cmpxchg and then retry if it fails (which is not
> > a wait-free operation, and honestly not great imo), or if you don't
> > handle this:
> > Someone moved an allocated local kptr backed by B into your map, but
> > you don't hold reference to it.
>
> You don't need a reference while something else is holding the allocator alive. The
> references exist to extend the lifetime past the lifetimes of programs that can
> directly reference the allocator.
>
> > That other program may release
> > allocator map -> allocator,
>
> The allocator map cannot be removed while it's in used_maps of the program. On
> program unload, we'll add B to the used_allocators list, as a value from B is in the
> map. Only then will the allocator map be releasable.
>

Ahh, so prog1 and prog2 both share map M, but not allocator map A and
B, so if it swaps in pointer of B into M while prog1 is unloading, it
will take unneeded ref to A (it it sees kptr to A just before swap).
But when prog2 will unload, it will then see that ptr value is from B
so it will also take ref to B in M's used_allocators. Then A stays
alive for a little while longer, but only when this race happens. But
there won't be any correctness problem as both A and B are kept alive.

Makes sense, but IIUC this only takes care of this specific case. It
can also see 'NULL' and miss taking reference to A.

prog1 uses A, M
prog2 uses B, M
A and B are allocator maps, M is normal hashmap, M is shared between both.

prog1 is unloading:
M holds kptr from A.
during walk, before loading field, prog2 xchg it to NULL, M does not
take ref to A. // 1
Right after it is done processing this field during its walk, prog2
now xchg it back in, now M is holding A but did not take ref to A. //
2
prog1 unloads. M's used_allocators list is empty.

M is still kept alive by prog2.
Now prog2 has already moved its result of kptr_xchg back into the map in step 2.
Hence, prog2 execution can terminate, this ends its RCU section.
Allocator A is waiting for all prior RCU readers, eventually it can be freed.
Now prog2 runs again, starts a new RCU section, kptr_xchg this ptr
from M, and tries to free it. Allocator A is already gone.

If this is also taken care of, I'll only be poking at the code next
when you post it, so that I don't waste any more of your time. But
IIUC this can still cause issues, right?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-29 23:45           ` Alexei Starovoitov
@ 2022-08-30  0:20             ` Kumar Kartikeya Dwivedi
  2022-08-30  0:26               ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-30  0:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Delyan Kratunov, davem, joannelkoong, andrii, tj, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote:
> >
> > >
> > > It is not very precise, but until those maps are gone it delays
> > > release of the allocator (we can empty all percpu caches to save
> > > memory once bpf_map pinning the allocator is gone, because allocations
> > > are not going to be served). But it allows unit_free to be relatively
> > > less costly as long as those 'candidate' maps are around.
> >
> > Yes, we considered this but it's much easier to get to pathological behaviors, by
> > just loading and unloading programs that can access an allocator in a loop. The
> > freelists being empty help but it's still quite easy to hold a lot of memory for
> > nothing.
> >
> > The pointer walk was proposed to prune most such pathological cases while still being
> > conservative enough to be easy to implement. Only races with the pointer walk can
> > extend the lifetime unnecessarily.
>
> I'm getting lost in this thread.
>
> Here is my understanding so far:
> We don't free kernel kptrs from map in release_uref,
> but we should for local kptrs, since such objs are
> not much different from timers.
> So release_uref will xchg all such kptrs and free them
> into the allocator without touching allocator's refcnt.
> So there is no concurrency issue that Kumar was concerned about.

Haven't really thought through whether this will fix the concurrent
kptr swap problem, but then with this I think you need:
- New helper bpf_local_kptr_xchg(map, map_value, kptr)
- Associating map_uid of map, map_value
- Always doing atomic_inc_not_zero(map->usercnt) for each call to
local_kptr_xchg
1 and 2 because of inner_maps, 3 because of release_uref.
But maybe not a deal breaker?

> We might need two arrays though.
> prog->used_allocators[] and map->used_allocators[]
> The verifier would populate both at load time.
> At prog unload dec refcnt in one array.
> At map free dec refcnt in the other array.
> Map-in-map insert/delete of new map would copy allocators[] from
> outer map.
> As the general suggestion to solve this problem I think
> we really need to avoid run-time refcnt changes at alloc/free
> even when they're per-cpu 'fast'.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-30  0:20             ` Kumar Kartikeya Dwivedi
@ 2022-08-30  0:26               ` Alexei Starovoitov
  2022-08-30  0:44                 ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-30  0:26 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Delyan Kratunov, davem, joannelkoong, andrii, tj, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Mon, Aug 29, 2022 at 5:20 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote:
> > >
> > > >
> > > > It is not very precise, but until those maps are gone it delays
> > > > release of the allocator (we can empty all percpu caches to save
> > > > memory once bpf_map pinning the allocator is gone, because allocations
> > > > are not going to be served). But it allows unit_free to be relatively
> > > > less costly as long as those 'candidate' maps are around.
> > >
> > > Yes, we considered this but it's much easier to get to pathological behaviors, by
> > > just loading and unloading programs that can access an allocator in a loop. The
> > > freelists being empty help but it's still quite easy to hold a lot of memory for
> > > nothing.
> > >
> > > The pointer walk was proposed to prune most such pathological cases while still being
> > > conservative enough to be easy to implement. Only races with the pointer walk can
> > > extend the lifetime unnecessarily.
> >
> > I'm getting lost in this thread.
> >
> > Here is my understanding so far:
> > We don't free kernel kptrs from map in release_uref,
> > but we should for local kptrs, since such objs are
> > not much different from timers.
> > So release_uref will xchg all such kptrs and free them
> > into the allocator without touching allocator's refcnt.
> > So there is no concurrency issue that Kumar was concerned about.
>
> Haven't really thought through whether this will fix the concurrent
> kptr swap problem, but then with this I think you need:
> - New helper bpf_local_kptr_xchg(map, map_value, kptr)

no. why?
current bpf_kptr_xchg(void *map_value, void *ptr) should work.
The verifier knows map ptr from map_value.

> - Associating map_uid of map, map_value
> - Always doing atomic_inc_not_zero(map->usercnt) for each call to
> local_kptr_xchg
> 1 and 2 because of inner_maps, 3 because of release_uref.
> But maybe not a deal breaker?

No run-time refcnts.
All possible allocators will be added to map->used_allocators
at prog load time and allocator's refcnt incremented.
At run-time bpf_kptr_xchg(map_value, ptr) will be happening
with an allocator A which was added to that map->used_allocators
already.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-30  0:26               ` Alexei Starovoitov
@ 2022-08-30  0:44                 ` Kumar Kartikeya Dwivedi
  2022-08-30  1:05                   ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-30  0:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Delyan Kratunov, davem, joannelkoong, andrii, tj, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Tue, 30 Aug 2022 at 02:26, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 29, 2022 at 5:20 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote:
> > > >
> > > > >
> > > > > It is not very precise, but until those maps are gone it delays
> > > > > release of the allocator (we can empty all percpu caches to save
> > > > > memory once bpf_map pinning the allocator is gone, because allocations
> > > > > are not going to be served). But it allows unit_free to be relatively
> > > > > less costly as long as those 'candidate' maps are around.
> > > >
> > > > Yes, we considered this but it's much easier to get to pathological behaviors, by
> > > > just loading and unloading programs that can access an allocator in a loop. The
> > > > freelists being empty help but it's still quite easy to hold a lot of memory for
> > > > nothing.
> > > >
> > > > The pointer walk was proposed to prune most such pathological cases while still being
> > > > conservative enough to be easy to implement. Only races with the pointer walk can
> > > > extend the lifetime unnecessarily.
> > >
> > > I'm getting lost in this thread.
> > >
> > > Here is my understanding so far:
> > > We don't free kernel kptrs from map in release_uref,
> > > but we should for local kptrs, since such objs are
> > > not much different from timers.
> > > So release_uref will xchg all such kptrs and free them
> > > into the allocator without touching allocator's refcnt.
> > > So there is no concurrency issue that Kumar was concerned about.
> >
> > Haven't really thought through whether this will fix the concurrent
> > kptr swap problem, but then with this I think you need:
> > - New helper bpf_local_kptr_xchg(map, map_value, kptr)
>
> no. why?
> current bpf_kptr_xchg(void *map_value, void *ptr) should work.
> The verifier knows map ptr from map_value.
>
> > - Associating map_uid of map, map_value
> > - Always doing atomic_inc_not_zero(map->usercnt) for each call to
> > local_kptr_xchg
> > 1 and 2 because of inner_maps, 3 because of release_uref.
> > But maybe not a deal breaker?
>
> No run-time refcnts.

How is future kptr_xchg prevented for the map after its usercnt drops to 0?
If we don't check it at runtime we can xchg in non-NULL kptr after
release_uref callback.
For timer you are taking timer spinlock and reading map->usercnt in
timer_set_callback.

Or do you mean this case can never happen with your approach?

> All possible allocators will be added to map->used_allocators
> at prog load time and allocator's refcnt incremented.
> At run-time bpf_kptr_xchg(map_value, ptr) will be happening
> with an allocator A which was added to that map->used_allocators
> already.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-30  0:44                 ` Kumar Kartikeya Dwivedi
@ 2022-08-30  1:05                   ` Alexei Starovoitov
  2022-08-30  1:40                     ` Delyan Kratunov
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-30  1:05 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Delyan Kratunov, davem, joannelkoong, andrii, tj, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Mon, Aug 29, 2022 at 5:45 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 30 Aug 2022 at 02:26, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Aug 29, 2022 at 5:20 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote:
> > > > >
> > > > > >
> > > > > > It is not very precise, but until those maps are gone it delays
> > > > > > release of the allocator (we can empty all percpu caches to save
> > > > > > memory once bpf_map pinning the allocator is gone, because allocations
> > > > > > are not going to be served). But it allows unit_free to be relatively
> > > > > > less costly as long as those 'candidate' maps are around.
> > > > >
> > > > > Yes, we considered this but it's much easier to get to pathological behaviors, by
> > > > > just loading and unloading programs that can access an allocator in a loop. The
> > > > > freelists being empty help but it's still quite easy to hold a lot of memory for
> > > > > nothing.
> > > > >
> > > > > The pointer walk was proposed to prune most such pathological cases while still being
> > > > > conservative enough to be easy to implement. Only races with the pointer walk can
> > > > > extend the lifetime unnecessarily.
> > > >
> > > > I'm getting lost in this thread.
> > > >
> > > > Here is my understanding so far:
> > > > We don't free kernel kptrs from map in release_uref,
> > > > but we should for local kptrs, since such objs are
> > > > not much different from timers.
> > > > So release_uref will xchg all such kptrs and free them
> > > > into the allocator without touching allocator's refcnt.
> > > > So there is no concurrency issue that Kumar was concerned about.
> > >
> > > Haven't really thought through whether this will fix the concurrent
> > > kptr swap problem, but then with this I think you need:
> > > - New helper bpf_local_kptr_xchg(map, map_value, kptr)
> >
> > no. why?
> > current bpf_kptr_xchg(void *map_value, void *ptr) should work.
> > The verifier knows map ptr from map_value.
> >
> > > - Associating map_uid of map, map_value
> > > - Always doing atomic_inc_not_zero(map->usercnt) for each call to
> > > local_kptr_xchg
> > > 1 and 2 because of inner_maps, 3 because of release_uref.
> > > But maybe not a deal breaker?
> >
> > No run-time refcnts.
>
> How is future kptr_xchg prevented for the map after its usercnt drops to 0?
> If we don't check it at runtime we can xchg in non-NULL kptr after
> release_uref callback.
> For timer you are taking timer spinlock and reading map->usercnt in
> timer_set_callback.

Sorry I confused myself and others with release_uref.
I meant map_poke_untrack-like call.
When we drop refs from used maps in __bpf_free_used_maps
we walk all elements.
Similar idea here.
When prog is unloaded it cleans up all objects it allocated
and stored into maps before dropping refcnt-s
in prog->used_allocators.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-30  1:05                   ` Alexei Starovoitov
@ 2022-08-30  1:40                     ` Delyan Kratunov
  2022-08-30  3:34                       ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Delyan Kratunov @ 2022-08-30  1:40 UTC (permalink / raw)
  To: memxor, alexei.starovoitov
  Cc: tj, joannelkoong, andrii, davem, daniel, Dave Marchevsky,
	Kernel Team, bpf

On Mon, 2022-08-29 at 18:05 -0700, Alexei Starovoitov wrote:
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
> 
> > -------------------------------------------------------------------!
> 
> On Mon, Aug 29, 2022 at 5:45 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> > 
> > On Tue, 30 Aug 2022 at 02:26, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > > 
> > > On Mon, Aug 29, 2022 at 5:20 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > > 
> > > > On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > 
> > > > > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote:
> > > > > > 
> > > > > > > 
> > > > > > > It is not very precise, but until those maps are gone it delays
> > > > > > > release of the allocator (we can empty all percpu caches to save
> > > > > > > memory once bpf_map pinning the allocator is gone, because allocations
> > > > > > > are not going to be served). But it allows unit_free to be relatively
> > > > > > > less costly as long as those 'candidate' maps are around.
> > > > > > 
> > > > > > Yes, we considered this but it's much easier to get to pathological behaviors, by
> > > > > > just loading and unloading programs that can access an allocator in a loop. The
> > > > > > freelists being empty help but it's still quite easy to hold a lot of memory for
> > > > > > nothing.
> > > > > > 
> > > > > > The pointer walk was proposed to prune most such pathological cases while still being
> > > > > > conservative enough to be easy to implement. Only races with the pointer walk can
> > > > > > extend the lifetime unnecessarily.
> > > > > 
> > > > > I'm getting lost in this thread.
> > > > > 
> > > > > Here is my understanding so far:
> > > > > We don't free kernel kptrs from map in release_uref,
> > > > > but we should for local kptrs, since such objs are
> > > > > not much different from timers.
> > > > > So release_uref will xchg all such kptrs and free them
> > > > > into the allocator without touching allocator's refcnt.
> > > > > So there is no concurrency issue that Kumar was concerned about.
Yes, if we populate used_allocators on load and copy them to inner maps, this might
work. It requires the most conservative approach where loading and unloading a
program in a loop would add its allocators to the visible pinned maps, accumulating
allocators we can't release until the map is gone.

However, I thought you wanted to walk the values instead to prevent this abuse. At
least that's the understanding I was operating under.

If a program has the max number of possible allocators and we just load/unload it in
a loop, with a visible pinned map, the used_allocators list of that map can easily
skyrocket. This is a problem in itself, in that it needs to grow dynamically (program
load/unload is a good context to do that from but inner map insert/delete can also
grow the list and that's in a bad context potentially).

> > > > 
> > > > Haven't really thought through whether this will fix the concurrent
> > > > kptr swap problem, but then with this I think you need:
> > > > - New helper bpf_local_kptr_xchg(map, map_value, kptr)
> > > 
> > > no. why?
> > > current bpf_kptr_xchg(void *map_value, void *ptr) should work.
> > > The verifier knows map ptr from map_value.
> > > 
> > > > - Associating map_uid of map, map_value
> > > > - Always doing atomic_inc_not_zero(map->usercnt) for each call to
> > > > local_kptr_xchg
> > > > 1 and 2 because of inner_maps, 3 because of release_uref.
> > > > But maybe not a deal breaker?
> > > 
> > > No run-time refcnts.
> > 
> > How is future kptr_xchg prevented for the map after its usercnt drops to 0?
> > If we don't check it at runtime we can xchg in non-NULL kptr after
> > release_uref callback.
> > For timer you are taking timer spinlock and reading map->usercnt in
> > timer_set_callback.
> 
> Sorry I confused myself and others with release_uref.
> I meant map_poke_untrack-like call.
> When we drop refs from used maps in __bpf_free_used_maps
> we walk all elements.
> Similar idea here.
> When prog is unloaded it cleans up all objects it allocated
> and stored into maps before dropping refcnt-s
> in prog->used_allocators.

For an allocator that's visible from multiple programs (say, it's in a map-of-maps),
how would we know which values were allocated by which program? Do we forbid shared
allocators outright? I can imagine container agent-like software wanting to reuse its
allocator caches from a previous run.

Besides, cleaning up the values is the easy part - so long as the map is extending
the allocator's lifetime, this is safe, we can even use the kptr destructor mechanism
(though I'd rather not).

There are three main cases of lifetime extension:
1) Directly visible maps - normal used_allocators with or without a walk works here.
2) Inner maps. I plan to straight up disallow their insertion if they have a
kptr_local field. If someone needs this, we can solve it then, it's too hard to do
the union of lists logic cleanly for a v1.
3) Stack of another program. I'm not sure how to require that programs with
kptr_local do not use the same btf structs but if that's possible, this can naturally
be disallowed via the btf comparisons failing (it's another case of btf domain
crossover). Maybe we tag the btf struct as containing local kptr type tags and
disallow its use for more than one program?

Separately, I think I just won't allow allocators as inner maps, that's for another
day too (though it may work just fine).

Perfect, enemy of the good, something-something.

-- Delyan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-30  1:40                     ` Delyan Kratunov
@ 2022-08-30  3:34                       ` Alexei Starovoitov
  2022-08-30  5:02                         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-30  3:34 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: memxor, tj, joannelkoong, andrii, davem, daniel, Dave Marchevsky,
	Kernel Team, bpf

On Mon, Aug 29, 2022 at 6:40 PM Delyan Kratunov <delyank@fb.com> wrote:
>
> Yes, if we populate used_allocators on load and copy them to inner maps, this might
> work. It requires the most conservative approach where loading and unloading a
> program in a loop would add its allocators to the visible pinned maps, accumulating
> allocators we can't release until the map is gone.
>
> However, I thought you wanted to walk the values instead to prevent this abuse. At
> least that's the understanding I was operating under.
>
> If a program has the max number of possible allocators and we just load/unload it in
> a loop, with a visible pinned map, the used_allocators list of that map can easily
> skyrocket.

Right. That will be an issue if we don't trim the list
at prog unload.

> > Sorry I confused myself and others with release_uref.
> > I meant map_poke_untrack-like call.
> > When we drop refs from used maps in __bpf_free_used_maps
> > we walk all elements.
> > Similar idea here.
> > When prog is unloaded it cleans up all objects it allocated
> > and stored into maps before dropping refcnt-s
> > in prog->used_allocators.
>
> For an allocator that's visible from multiple programs (say, it's in a map-of-maps),
> how would we know which values were allocated by which program? Do we forbid shared
> allocators outright?

Hopefully we don't need to forbid shared allocators and
allow map-in-map to contain kptr local.

> Separately, I think I just won't allow allocators as inner maps, that's for another
> day too (though it may work just fine).
>
> Perfect, enemy of the good, something-something.

Right, but if we can allow that with something simple
it would be nice.

After a lot of head scratching realized that
walk-all-elems-and-kptr_xchg approach doesn't work,
because prog_A and prog_B might share a map and when
prog_A is unloaded and trying to do kptr_xchg on all elems
the prog_B might kptr_xchg as well and walk_all loop
will miss kptrs.
prog->used_allocators[] approach is broken too.
Since the prog_B (in the example above) might see
objects that were allocated from prog_A's allocators.
prog->used_allocators at load time is incorrect.

To prevent all of these issues how about we
restrict kptr local to contain a pointer only
from one allocator.
When parsing map's BTF if there is only one kptr local
in the map value the equivalent of map->used_allocators[]
will guarantee to contain only one allocator.
Two kptr locals in the map value -> potentially two allocators.

So here is new proposal:

At load time the verifier walks all kptr_xchg(map_value, obj)
and adds obj's allocator to
map->used_allocators <- {kptr_offset, allocator};
If kptr_offset already exists -> failure to load.
Allocator can probably be a part of struct bpf_map_value_off_desc.

In other words the pairs of {kptr_offset, allocator}
say 'there could be an object from that allocator in
that kptr in some map values'.

Do nothing at prog unload.

At map free time walk all elements and free kptrs.
Finally drop allocator refcnts.

This approach allows sharing of allocators.
kptr local in map-in-map also should be fine.
If not we have a problem with bpf_map_value_off_desc
and map-in-map then.

The prog doesn't need to have a special used_allocator list,
since if bpf prog doesn't do kptr_xchg all allocated
objects will be freed during prog execution.
Instead since allocator is a different type of map it
should go into existing used_maps[] to make sure
we don't free allocator when prog is executing.

Maybe with this approach we won't even need to
hide the allocator pointer into the first 8 bytes.
For all pointers returned from kptr_xchg the verifier
will know which allocator is supposed to be used for freeing.

Thoughts?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-30  3:34                       ` Alexei Starovoitov
@ 2022-08-30  5:02                         ` Kumar Kartikeya Dwivedi
  2022-08-30  6:03                           ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-30  5:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Delyan Kratunov, tj, joannelkoong, andrii, davem, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Tue, 30 Aug 2022 at 05:35, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 29, 2022 at 6:40 PM Delyan Kratunov <delyank@fb.com> wrote:
> >
> > Yes, if we populate used_allocators on load and copy them to inner maps, this might
> > work. It requires the most conservative approach where loading and unloading a
> > program in a loop would add its allocators to the visible pinned maps, accumulating
> > allocators we can't release until the map is gone.
> >
> > However, I thought you wanted to walk the values instead to prevent this abuse. At
> > least that's the understanding I was operating under.
> >
> > If a program has the max number of possible allocators and we just load/unload it in
> > a loop, with a visible pinned map, the used_allocators list of that map can easily
> > skyrocket.
>
> Right. That will be an issue if we don't trim the list
> at prog unload.
>
> > > Sorry I confused myself and others with release_uref.
> > > I meant map_poke_untrack-like call.
> > > When we drop refs from used maps in __bpf_free_used_maps
> > > we walk all elements.
> > > Similar idea here.
> > > When prog is unloaded it cleans up all objects it allocated
> > > and stored into maps before dropping refcnt-s
> > > in prog->used_allocators.
> >
> > For an allocator that's visible from multiple programs (say, it's in a map-of-maps),
> > how would we know which values were allocated by which program? Do we forbid shared
> > allocators outright?
>
> Hopefully we don't need to forbid shared allocators and
> allow map-in-map to contain kptr local.
>
> > Separately, I think I just won't allow allocators as inner maps, that's for another
> > day too (though it may work just fine).
> >
> > Perfect, enemy of the good, something-something.
>
> Right, but if we can allow that with something simple
> it would be nice.
>
> After a lot of head scratching realized that
> walk-all-elems-and-kptr_xchg approach doesn't work,
> because prog_A and prog_B might share a map and when
> prog_A is unloaded and trying to do kptr_xchg on all elems
> the prog_B might kptr_xchg as well and walk_all loop
> will miss kptrs.

Agreed, I can't see it working either.

> prog->used_allocators[] approach is broken too.
> Since the prog_B (in the example above) might see
> objects that were allocated from prog_A's allocators.
> prog->used_allocators at load time is incorrect.
>
> To prevent all of these issues how about we
> restrict kptr local to contain a pointer only
> from one allocator.
> When parsing map's BTF if there is only one kptr local
> in the map value the equivalent of map->used_allocators[]
> will guarantee to contain only one allocator.
> Two kptr locals in the map value -> potentially two allocators.
>
> So here is new proposal:
>

Thanks for the proposal, Alexei. I think we're getting close to a
solution, but still some comments below.

> At load time the verifier walks all kptr_xchg(map_value, obj)
> and adds obj's allocator to
> map->used_allocators <- {kptr_offset, allocator};
> If kptr_offset already exists -> failure to load.
> Allocator can probably be a part of struct bpf_map_value_off_desc.
>
> In other words the pairs of {kptr_offset, allocator}
> say 'there could be an object from that allocator in
> that kptr in some map values'.
>
> Do nothing at prog unload.
>
> At map free time walk all elements and free kptrs.
> Finally drop allocator refcnts.
>

Yes, this should be possible.
It's quite easy to capture the map_ptr for the allocated local kptr.
Limiting each local kptr to one allocator is probably fine, at least for a v1.

One problem I see is how it works when the allocator map is an inner map.
Then, it is not possible to find the backing allocator instance at
verification time, hence not possible to take the reference to it in
map->used_allocators.
But let's just assume that is disallowed for now.

The other problem I see is that when the program just does
kptr_xchg(map_value, NULL), we may not have allocator info from
kptr_offset at that moment. Allocating prog which fills
used_allocators may be verified later. We _can_ reject this, but it
makes everything fragile (dependent on which order you load programs
in), which won't be great. You can then use this lost info to make
kptr disjoint from allocator lifetime.

Let me explain through an example.

Consider this order to set up the programs:
One allocator map A.
Two hashmaps M1, M2.
Three programs P1, P2, P3.

P1 uses M1, M2.
P2 uses A, M1.
P3 uses M2.

Sequence:
map_create A, M1, M2.

Load P1, uses M1, M2. What this P1 does is:
p = kptr_xchg(M1.value, NULL);
kptr_xchg(M2.value, p);

So it moves the kptr in M1 into M2. The problem is at this point
kptr_offset is not populated, so we cannot fill used_allocators of M2
as we cannot track which allocator is used to fill M1.value. We saw
nothing filling it yet.

Next, load P3. It does:
p = kptr_xchg(M2.value, NULL);
unit_free(p); // let's assume p has bpf_mem_alloc ptr behind itself so
this is ok if allocator is alive.

Again, M2.used_allocators is empty. Nothing is filled into it.

Next, load P2.
p = alloc(&A, ...);
kptr_xchg(M1.value, p);

Now, M1.used_allocators is filled with allocator ref and kptr_offset.
But M2.used_allocators will remain unfilled.

Now, run programs in sequence of P2, then P1. This will allocate from
A, and move the ref to M1, then to M2. But only P1 and P2 have
references to M1 so it keeps the allocator alive. However, now unload
both P1 and P2.
P1, P2, A, allocator of A, M1 all can be freed after RCU gp wait. M2
is still held by loaded P3.

Now, M2.used_allocators is empty. P3 is using it, and it is holding
allocation from allocator A. Both M1 and A are freed.
When P3 runs now, it can kptr_xchg and try to free it, and the same
uaf happens again.
If not that, uaf when M2 is freed and it does unit_free on the alive local kptr.

--

Will this case be covered by your approach? Did I miss something?

The main issue is that this allocator info can be lost depending on
how you verify a set of programs. It would not be lost if we verified
in order P2, P1, P3 instead of the current P1, P3, P2.

So we might have to teach the verifier to identify kptr_xchg edges
between maps, and propagate any used_allocators to the other map? But
it's becoming too complicated.

You _can_ reject loads of programs when you don't find kptr_offset
populated on seeing kptr_xchg(..., NULL), but I don't think this is
practical either. It makes the things sensitive to program
verification order, which would be confusing for users.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-30  5:02                         ` Kumar Kartikeya Dwivedi
@ 2022-08-30  6:03                           ` Alexei Starovoitov
  2022-08-30 20:31                             ` Delyan Kratunov
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-30  6:03 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Delyan Kratunov, tj, joannelkoong, andrii, davem, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Mon, Aug 29, 2022 at 10:03 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
> >
> > So here is new proposal:
> >
>
> Thanks for the proposal, Alexei. I think we're getting close to a
> solution, but still some comments below.
>
> > At load time the verifier walks all kptr_xchg(map_value, obj)
> > and adds obj's allocator to
> > map->used_allocators <- {kptr_offset, allocator};
> > If kptr_offset already exists -> failure to load.
> > Allocator can probably be a part of struct bpf_map_value_off_desc.
> >
> > In other words the pairs of {kptr_offset, allocator}
> > say 'there could be an object from that allocator in
> > that kptr in some map values'.
> >
> > Do nothing at prog unload.
> >
> > At map free time walk all elements and free kptrs.
> > Finally drop allocator refcnts.
> >
>
> Yes, this should be possible.
> It's quite easy to capture the map_ptr for the allocated local kptr.
> Limiting each local kptr to one allocator is probably fine, at least for a v1.
>
> One problem I see is how it works when the allocator map is an inner map.
> Then, it is not possible to find the backing allocator instance at
> verification time, hence not possible to take the reference to it in
> map->used_allocators.
> But let's just assume that is disallowed for now.
>
> The other problem I see is that when the program just does
> kptr_xchg(map_value, NULL), we may not have allocator info from
> kptr_offset at that moment. Allocating prog which fills
> used_allocators may be verified later. We _can_ reject this, but it
> makes everything fragile (dependent on which order you load programs
> in), which won't be great. You can then use this lost info to make
> kptr disjoint from allocator lifetime.
>
> Let me explain through an example.
>
> Consider this order to set up the programs:
> One allocator map A.
> Two hashmaps M1, M2.
> Three programs P1, P2, P3.
>
> P1 uses M1, M2.
> P2 uses A, M1.
> P3 uses M2.
>
> Sequence:
> map_create A, M1, M2.
>
> Load P1, uses M1, M2. What this P1 does is:
> p = kptr_xchg(M1.value, NULL);
> kptr_xchg(M2.value, p);
>
> So it moves the kptr in M1 into M2. The problem is at this point
> kptr_offset is not populated, so we cannot fill used_allocators of M2
> as we cannot track which allocator is used to fill M1.value. We saw
> nothing filling it yet.
>
> Next, load P3. It does:
> p = kptr_xchg(M2.value, NULL);
> unit_free(p); // let's assume p has bpf_mem_alloc ptr behind itself so
> this is ok if allocator is alive.
>
> Again, M2.used_allocators is empty. Nothing is filled into it.
>
> Next, load P2.
> p = alloc(&A, ...);
> kptr_xchg(M1.value, p);
>
> Now, M1.used_allocators is filled with allocator ref and kptr_offset.
> But M2.used_allocators will remain unfilled.
>
> Now, run programs in sequence of P2, then P1. This will allocate from
> A, and move the ref to M1, then to M2. But only P1 and P2 have
> references to M1 so it keeps the allocator alive. However, now unload
> both P1 and P2.
> P1, P2, A, allocator of A, M1 all can be freed after RCU gp wait. M2
> is still held by loaded P3.
>
> Now, M2.used_allocators is empty. P3 is using it, and it is holding
> allocation from allocator A. Both M1 and A are freed.
> When P3 runs now, it can kptr_xchg and try to free it, and the same
> uaf happens again.
> If not that, uaf when M2 is freed and it does unit_free on the alive local kptr.
>
> --
>
> Will this case be covered by your approach? Did I miss something?
>
> The main issue is that this allocator info can be lost depending on
> how you verify a set of programs. It would not be lost if we verified
> in order P2, P1, P3 instead of the current P1, P3, P2.
>
> So we might have to teach the verifier to identify kptr_xchg edges
> between maps, and propagate any used_allocators to the other map? But
> it's becoming too complicated.
>
> You _can_ reject loads of programs when you don't find kptr_offset
> populated on seeing kptr_xchg(..., NULL), but I don't think this is
> practical either. It makes the things sensitive to program
> verification order, which would be confusing for users.

Right. Thanks for brainstorming and coming up with the case
where it breaks.

Let me explain the thought process behind the proposal.
The way the progs will be written will be something like:

struct foo {
  int var;
};

struct {
  __uint(type, BPF_MAP_TYPE_ALLOCATOR);
  __type(value, struct foo);
} ma SEC(".maps");

struct map_val {
  struct foo * ptr __kptr __local;
};

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __uint(max_entries, 123);
  __type(key, __u32);
  __type(value, struct map_val);
} hash SEC(".maps");

struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));
bpf_kptr_xchg(&map_val->ptr, p);

Even if prog doesn't allocate and only does kptr_xchg like
your P1 and P3 do the C code has to have a full
definition 'struct foo' to compile P1 and P3.
P1 and P3 don't need to see definition of 'ma' to be compiled,
but 'struct foo' has to be seen.
BTF reference will be taken by both 'ma' and by 'hash'.
The btf_id will come from that BTF.

The type is tied to BTF and tied to kptr in map value.
The type is also tied to the allocator.
The type creates a dependency chain between allocator and map.
So the restriction of one allocator per kptr feels
reasonable and doesn't feel restrictive at all.
That dependency chain is there in the C code of the program.
Hence the proposal to discover this dependency in the verifier
through tracking of allocator from mem_alloc into kptr_xchg.
But you're correct that it's not working for P1 and P3.

I can imagine a few ways to solve it.
1. Ask users to annotate kptr local with the allocator
that will be used.
It's easy for progs P1 and P3. All definitions are likely available.
It's only an extra tag of some form.

2. move 'used_allocator' from map further into BTF,
  since BTF is the root of this dependency chain.
When the verifier sees bpf_mem_alloc in P2 it will add
{allocator, btf_id} pair to BTF.

If P1 loads first and the verifier see:
p = kptr_xchg(M1.value, NULL);
it will add {unique_allocator_placeholder, btf_id} to BTF.
Then
kptr_xchg(M2.value, p); does nothing.
The verifier checks that M1's BTF == M2's BTF and id-s are same.

Then P3 loads with:
p = kptr_xchg(M2.value, NULL);
unit_free(p);
since unique_allocator_placholder is already there for that btf_id
the verifier does nothing.

Eventually it will see bpf_mem_alloc for that btf_id and will
replace the placeholder with the actual allocator.
We can even allow P1 and P3 to be runnable after load right away.
Since nothing can allocate into that kptr local those
kptr_xchg() in P1 and P3 will be returning NULL.
If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue.

Ideally for unit_free(p); in P3 the verifier would add a hidden
'ma' argument, so that allocator doesn't need to be stored dynamically.
We can either insns of P3 after P2 was verified or
pass a pointer to a place in BTF->used_allocator list of pairs
where actual allocator pointer will be written later.
Then no patching is needed.
If P2 never loads the unit_free(*addr /* here it will load the
value of unique_allocator_placeholder */, ...)
but since unit_free() will never execute with valid obj to be freed.

"At map free time walk all elements and free kptrs" step stays the same.
but we decrement allocator refcnt only when BTF is freed
which should be after map free time.
So btf_put(map->btf); would need to move after ops->map_free.

Maybe unique_allocator_placeholder approach can be used
without moving the list into BTF. Need to think more about it tomorrow.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-30  6:03                           ` Alexei Starovoitov
@ 2022-08-30 20:31                             ` Delyan Kratunov
  2022-08-31  1:52                               ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Delyan Kratunov @ 2022-08-30 20:31 UTC (permalink / raw)
  To: memxor, alexei.starovoitov
  Cc: davem, joannelkoong, andrii, tj, daniel, Dave Marchevsky,
	Kernel Team, bpf

On Mon, 2022-08-29 at 23:03 -0700, Alexei Starovoitov wrote:
> 
> On Mon, Aug 29, 2022 at 10:03 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> > > 
> > > So here is new proposal:
> > > 
> > 
> > Thanks for the proposal, Alexei. I think we're getting close to a
> > solution, but still some comments below.
> > 
> > > At load time the verifier walks all kptr_xchg(map_value, obj)
> > > and adds obj's allocator to
> > > map->used_allocators <- {kptr_offset, allocator};
> > > If kptr_offset already exists -> failure to load.
> > > Allocator can probably be a part of struct bpf_map_value_off_desc.
> > > 
> > > In other words the pairs of {kptr_offset, allocator}
> > > say 'there could be an object from that allocator in
> > > that kptr in some map values'.
> > > 
> > > Do nothing at prog unload.
> > > 
> > > At map free time walk all elements and free kptrs.
> > > Finally drop allocator refcnts.
> > > 
> > 
> > Yes, this should be possible.
> > It's quite easy to capture the map_ptr for the allocated local kptr.
> > Limiting each local kptr to one allocator is probably fine, at least for a v1.
> > 
> > One problem I see is how it works when the allocator map is an inner map.
> > Then, it is not possible to find the backing allocator instance at
> > verification time, hence not possible to take the reference to it in
> > map->used_allocators.
> > But let's just assume that is disallowed for now.
> > 
> > The other problem I see is that when the program just does
> > kptr_xchg(map_value, NULL), we may not have allocator info from
> > kptr_offset at that moment. Allocating prog which fills
> > used_allocators may be verified later. We _can_ reject this, but it
> > makes everything fragile (dependent on which order you load programs
> > in), which won't be great. You can then use this lost info to make
> > kptr disjoint from allocator lifetime.
> > 
> > Let me explain through an example.
> > 
> > Consider this order to set up the programs:
> > One allocator map A.
> > Two hashmaps M1, M2.
> > Three programs P1, P2, P3.
> > 
> > P1 uses M1, M2.
> > P2 uses A, M1.
> > P3 uses M2.
> > 
> > Sequence:
> > map_create A, M1, M2.
> > 
> > Load P1, uses M1, M2. What this P1 does is:
> > p = kptr_xchg(M1.value, NULL);
> > kptr_xchg(M2.value, p);
> > 
> > So it moves the kptr in M1 into M2. The problem is at this point
> > kptr_offset is not populated, so we cannot fill used_allocators of M2
> > as we cannot track which allocator is used to fill M1.value. We saw
> > nothing filling it yet.
> > 
> > Next, load P3. It does:
> > p = kptr_xchg(M2.value, NULL);
> > unit_free(p); // let's assume p has bpf_mem_alloc ptr behind itself so
> > this is ok if allocator is alive.
> > 
> > Again, M2.used_allocators is empty. Nothing is filled into it.
> > 
> > Next, load P2.
> > p = alloc(&A, ...);
> > kptr_xchg(M1.value, p);
> > 
> > Now, M1.used_allocators is filled with allocator ref and kptr_offset.
> > But M2.used_allocators will remain unfilled.
> > 
> > Now, run programs in sequence of P2, then P1. This will allocate from
> > A, and move the ref to M1, then to M2. But only P1 and P2 have
> > references to M1 so it keeps the allocator alive. However, now unload
> > both P1 and P2.
> > P1, P2, A, allocator of A, M1 all can be freed after RCU gp wait. M2
> > is still held by loaded P3.
> > 
> > Now, M2.used_allocators is empty. P3 is using it, and it is holding
> > allocation from allocator A. Both M1 and A are freed.
> > When P3 runs now, it can kptr_xchg and try to free it, and the same
> > uaf happens again.
> > If not that, uaf when M2 is freed and it does unit_free on the alive local kptr.
> > 
> > --
> > 
> > Will this case be covered by your approach? Did I miss something?
> > 
> > The main issue is that this allocator info can be lost depending on
> > how you verify a set of programs. It would not be lost if we verified
> > in order P2, P1, P3 instead of the current P1, P3, P2.
> > 
> > So we might have to teach the verifier to identify kptr_xchg edges
> > between maps, and propagate any used_allocators to the other map? But
> > it's becoming too complicated.
> > 
> > You _can_ reject loads of programs when you don't find kptr_offset
> > populated on seeing kptr_xchg(..., NULL), but I don't think this is
> > practical either. It makes the things sensitive to program
> > verification order, which would be confusing for users.
> 
> Right. Thanks for brainstorming and coming up with the case
> where it breaks.
> 
> Let me explain the thought process behind the proposal.
> The way the progs will be written will be something like:
> 
> struct foo {
>   int var;
> };
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   __type(value, struct foo);
> } ma SEC(".maps");
> 
> struct map_val {
>   struct foo * ptr __kptr __local;
> };
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __uint(max_entries, 123);
>   __type(key, __u32);
>   __type(value, struct map_val);
> } hash SEC(".maps");
> 
> struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));
> bpf_kptr_xchg(&map_val->ptr, p);
> 
> Even if prog doesn't allocate and only does kptr_xchg like
> your P1 and P3 do the C code has to have a full
> definition 'struct foo' to compile P1 and P3.
> P1 and P3 don't need to see definition of 'ma' to be compiled,
> but 'struct foo' has to be seen.
> BTF reference will be taken by both 'ma' and by 'hash'.
> The btf_id will come from that BTF.
> 
> The type is tied to BTF and tied to kptr in map value.
> The type is also tied to the allocator.
> The type creates a dependency chain between allocator and map.
> So the restriction of one allocator per kptr feels
> reasonable and doesn't feel restrictive at all.
> That dependency chain is there in the C code of the program.
> Hence the proposal to discover this dependency in the verifier
> through tracking of allocator from mem_alloc into kptr_xchg.
> But you're correct that it's not working for P1 and P3.

Encoding the allocator in the runtime type system is fine but it comes with its own
set of tradeoffs.

> 
> I can imagine a few ways to solve it.
> 1. Ask users to annotate kptr local with the allocator
> that will be used.
> It's easy for progs P1 and P3. All definitions are likely available.
> It's only an extra tag of some form.

This would introduce maps referring to other maps and would thus require substantial
work in libbpf. In order to encode the link specific to field instead of the whole
map object, we'd have to resort to something like map names as type tags, which is
not a great design (arbitrary strings etc). 

> 2. move 'used_allocator' from map further into BTF,
>   since BTF is the root of this dependency chain.

This would _maybe_ work. The hole I can see in this plan is that once a slot is
claimed, it cannot be unclaimed and thus maps can remain in a state that leaves the
local kptr fields useless (i.e. the allocator is around but no allocator map for it
exists and thus no program can use these fields again, they can only be free()).

The reason it can't be unclaimed is that programs were verified with a specific
allocator value in mind and we can't change that after they're loaded. To be able to
unclaim a slot, we'd need to remove everything using that system (i.e. btf object)
and load it again, which is not great.

> When the verifier sees bpf_mem_alloc in P2 it will add
> {allocator, btf_id} pair to BTF.
> 
> If P1 loads first and the verifier see:
> p = kptr_xchg(M1.value, NULL);
> it will add {unique_allocator_placeholder, btf_id} to BTF.
> Then
> kptr_xchg(M2.value, p); does nothing.
> The verifier checks that M1's BTF == M2's BTF and id-s are same.

Note to self: don't allow setting base_btf from userspace without auditing all these
checks.

> 
> Then P3 loads with:
> p = kptr_xchg(M2.value, NULL);
> unit_free(p);
> since unique_allocator_placholder is already there for that btf_id
> the verifier does nothing.
> 
> Eventually it will see bpf_mem_alloc for that btf_id and will
> replace the placeholder with the actual allocator.
> We can even allow P1 and P3 to be runnable after load right away.
> Since nothing can allocate into that kptr local those
> kptr_xchg() in P1 and P3 will be returning NULL.
> If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue.
> 
> Ideally for unit_free(p); in P3 the verifier would add a hidden
> 'ma' argument, so that allocator doesn't need to be stored dynamically.
> We can either insns of P3 after P2 was verified or
> pass a pointer to a place in BTF->used_allocator list of pairs
> where actual allocator pointer will be written later.
> Then no patching is needed.
> If P2 never loads the unit_free(*addr /* here it will load the
> value of unique_allocator_placeholder */, ...)
> but since unit_free() will never execute with valid obj to be freed.
> 
> "At map free time walk all elements and free kptrs" step stays the same.
> but we decrement allocator refcnt only when BTF is freed
> which should be after map free time.
> So btf_put(map->btf); would need to move after ops->map_free.
> 
> Maybe unique_allocator_placeholder approach can be used
> without moving the list into BTF. Need to think more about it tomorrow.

I don't think we have to resort to storing the type-allocator mappings in the BTF at
all.

In fact, we can encode them where you wanted to encode them with tags - on the fields
themselves. We can put the mem_alloc reference in the kptr field descriptors and have
it transition to a specific allocator irreversibly. We would need to record where any
equivalences between allocators we need for the currently loaded programs but it's
not impossible.

Making the transition reversible is the hard part of course.

Taking a step back, we're trying to convert a runtime property (this object came from
this allocator) into a static property. The _cleanest_ way to do this would be to
store the dependencies explicitly and propagate/dissolve them on program load/unload.
Note that only programs introduce new dependency edges, maps on their own do not
mandate dependencies but stored values can extend the lifetime of a dependency chain.

There are only a couple of types of dependencies:
Program X stores allocation from Y into map Z, field A
Program X requires allocator for Y.A to equal allocator for Z.A (+ a version for
inner maps)
Program X uses field Z.A (therefore Z.A values may live on stack, so you can't walk
the map yet)

On program load, we materialize these in a table. They can have placeholder values or
take values from existing maps.

When a program loads that makes a placeholder a concrete allocator instance, we
propagate that to all related dependencies (it's kinda like constant propagation).
Calls to bpf_obj_free can receive a bpf_mem_alloc** to a field in that program's
dependency in this table.

On program unload, we can tag the relationships the program introduced as ready to
discard. Once the entire chain is ready to discard, no programs require this
relationship, so we can potentially walk the map and if all the values are NULL,
dissolve the relationship. The map field can now be used with a different allocator.

If there's a non-NULL value in the map, we can't do much - we need a program to load
that uses this map and on that program's unload, we can check again. On map free, we
can free the values, of course, but we can't remove the dependency edges from the
table, since values may have propagated to other tables (this depends on the concrete
implementation - we might be able to have the map remove all edges that reference
it).

Once all relationships for allocator A are gone, we can destroy it safely.

This scheme allows for reuse of maps so long as the values get cleared before
attempts to store in that field from a second allocator. It does not allow mixing
allocators for the same field.

I may have missed relationships but they too can follow this pattern - you store them
explicitly, then reason about them at program load/unload or map free.

Anything less explicit is doing this anyway, either less precisely or less clearly.

All this machinery _does_ create the closest approximation of the runtime property
(per field instead of per value) but it's also approximating something that the
allocator can easily keep track of itself, precisely, at runtime.

I don't think it's worth the complexity, explicit or not.

-- Delyan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-30 20:31                             ` Delyan Kratunov
@ 2022-08-31  1:52                               ` Alexei Starovoitov
  2022-08-31 17:38                                 ` Delyan Kratunov
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-31  1:52 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: memxor, davem, joannelkoong, andrii, tj, daniel, Dave Marchevsky,
	Kernel Team, bpf

On Tue, Aug 30, 2022 at 08:31:55PM +0000, Delyan Kratunov wrote:
> On Mon, 2022-08-29 at 23:03 -0700, Alexei Starovoitov wrote:
> > 
> > On Mon, Aug 29, 2022 at 10:03 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > > > 
> > > > So here is new proposal:
> > > > 
> > > 
> > > Thanks for the proposal, Alexei. I think we're getting close to a
> > > solution, but still some comments below.
> > > 
> > > > At load time the verifier walks all kptr_xchg(map_value, obj)
> > > > and adds obj's allocator to
> > > > map->used_allocators <- {kptr_offset, allocator};
> > > > If kptr_offset already exists -> failure to load.
> > > > Allocator can probably be a part of struct bpf_map_value_off_desc.
> > > > 
> > > > In other words the pairs of {kptr_offset, allocator}
> > > > say 'there could be an object from that allocator in
> > > > that kptr in some map values'.
> > > > 
> > > > Do nothing at prog unload.
> > > > 
> > > > At map free time walk all elements and free kptrs.
> > > > Finally drop allocator refcnts.
> > > > 
> > > 
> > > Yes, this should be possible.
> > > It's quite easy to capture the map_ptr for the allocated local kptr.
> > > Limiting each local kptr to one allocator is probably fine, at least for a v1.
> > > 
> > > One problem I see is how it works when the allocator map is an inner map.
> > > Then, it is not possible to find the backing allocator instance at
> > > verification time, hence not possible to take the reference to it in
> > > map->used_allocators.
> > > But let's just assume that is disallowed for now.
> > > 
> > > The other problem I see is that when the program just does
> > > kptr_xchg(map_value, NULL), we may not have allocator info from
> > > kptr_offset at that moment. Allocating prog which fills
> > > used_allocators may be verified later. We _can_ reject this, but it
> > > makes everything fragile (dependent on which order you load programs
> > > in), which won't be great. You can then use this lost info to make
> > > kptr disjoint from allocator lifetime.
> > > 
> > > Let me explain through an example.
> > > 
> > > Consider this order to set up the programs:
> > > One allocator map A.
> > > Two hashmaps M1, M2.
> > > Three programs P1, P2, P3.
> > > 
> > > P1 uses M1, M2.
> > > P2 uses A, M1.
> > > P3 uses M2.
> > > 
> > > Sequence:
> > > map_create A, M1, M2.
> > > 
> > > Load P1, uses M1, M2. What this P1 does is:
> > > p = kptr_xchg(M1.value, NULL);
> > > kptr_xchg(M2.value, p);
> > > 
> > > So it moves the kptr in M1 into M2. The problem is at this point
> > > kptr_offset is not populated, so we cannot fill used_allocators of M2
> > > as we cannot track which allocator is used to fill M1.value. We saw
> > > nothing filling it yet.
> > > 
> > > Next, load P3. It does:
> > > p = kptr_xchg(M2.value, NULL);
> > > unit_free(p); // let's assume p has bpf_mem_alloc ptr behind itself so
> > > this is ok if allocator is alive.
> > > 
> > > Again, M2.used_allocators is empty. Nothing is filled into it.
> > > 
> > > Next, load P2.
> > > p = alloc(&A, ...);
> > > kptr_xchg(M1.value, p);
> > > 
> > > Now, M1.used_allocators is filled with allocator ref and kptr_offset.
> > > But M2.used_allocators will remain unfilled.
> > > 
> > > Now, run programs in sequence of P2, then P1. This will allocate from
> > > A, and move the ref to M1, then to M2. But only P1 and P2 have
> > > references to M1 so it keeps the allocator alive. However, now unload
> > > both P1 and P2.
> > > P1, P2, A, allocator of A, M1 all can be freed after RCU gp wait. M2
> > > is still held by loaded P3.
> > > 
> > > Now, M2.used_allocators is empty. P3 is using it, and it is holding
> > > allocation from allocator A. Both M1 and A are freed.
> > > When P3 runs now, it can kptr_xchg and try to free it, and the same
> > > uaf happens again.
> > > If not that, uaf when M2 is freed and it does unit_free on the alive local kptr.
> > > 
> > > --
> > > 
> > > Will this case be covered by your approach? Did I miss something?
> > > 
> > > The main issue is that this allocator info can be lost depending on
> > > how you verify a set of programs. It would not be lost if we verified
> > > in order P2, P1, P3 instead of the current P1, P3, P2.
> > > 
> > > So we might have to teach the verifier to identify kptr_xchg edges
> > > between maps, and propagate any used_allocators to the other map? But
> > > it's becoming too complicated.
> > > 
> > > You _can_ reject loads of programs when you don't find kptr_offset
> > > populated on seeing kptr_xchg(..., NULL), but I don't think this is
> > > practical either. It makes the things sensitive to program
> > > verification order, which would be confusing for users.
> > 
> > Right. Thanks for brainstorming and coming up with the case
> > where it breaks.
> > 
> > Let me explain the thought process behind the proposal.
> > The way the progs will be written will be something like:
> > 
> > struct foo {
> >   int var;
> > };
> > 
> > struct {
> >   __uint(type, BPF_MAP_TYPE_ALLOCATOR);
> >   __type(value, struct foo);
> > } ma SEC(".maps");
> > 
> > struct map_val {
> >   struct foo * ptr __kptr __local;
> > };
> > 
> > struct {
> >   __uint(type, BPF_MAP_TYPE_HASH);
> >   __uint(max_entries, 123);
> >   __type(key, __u32);
> >   __type(value, struct map_val);
> > } hash SEC(".maps");
> > 
> > struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));
> > bpf_kptr_xchg(&map_val->ptr, p);
> > 
> > Even if prog doesn't allocate and only does kptr_xchg like
> > your P1 and P3 do the C code has to have a full
> > definition 'struct foo' to compile P1 and P3.
> > P1 and P3 don't need to see definition of 'ma' to be compiled,
> > but 'struct foo' has to be seen.
> > BTF reference will be taken by both 'ma' and by 'hash'.
> > The btf_id will come from that BTF.
> > 
> > The type is tied to BTF and tied to kptr in map value.
> > The type is also tied to the allocator.
> > The type creates a dependency chain between allocator and map.
> > So the restriction of one allocator per kptr feels
> > reasonable and doesn't feel restrictive at all.
> > That dependency chain is there in the C code of the program.
> > Hence the proposal to discover this dependency in the verifier
> > through tracking of allocator from mem_alloc into kptr_xchg.
> > But you're correct that it's not working for P1 and P3.
> 
> Encoding the allocator in the runtime type system is fine but it comes with its own
> set of tradeoffs.
> 
> > 
> > I can imagine a few ways to solve it.
> > 1. Ask users to annotate kptr local with the allocator
> > that will be used.
> > It's easy for progs P1 and P3. All definitions are likely available.
> > It's only an extra tag of some form.
> 
> This would introduce maps referring to other maps and would thus require substantial
> work in libbpf. In order to encode the link specific to field instead of the whole
> map object, we'd have to resort to something like map names as type tags, which is
> not a great design (arbitrary strings etc). 
> 
> > 2. move 'used_allocator' from map further into BTF,
> >   since BTF is the root of this dependency chain.
> 
> This would _maybe_ work. The hole I can see in this plan is that once a slot is
> claimed, it cannot be unclaimed and thus maps can remain in a state that leaves the
> local kptr fields useless (i.e. the allocator is around but no allocator map for it
> exists and thus no program can use these fields again, they can only be free()).

That's correct if we think of allocators as property of programs, but see below.

> The reason it can't be unclaimed is that programs were verified with a specific
> allocator value in mind and we can't change that after they're loaded. To be able to
> unclaim a slot, we'd need to remove everything using that system (i.e. btf object)
> and load it again, which is not great.

That's also correct.

> 
> > When the verifier sees bpf_mem_alloc in P2 it will add
> > {allocator, btf_id} pair to BTF.
> > 
> > If P1 loads first and the verifier see:
> > p = kptr_xchg(M1.value, NULL);
> > it will add {unique_allocator_placeholder, btf_id} to BTF.
> > Then
> > kptr_xchg(M2.value, p); does nothing.
> > The verifier checks that M1's BTF == M2's BTF and id-s are same.
> 
> Note to self: don't allow setting base_btf from userspace without auditing all these
> checks.
> 
> > 
> > Then P3 loads with:
> > p = kptr_xchg(M2.value, NULL);
> > unit_free(p);
> > since unique_allocator_placholder is already there for that btf_id
> > the verifier does nothing.
> > 
> > Eventually it will see bpf_mem_alloc for that btf_id and will
> > replace the placeholder with the actual allocator.
> > We can even allow P1 and P3 to be runnable after load right away.
> > Since nothing can allocate into that kptr local those
> > kptr_xchg() in P1 and P3 will be returning NULL.
> > If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue.
> > 
> > Ideally for unit_free(p); in P3 the verifier would add a hidden
> > 'ma' argument, so that allocator doesn't need to be stored dynamically.
> > We can either insns of P3 after P2 was verified or
> > pass a pointer to a place in BTF->used_allocator list of pairs
> > where actual allocator pointer will be written later.
> > Then no patching is needed.
> > If P2 never loads the unit_free(*addr /* here it will load the
> > value of unique_allocator_placeholder */, ...)
> > but since unit_free() will never execute with valid obj to be freed.
> > 
> > "At map free time walk all elements and free kptrs" step stays the same.
> > but we decrement allocator refcnt only when BTF is freed
> > which should be after map free time.
> > So btf_put(map->btf); would need to move after ops->map_free.
> > 
> > Maybe unique_allocator_placeholder approach can be used
> > without moving the list into BTF. Need to think more about it tomorrow.
> 
> I don't think we have to resort to storing the type-allocator mappings in the BTF at
> all.
> 
> In fact, we can encode them where you wanted to encode them with tags - on the fields
> themselves. We can put the mem_alloc reference in the kptr field descriptors and have
> it transition to a specific allocator irreversibly. We would need to record where any
> equivalences between allocators we need for the currently loaded programs but it's
> not impossible.
> 
> Making the transition reversible is the hard part of course.
> 
> Taking a step back, we're trying to convert a runtime property (this object came from
> this allocator) into a static property. The _cleanest_ way to do this would be to
> store the dependencies explicitly and propagate/dissolve them on program load/unload.

Agree with above.

> Note that only programs introduce new dependency edges, maps on their own do not
> mandate dependencies but stored values can extend the lifetime of a dependency chain.

I think we're getting to the bottom of the issues with this api.
The api is designed around the concept of an allocator being another map type.
That felt correct in the beginning, but see below.

> There are only a couple of types of dependencies:
> Program X stores allocation from Y into map Z, field A
> Program X requires allocator for Y.A to equal allocator for Z.A (+ a version for
> inner maps)

What does it mean for two allocators to be equivalent?
Like mergeable slabs? If so the kernel gave the answer to that question long ago:
merge them whenever possible.
bpf-speak if two allocators have the same type they should be mergeable (== equivalent).
Now let's come up with a use case when bpf core would need to recognize two
equivalent allocators.
Take networking service, like katran. It has a bunch of progs that are using
pinned maps. The maps are pinned because progs can be reloaded due to
live-upgrade or due to user space restart. The map contents are preserved, but
the progs are reloaded. That's a common scenario.
If we design allocator api in a way that allocator is another map. The prog
reload case has two options:
- create new allocator map and use it in reloaded progs
- pin allocator map along with other maps at the very beginning and
  make reloaded progs use it

In the later case there is no new allocator. No need to do equivalence checks.
In the former case there is a new allocator and bpf core needs to recognize
equivalence otherwise pinned maps with kptrs won't be usable out of reloaded
progs. But what are the benefits of creating new allocators on reload?
I don't see any. Old allocator had warm caches populated with cache friendly
objects. New allocator has none of it, but it's going to be used for exactly
the same objects. What's worse. New allocator will likely bring a new BTF object
making equivalence work even harder.

I think it all points to the design issue where allocator is another map and
dependency between maps is a run-time property. As you pointed out earlier the
cleanest way is to make this dependency static. Turned out we have such static
dependency already. Existing bpf maps depend on kernel slab allocator. After
bpf_mem_alloc patches each hash map will have a hidden dependency on that
allocator.

What we've designed so far is:

struct foo {
  int var;
};

struct {
  __uint(type, BPF_MAP_TYPE_ALLOCATOR);
  __type(value, struct foo);
} ma SEC(".maps");

struct map_val {
  struct foo * ptr __kptr __local;
};

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __uint(max_entries, 123);
  __type(key, __u32);
  __type(value, struct map_val);
} hash SEC(".maps");

struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));
bpf_kptr_xchg(&map_val->ptr, p);

/* this is a copy-paste of my earlier example. no changes */

but what wasn't obvious that we have two allocators here.
One explicit 'ma' and another hidden allocator in 'hash'.
Both are based on 'struct bpf_mem_alloc'.
One will be allocating 'struct foo'. Another 'struct map_val'.
But we missed opportunity to make them mergeable and
bpf prog cannot customize them.

I think the way to fix the api is to recognize:
- every map has an allocator
- expose that allocator to progs
- allow sharing of allocators between maps

so an allocator is a mandatory part of the map.
If it's not specified an implicit one will be used.
Thinking along these lines we probably need map-in-map-like
specification of explicit allocator(s) for maps:

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __uint(max_entries, 123);
  __type(key, __u32);
  __type(value, struct map_val);
  __array(elem_allocator, struct {
      __uint(type, BPF_MAP_TYPE_ALLOCATOR);
  });
  __array(kptr_allocator, struct {
      __uint(type, BPF_MAP_TYPE_ALLOCATOR);
  });
} hash SEC(".maps");

That would be explicit and static declartion of allocators that
hash map should use.
The benefits:
- the prog writers can share the same allocator across multiple
hash maps by specifying the same 'elem_allocator'.
(struct bpf_mem_alloc already supports more than one size)
The most memory concious authors can use the same allocator
across all of their maps achieving the best memory savings.
Not talking about kptr yet. This is just plain hash maps.

- the program can influence hash map allocator.
Once we have bpf_mem_prefill() helper the bpf program can add
reserved elements to a hash map and guarantee that
bpf_map_update_elem() will succeed later.

- kptr allocator is specified statically. At map creation time
the map will take reference of allocator and will keep it until
destruction. The verifier will make sure that kptr_xchg-ed objects
come only from that allocator.
So all of the refcnt issues discussed in this thread are gone.
The prog will look like:

struct {
  __uint(type, BPF_MAP_TYPE_ALLOCATOR);
} ma SEC(".maps");

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __type(value, struct map_val);
  __array(kptr_allocator, struct {
      __uint(type, BPF_MAP_TYPE_ALLOCATOR);
  });
} hash SEC(".maps") = {
        .kptr_allocator = { (void *)&ma },
};
struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));

If allocated kptr-s don't need to be in multiple maps the prog can be:

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __type(value, struct map_val);
  __array(kptr_allocator, struct {
      __uint(type, BPF_MAP_TYPE_ALLOCATOR);
  });
} hash SEC(".maps");

struct foo * p = bpf_mem_alloc(&hash->kptr_allocator, type_id_local(struct foo));

The same hash->kptr_allocator can be used to allocate different types.
We can also allow the same allocator to be specified in hash->elem_allocator
and hash->kptr_allocator.

We can allow progs to skip declaring explicit BPF_MAP_TYPE_ALLOCATOR.
Like:
struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __type(value, struct map_val);
} hash SEC(".maps");

struct foo * p = bpf_mem_alloc(&hash->elem_allocator, type_id_local(struct foo));

In this case both kptr objects and map elements come from the same allocator
that was implicitly created at hash map creation time.

The prog reload use case is naturally solved.
By pinning map the user space pinned allocators and prog reload will reuse
the same maps with the same allocators.

If this design revision makes sense the bpf_mem_alloc needs a bit of work
to support the above cleanly. It should be straightforward.

> If there's a non-NULL value in the map, we can't do much - we need a program to load
> that uses this map and on that program's unload, we can check again. On map free, we
> can free the values, of course, but we can't remove the dependency edges from the
> table, since values may have propagated to other tables (this depends on the concrete
> implementation - we might be able to have the map remove all edges that reference
> it).
...
> I don't think it's worth the complexity, explicit or not.

The edge tracking dependency graph sounds quite complex and I agree that it's not worth it.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-31  1:52                               ` Alexei Starovoitov
@ 2022-08-31 17:38                                 ` Delyan Kratunov
  2022-08-31 18:57                                   ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Delyan Kratunov @ 2022-08-31 17:38 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: tj, joannelkoong, andrii, davem, daniel, Dave Marchevsky, memxor,
	Kernel Team, bpf

On Tue, 2022-08-30 at 18:52 -0700, Alexei Starovoitov wrote:
> > [SNIP]
> > This would _maybe_ work. The hole I can see in this plan is that once a slot is
> > claimed, it cannot be unclaimed and thus maps can remain in a state that leaves the
> > local kptr fields useless (i.e. the allocator is around but no allocator map for it
> > exists and thus no program can use these fields again, they can only be free()).
> 
> That's correct if we think of allocators as property of programs, but see below.

Allocators are not properties of programs, allocator _access_ is. Allocators are
properties of the objects allocated from them, programs and maps only participate in
the lifecycle of the objects and thus extend the lifetime of the allocator.

> 
> > The reason it can't be unclaimed is that programs were verified with a specific
> > allocator value in mind and we can't change that after they're loaded. To be able to
> > unclaim a slot, we'd need to remove everything using that system (i.e. btf object)
> > and load it again, which is not great.
> 
> That's also correct.
> 
> > 
> > > When the verifier sees bpf_mem_alloc in P2 it will add
> > > {allocator, btf_id} pair to BTF.
> > > 
> > > If P1 loads first and the verifier see:
> > > p = kptr_xchg(M1.value, NULL);
> > > it will add {unique_allocator_placeholder, btf_id} to BTF.
> > > Then
> > > kptr_xchg(M2.value, p); does nothing.
> > > The verifier checks that M1's BTF == M2's BTF and id-s are same.
> > 
> > Note to self: don't allow setting base_btf from userspace without auditing all these
> > checks.
> > 
> > > 
> > > Then P3 loads with:
> > > p = kptr_xchg(M2.value, NULL);
> > > unit_free(p);
> > > since unique_allocator_placholder is already there for that btf_id
> > > the verifier does nothing.
> > > 
> > > Eventually it will see bpf_mem_alloc for that btf_id and will
> > > replace the placeholder with the actual allocator.
> > > We can even allow P1 and P3 to be runnable after load right away.
> > > Since nothing can allocate into that kptr local those
> > > kptr_xchg() in P1 and P3 will be returning NULL.
> > > If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue.
> > > 
> > > Ideally for unit_free(p); in P3 the verifier would add a hidden
> > > 'ma' argument, so that allocator doesn't need to be stored dynamically.
> > > We can either insns of P3 after P2 was verified or
> > > pass a pointer to a place in BTF->used_allocator list of pairs
> > > where actual allocator pointer will be written later.
> > > Then no patching is needed.
> > > If P2 never loads the unit_free(*addr /* here it will load the
> > > value of unique_allocator_placeholder */, ...)
> > > but since unit_free() will never execute with valid obj to be freed.
> > > 
> > > "At map free time walk all elements and free kptrs" step stays the same.
> > > but we decrement allocator refcnt only when BTF is freed
> > > which should be after map free time.
> > > So btf_put(map->btf); would need to move after ops->map_free.
> > > 
> > > Maybe unique_allocator_placeholder approach can be used
> > > without moving the list into BTF. Need to think more about it tomorrow.
> > 
> > I don't think we have to resort to storing the type-allocator mappings in the BTF at
> > all.
> > 
> > In fact, we can encode them where you wanted to encode them with tags - on the fields
> > themselves. We can put the mem_alloc reference in the kptr field descriptors and have
> > it transition to a specific allocator irreversibly. We would need to record where any
> > equivalences between allocators we need for the currently loaded programs but it's
> > not impossible.
> > 
> > Making the transition reversible is the hard part of course.
> > 
> > Taking a step back, we're trying to convert a runtime property (this object came from
> > this allocator) into a static property. The _cleanest_ way to do this would be to
> > store the dependencies explicitly and propagate/dissolve them on program load/unload.
> 
> Agree with above.
> 
> > Note that only programs introduce new dependency edges, maps on their own do not
> > mandate dependencies but stored values can extend the lifetime of a dependency chain.
> 
> I think we're getting to the bottom of the issues with this api.
> The api is designed around the concept of an allocator being another map type.
> That felt correct in the beginning, but see below.
> 
> > There are only a couple of types of dependencies:
> > Program X stores allocation from Y into map Z, field A
> > Program X requires allocator for Y.A to equal allocator for Z.A (+ a version for
> > inner maps)
> 
> What does it mean for two allocators to be equivalent?
> Like mergeable slabs? If so the kernel gave the answer to that question long ago:
> merge them whenever possible.
> bpf-speak if two allocators have the same type they should be mergeable (== equivalent).

The requirement to track (allocator, type) tuples came from the desire to manage
allocator lifetime statically, it's not inherent in the problem. It allows us to
narrow down the problem space while retaining some flexibility in that multiple
allocators can participate in the same map.

> Now let's come up with a use case when bpf core would need to recognize two
> equivalent allocators.
> Take networking service, like katran. It has a bunch of progs that are using
> pinned maps. The maps are pinned because progs can be reloaded due to
> live-upgrade or due to user space restart. The map contents are preserved, but
> the progs are reloaded. That's a common scenario.
> If we design allocator api in a way that allocator is another map. The prog
> reload case has two options:
> - create new allocator map and use it in reloaded progs
> - pin allocator map along with other maps at the very beginning and
>   make reloaded progs use it
> 
> In the later case there is no new allocator. No need to do equivalence checks.
> In the former case there is a new allocator and bpf core needs to recognize
> equivalence otherwise pinned maps with kptrs won't be usable out of reloaded
> progs. But what are the benefits of creating new allocators on reload?
> I don't see any. Old allocator had warm caches populated with cache friendly
> objects. New allocator has none of it, but it's going to be used for exactly
> the same objects. What's worse. New allocator will likely bring a new BTF object
> making equivalence work even harder.

From an optimal behavior point of view, I agree that starting a new allocator is not
ideal. However, if a design allows suboptimal behavior, it _will_ be used by users,
so it must be correct.

> 
> I think it all points to the design issue where allocator is another map and
> dependency between maps is a run-time property. As you pointed out earlier the
> cleanest way is to make this dependency static. Turned out we have such static
> dependency already. Existing bpf maps depend on kernel slab allocator. After
> bpf_mem_alloc patches each hash map will have a hidden dependency on that
> allocator.
> 
> What we've designed so far is:
> 
> struct foo {
>   int var;
> };
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   __type(value, struct foo);
> } ma SEC(".maps");
> 
> struct map_val {
>   struct foo * ptr __kptr __local;
> };
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __uint(max_entries, 123);
>   __type(key, __u32);
>   __type(value, struct map_val);
> } hash SEC(".maps");
> 
> struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));
> bpf_kptr_xchg(&map_val->ptr, p);
> 
> /* this is a copy-paste of my earlier example. no changes */
> 
> but what wasn't obvious that we have two allocators here.
> One explicit 'ma' and another hidden allocator in 'hash'.
> Both are based on 'struct bpf_mem_alloc'.
> One will be allocating 'struct foo'. Another 'struct map_val'.
> But we missed opportunity to make them mergeable and
> bpf prog cannot customize them.
> 
> I think the way to fix the api is to recognize:
> - every map has an allocator

Array maps don't or at least not in a way that matters. But sure.

> - expose that allocator to progs

Good idea on its own.

> - allow sharing of allocators between maps

Hidden assumption here (see below).

> 
> so an allocator is a mandatory part of the map.
> If it's not specified an implicit one will be used.
> Thinking along these lines we probably need map-in-map-like
> specification of explicit allocator(s) for maps:
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __uint(max_entries, 123);
>   __type(key, __u32);
>   __type(value, struct map_val);
>   __array(elem_allocator, struct {
>       __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   });
>   __array(kptr_allocator, struct {
>       __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   });
> } hash SEC(".maps");
> 
> That would be explicit and static declartion of allocators that
> hash map should use.

Adding a kptr_allocator customization mechanism to all maps is fine, though pretty
heavy-handed in terms of abstraction leakage. elem_allocator doesn't make sense for
all maps.

> The benefits:
> - the prog writers can share the same allocator across multiple
> hash maps by specifying the same 'elem_allocator'.
> (struct bpf_mem_alloc already supports more than one size)
> The most memory concious authors can use the same allocator
> across all of their maps achieving the best memory savings.
> Not talking about kptr yet. This is just plain hash maps.
> 
> - the program can influence hash map allocator.
> Once we have bpf_mem_prefill() helper the bpf program can add
> reserved elements to a hash map and guarantee that
> bpf_map_update_elem() will succeed later.

I like these extensions, they make sense.

> 
> - kptr allocator is specified statically. At map creation time
> the map will take reference of allocator and will keep it until
> destruction. The verifier will make sure that kptr_xchg-ed objects
> come only from that allocator.
> So all of the refcnt issues discussed in this thread are gone.
> The prog will look like:
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_ALLOCATOR);
> } ma SEC(".maps");
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __type(value, struct map_val);
>   __array(kptr_allocator, struct {
>       __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   });
> } hash SEC(".maps") = {
>         .kptr_allocator = { (void *)&ma },
> };

This part is going to be painful to implement but plausible. I also have to verify
that this will emit relocations and not global initializers (I don't recall the exact
rules).

> struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));
> 
> If allocated kptr-s don't need to be in multiple maps the prog can be:
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __type(value, struct map_val);
>   __array(kptr_allocator, struct {
>       __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   });
> } hash SEC(".maps");
> 
> struct foo * p = bpf_mem_alloc(&hash->kptr_allocator, type_id_local(struct foo));
> 
> The same hash->kptr_allocator can be used to allocate different types.
> We can also allow the same allocator to be specified in hash->elem_allocator
> and hash->kptr_allocator.

The need to extract the allocator into its own map in order to have kptrs of the same
type in more than one map is definitely a convenience downgrade. 

We're now to a point where we're not even putting the onus of dependency tracking on
the verifier but entirely on the developer, while making the dependency tracking less
granular. On the spectrum where on one side we have the allocator tracking its
lifetime at runtime in a way where allocators can be easily mixed in the same map,
through verifier smarts with _some_ restrictions (per field), to this design, this
feels like the most restrictive version from a developer experience pov.

I personally also don't like APIs where easy things are easy but the moment you want
to do something slightly out of the blessed path, you hit a cliff and have to learn
about a bunch of things you don't care about. "I just want to move a pointer from
here to over there, why do I have to refactor all my maps?"

> 
> We can allow progs to skip declaring explicit BPF_MAP_TYPE_ALLOCATOR.
> Like:
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __type(value, struct map_val);
> } hash SEC(".maps");
> 
> struct foo * p = bpf_mem_alloc(&hash->elem_allocator, type_id_local(struct foo));
> 
> In this case both kptr objects and map elements come from the same allocator
> that was implicitly created at hash map creation time.

Overall, this design (or maybe the way it's presented here) conflates a few ideas.

1) The extensions to expose and customize map's internal element allocator are fine
independently of even this patchset.

2) The idea that kptrs in a map need to have a statically identifiable allocator is
taken as an axiom, and then expanded to its extreme (single allocator per map as
opposed to the smarter verifier schemes). I still contest that this is not the case
and the runtime overhead it avoids is paid back in bad developer experience multiple
times over.

3) The idea that allocators can be merged between elements and kptrs is independent
of the static requirements. If the map's internal allocator is exposed per 1), we can
still use it to allocate kptrs but not require that all kptrs in a map are from the
same allocator.

Going this coarse in the API is easy for us but fundamentally more limiting for
developers. It's not hard to imagine situations where the verifier dependency
tracking or runtime lifetime tracking would allow for pinned maps to be retained but
this scheme would require new maps entirely. (Any situation where you just refactored
the implicit allocator out to share it, for example)

I also don't think that simplicity for us (a one time implementation cost +
continuous maintenance cost) trumps over long term developer experience (a much
bigger implementation cost over a much bigger time span).

> 
> The prog reload use case is naturally solved.
> By pinning map the user space pinned allocators and prog reload will reuse
> the same maps with the same allocators.
> 
> If this design revision makes sense the bpf_mem_alloc needs a bit of work
> to support the above cleanly. It should be straightforward.
> 
> > If there's a non-NULL value in the map, we can't do much - we need a program to load
> > that uses this map and on that program's unload, we can check again. On map free, we
> > can free the values, of course, but we can't remove the dependency edges from the
> > table, since values may have propagated to other tables (this depends on the concrete
> > implementation - we might be able to have the map remove all edges that reference
> > it).
> ...
> > I don't think it's worth the complexity, explicit or not.
> 
> The edge tracking dependency graph sounds quite complex and I agree that it's not worth it.

So far, my ranked choice vote is:

1) maximum flexibility and runtime live object counts (with exposed allocators, I
like the merging)
2) medium flexibility with per-field allocator tracking in the verifier and the
ability to lose the association once programs are unloaded and values are gone. This
also works better with exposed allocators since they are implicitly pinned and would
be usable to store values in another map.
3) minimum flexibility with static whole-map kptr allocators

-- Delyan


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-31 17:38                                 ` Delyan Kratunov
@ 2022-08-31 18:57                                   ` Alexei Starovoitov
  2022-08-31 20:12                                     ` Kumar Kartikeya Dwivedi
  2022-08-31 21:02                                     ` Delyan Kratunov
  0 siblings, 2 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-31 18:57 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: tj, joannelkoong, andrii, davem, daniel, Dave Marchevsky, memxor,
	Kernel Team, bpf

On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote:
> 
> Overall, this design (or maybe the way it's presented here) conflates a few ideas.
> 
> 1) The extensions to expose and customize map's internal element allocator are fine
> independently of even this patchset.
> 
> 2) The idea that kptrs in a map need to have a statically identifiable allocator is
> taken as an axiom, and then expanded to its extreme (single allocator per map as
> opposed to the smarter verifier schemes). I still contest that this is not the case
> and the runtime overhead it avoids is paid back in bad developer experience multiple
> times over.
> 
> 3) The idea that allocators can be merged between elements and kptrs is independent
> of the static requirements. If the map's internal allocator is exposed per 1), we can
> still use it to allocate kptrs but not require that all kptrs in a map are from the
> same allocator.
> 
> Going this coarse in the API is easy for us but fundamentally more limiting for
> developers. It's not hard to imagine situations where the verifier dependency
> tracking or runtime lifetime tracking would allow for pinned maps to be retained but
> this scheme would require new maps entirely. (Any situation where you just refactored
> the implicit allocator out to share it, for example)
> 
> I also don't think that simplicity for us (a one time implementation cost +
> continuous maintenance cost) trumps over long term developer experience (a much
> bigger implementation cost over a much bigger time span).

It feels we're thinking about scope and use cases for the allocator quite
differently and what you're seeing as 'limiting developer choices' to me looks
like 'not a limiting issue at all'. To me the allocator is one
jemalloc/tcmalloc instance. One user space application with multiple threads,
lots of maps and code is using exactly one such allocator. The allocator
manages all the memory of user space process. In bpf land we don't have a bpf
process. We don't have a bpf name space either.  A loose analogy would be a set
of programs and maps managed by one user space agent. The bpf allocator would
manage all the memory of these maps and programs and provide a "memory namespace"
for this set of programs. Another user space agent with its own programs
would never want to share the same allocator. In user space a chunk of memory
could be mmap-ed between different process to share the data, but you would never
put a tcmalloc over such memory to be an allocator for different processes.

More below.

> So far, my ranked choice vote is:
> 
> 1) maximum flexibility and runtime live object counts (with exposed allocators, I
> like the merging)
> 2) medium flexibility with per-field allocator tracking in the verifier and the
> ability to lose the association once programs are unloaded and values are gone. This
> also works better with exposed allocators since they are implicitly pinned and would
> be usable to store values in another map.
> 3) minimum flexibility with static whole-map kptr allocators

The option 1 flexibility is necessary when allocator is seen as a private pool
of objects of given size. Like kernel's kmem_cache instance.
I don't think we quite there yet.
There is a need to "preallocate this object from sleepable context,
so the prog has a guaranteed chunk of memory to use in restricted context",
but, arguably, it's not a job of bpf allocator. bpf prog can allocate an object,
stash it into kptr, and use it later.
So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
more than we need. Ideally it would be easy to specifiy one single
allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
In other words one bpf allocator per bpf "namespace" is more than enough.
Program authors shouldn't be creating allocators left and right. All these
free lists will waste memory.
btw I've added an extra patch to bpf_mem_alloc series:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1
that removes kmem_cache usage.
Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea.
When free_lists are not shared they will similarly waste memory.
In user space the C code just does malloc() and the memory is isolated per process.
Ideally in bpf world the programs would just do:
bpf_mem_alloc(btf_type_id_local(struct foo));
without specifying an allocator, but that would require one global allocator
for all bpf programs in the kernel which is probably not a direction we should go ?
So the programs have to specify an allocator to use in bpf_mem_alloc(),
but it should be one for all progs, maps in a bpf-package/set/namespace.
If it's easy for programs to specify a bunch of allocators, like one for each program,
or one for each btf_type_id the bpf kernel infra would be required to merge
these allocators from day one. (The profileration of kmem_cache-s in the past
forced merging of them). By restricting bpf program choices with allocator-per-map
(this option 3) we're not only making the kernel side to do less work
(no run-time ref counts, no merging is required today), we're also pushing
bpf progs to use memory concious choices.
Having said all that maybe one global allocator is not such a bad idea.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-31 18:57                                   ` Alexei Starovoitov
@ 2022-08-31 20:12                                     ` Kumar Kartikeya Dwivedi
  2022-08-31 20:38                                       ` Alexei Starovoitov
  2022-08-31 21:02                                     ` Delyan Kratunov
  1 sibling, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-31 20:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Delyan Kratunov, tj, joannelkoong, andrii, davem, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Wed, 31 Aug 2022 at 20:57, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote:
> >
> > Overall, this design (or maybe the way it's presented here) conflates a few ideas.
> >
> > 1) The extensions to expose and customize map's internal element allocator are fine
> > independently of even this patchset.
> >
> > 2) The idea that kptrs in a map need to have a statically identifiable allocator is
> > taken as an axiom, and then expanded to its extreme (single allocator per map as
> > opposed to the smarter verifier schemes). I still contest that this is not the case
> > and the runtime overhead it avoids is paid back in bad developer experience multiple
> > times over.
> >
> > 3) The idea that allocators can be merged between elements and kptrs is independent
> > of the static requirements. If the map's internal allocator is exposed per 1), we can
> > still use it to allocate kptrs but not require that all kptrs in a map are from the
> > same allocator.
> >
> > Going this coarse in the API is easy for us but fundamentally more limiting for
> > developers. It's not hard to imagine situations where the verifier dependency
> > tracking or runtime lifetime tracking would allow for pinned maps to be retained but
> > this scheme would require new maps entirely. (Any situation where you just refactored
> > the implicit allocator out to share it, for example)
> >
> > I also don't think that simplicity for us (a one time implementation cost +
> > continuous maintenance cost) trumps over long term developer experience (a much
> > bigger implementation cost over a much bigger time span).
>
> It feels we're thinking about scope and use cases for the allocator quite
> differently and what you're seeing as 'limiting developer choices' to me looks
> like 'not a limiting issue at all'. To me the allocator is one

I went over the proposal multiple times, just to make sure I
understood it properly, but I still can't see this working ideally for
the inner map case, even if we ignore the rest of the things for a
moment.
But maybe you prefer to just forbid them there? Please correct me if I'm wrong.

You won't be able to know the allocator statically for inner maps (and
hence not be able to enforce the kptr_xchg requirement to be from the
same allocator as map). To know it, we will have to force all
inner_maps to use the same allocator, either the one specified for
inner map fd, or the one in map-in-map definition, or elsewhere. But
to be able to know it statically the information will have to come
from map-in-map somehow.

That seems like a very weird limitation just to use local kptrs, and
doesn't even make sense for map-in-map use cases IMO.
And unless I'm missing something there isn't an easy way to
accommodate it in the 'statically known allocator' proposal, because
such inner_map allocators (and inner_maps) are themselves not static.

> jemalloc/tcmalloc instance. One user space application with multiple threads,
> lots of maps and code is using exactly one such allocator. The allocator
> manages all the memory of user space process. In bpf land we don't have a bpf
> process. We don't have a bpf name space either.  A loose analogy would be a set
> of programs and maps managed by one user space agent. The bpf allocator would
> manage all the memory of these maps and programs and provide a "memory namespace"
> for this set of programs. Another user space agent with its own programs
> would never want to share the same allocator. In user space a chunk of memory
> could be mmap-ed between different process to share the data, but you would never
> put a tcmalloc over such memory to be an allocator for different processes.
>

But just saying "would never" or "should never" doesn't work right?
Hyrum's law and all.

libbpf style "bpf package" deployments may not be the only consumers
of this infra in the future. So designing around this specific idea
that people will never or shouldn't dynamically share their allocator
objects between maps which don't share the same allocator seems
destined to only serve us in the short run IMO.

People may come up with cases where they are passing ownership of
objects between such bpf packages, and after coming up with multiple
examples before it doesn't seem likely static dependencies will be
able to capture such dynamic runtime relationships, e.g. static
dependencies don't even work in the inner_map case without more
restrictions.

> More below.
>
> > So far, my ranked choice vote is:
> >
> > 1) maximum flexibility and runtime live object counts (with exposed allocators, I
> > like the merging)
> > 2) medium flexibility with per-field allocator tracking in the verifier and the
> > ability to lose the association once programs are unloaded and values are gone. This
> > also works better with exposed allocators since they are implicitly pinned and would
> > be usable to store values in another map.
> > 3) minimum flexibility with static whole-map kptr allocators
>
> The option 1 flexibility is necessary when allocator is seen as a private pool
> of objects of given size. Like kernel's kmem_cache instance.
> I don't think we quite there yet.
> There is a need to "preallocate this object from sleepable context,
> so the prog has a guaranteed chunk of memory to use in restricted context",
> but, arguably, it's not a job of bpf allocator. bpf prog can allocate an object,
> stash it into kptr, and use it later.

At least if not adding support for it all now, I think this kind of
flexibility in option 1 needs to be given some more consideration, as
in whether this proposal to encode things statically would be able to
accommodate such cases in the future. To me it seems pretty hard (and
unless I missed something, it already won't work for inner_maps case
without requiring all to use the same allocator).

We might actually be able to do a hybrid of the options by utilizing
the statically known allocator info to acquire references and runtime
object counts, which may help eliminate/delay the actual cost we pay
for it - the atomic upgrade, when initial reference goes away.

So I think I lean towards option 1 as well, and then the same order as
Delyan. It seems to cover all kinds of corner cases (allocator known
vs unknown, normal vs inner maps, etc.) we've been going over in this
thread, and would also be future proof in terms of permitting
unforeseen patterns of usage.

> So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
> more than we need. Ideally it would be easy to specifiy one single
> allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
> In other words one bpf allocator per bpf "namespace" is more than enough.
> Program authors shouldn't be creating allocators left and right. All these
> free lists will waste memory.
> btw I've added an extra patch to bpf_mem_alloc series:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1
> that removes kmem_cache usage.
> Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea.
> When free_lists are not shared they will similarly waste memory.
> In user space the C code just does malloc() and the memory is isolated per process.
> Ideally in bpf world the programs would just do:
> bpf_mem_alloc(btf_type_id_local(struct foo));
> without specifying an allocator, but that would require one global allocator
> for all bpf programs in the kernel which is probably not a direction we should go ?
> So the programs have to specify an allocator to use in bpf_mem_alloc(),
> but it should be one for all progs, maps in a bpf-package/set/namespace.|

But "all progs" doesn't mean all of them are running in the same
context and having the same kinds of memory allocation requirements,
right? While having too many allocators is also bad, having just one
single one per package would also be limiting. A bpf object/package is
very different from a userspace process. So the analogy doesn't
exactly fit IMO.



> If it's easy for programs to specify a bunch of allocators, like one for each program,
> or one for each btf_type_id the bpf kernel infra would be required to merge
> these allocators from day one. (The profileration of kmem_cache-s in the past
> forced merging of them). By restricting bpf program choices with allocator-per-map
> (this option 3) we're not only making the kernel side to do less work
> (no run-time ref counts, no merging is required today), we're also pushing
> bpf progs to use memory concious choices.
> Having said all that maybe one global allocator is not such a bad idea.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-31 20:12                                     ` Kumar Kartikeya Dwivedi
@ 2022-08-31 20:38                                       ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-08-31 20:38 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Delyan Kratunov, tj, joannelkoong, andrii, davem, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Wed, Aug 31, 2022 at 10:12:28PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Wed, 31 Aug 2022 at 20:57, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote:
> > >
> > > Overall, this design (or maybe the way it's presented here) conflates a few ideas.
> > >
> > > 1) The extensions to expose and customize map's internal element allocator are fine
> > > independently of even this patchset.
> > >
> > > 2) The idea that kptrs in a map need to have a statically identifiable allocator is
> > > taken as an axiom, and then expanded to its extreme (single allocator per map as
> > > opposed to the smarter verifier schemes). I still contest that this is not the case
> > > and the runtime overhead it avoids is paid back in bad developer experience multiple
> > > times over.
> > >
> > > 3) The idea that allocators can be merged between elements and kptrs is independent
> > > of the static requirements. If the map's internal allocator is exposed per 1), we can
> > > still use it to allocate kptrs but not require that all kptrs in a map are from the
> > > same allocator.
> > >
> > > Going this coarse in the API is easy for us but fundamentally more limiting for
> > > developers. It's not hard to imagine situations where the verifier dependency
> > > tracking or runtime lifetime tracking would allow for pinned maps to be retained but
> > > this scheme would require new maps entirely. (Any situation where you just refactored
> > > the implicit allocator out to share it, for example)
> > >
> > > I also don't think that simplicity for us (a one time implementation cost +
> > > continuous maintenance cost) trumps over long term developer experience (a much
> > > bigger implementation cost over a much bigger time span).
> >
> > It feels we're thinking about scope and use cases for the allocator quite
> > differently and what you're seeing as 'limiting developer choices' to me looks
> > like 'not a limiting issue at all'. To me the allocator is one
> 
> I went over the proposal multiple times, just to make sure I
> understood it properly, but I still can't see this working ideally for
> the inner map case, even if we ignore the rest of the things for a
> moment.
> But maybe you prefer to just forbid them there? Please correct me if I'm wrong.
> 
> You won't be able to know the allocator statically for inner maps (and
> hence not be able to enforce the kptr_xchg requirement to be from the
> same allocator as map). To know it, we will have to force all
> inner_maps to use the same allocator, 

Of course. That's the idea. I don't see a practical use case to use
different allocators in different inner maps.

> either the one specified for
> inner map fd, or the one in map-in-map definition, or elsewhere. But
> to be able to know it statically the information will have to come
> from map-in-map somehow.
> 
> That seems like a very weird limitation just to use local kptrs, and
> doesn't even make sense for map-in-map use cases IMO.
> And unless I'm missing something there isn't an easy way to
> accommodate it in the 'statically known allocator' proposal, because
> such inner_map allocators (and inner_maps) are themselves not static.

It doesn't look difficult. The inner map template has to be defined in the outer map.
All inner maps must fit this template. Currently it requires key/value
to be exactly the same. We used to enforce max_entries too, but it was relaxed later.
The same allocator for all inner maps would be a part of the requirement.
Easy to enforce and easy for progs to comply.
It doesn't look limiting or weird to me.

> > jemalloc/tcmalloc instance. One user space application with multiple threads,
> > lots of maps and code is using exactly one such allocator. The allocator
> > manages all the memory of user space process. In bpf land we don't have a bpf
> > process. We don't have a bpf name space either.  A loose analogy would be a set
> > of programs and maps managed by one user space agent. The bpf allocator would
> > manage all the memory of these maps and programs and provide a "memory namespace"
> > for this set of programs. Another user space agent with its own programs
> > would never want to share the same allocator. In user space a chunk of memory
> > could be mmap-ed between different process to share the data, but you would never
> > put a tcmalloc over such memory to be an allocator for different processes.
> >
> 
> But just saying "would never" or "should never" doesn't work right?
> Hyrum's law and all.

Agree to disagree. Vanilla C allows null pointer dereferences too.
BPF C doesn't.

> libbpf style "bpf package" deployments may not be the only consumers
> of this infra in the future. So designing around this specific idea
> that people will never or shouldn't dynamically share their allocator
> objects between maps which don't share the same allocator seems
> destined to only serve us in the short run IMO.
> 
> People may come up with cases where they are passing ownership of
> objects between such bpf packages, and after coming up with multiple
> examples before it doesn't seem likely static dependencies will be
> able to capture such dynamic runtime relationships, e.g. static
> dependencies don't even work in the inner_map case without more
> restrictions.

The maps are such shared objects.
They are shared between bpf programs and between progs and user space.
Not proposing anything new here.
An allocator connected to a map preserves this sharing ability.

> 
> > More below.
> >
> > > So far, my ranked choice vote is:
> > >
> > > 1) maximum flexibility and runtime live object counts (with exposed allocators, I
> > > like the merging)
> > > 2) medium flexibility with per-field allocator tracking in the verifier and the
> > > ability to lose the association once programs are unloaded and values are gone. This
> > > also works better with exposed allocators since they are implicitly pinned and would
> > > be usable to store values in another map.
> > > 3) minimum flexibility with static whole-map kptr allocators
> >
> > The option 1 flexibility is necessary when allocator is seen as a private pool
> > of objects of given size. Like kernel's kmem_cache instance.
> > I don't think we quite there yet.
> > There is a need to "preallocate this object from sleepable context,
> > so the prog has a guaranteed chunk of memory to use in restricted context",
> > but, arguably, it's not a job of bpf allocator. bpf prog can allocate an object,
> > stash it into kptr, and use it later.
> 
> At least if not adding support for it all now, I think this kind of
> flexibility in option 1 needs to be given some more consideration, as
> in whether this proposal to encode things statically would be able to
> accommodate such cases in the future. To me it seems pretty hard (and
> unless I missed something, it already won't work for inner_maps case
> without requiring all to use the same allocator).

What use case are we walking about?
So far I hear 'ohh it will be limiting', but nothing concrete.

> 
> We might actually be able to do a hybrid of the options by utilizing
> the statically known allocator info to acquire references and runtime
> object counts, which may help eliminate/delay the actual cost we pay
> for it - the atomic upgrade, when initial reference goes away.
> 
> So I think I lean towards option 1 as well, and then the same order as
> Delyan. It seems to cover all kinds of corner cases (allocator known
> vs unknown, normal vs inner maps, etc.) we've been going over in this
> thread, and would also be future proof in terms of permitting
> unforeseen patterns of usage.

These "unforeseen patterns" sounds as "lets overdesign now because we
cannot predict the future".

> > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
> > more than we need. Ideally it would be easy to specifiy one single
> > allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
> > In other words one bpf allocator per bpf "namespace" is more than enough.
> > Program authors shouldn't be creating allocators left and right. All these
> > free lists will waste memory.
> > btw I've added an extra patch to bpf_mem_alloc series:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1
> > that removes kmem_cache usage.
> > Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea.
> > When free_lists are not shared they will similarly waste memory.
> > In user space the C code just does malloc() and the memory is isolated per process.
> > Ideally in bpf world the programs would just do:
> > bpf_mem_alloc(btf_type_id_local(struct foo));
> > without specifying an allocator, but that would require one global allocator
> > for all bpf programs in the kernel which is probably not a direction we should go ?
> > So the programs have to specify an allocator to use in bpf_mem_alloc(),
> > but it should be one for all progs, maps in a bpf-package/set/namespace.|
> 
> But "all progs" doesn't mean all of them are running in the same
> context and having the same kinds of memory allocation requirements,
> right? While having too many allocators is also bad, having just one
> single one per package would also be limiting. A bpf object/package is
> very different from a userspace process. So the analogy doesn't
> exactly fit IMO.

It's not exact fit, for sure.
bpf progs are more analogous to kernel modules.
The modules just do kmalloc.
The more we discuss the more I'm leaning towards the same model as well:
Just one global allocator for all bpf progs.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-31 18:57                                   ` Alexei Starovoitov
  2022-08-31 20:12                                     ` Kumar Kartikeya Dwivedi
@ 2022-08-31 21:02                                     ` Delyan Kratunov
  2022-08-31 22:32                                       ` Kumar Kartikeya Dwivedi
  2022-09-01  3:55                                       ` Alexei Starovoitov
  1 sibling, 2 replies; 59+ messages in thread
From: Delyan Kratunov @ 2022-08-31 21:02 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: davem, joannelkoong, andrii, tj, daniel, Dave Marchevsky, memxor,
	Kernel Team, bpf

On Wed, 2022-08-31 at 11:57 -0700, Alexei Starovoitov wrote:
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
> 
> > -------------------------------------------------------------------!
> 
> On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote:
> > 
> > Overall, this design (or maybe the way it's presented here) conflates a few ideas.
> > 
> > 1) The extensions to expose and customize map's internal element allocator are fine
> > independently of even this patchset.
> > 
> > 2) The idea that kptrs in a map need to have a statically identifiable allocator is
> > taken as an axiom, and then expanded to its extreme (single allocator per map as
> > opposed to the smarter verifier schemes). I still contest that this is not the case
> > and the runtime overhead it avoids is paid back in bad developer experience multiple
> > times over.
> > 
> > 3) The idea that allocators can be merged between elements and kptrs is independent
> > of the static requirements. If the map's internal allocator is exposed per 1), we can
> > still use it to allocate kptrs but not require that all kptrs in a map are from the
> > same allocator.
> > 
> > Going this coarse in the API is easy for us but fundamentally more limiting for
> > developers. It's not hard to imagine situations where the verifier dependency
> > tracking or runtime lifetime tracking would allow for pinned maps to be retained but
> > this scheme would require new maps entirely. (Any situation where you just refactored
> > the implicit allocator out to share it, for example)
> > 
> > I also don't think that simplicity for us (a one time implementation cost +
> > continuous maintenance cost) trumps over long term developer experience (a much
> > bigger implementation cost over a much bigger time span).
> 
> It feels we're thinking about scope and use cases for the allocator quite
> differently and what you're seeing as 'limiting developer choices' to me looks
> like 'not a limiting issue at all'. To me the allocator is one
> jemalloc/tcmalloc instance. One user space application with multiple threads,
> lots of maps and code is using exactly one such allocator. The allocator
> manages all the memory of user space process. In bpf land we don't have a bpf
> process. We don't have a bpf name space either.  A loose analogy would be a set
> of programs and maps managed by one user space agent. The bpf allocator would
> manage all the memory of these maps and programs and provide a "memory namespace"
> for this set of programs. Another user space agent with its own programs
> would never want to share the same allocator. In user space a chunk of memory
> could be mmap-ed between different process to share the data, but you would never
> put a tcmalloc over such memory to be an allocator for different processes.
> 
> More below.
> 
> > So far, my ranked choice vote is:
> > 
> > 1) maximum flexibility and runtime live object counts (with exposed allocators, I
> > like the merging)
> > 2) medium flexibility with per-field allocator tracking in the verifier and the
> > ability to lose the association once programs are unloaded and values are gone. This
> > also works better with exposed allocators since they are implicitly pinned and would
> > be usable to store values in another map.
> > 3) minimum flexibility with static whole-map kptr allocators
> 
> The option 1 flexibility is necessary when allocator is seen as a private pool
> of objects of given size. Like kernel's kmem_cache instance.
> I don't think we quite there yet.

If we're not there, we should aim to get there :) 

> There is a need to "preallocate this object from sleepable context,
> so the prog has a guaranteed chunk of memory to use in restricted context",
> but, arguably, it's not a job of bpf allocator. 

Leaving it to the programs is worse for memory usage (discussed below).

> bpf prog can allocate an object, stash it into kptr, and use it later.

Given that tracing programs can't really maintain their own freelists safely (I think
they're missing the building blocks - you can't cmpxchg kptrs), I do feel like
isolated allocators are a requirement here. Without them, allocations can fail and
there's no way to write a reliable program.

*If* we ensure that you can build a usable freelist out of allocator-backed memory
for (a set of) nmi programs, then I can maybe get behind this (but there's other
reasons not to do this).

> So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
> more than we need. Ideally it would be easy to specifiy one single
> allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
> In other words one bpf allocator per bpf "namespace" is more than enough.

_Potentially_. Programs need to know that when they reserved X objects, they'll have
them available at a later time and any sharing with other programs can remove this
property. A _set_ of programs can in theory determine the right prefill levels, but
this is certainly easier to reason about on a per-program basis, given that programs
will run at different rates.

> Program authors shouldn't be creating allocators left and right. All these
> free lists will waste memory.
> btw I've added an extra patch to bpf_mem_alloc series:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1
> that removes kmem_cache usage.
> Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea.
> When free_lists are not shared they will similarly waste memory.
> In user space the C code just does malloc() and the memory is isolated per process.
> Ideally in bpf world the programs would just do:
> bpf_mem_alloc(btf_type_id_local(struct foo));
> without specifying an allocator, but that would require one global allocator
> for all bpf programs in the kernel which is probably not a direction we should go ?

Why does it require a global allocator? For example, you can have each program have
its own internal allocator and with runtime live counts, this API is very achievable.
Once the program unloads, you can drain the freelists, so most allocator memory does
not have to live as long as the longest-lived object from that allocator. In
addition, all allocators can share a global freelist too, so chunks released after
the program unloads get a chance to be reused.

> So the programs have to specify an allocator to use in bpf_mem_alloc(),
> but it should be one for all progs, maps in a bpf-package/set/namespace.
> If it's easy for programs to specify a bunch of allocators, like one for each program,
> or one for each btf_type_id the bpf kernel infra would be required to merge
> these allocators from day one. 

How is having one allocator per program different from having one allocator per set
of programs, with per-program bpf-side freelists? The requirement that some (most?)
programs need deterministic access to their freelists is still there, no matter the
number of allocators. If we fear that the default freelist behavior will waste
memory, then the defaults need to be aggressively conservative, with programs being
able to adjust them.

Besides, if we punt the freelists to bpf, then we get absolutely no control over the
memory usage, which is strictly worse for us (and worse developer experience on top).

> (The profileration of kmem_cache-s in the past
> forced merging of them). By restricting bpf program choices with allocator-per-map
> (this option 3) we're not only making the kernel side to do less work
> (no run-time ref counts, no merging is required today), we're also pushing
> bpf progs to use memory concious choices.

This is conflating "there needs to be a limit on memory stuck in freelists" with "you
can only store kptrs from one allocator in each map." The former practically
advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree
with the latter - it's coming strictly from the desire to have static mappings
between object storage and allocators; it's not coming from a memory usage need, it
only avoids runtime live object counts.

> Having said all that maybe one global allocator is not such a bad idea.

It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however,
think there is value in having precise and conservative freelist policies, along with
a global freelist for overflow and draining after program unload. The latter would
allow us to share memory between allocators without sacrificing per-allocator
freelist determinism, especially if paired with very static (but configurable)
freelist thresholds.

-- Delyan


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-31 21:02                                     ` Delyan Kratunov
@ 2022-08-31 22:32                                       ` Kumar Kartikeya Dwivedi
  2022-09-01  0:41                                         ` Alexei Starovoitov
  2022-09-01  3:55                                       ` Alexei Starovoitov
  1 sibling, 1 reply; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-31 22:32 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: alexei.starovoitov, davem, joannelkoong, andrii, tj, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Wed, 31 Aug 2022 at 23:02, Delyan Kratunov <delyank@fb.com> wrote:
>
> On Wed, 2022-08-31 at 11:57 -0700, Alexei Starovoitov wrote:
> > !-------------------------------------------------------------------|
> >   This Message Is From an External Sender
> >
> > > -------------------------------------------------------------------!
> >
> > On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote:
> > >
> > > Overall, this design (or maybe the way it's presented here) conflates a few ideas.
> > >
> > > 1) The extensions to expose and customize map's internal element allocator are fine
> > > independently of even this patchset.
> > >
> > > 2) The idea that kptrs in a map need to have a statically identifiable allocator is
> > > taken as an axiom, and then expanded to its extreme (single allocator per map as
> > > opposed to the smarter verifier schemes). I still contest that this is not the case
> > > and the runtime overhead it avoids is paid back in bad developer experience multiple
> > > times over.
> > >
> > > 3) The idea that allocators can be merged between elements and kptrs is independent
> > > of the static requirements. If the map's internal allocator is exposed per 1), we can
> > > still use it to allocate kptrs but not require that all kptrs in a map are from the
> > > same allocator.
> > >
> > > Going this coarse in the API is easy for us but fundamentally more limiting for
> > > developers. It's not hard to imagine situations where the verifier dependency
> > > tracking or runtime lifetime tracking would allow for pinned maps to be retained but
> > > this scheme would require new maps entirely. (Any situation where you just refactored
> > > the implicit allocator out to share it, for example)
> > >
> > > I also don't think that simplicity for us (a one time implementation cost +
> > > continuous maintenance cost) trumps over long term developer experience (a much
> > > bigger implementation cost over a much bigger time span).
> >
> > It feels we're thinking about scope and use cases for the allocator quite
> > differently and what you're seeing as 'limiting developer choices' to me looks
> > like 'not a limiting issue at all'. To me the allocator is one
> > jemalloc/tcmalloc instance. One user space application with multiple threads,
> > lots of maps and code is using exactly one such allocator. The allocator
> > manages all the memory of user space process. In bpf land we don't have a bpf
> > process. We don't have a bpf name space either.  A loose analogy would be a set
> > of programs and maps managed by one user space agent. The bpf allocator would
> > manage all the memory of these maps and programs and provide a "memory namespace"
> > for this set of programs. Another user space agent with its own programs
> > would never want to share the same allocator. In user space a chunk of memory
> > could be mmap-ed between different process to share the data, but you would never
> > put a tcmalloc over such memory to be an allocator for different processes.
> >
> > More below.
> >
> > > So far, my ranked choice vote is:
> > >
> > > 1) maximum flexibility and runtime live object counts (with exposed allocators, I
> > > like the merging)
> > > 2) medium flexibility with per-field allocator tracking in the verifier and the
> > > ability to lose the association once programs are unloaded and values are gone. This
> > > also works better with exposed allocators since they are implicitly pinned and would
> > > be usable to store values in another map.
> > > 3) minimum flexibility with static whole-map kptr allocators
> >
> > The option 1 flexibility is necessary when allocator is seen as a private pool
> > of objects of given size. Like kernel's kmem_cache instance.
> > I don't think we quite there yet.
>
> If we're not there, we should aim to get there :)
>
> > There is a need to "preallocate this object from sleepable context,
> > so the prog has a guaranteed chunk of memory to use in restricted context",
> > but, arguably, it's not a job of bpf allocator.
>
> Leaving it to the programs is worse for memory usage (discussed below).
>
> > bpf prog can allocate an object, stash it into kptr, and use it later.
>
> Given that tracing programs can't really maintain their own freelists safely (I think
> they're missing the building blocks - you can't cmpxchg kptrs), I do feel like
> isolated allocators are a requirement here. Without them, allocations can fail and
> there's no way to write a reliable program.
>
> *If* we ensure that you can build a usable freelist out of allocator-backed memory
> for (a set of) nmi programs, then I can maybe get behind this (but there's other
> reasons not to do this).
>
> > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
> > more than we need. Ideally it would be easy to specifiy one single
> > allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
> > In other words one bpf allocator per bpf "namespace" is more than enough.
>
> _Potentially_. Programs need to know that when they reserved X objects, they'll have
> them available at a later time and any sharing with other programs can remove this
> property. A _set_ of programs can in theory determine the right prefill levels, but
> this is certainly easier to reason about on a per-program basis, given that programs
> will run at different rates.
>
> > Program authors shouldn't be creating allocators left and right. All these
> > free lists will waste memory.
> > btw I've added an extra patch to bpf_mem_alloc series:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1
> > that removes kmem_cache usage.
> > Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea.
> > When free_lists are not shared they will similarly waste memory.
> > In user space the C code just does malloc() and the memory is isolated per process.
> > Ideally in bpf world the programs would just do:
> > bpf_mem_alloc(btf_type_id_local(struct foo));
> > without specifying an allocator, but that would require one global allocator
> > for all bpf programs in the kernel which is probably not a direction we should go ?
>
> Why does it require a global allocator? For example, you can have each program have
> its own internal allocator and with runtime live counts, this API is very achievable.
> Once the program unloads, you can drain the freelists, so most allocator memory does
> not have to live as long as the longest-lived object from that allocator. In
> addition, all allocators can share a global freelist too, so chunks released after
> the program unloads get a chance to be reused.
>
> > So the programs have to specify an allocator to use in bpf_mem_alloc(),
> > but it should be one for all progs, maps in a bpf-package/set/namespace.
> > If it's easy for programs to specify a bunch of allocators, like one for each program,
> > or one for each btf_type_id the bpf kernel infra would be required to merge
> > these allocators from day one.
>
> How is having one allocator per program different from having one allocator per set
> of programs, with per-program bpf-side freelists? The requirement that some (most?)
> programs need deterministic access to their freelists is still there, no matter the
> number of allocators. If we fear that the default freelist behavior will waste
> memory, then the defaults need to be aggressively conservative, with programs being
> able to adjust them.
>
> Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> memory usage, which is strictly worse for us (and worse developer experience on top).
>
> > (The profileration of kmem_cache-s in the past
> > forced merging of them). By restricting bpf program choices with allocator-per-map
> > (this option 3) we're not only making the kernel side to do less work
> > (no run-time ref counts, no merging is required today), we're also pushing
> > bpf progs to use memory concious choices.
>
> This is conflating "there needs to be a limit on memory stuck in freelists" with "you
> can only store kptrs from one allocator in each map." The former practically
> advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree
> with the latter - it's coming strictly from the desire to have static mappings
> between object storage and allocators; it's not coming from a memory usage need, it
> only avoids runtime live object counts.
>
> > Having said all that maybe one global allocator is not such a bad idea.
>
> It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however,
> think there is value in having precise and conservative freelist policies, along with
> a global freelist for overflow and draining after program unload. The latter would
> allow us to share memory between allocators without sacrificing per-allocator
> freelist determinism, especially if paired with very static (but configurable)
> freelist thresholds.
>

These are all good points. Sharing an allocator between all programs
means bpf_mem_prefill request cannot really guarantee much, it does
hurt determinism. The prefilled items can be drained by some other
program with an inconsistent allocation pattern.

But going back to what Alexei replied in the other thread:
> bpf progs are more analogous to kernel modules.
> The modules just do kmalloc.
> The more we discuss the more I'm leaning towards the same model as well:
> Just one global allocator for all bpf progs.

There does seem to be one big benefit in having a global allocator
(not per program, but actually globally in the kernel, basically a
percpu freelist cache fronting kmalloc) usable safely in any context.
We don't have to do any allocator lifetime tracking at all, that case
reduces to basically how we handle kernel kptrs currently.

I am wondering if we can go with such an approach: by default, the
global allocator in the kernel serves bpf_mem_alloc requests, which
allows freelist sharing between all programs, it is basically kmalloc
but safe in NMI and with reentrancy protection. When determinism is
needed, use the percpu refcount approach with option 1 from Delyan for
the custom allocator case.

Now by default you have conservative global freelist sharing (percpu),
and when required program can use a custom allocator and prefill to
keep the cache ready to serve requests (where that kind of control
will be very useful for progs in NMI/hardirq context, where depletion
of cache means NULL from unit_alloc), where its own allocator freelist
will be unaffected by other allocations.

Any kptr from the bpf_mem_alloc allocator can go to any map, no problem at all.
The only extra cost is maintaining the percpu live counts for
non-global allocators, it is basically free for the global case.
And it would also be allowed to probably choose and share allocators
between maps, as proposed by Alexei before. That has no effect on
kptrs being stored in them, as most commonly they would be from the
global allocator.

Thoughts on this?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-31 22:32                                       ` Kumar Kartikeya Dwivedi
@ 2022-09-01  0:41                                         ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2022-09-01  0:41 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Delyan Kratunov, davem, joannelkoong, andrii, tj, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Thu, Sep 01, 2022 at 12:32:33AM +0200, Kumar Kartikeya Dwivedi wrote:
> > bpf progs are more analogous to kernel modules.
> > The modules just do kmalloc.
> > The more we discuss the more I'm leaning towards the same model as well:
> > Just one global allocator for all bpf progs.
> 
> There does seem to be one big benefit in having a global allocator
> (not per program, but actually globally in the kernel, basically a
> percpu freelist cache fronting kmalloc) usable safely in any context.
> We don't have to do any allocator lifetime tracking at all, that case
> reduces to basically how we handle kernel kptrs currently.
> 
> I am wondering if we can go with such an approach: by default, the
> global allocator in the kernel serves bpf_mem_alloc requests, which
> allows freelist sharing between all programs, it is basically kmalloc
> but safe in NMI and with reentrancy protection. 

Right. That what I was proposing.

> When determinism is
> needed, use the percpu refcount approach with option 1 from Delyan for
> the custom allocator case.

I wasn't rejecting that part. I was suggesting to table that discussion.
The best way to achieve guaranteed allocation is still an open question.
So far we've only talked about a new map type with "allocator" type...
Is this really the best design?

> Now by default you have conservative global freelist sharing (percpu),
> and when required program can use a custom allocator and prefill to
> keep the cache ready to serve requests (where that kind of control
> will be very useful for progs in NMI/hardirq context, where depletion
> of cache means NULL from unit_alloc), where its own allocator freelist
> will be unaffected by other allocations.

The custom allocator is not necessary the right answer.
It could be. Maybe it should be open coded free list of preallocated
items that bpf prog takes from global allocator and pushes them to a list?
We'll have locks and native link lists in bpf soon.
So why "allocator" concept should do a double job of allocating
and keeping a link list for prefill reasons?

> Any kptr from the bpf_mem_alloc allocator can go to any map, no problem at all.
> The only extra cost is maintaining the percpu live counts for
> non-global allocators, it is basically free for the global case.
> And it would also be allowed to probably choose and share allocators
> between maps, as proposed by Alexei before. That has no effect on
> kptrs being stored in them, as most commonly they would be from the
> global allocator.

It still feels to me that doing global allocator only for now will be good
enough. prefill use case for one element can be solved already without
any extra work (just kptr_xchg in and out).
prefill of multiple objects might get nicely solved with native link lists too.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-08-31 21:02                                     ` Delyan Kratunov
  2022-08-31 22:32                                       ` Kumar Kartikeya Dwivedi
@ 2022-09-01  3:55                                       ` Alexei Starovoitov
  2022-09-01 22:46                                         ` Delyan Kratunov
  1 sibling, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-09-01  3:55 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: davem, joannelkoong, andrii, tj, daniel, Dave Marchevsky, memxor,
	Kernel Team, bpf

On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote:
> Given that tracing programs can't really maintain their own freelists safely (I think
> they're missing the building blocks - you can't cmpxchg kptrs),

Today? yes, but soon we will have link lists supported natively.

> I do feel like
> isolated allocators are a requirement here. Without them, allocations can fail and
> there's no way to write a reliable program.

Completely agree that there should be a way for programs
to guarantee availability of the element.
Inline allocation can fail regardless whether allocation pool
is shared by multiple programs or a single program owns an allocator.
In that sense, allowing multiple programs to create an instance
of an allocator doesn't solve this problem.
Short free list inside bpf_mem_cache is an implementation detail.
"prefill to guarantee successful alloc" is a bit out of scope
of an allocator.
"allocate a set and stash it" should be a separate building block
available to bpf progs when step "allocate" can fail and
efficient "stash it" can probably be done on top of the link list.

> *If* we ensure that you can build a usable freelist out of allocator-backed memory
> for (a set of) nmi programs, then I can maybe get behind this (but there's other
> reasons not to do this).

Agree that nmi adds another quirk to "stash it" step.
If native link list is not going to work then something
else would have to be designed.

> > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
> > more than we need. Ideally it would be easy to specifiy one single
> > allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
> > In other words one bpf allocator per bpf "namespace" is more than enough.
>
> _Potentially_. Programs need to know that when they reserved X objects, they'll have
> them available at a later time and any sharing with other programs can remove this
> property.

Agree.

> A _set_ of programs can in theory determine the right prefill levels, but
> this is certainly easier to reason about on a per-program basis, given that programs
> will run at different rates.

Agree as well.

> Why does it require a global allocator? For example, you can have each program have
> its own internal allocator and with runtime live counts, this API is very achievable.
> Once the program unloads, you can drain the freelists, so most allocator memory does
> not have to live as long as the longest-lived object from that allocator. In
> addition, all allocators can share a global freelist too, so chunks released after
> the program unloads get a chance to be reused.

All makes sense to me except that the kernel can provide that
global allocator and per-program "allocators" can hopefully be
implemented as native bpf code.

> How is having one allocator per program different from having one allocator per set
> of programs, with per-program bpf-side freelists? The requirement that some (most?)
> programs need deterministic access to their freelists is still there, no matter the
> number of allocators. If we fear that the default freelist behavior will waste
> memory, then the defaults need to be aggressively conservative, with programs being
> able to adjust them.

I think the disagreement here is that per-prog allocator based
on bpf_mem_alloc isn't going to be any more deterministic than
one global bpf_mem_alloc for all progs.
Per-prog short free list of ~64 elements vs
global free list of ~64 elements.
In both cases these lists will have to do irq_work and refill
out of global slabs.

> Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> memory usage, which is strictly worse for us (and worse developer experience on top).

I don't understand this point.
All allocations are still coming out of bpf_mem_alloc.
We can have debug mode with memleak support and other debug
mechanisms.

> > (The profileration of kmem_cache-s in the past
> > forced merging of them). By restricting bpf program choices with allocator-per-map
> > (this option 3) we're not only making the kernel side to do less work
> > (no run-time ref counts, no merging is required today), we're also pushing
> > bpf progs to use memory concious choices.
>
> This is conflating "there needs to be a limit on memory stuck in freelists" with "you
> can only store kptrs from one allocator in each map." The former practically
> advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree
> with the latter - it's coming strictly from the desire to have static mappings
> between object storage and allocators; it's not coming from a memory usage need, it
> only avoids runtime live object counts.
>
> > Having said all that maybe one global allocator is not such a bad idea.
>
> It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however,
> think there is value in having precise and conservative freelist policies, along with
> a global freelist for overflow and draining after program unload. The latter would
> allow us to share memory between allocators without sacrificing per-allocator
> freelist determinism, especially if paired with very static (but configurable)
> freelist thresholds.

What is 'freelist determinism' ?
Are you talking about some other freelist on top of bpf_mem_alloc's
free lists ?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-09-01  3:55                                       ` Alexei Starovoitov
@ 2022-09-01 22:46                                         ` Delyan Kratunov
  2022-09-02  0:12                                           ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Delyan Kratunov @ 2022-09-01 22:46 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: tj, joannelkoong, andrii, davem, daniel, Dave Marchevsky, memxor,
	Kernel Team, bpf

On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote:
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
> 
> > -------------------------------------------------------------------!
> 
> On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote:
> > Given that tracing programs can't really maintain their own freelists safely (I think
> > they're missing the building blocks - you can't cmpxchg kptrs),
> 
> Today? yes, but soon we will have link lists supported natively.
> 
> > I do feel like
> > isolated allocators are a requirement here. Without them, allocations can fail and
> > there's no way to write a reliable program.
> 
> Completely agree that there should be a way for programs
> to guarantee availability of the element.
> Inline allocation can fail regardless whether allocation pool
> is shared by multiple programs or a single program owns an allocator.

I'm not sure I understand this point. 
If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs
for struct foo to succeed. In what situations can this fail if I'm the only program
using it _and_ I've calculated the prefill amount correctly?

Unless you're saying that the prefill wouldn't adjust the freelist limits, in which
case, I think that's a mistake - prefill should effectively _set_ the freelist
limits.

> In that sense, allowing multiple programs to create an instance
> of an allocator doesn't solve this problem.
> Short free list inside bpf_mem_cache is an implementation detail.
> "prefill to guarantee successful alloc" is a bit out of scope
> of an allocator.

I disagree that it's out of scope. This is the only access to dynamic memory from a
bpf program, it makes sense that it covers the requirements of bpf programs,
including prefill/freelist behavior, so all programs can safely use it.

> "allocate a set and stash it" should be a separate building block
> available to bpf progs when step "allocate" can fail and
> efficient "stash it" can probably be done on top of the link list.

Do you imagine a BPF object that every user has to link into their programs (yuck),
or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things
this way.

If it's distributed separately, I think that's an unexpected burden on developers
(I'm thinking especially of tools not writing programs in C or using libbpf/bpftool
skels). There are no other bpf features that require a userspace support library like
this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful
without a library)
> 
> > *If* we ensure that you can build a usable freelist out of allocator-backed memory
> > for (a set of) nmi programs, then I can maybe get behind this (but there's other
> > reasons not to do this).
> 
> Agree that nmi adds another quirk to "stash it" step.
> If native link list is not going to work then something
> else would have to be designed.

Given that I'll be making the deferred work mechanism on top of this allocator, we
definitely need to cover nmi usage cleanly. It would be almost unusable to have your
deferred work fail to submit randomly.

> 
> > > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
> > > more than we need. Ideally it would be easy to specifiy one single
> > > allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
> > > In other words one bpf allocator per bpf "namespace" is more than enough.
> > 
> > _Potentially_. Programs need to know that when they reserved X objects, they'll have
> > them available at a later time and any sharing with other programs can remove this
> > property.
> 
> Agree.
> 
> > A _set_ of programs can in theory determine the right prefill levels, but
> > this is certainly easier to reason about on a per-program basis, given that programs
> > will run at different rates.
> 
> Agree as well.
> 
> > Why does it require a global allocator? For example, you can have each program have
> > its own internal allocator and with runtime live counts, this API is very achievable.
> > Once the program unloads, you can drain the freelists, so most allocator memory does
> > not have to live as long as the longest-lived object from that allocator. In
> > addition, all allocators can share a global freelist too, so chunks released after
> > the program unloads get a chance to be reused.
> 
> All makes sense to me except that the kernel can provide that
> global allocator and per-program "allocators" can hopefully be
> implemented as native bpf code.
> 
> > How is having one allocator per program different from having one allocator per set
> > of programs, with per-program bpf-side freelists? The requirement that some (most?)
> > programs need deterministic access to their freelists is still there, no matter the
> > number of allocators. If we fear that the default freelist behavior will waste
> > memory, then the defaults need to be aggressively conservative, with programs being
> > able to adjust them.
> 
> I think the disagreement here is that per-prog allocator based
> on bpf_mem_alloc isn't going to be any more deterministic than
> one global bpf_mem_alloc for all progs.
> Per-prog short free list of ~64 elements vs
> global free list of ~64 elements.

Right, I think I had a hidden assumption here that we've exposed. 
Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be
1000 struct foos on the freelist and the freelist thresholds would be adjusted
accordingly (i.e., you can free all of them and allocate them again, all from the
freelist).

Ultimately, that's what nmi programs actually need but I see why that's not an
obvious behavior.

> In both cases these lists will have to do irq_work and refill
> out of global slabs.

If a tracing program needs irq_work to refill, then hasn't the API already failed the
program writer? I'd have to remind myself how irq_work actually works but given that
it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before
irq_work has a chance to refill it. I don't see how you'd write reliable programs
this way.

> 
> > Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> > memory usage, which is strictly worse for us (and worse developer experience on top).
> 
> I don't understand this point.
> All allocations are still coming out of bpf_mem_alloc.
> We can have debug mode with memleak support and other debug
> mechanisms.

I mostly mean accounting here. If we segment the allocated objects by finer-grained
allocators, we can attribute them to individual programs better. But, I agree, this
can be implemented in other ways, it can just be counts/tables on bpf_prog.

> 
> > > (The profileration of kmem_cache-s in the past
> > > forced merging of them). By restricting bpf program choices with allocator-per-map
> > > (this option 3) we're not only making the kernel side to do less work
> > > (no run-time ref counts, no merging is required today), we're also pushing
> > > bpf progs to use memory concious choices.
> > 
> > This is conflating "there needs to be a limit on memory stuck in freelists" with "you
> > can only store kptrs from one allocator in each map." The former practically
> > advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree
> > with the latter - it's coming strictly from the desire to have static mappings
> > between object storage and allocators; it's not coming from a memory usage need, it
> > only avoids runtime live object counts.
> > 
> > > Having said all that maybe one global allocator is not such a bad idea.
> > 
> > It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however,
> > think there is value in having precise and conservative freelist policies, along with
> > a global freelist for overflow and draining after program unload. The latter would
> > allow us to share memory between allocators without sacrificing per-allocator
> > freelist determinism, especially if paired with very static (but configurable)
> > freelist thresholds.
> 
> What is 'freelist determinism' ?

The property that prefill keeps all objects on the freelist, so the following
sequence doesn't observe allocation failures:

bpf_mem_prefill(1000, struct foo);
run_1000_times { alloc(struct foo); }
run_1000_times { free(struct foo); }
run_1000_times { alloc(struct foo); }
alloc(struct foo) // this can observe a failure

> Are you talking about some other freelist on top of bpf_mem_alloc's
> free lists ?

Well, that's the question, isn't it? I think it should be part of the bpf kernel
ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's
instantiated per-program, that's easiest to reason about.

-- Delyan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-09-01 22:46                                         ` Delyan Kratunov
@ 2022-09-02  0:12                                           ` Alexei Starovoitov
  2022-09-02  1:40                                             ` Delyan Kratunov
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-09-02  0:12 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: tj, joannelkoong, andrii, davem, daniel, Dave Marchevsky, memxor,
	Kernel Team, bpf

On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote:
> On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote:
> > !-------------------------------------------------------------------|
> >   This Message Is From an External Sender
> > 
> > > -------------------------------------------------------------------!
> > 
> > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote:
> > > Given that tracing programs can't really maintain their own freelists safely (I think
> > > they're missing the building blocks - you can't cmpxchg kptrs),
> > 
> > Today? yes, but soon we will have link lists supported natively.
> > 
> > > I do feel like
> > > isolated allocators are a requirement here. Without them, allocations can fail and
> > > there's no way to write a reliable program.
> > 
> > Completely agree that there should be a way for programs
> > to guarantee availability of the element.
> > Inline allocation can fail regardless whether allocation pool
> > is shared by multiple programs or a single program owns an allocator.
> 
> I'm not sure I understand this point. 
> If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs
> for struct foo to succeed. In what situations can this fail if I'm the only program
> using it _and_ I've calculated the prefill amount correctly?
> 
> Unless you're saying that the prefill wouldn't adjust the freelist limits, in which
> case, I think that's a mistake - prefill should effectively _set_ the freelist
> limits.

There is no prefill implementation today, so we're just guessing, but let's try.
prefill would probably have to adjust high-watermark limit.
That makes sense, but for how long? Should the watermark go back after time
or after N objects were consumed?
What prefill is going to do? Prefill on current cpu only ?
but it doesn't help the prog to be deterministic in consuming them.
Prefill on all cpu-s? That's possible,
but for_each_possible_cpu() {irq_work_queue_on(cpu);}
looks to be a significant memory and run-time overhead.
When freelist is managed by the program it may contain just N elements
that progs needs.

> > In that sense, allowing multiple programs to create an instance
> > of an allocator doesn't solve this problem.
> > Short free list inside bpf_mem_cache is an implementation detail.
> > "prefill to guarantee successful alloc" is a bit out of scope
> > of an allocator.
> 
> I disagree that it's out of scope. This is the only access to dynamic memory from a
> bpf program, it makes sense that it covers the requirements of bpf programs,
> including prefill/freelist behavior, so all programs can safely use it.
> 
> > "allocate a set and stash it" should be a separate building block
> > available to bpf progs when step "allocate" can fail and
> > efficient "stash it" can probably be done on top of the link list.
> 
> Do you imagine a BPF object that every user has to link into their programs (yuck),
> or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things
> this way.

I'm assuming Kumar's proposed list api:
struct bpf_list_head head;
struct bpf_list_node node;
bpf_list_insert(&node, &head);

will work out.

> If it's distributed separately, I think that's an unexpected burden on developers
> (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool
> skels). There are no other bpf features that require a userspace support library like
> this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful
> without a library)

bpf progs must not pay for what they don't use. Hence all building blocks should be
small. We will have libc-like bpf libraries with bigger blocks eventually. 

> > I think the disagreement here is that per-prog allocator based
> > on bpf_mem_alloc isn't going to be any more deterministic than
> > one global bpf_mem_alloc for all progs.
> > Per-prog short free list of ~64 elements vs
> > global free list of ~64 elements.
> 
> Right, I think I had a hidden assumption here that we've exposed. 
> Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be
> 1000 struct foos on the freelist and the freelist thresholds would be adjusted
> accordingly (i.e., you can free all of them and allocate them again, all from the
> freelist).
> 
> Ultimately, that's what nmi programs actually need but I see why that's not an
> obvious behavior.

How prefill is going to work is still to-be-designed.
In addition to current-cpu vs on-all-cpu question above...
Will prefill() helper just do irq_work ?
If so then it doesn't help nmi and irq-disabled progs at all.
prefill helper working asynchronously doesn't guarantee availability of objects
later to the program.
prefill() becomes a hint and probably useless as such.
So it should probably be synchronous and fail when in-nmi or in-irq?
But bpf prog cannot know its context, so only safe synchronous prefill()
would be out of sleepable progs.

> > In both cases these lists will have to do irq_work and refill
> > out of global slabs.
> 
> If a tracing program needs irq_work to refill, then hasn't the API already failed the
> program writer? I'd have to remind myself how irq_work actually works but given that
> it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before
> irq_work has a chance to refill it. I don't see how you'd write reliable programs
> this way.

The only way nmi-prog can guarantee availability if it allocates and reserves
objects in a different non-nmi program.
If everything runs in nmi there is nothing can be done.

> > 
> > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> > > memory usage, which is strictly worse for us (and worse developer experience on top).
> > 
> > I don't understand this point.
> > All allocations are still coming out of bpf_mem_alloc.
> > We can have debug mode with memleak support and other debug
> > mechanisms.
> 
> I mostly mean accounting here. If we segment the allocated objects by finer-grained
> allocators, we can attribute them to individual programs better. But, I agree, this
> can be implemented in other ways, it can just be counts/tables on bpf_prog.

mem accounting is the whole different, huge and largely unsovled topic.
The thread about memcg and bpf is still ongoing.
The fine-grained bpf_mem_alloc isn't going to magically solve it.

> > 
> > What is 'freelist determinism' ?
> 
> The property that prefill keeps all objects on the freelist, so the following
> sequence doesn't observe allocation failures:
> 
> bpf_mem_prefill(1000, struct foo);
> run_1000_times { alloc(struct foo); }
> run_1000_times { free(struct foo); }
> run_1000_times { alloc(struct foo); }
> alloc(struct foo) // this can observe a failure

we cannot evalute the above until we answer current-cpu vs on-all-cpus question
and whether bpf_mem_prefill is sync or async.

I still think designing prefill and guaranteed availability is out of scope
of allocator.

> > Are you talking about some other freelist on top of bpf_mem_alloc's
> > free lists ?
> 
> Well, that's the question, isn't it? I think it should be part of the bpf kernel
> ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's
> instantiated per-program, that's easiest to reason about.

There should be a way. For sure. Helper/kfunc or trivial stuff on top of bpf_link_list
is still a question. Bundling this feature together with an allocator feels artificial.
In user space C you wouldn't combine tcmalloc with custom free list.
During early days of bpf bundling would totally make sense, but right now we're able to
create much smaller building blocks than in the past. I don't think we'll be adding
any more new map types. We probably won't be adding any new program types either.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-09-02  0:12                                           ` Alexei Starovoitov
@ 2022-09-02  1:40                                             ` Delyan Kratunov
  2022-09-02  3:29                                               ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Delyan Kratunov @ 2022-09-02  1:40 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: davem, joannelkoong, andrii, tj, daniel, Dave Marchevsky, memxor,
	Kernel Team, bpf

On Thu, 2022-09-01 at 17:12 -0700, Alexei Starovoitov wrote:
> On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote:
> > On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote:
> > > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote:
> > > > Given that tracing programs can't really maintain their own freelists safely (I think
> > > > they're missing the building blocks - you can't cmpxchg kptrs),
> > > 
> > > Today? yes, but soon we will have link lists supported natively.
> > > 
> > > > I do feel like
> > > > isolated allocators are a requirement here. Without them, allocations can fail and
> > > > there's no way to write a reliable program.
> > > 
> > > Completely agree that there should be a way for programs
> > > to guarantee availability of the element.
> > > Inline allocation can fail regardless whether allocation pool
> > > is shared by multiple programs or a single program owns an allocator.
> > 
> > I'm not sure I understand this point. 
> > If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs
> > for struct foo to succeed. In what situations can this fail if I'm the only program
> > using it _and_ I've calculated the prefill amount correctly?
> > 
> > Unless you're saying that the prefill wouldn't adjust the freelist limits, in which
> > case, I think that's a mistake - prefill should effectively _set_ the freelist
> > limits.
> 
> There is no prefill implementation today, so we're just guessing, but let's try.

Well, initial capacity was going to be part of the map API, so I always considered it
part of the same work.

> prefill would probably have to adjust high-watermark limit.
> That makes sense, but for how long? Should the watermark go back after time
> or after N objects were consumed?

Neither, if you want your pool of objects to not vanish from under you.

> What prefill is going to do? Prefill on current cpu only ?
> but it doesn't help the prog to be deterministic in consuming them.
> Prefill on all cpu-s? That's possible,
> but for_each_possible_cpu() {irq_work_queue_on(cpu);}
> looks to be a significant memory and run-time overhead.

No, that's overkill imo, especially on 100+ core systems.
I was imagining the allocator consuming the current cpu freelist first, then stealing
from other cpus, and only if they are empty, giving up and scheduling irq_work. 

A little complex to implement but it's possible. It does require atomics everywhere
though.

> When freelist is managed by the program it may contain just N elements
> that progs needs.
> 
> > > In that sense, allowing multiple programs to create an instance
> > > of an allocator doesn't solve this problem.
> > > Short free list inside bpf_mem_cache is an implementation detail.
> > > "prefill to guarantee successful alloc" is a bit out of scope
> > > of an allocator.
> > 
> > I disagree that it's out of scope. This is the only access to dynamic memory from a
> > bpf program, it makes sense that it covers the requirements of bpf programs,
> > including prefill/freelist behavior, so all programs can safely use it.
> > 
> > > "allocate a set and stash it" should be a separate building block
> > > available to bpf progs when step "allocate" can fail and
> > > efficient "stash it" can probably be done on top of the link list.
> > 
> > Do you imagine a BPF object that every user has to link into their programs (yuck),
> > or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things
> > this way.
> 
> I'm assuming Kumar's proposed list api:
> struct bpf_list_head head;
> struct bpf_list_node node;
> bpf_list_insert(&node, &head);
> 
> will work out.

Given the assumed locking in that design, I don't see how it would help nmi programs
tbh. This is list_head, we need llist_head, relatively speaking.

> 
> > If it's distributed separately, I think that's an unexpected burden on developers
> > (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool
> > skels). There are no other bpf features that require a userspace support library like
> > this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful
> > without a library)
> 
> bpf progs must not pay for what they don't use. Hence all building blocks should be
> small. We will have libc-like bpf libraries with bigger blocks eventually. 

I'm not sure I understand how having the mechanism in helpers and managed by the
kernel is paying for something they don't use?

> 
> > > I think the disagreement here is that per-prog allocator based
> > > on bpf_mem_alloc isn't going to be any more deterministic than
> > > one global bpf_mem_alloc for all progs.
> > > Per-prog short free list of ~64 elements vs
> > > global free list of ~64 elements.
> > 
> > Right, I think I had a hidden assumption here that we've exposed. 
> > Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be
> > 1000 struct foos on the freelist and the freelist thresholds would be adjusted
> > accordingly (i.e., you can free all of them and allocate them again, all from the
> > freelist).
> > 
> > Ultimately, that's what nmi programs actually need but I see why that's not an
> > obvious behavior.
> 
> How prefill is going to work is still to-be-designed.

That's the new part for me, though - the maps design had a mechanism to specify
initial capacity, and it worked for nmi programs. That's why I'm pulling on this
thread, it's the hardest thing to get right _and_ it needs to exist before deferred
work can be useful.

> In addition to current-cpu vs on-all-cpu question above...
> Will prefill() helper just do irq_work ?
> If so then it doesn't help nmi and irq-disabled progs at all.
> prefill helper working asynchronously doesn't guarantee availability of objects
> later to the program.
> prefill() becomes a hint and probably useless as such.

Agreed.

> So it should probably be synchronous and fail when in-nmi or in-irq?
> But bpf prog cannot know its context, so only safe synchronous prefill()
> would be out of sleepable progs.

Initial maps capacity would've come from the syscall, so the program itself would not
contain a prefill() call. 

We covered this in our initial discussions - I also think that requiring every
perf_event program to setup a uprobe or syscall program to fill the object pool
(internal or external) is also a bad design.

If we're going for a global allocator, I suppose we could encode these requirements
in BTF and satisfy them on program load? .alloc map with some predefined names or
something?

> 
> > > In both cases these lists will have to do irq_work and refill
> > > out of global slabs.
> > 
> > If a tracing program needs irq_work to refill, then hasn't the API already failed the
> > program writer? I'd have to remind myself how irq_work actually works but given that
> > it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before
> > irq_work has a chance to refill it. I don't see how you'd write reliable programs
> > this way.
> 
> The only way nmi-prog can guarantee availability if it allocates and reserves
> objects in a different non-nmi program.
> If everything runs in nmi there is nothing can be done.

See above, we were using the map load syscall to satisfy this before. We could
probably do the same here but it's just documenting requirements as opposed to also
introducing ownership/lifetime problems.

> 
> > > 
> > > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> > > > memory usage, which is strictly worse for us (and worse developer experience on top).
> > > 
> > > I don't understand this point.
> > > All allocations are still coming out of bpf_mem_alloc.
> > > We can have debug mode with memleak support and other debug
> > > mechanisms.
> > 
> > I mostly mean accounting here. If we segment the allocated objects by finer-grained
> > allocators, we can attribute them to individual programs better. But, I agree, this
> > can be implemented in other ways, it can just be counts/tables on bpf_prog.
> 
> mem accounting is the whole different, huge and largely unsovled topic.
> The thread about memcg and bpf is still ongoing.
> The fine-grained bpf_mem_alloc isn't going to magically solve it.
> 
> > > 
> > > What is 'freelist determinism' ?
> > 
> > The property that prefill keeps all objects on the freelist, so the following
> > sequence doesn't observe allocation failures:
> > 
> > bpf_mem_prefill(1000, struct foo);
> > run_1000_times { alloc(struct foo); }
> > run_1000_times { free(struct foo); }
> > run_1000_times { alloc(struct foo); }
> > alloc(struct foo) // this can observe a failure
> 
> we cannot evalute the above until we answer current-cpu vs on-all-cpus question
> and whether bpf_mem_prefill is sync or async.
> 
> I still think designing prefill and guaranteed availability is out of scope
> of allocator.

It was in the maps design on purpose though, I need it for deferred work to be useful
(remember that build id EFAULT thread? only way to fix it for good requires that work
submission never fails, which needs allocations from nmi to never fail).

> 
> > > Are you talking about some other freelist on top of bpf_mem_alloc's
> > > free lists ?
> > 
> > Well, that's the question, isn't it? I think it should be part of the bpf kernel
> > ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's
> > instantiated per-program, that's easiest to reason about.
> 
> There should be a way. For sure. Helper/kfunc or trivial stuff on top of bpf_link_list
> is still a question. Bundling this feature together with an allocator feels artificial.
> In user space C you wouldn't combine tcmalloc with custom free list.

Userspace doesn't have nmi or need allocators that work from signal handlers, for a
more appropriate analogy. We actually need this to work reliably from nmi, so we can
shift work _away_ from nmi. If we didn't have this use case, I would've folded on the
entire issue and kicked the can down the road (plenty of helpers don't work in nmi).

-- Delyan

> During early days of bpf bundling would totally make sense, but right now we're able to
> create much smaller building blocks than in the past. I don't think we'll be adding
> any more new map types. We probably won't be adding any new program types either.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-09-02  1:40                                             ` Delyan Kratunov
@ 2022-09-02  3:29                                               ` Alexei Starovoitov
  2022-09-04 22:28                                                 ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2022-09-02  3:29 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: davem, joannelkoong, andrii, tj, daniel, Dave Marchevsky, memxor,
	Kernel Team, bpf

On Fri, Sep 02, 2022 at 01:40:29AM +0000, Delyan Kratunov wrote:
> On Thu, 2022-09-01 at 17:12 -0700, Alexei Starovoitov wrote:
> > On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote:
> > > On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote:
> > > > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote:
> > > > > Given that tracing programs can't really maintain their own freelists safely (I think
> > > > > they're missing the building blocks - you can't cmpxchg kptrs),
> > > > 
> > > > Today? yes, but soon we will have link lists supported natively.
> > > > 
> > > > > I do feel like
> > > > > isolated allocators are a requirement here. Without them, allocations can fail and
> > > > > there's no way to write a reliable program.
> > > > 
> > > > Completely agree that there should be a way for programs
> > > > to guarantee availability of the element.
> > > > Inline allocation can fail regardless whether allocation pool
> > > > is shared by multiple programs or a single program owns an allocator.
> > > 
> > > I'm not sure I understand this point. 
> > > If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs
> > > for struct foo to succeed. In what situations can this fail if I'm the only program
> > > using it _and_ I've calculated the prefill amount correctly?
> > > 
> > > Unless you're saying that the prefill wouldn't adjust the freelist limits, in which
> > > case, I think that's a mistake - prefill should effectively _set_ the freelist
> > > limits.
> > 
> > There is no prefill implementation today, so we're just guessing, but let's try.
> 
> Well, initial capacity was going to be part of the map API, so I always considered it
> part of the same work.
> 
> > prefill would probably have to adjust high-watermark limit.
> > That makes sense, but for how long? Should the watermark go back after time
> > or after N objects were consumed?
> 
> Neither, if you want your pool of objects to not vanish from under you.
> 
> > What prefill is going to do? Prefill on current cpu only ?
> > but it doesn't help the prog to be deterministic in consuming them.
> > Prefill on all cpu-s? That's possible,
> > but for_each_possible_cpu() {irq_work_queue_on(cpu);}
> > looks to be a significant memory and run-time overhead.
> 
> No, that's overkill imo, especially on 100+ core systems.
> I was imagining the allocator consuming the current cpu freelist first, then stealing
> from other cpus, and only if they are empty, giving up and scheduling irq_work. 

stealing from other cpus?!
That's certainly out of scope for bpf_mem_alloc as it's implemented.
Stealing from other cpus would require a redesign.

> A little complex to implement but it's possible. It does require atomics everywhere
> though.

atomic everywhere and many more weeks of thinking and debugging.
kernel/bpf/percpu_freelist.c does stealing from other cpus and it wasn't
trivial to do.

> 
> > When freelist is managed by the program it may contain just N elements
> > that progs needs.
> > 
> > > > In that sense, allowing multiple programs to create an instance
> > > > of an allocator doesn't solve this problem.
> > > > Short free list inside bpf_mem_cache is an implementation detail.
> > > > "prefill to guarantee successful alloc" is a bit out of scope
> > > > of an allocator.
> > > 
> > > I disagree that it's out of scope. This is the only access to dynamic memory from a
> > > bpf program, it makes sense that it covers the requirements of bpf programs,
> > > including prefill/freelist behavior, so all programs can safely use it.
> > > 
> > > > "allocate a set and stash it" should be a separate building block
> > > > available to bpf progs when step "allocate" can fail and
> > > > efficient "stash it" can probably be done on top of the link list.
> > > 
> > > Do you imagine a BPF object that every user has to link into their programs (yuck),
> > > or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things
> > > this way.
> > 
> > I'm assuming Kumar's proposed list api:
> > struct bpf_list_head head;
> > struct bpf_list_node node;
> > bpf_list_insert(&node, &head);
> > 
> > will work out.
> 
> Given the assumed locking in that design, I don't see how it would help nmi programs
> tbh. This is list_head, we need llist_head, relatively speaking.

Of course. bpf-native link list could be per-cpu and based on llist.
bpf_list vs bpf_llist. SMOP :)

> 
> > 
> > > If it's distributed separately, I think that's an unexpected burden on developers
> > > (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool
> > > skels). There are no other bpf features that require a userspace support library like
> > > this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful
> > > without a library)
> > 
> > bpf progs must not pay for what they don't use. Hence all building blocks should be
> > small. We will have libc-like bpf libraries with bigger blocks eventually. 
> 
> I'm not sure I understand how having the mechanism in helpers and managed by the
> kernel is paying for something they don't use?

every feature adds up.. like stealing from cpus.

> > 
> > > > I think the disagreement here is that per-prog allocator based
> > > > on bpf_mem_alloc isn't going to be any more deterministic than
> > > > one global bpf_mem_alloc for all progs.
> > > > Per-prog short free list of ~64 elements vs
> > > > global free list of ~64 elements.
> > > 
> > > Right, I think I had a hidden assumption here that we've exposed. 
> > > Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be
> > > 1000 struct foos on the freelist and the freelist thresholds would be adjusted
> > > accordingly (i.e., you can free all of them and allocate them again, all from the
> > > freelist).
> > > 
> > > Ultimately, that's what nmi programs actually need but I see why that's not an
> > > obvious behavior.
> > 
> > How prefill is going to work is still to-be-designed.
> 
> That's the new part for me, though - the maps design had a mechanism to specify
> initial capacity, and it worked for nmi programs. That's why I'm pulling on this
> thread, it's the hardest thing to get right _and_ it needs to exist before deferred
> work can be useful.

Specifying initial capacity sounds great in theory, but what does it mean in practice?
N elements on each cpu or evenly distributed across all?

> 
> > In addition to current-cpu vs on-all-cpu question above...
> > Will prefill() helper just do irq_work ?
> > If so then it doesn't help nmi and irq-disabled progs at all.
> > prefill helper working asynchronously doesn't guarantee availability of objects
> > later to the program.
> > prefill() becomes a hint and probably useless as such.
> 
> Agreed.
> 
> > So it should probably be synchronous and fail when in-nmi or in-irq?
> > But bpf prog cannot know its context, so only safe synchronous prefill()
> > would be out of sleepable progs.
> 
> Initial maps capacity would've come from the syscall, so the program itself would not
> contain a prefill() call. 
> 
> We covered this in our initial discussions - I also think that requiring every
> perf_event program to setup a uprobe or syscall program to fill the object pool
> (internal or external) is also a bad design.

right. we did. prefill from user space makes sense.

> If we're going for a global allocator, I suppose we could encode these requirements
> in BTF and satisfy them on program load? .alloc map with some predefined names or
> something?

ohh. When I was saying 'global allocator' I meant an allocator that is not exposed
to bpf progs at all. It's just there for all programs. It has hidden watermarks
and prefill for it doesn't make sense. Pretty much kmalloc equivalent.

> > 
> > > > In both cases these lists will have to do irq_work and refill
> > > > out of global slabs.
> > > 
> > > If a tracing program needs irq_work to refill, then hasn't the API already failed the
> > > program writer? I'd have to remind myself how irq_work actually works but given that
> > > it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before
> > > irq_work has a chance to refill it. I don't see how you'd write reliable programs
> > > this way.
> > 
> > The only way nmi-prog can guarantee availability if it allocates and reserves
> > objects in a different non-nmi program.
> > If everything runs in nmi there is nothing can be done.
> 
> See above, we were using the map load syscall to satisfy this before. We could
> probably do the same here but it's just documenting requirements as opposed to also
> introducing ownership/lifetime problems.
> 
> > 
> > > > 
> > > > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> > > > > memory usage, which is strictly worse for us (and worse developer experience on top).
> > > > 
> > > > I don't understand this point.
> > > > All allocations are still coming out of bpf_mem_alloc.
> > > > We can have debug mode with memleak support and other debug
> > > > mechanisms.
> > > 
> > > I mostly mean accounting here. If we segment the allocated objects by finer-grained
> > > allocators, we can attribute them to individual programs better. But, I agree, this
> > > can be implemented in other ways, it can just be counts/tables on bpf_prog.
> > 
> > mem accounting is the whole different, huge and largely unsovled topic.
> > The thread about memcg and bpf is still ongoing.
> > The fine-grained bpf_mem_alloc isn't going to magically solve it.
> > 
> > > > 
> > > > What is 'freelist determinism' ?
> > > 
> > > The property that prefill keeps all objects on the freelist, so the following
> > > sequence doesn't observe allocation failures:
> > > 
> > > bpf_mem_prefill(1000, struct foo);
> > > run_1000_times { alloc(struct foo); }
> > > run_1000_times { free(struct foo); }
> > > run_1000_times { alloc(struct foo); }
> > > alloc(struct foo) // this can observe a failure
> > 
> > we cannot evalute the above until we answer current-cpu vs on-all-cpus question
> > and whether bpf_mem_prefill is sync or async.
> > 
> > I still think designing prefill and guaranteed availability is out of scope
> > of allocator.
> 
> It was in the maps design on purpose though, I need it for deferred work to be useful
> (remember that build id EFAULT thread? only way to fix it for good requires that work
> submission never fails, which needs allocations from nmi to never fail).

iirc build_id EFAULT-ing thread the main issue was:
moving build_id collection from nmi into exit_to_user context so that build_id logic
can do copy_from_user.
In that context it can allocate with GFP_KERNEL too.
We've discussed combing kernel stack with later user+build_id, ringbufs, etc
Lots of things.

> > 
> > > > Are you talking about some other freelist on top of bpf_mem_alloc's
> > > > free lists ?
> > > 
> > > Well, that's the question, isn't it? I think it should be part of the bpf kernel
> > > ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's
> > > instantiated per-program, that's easiest to reason about.
> > 
> > There should be a way. For sure. Helper/kfunc or trivial stuff on top of bpf_link_list
> > is still a question. Bundling this feature together with an allocator feels artificial.
> > In user space C you wouldn't combine tcmalloc with custom free list.
> 
> Userspace doesn't have nmi or need allocators that work from signal handlers, for a
> more appropriate analogy. We actually need this to work reliably from nmi, so we can
> shift work _away_ from nmi. If we didn't have this use case, I would've folded on the
> entire issue and kicked the can down the road (plenty of helpers don't work in nmi).

Sure.
I think all the arguments against global mem_alloc come from the assumption that
run-time percpu_ref_get/put in bpf_mem_alloc/free will work.
Kumar mentioned that we have to carefully think when to do percpu_ref_exit()
since it will convert percpu_ref to atomic and performance will suffer.

Also there could be yet another solution to refcounting that will enable
per-program custom bpf_mem_alloc.
For example:
- bpf_mem_alloc is a new map type. It's added to prog->used_maps[]
- no run-time refcnt-ing
- don't store mem_alloc into hidden 8 bytes
- since __kptr __local enforces type and size we can allow:
  obj = bpf_mem_alloc(alloc_A, btf_type_id_local(struct foo));
  kptr_xchg(val, obj);
  ..
  // on different cpu in a different prog
  obj = kptr_xchg(val, NULL);
  bpf_mem_free(alloc_B, obj);
The verifier will need to make sure that alloc_A and alloc_B can service the same type.
If allocators can service any type sizes not checks are necessary.

- during hash map free we do:
  obj = xchg(val)
  bpf_mem_free(global_alloc, obj);
Where global_alloc is the global allocator I was talking about. It's always there.
Cannot get any simpler.

My main point is let's postpone the discussions about features that will happen ten steps
from now. Let's start with global allocator first and don't expose bpf_mem_alloc
as a map just yet. It will enable plenty of new use cases and unblock other works.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
  2022-09-02  3:29                                               ` Alexei Starovoitov
@ 2022-09-04 22:28                                                 ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 59+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 22:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Delyan Kratunov, davem, joannelkoong, andrii, tj, daniel,
	Dave Marchevsky, Kernel Team, bpf

On Fri, 2 Sept 2022 at 05:29, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 02, 2022 at 01:40:29AM +0000, Delyan Kratunov wrote:
> > On Thu, 2022-09-01 at 17:12 -0700, Alexei Starovoitov wrote:
> > > On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote:
> > > > On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote:
> > > > > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote:
> > > > > > Given that tracing programs can't really maintain their own freelists safely (I think
> > > > > > they're missing the building blocks - you can't cmpxchg kptrs),
> > > > >
> > > > > Today? yes, but soon we will have link lists supported natively.
> > > > >
> > > > > > I do feel like
> > > > > > isolated allocators are a requirement here. Without them, allocations can fail and
> > > > > > there's no way to write a reliable program.
> > > > >
> > > > > Completely agree that there should be a way for programs
> > > > > to guarantee availability of the element.
> > > > > Inline allocation can fail regardless whether allocation pool
> > > > > is shared by multiple programs or a single program owns an allocator.
> > > >
> > > > I'm not sure I understand this point.
> > > > If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs
> > > > for struct foo to succeed. In what situations can this fail if I'm the only program
> > > > using it _and_ I've calculated the prefill amount correctly?
> > > >
> > > > Unless you're saying that the prefill wouldn't adjust the freelist limits, in which
> > > > case, I think that's a mistake - prefill should effectively _set_ the freelist
> > > > limits.
> > >
> > > There is no prefill implementation today, so we're just guessing, but let's try.
> >
> > Well, initial capacity was going to be part of the map API, so I always considered it
> > part of the same work.
> >
> > > prefill would probably have to adjust high-watermark limit.
> > > That makes sense, but for how long? Should the watermark go back after time
> > > or after N objects were consumed?
> >
> > Neither, if you want your pool of objects to not vanish from under you.
> >
> > > What prefill is going to do? Prefill on current cpu only ?
> > > but it doesn't help the prog to be deterministic in consuming them.
> > > Prefill on all cpu-s? That's possible,
> > > but for_each_possible_cpu() {irq_work_queue_on(cpu);}
> > > looks to be a significant memory and run-time overhead.
> >
> > No, that's overkill imo, especially on 100+ core systems.
> > I was imagining the allocator consuming the current cpu freelist first, then stealing
> > from other cpus, and only if they are empty, giving up and scheduling irq_work.
>
> stealing from other cpus?!
> That's certainly out of scope for bpf_mem_alloc as it's implemented.
> Stealing from other cpus would require a redesign.
>

Yes, stealing would most likely force us to use a spinlock, concurrent
llist_del_first doesn't work, so that is the only option that comes to
mind unless you have something fancy in mind (and I would be genuinely
interested to know how :)).

It will then be some more verifier side work if you want to make it
work in tracing and perf_event progs. Essentially, we would need to
teach it to treat bpf_in_nmi() branch specially and force it to use
spin_trylock, otherwise spin_lock can be used (since bpf's is an
irqsave variant so lower contexts are exclusive).
Even then there might be some corner cases I don't remember right now.

> > A little complex to implement but it's possible. It does require atomics everywhere
> > though.
>
> atomic everywhere and many more weeks of thinking and debugging.
> kernel/bpf/percpu_freelist.c does stealing from other cpus and it wasn't
> trivial to do.
>
> >
> > > When freelist is managed by the program it may contain just N elements
> > > that progs needs.
> > >
> > > > > In that sense, allowing multiple programs to create an instance
> > > > > of an allocator doesn't solve this problem.
> > > > > Short free list inside bpf_mem_cache is an implementation detail.
> > > > > "prefill to guarantee successful alloc" is a bit out of scope
> > > > > of an allocator.
> > > >
> > > > I disagree that it's out of scope. This is the only access to dynamic memory from a
> > > > bpf program, it makes sense that it covers the requirements of bpf programs,
> > > > including prefill/freelist behavior, so all programs can safely use it.
> > > >
> > > > > "allocate a set and stash it" should be a separate building block
> > > > > available to bpf progs when step "allocate" can fail and
> > > > > efficient "stash it" can probably be done on top of the link list.
> > > >
> > > > Do you imagine a BPF object that every user has to link into their programs (yuck),
> > > > or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things
> > > > this way.
> > >
> > > I'm assuming Kumar's proposed list api:
> > > struct bpf_list_head head;
> > > struct bpf_list_node node;
> > > bpf_list_insert(&node, &head);
> > >
> > > will work out.
> >
> > Given the assumed locking in that design, I don't see how it would help nmi programs
> > tbh. This is list_head, we need llist_head, relatively speaking.
>
> Of course. bpf-native link list could be per-cpu and based on llist.
> bpf_list vs bpf_llist. SMOP :)

+1. percpu NMI safe list using local_t style protection should work
out well. It will hook into the same infra for locked linked lists,
but use local_t lock for protection. percpu maps only have local_t,
non-percpu has bpf_spin_lock. Also need to limit remote percpu access
(using bpf_lookup_elem_percpu).
The only labor needed is doing the trylock part for it (since it can
fail, inc_return != 1), so only one branch with the checked result of
trylock has the lock. The lock section is already limited to current
bpf_func_state, and unlocking always happens in the same frame. Other
than that, it is trivial to support with most basic infra already
there in [0].

[0]: https://lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com/

> >
> > >
> > > > If it's distributed separately, I think that's an unexpected burden on developers
> > > > (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool
> > > > skels). There are no other bpf features that require a userspace support library like
> > > > this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful
> > > > without a library)
> > >
> > > bpf progs must not pay for what they don't use. Hence all building blocks should be
> > > small. We will have libc-like bpf libraries with bigger blocks eventually.
> >
> > I'm not sure I understand how having the mechanism in helpers and managed by the
> > kernel is paying for something they don't use?
>
> every feature adds up.. like stealing from cpus.
>
> > >
> > > > > I think the disagreement here is that per-prog allocator based
> > > > > on bpf_mem_alloc isn't going to be any more deterministic than
> > > > > one global bpf_mem_alloc for all progs.
> > > > > Per-prog short free list of ~64 elements vs
> > > > > global free list of ~64 elements.
> > > >
> > > > Right, I think I had a hidden assumption here that we've exposed.
> > > > Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be
> > > > 1000 struct foos on the freelist and the freelist thresholds would be adjusted
> > > > accordingly (i.e., you can free all of them and allocate them again, all from the
> > > > freelist).
> > > >
> > > > Ultimately, that's what nmi programs actually need but I see why that's not an
> > > > obvious behavior.
> > >
> > > How prefill is going to work is still to-be-designed.
> >
> > That's the new part for me, though - the maps design had a mechanism to specify
> > initial capacity, and it worked for nmi programs. That's why I'm pulling on this
> > thread, it's the hardest thing to get right _and_ it needs to exist before deferred
> > work can be useful.
>
> Specifying initial capacity sounds great in theory, but what does it mean in practice?
> N elements on each cpu or evenly distributed across all?
>
> >
> > > In addition to current-cpu vs on-all-cpu question above...
> > > Will prefill() helper just do irq_work ?
> > > If so then it doesn't help nmi and irq-disabled progs at all.
> > > prefill helper working asynchronously doesn't guarantee availability of objects
> > > later to the program.
> > > prefill() becomes a hint and probably useless as such.
> >
> > Agreed.
> >
> > > So it should probably be synchronous and fail when in-nmi or in-irq?
> > > But bpf prog cannot know its context, so only safe synchronous prefill()
> > > would be out of sleepable progs.
> >
> > Initial maps capacity would've come from the syscall, so the program itself would not
> > contain a prefill() call.
> >
> > We covered this in our initial discussions - I also think that requiring every
> > perf_event program to setup a uprobe or syscall program to fill the object pool
> > (internal or external) is also a bad design.
>
> right. we did. prefill from user space makes sense.
>
> > If we're going for a global allocator, I suppose we could encode these requirements
> > in BTF and satisfy them on program load? .alloc map with some predefined names or
> > something?
>
> ohh. When I was saying 'global allocator' I meant an allocator that is not exposed
> to bpf progs at all. It's just there for all programs. It has hidden watermarks
> and prefill for it doesn't make sense. Pretty much kmalloc equivalent.
>
> >
> > [...]
> >
> > Userspace doesn't have nmi or need allocators that work from signal handlers, for a
> > more appropriate analogy. We actually need this to work reliably from nmi, so we can
> > shift work _away_ from nmi. If we didn't have this use case, I would've folded on the
> > entire issue and kicked the can down the road (plenty of helpers don't work in nmi).
>
> Sure.
> I think all the arguments against global mem_alloc come from the assumption that
> run-time percpu_ref_get/put in bpf_mem_alloc/free will work.
> Kumar mentioned that we have to carefully think when to do percpu_ref_exit()
> since it will convert percpu_ref to atomic and performance will suffer.
>
> Also there could be yet another solution to refcounting that will enable
> per-program custom bpf_mem_alloc.
> For example:
> - bpf_mem_alloc is a new map type. It's added to prog->used_maps[]
> - no run-time refcnt-ing
> - don't store mem_alloc into hidden 8 bytes
> - since __kptr __local enforces type and size we can allow:
>   obj = bpf_mem_alloc(alloc_A, btf_type_id_local(struct foo));
>   kptr_xchg(val, obj);
>   ..
>   // on different cpu in a different prog
>   obj = kptr_xchg(val, NULL);
>   bpf_mem_free(alloc_B, obj);
> The verifier will need to make sure that alloc_A and alloc_B can service the same type.
> If allocators can service any type sizes not checks are necessary.
>
> - during hash map free we do:
>   obj = xchg(val)
>   bpf_mem_free(global_alloc, obj);
> Where global_alloc is the global allocator I was talking about. It's always there.
> Cannot get any simpler.

Neat idea. The size of kptr is always known (already in the map value type).

Also realized a fun tidbit: we can technically do a sized delete [1]
as well. C, C++, Rust all support this, the first one manually
(actually will in C23 or whenever implementers get around to adopting
it), and the latter two statically using the type's size transform
delete call to do size delete. No need to do size class lookup using
bpf_mem_cache_idx in bpf_mem_free. The verifier can always pass a
hidden argument. So probably a minor perf improvement on the free path
that comes for free.

[1]: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm

>
> My main point is let's postpone the discussions about features that will happen ten steps
> from now. Let's start with global allocator first and don't expose bpf_mem_alloc
> as a map just yet. It will enable plenty of new use cases and unblock other works.

+1, I think we should go with global bpf_mem_alloc, even if it only
serves as a fallback in the future. It makes things much simpler and
pretty much follows the existing kmalloc idea with an extra cache.

Also, I'm confident the BPF percpu linked list case is going to be
pretty much equivalent or at least very close to bpf_mem_alloc
freelist.
But famous last words and all, we'll see.

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2022-09-04 22:28 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-19 21:42 [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 02/15] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 03/15] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 04/15] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 05/15] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 07/15] bpf: Optimize call_rcu " Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 08/15] bpf: Adjust low/high watermarks in bpf_mem_cache Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
2022-08-24 19:58   ` Kumar Kartikeya Dwivedi
2022-08-25  0:13     ` Alexei Starovoitov
2022-08-25  0:35       ` Joel Fernandes
2022-08-25  0:49         ` Joel Fernandes
2022-08-19 21:42 ` [PATCH v3 bpf-next 10/15] bpf: Add percpu allocation support to bpf_mem_alloc Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 11/15] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 12/15] bpf: Remove tracing program restriction on map types Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs Alexei Starovoitov
2022-08-19 22:21   ` Kumar Kartikeya Dwivedi
2022-08-19 22:43     ` Alexei Starovoitov
2022-08-19 22:56       ` Kumar Kartikeya Dwivedi
2022-08-19 23:01         ` Alexei Starovoitov
2022-08-24 19:49           ` Kumar Kartikeya Dwivedi
2022-08-25  0:08             ` Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 14/15] bpf: Remove prealloc-only restriction for " Alexei Starovoitov
2022-08-19 21:42 ` [PATCH v3 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc Alexei Starovoitov
2022-08-24 20:03 ` [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator Kumar Kartikeya Dwivedi
2022-08-25  0:16   ` Alexei Starovoitov
2022-08-25  0:56 ` [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular Delyan Kratunov
2022-08-26  4:03   ` Kumar Kartikeya Dwivedi
2022-08-29 21:23     ` Delyan Kratunov
2022-08-29 21:29     ` Delyan Kratunov
2022-08-29 22:07       ` Kumar Kartikeya Dwivedi
2022-08-29 23:18         ` Delyan Kratunov
2022-08-29 23:45           ` Alexei Starovoitov
2022-08-30  0:20             ` Kumar Kartikeya Dwivedi
2022-08-30  0:26               ` Alexei Starovoitov
2022-08-30  0:44                 ` Kumar Kartikeya Dwivedi
2022-08-30  1:05                   ` Alexei Starovoitov
2022-08-30  1:40                     ` Delyan Kratunov
2022-08-30  3:34                       ` Alexei Starovoitov
2022-08-30  5:02                         ` Kumar Kartikeya Dwivedi
2022-08-30  6:03                           ` Alexei Starovoitov
2022-08-30 20:31                             ` Delyan Kratunov
2022-08-31  1:52                               ` Alexei Starovoitov
2022-08-31 17:38                                 ` Delyan Kratunov
2022-08-31 18:57                                   ` Alexei Starovoitov
2022-08-31 20:12                                     ` Kumar Kartikeya Dwivedi
2022-08-31 20:38                                       ` Alexei Starovoitov
2022-08-31 21:02                                     ` Delyan Kratunov
2022-08-31 22:32                                       ` Kumar Kartikeya Dwivedi
2022-09-01  0:41                                         ` Alexei Starovoitov
2022-09-01  3:55                                       ` Alexei Starovoitov
2022-09-01 22:46                                         ` Delyan Kratunov
2022-09-02  0:12                                           ` Alexei Starovoitov
2022-09-02  1:40                                             ` Delyan Kratunov
2022-09-02  3:29                                               ` Alexei Starovoitov
2022-09-04 22:28                                                 ` Kumar Kartikeya Dwivedi
2022-08-30  0:17           ` Kumar Kartikeya Dwivedi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).