All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator.
@ 2022-08-26  2:44 Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
                   ` (15 more replies)
  0 siblings, 16 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce any context BPF specific memory allocator.

Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.
Front-end kmalloc() with per-cpu cache of free elements.
Refill this cache asynchronously from irq_work.

Major achievements enabled by bpf_mem_alloc:
- Dynamically allocated hash maps used to be 10 times slower than fully preallocated.
  With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc.
- Tracing bpf programs can use dynamically allocated hash maps.
  Potentially saving lots of memory. Typical hash map is sparsely populated.
- Sleepable bpf programs can used dynamically allocated hash maps.

v3->v4:
- fix build issue due to missing local.h on 32-bit arch
- add Kumar's ack
- proposal for next steps from Delyan:
https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com/

v2->v3:
- Rewrote the free_list algorithm based on discussions with Kumar. Patch 1.
- Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14.
- Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is
  requested to reduce memory consumption. Patch 15.
- Fix: zero-fill percpu allocation
- Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction

v2 thread:
https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/

v1->v2:
- Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
- Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
- Tuned watermarks per allocation size. Patch 8
- Adopted this approach to per-cpu allocation. Patch 10.
- Fully converted hash map to bpf_mem_alloc. Patch 11.
- Removed tracing prog restriction on map types. Combination of all patches and final patch 12.

v1 thread:
https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/

LWN article:
https://lwn.net/Articles/899274/

Future work:
- expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
- convert lru map to bpf_mem_alloc

Alexei Starovoitov (15):
  bpf: Introduce any context BPF specific memory allocator.
  bpf: Convert hash map to bpf_mem_alloc.
  selftests/bpf: Improve test coverage of test_maps
  samples/bpf: Reduce syscall overhead in map_perf_test.
  bpf: Relax the requirement to use preallocated hash maps in tracing
    progs.
  bpf: Optimize element count in non-preallocated hash map.
  bpf: Optimize call_rcu in non-preallocated hash map.
  bpf: Adjust low/high watermarks in bpf_mem_cache
  bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  bpf: Add percpu allocation support to bpf_mem_alloc.
  bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
  bpf: Remove tracing program restriction on map types
  bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  bpf: Remove prealloc-only restriction for sleepable bpf programs.
  bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.

 include/linux/bpf_mem_alloc.h             |  26 +
 include/linux/filter.h                    |   2 +
 kernel/bpf/Makefile                       |   2 +-
 kernel/bpf/core.c                         |   2 +
 kernel/bpf/hashtab.c                      | 132 +++--
 kernel/bpf/memalloc.c                     | 602 ++++++++++++++++++++++
 kernel/bpf/syscall.c                      |  14 +-
 kernel/bpf/verifier.c                     |  52 --
 samples/bpf/map_perf_test_kern.c          |  44 +-
 samples/bpf/map_perf_test_user.c          |   2 +-
 tools/testing/selftests/bpf/progs/timer.c |  11 -
 tools/testing/selftests/bpf/test_maps.c   |  38 +-
 12 files changed, 796 insertions(+), 131 deletions(-)
 create mode 100644 include/linux/bpf_mem_alloc.h
 create mode 100644 kernel/bpf/memalloc.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-29 21:30   ` Daniel Borkmann
                     ` (2 more replies)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 02/15] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
                   ` (14 subsequent siblings)
  15 siblings, 3 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.

Front-end kmalloc() with minimal per-cpu cache of free elements.
Refill this cache asynchronously from irq_work.

BPF programs always run with migration disabled.
It's safe to allocate from cache of the current cpu with irqs disabled.
Free-ing is always done into bucket of the current cpu as well.
irq_work trims extra free elements from buckets with kfree
and refills them with kmalloc, so global kmalloc logic takes care
of freeing objects allocated by one cpu and freed on another.

struct bpf_mem_alloc supports two modes:
- When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
  This is typical bpf hash map use case when all elements have equal size.
- When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
  kmalloc/kfree. Max allocation size is 4096 in this case.
  This is bpf_dynptr and bpf_kptr use case.

bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.

The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_mem_alloc.h |  26 ++
 kernel/bpf/Makefile           |   2 +-
 kernel/bpf/memalloc.c         | 476 ++++++++++++++++++++++++++++++++++
 3 files changed, 503 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf_mem_alloc.h
 create mode 100644 kernel/bpf/memalloc.c

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
new file mode 100644
index 000000000000..804733070f8d
--- /dev/null
+++ b/include/linux/bpf_mem_alloc.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+#ifndef _BPF_MEM_ALLOC_H
+#define _BPF_MEM_ALLOC_H
+#include <linux/compiler_types.h>
+
+struct bpf_mem_cache;
+struct bpf_mem_caches;
+
+struct bpf_mem_alloc {
+	struct bpf_mem_caches __percpu *caches;
+	struct bpf_mem_cache __percpu *cache;
+};
+
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size);
+void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
+
+/* kmalloc/kfree equivalent: */
+void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size);
+void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr);
+
+/* kmem_cache_alloc/free equivalent: */
+void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma);
+void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
+
+#endif /* _BPF_MEM_ALLOC_H */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 00e05b69a4df..341c94f208f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
-obj-$(CONFIG_BPF_SYSCALL) += btf.o
+obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
new file mode 100644
index 000000000000..29f340016e9e
--- /dev/null
+++ b/kernel/bpf/memalloc.c
@@ -0,0 +1,476 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+#include <linux/mm.h>
+#include <linux/llist.h>
+#include <linux/bpf.h>
+#include <linux/irq_work.h>
+#include <linux/bpf_mem_alloc.h>
+#include <linux/memcontrol.h>
+#include <asm/local.h>
+
+/* Any context (including NMI) BPF specific memory allocator.
+ *
+ * Tracing BPF programs can attach to kprobe and fentry. Hence they
+ * run in unknown context where calling plain kmalloc() might not be safe.
+ *
+ * Front-end kmalloc() with per-cpu per-bucket cache of free elements.
+ * Refill this cache asynchronously from irq_work.
+ *
+ * CPU_0 buckets
+ * 16 32 64 96 128 196 256 512 1024 2048 4096
+ * ...
+ * CPU_N buckets
+ * 16 32 64 96 128 196 256 512 1024 2048 4096
+ *
+ * The buckets are prefilled at the start.
+ * BPF programs always run with migration disabled.
+ * It's safe to allocate from cache of the current cpu with irqs disabled.
+ * Free-ing is always done into bucket of the current cpu as well.
+ * irq_work trims extra free elements from buckets with kfree
+ * and refills them with kmalloc, so global kmalloc logic takes care
+ * of freeing objects allocated by one cpu and freed on another.
+ *
+ * Every allocated objected is padded with extra 8 bytes that contains
+ * struct llist_node.
+ */
+#define LLIST_NODE_SZ sizeof(struct llist_node)
+
+/* similar to kmalloc, but sizeof == 8 bucket is gone */
+static u8 size_index[24] __ro_after_init = {
+	3,	/* 8 */
+	3,	/* 16 */
+	4,	/* 24 */
+	4,	/* 32 */
+	5,	/* 40 */
+	5,	/* 48 */
+	5,	/* 56 */
+	5,	/* 64 */
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+	6,	/* 104 */
+	6,	/* 112 */
+	6,	/* 120 */
+	6,	/* 128 */
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+};
+
+static int bpf_mem_cache_idx(size_t size)
+{
+	if (!size || size > 4096)
+		return -1;
+
+	if (size <= 192)
+		return size_index[(size - 1) / 8] - 1;
+
+	return fls(size - 1) - 1;
+}
+
+#define NUM_CACHES 11
+
+struct bpf_mem_cache {
+	/* per-cpu list of free objects of size 'unit_size'.
+	 * All accesses are done with interrupts disabled and 'active' counter
+	 * protection with __llist_add() and __llist_del_first().
+	 */
+	struct llist_head free_llist;
+	local_t active;
+
+	/* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill
+	 * are sequenced by per-cpu 'active' counter. But unit_free() cannot
+	 * fail. When 'active' is busy the unit_free() will add an object to
+	 * free_llist_extra.
+	 */
+	struct llist_head free_llist_extra;
+
+	/* kmem_cache != NULL when bpf_mem_alloc was created for specific
+	 * element size.
+	 */
+	struct kmem_cache *kmem_cache;
+	struct irq_work refill_work;
+	struct obj_cgroup *objcg;
+	int unit_size;
+	/* count of objects in free_llist */
+	int free_cnt;
+};
+
+struct bpf_mem_caches {
+	struct bpf_mem_cache cache[NUM_CACHES];
+};
+
+static struct llist_node notrace *__llist_del_first(struct llist_head *head)
+{
+	struct llist_node *entry, *next;
+
+	entry = head->first;
+	if (!entry)
+		return NULL;
+	next = entry->next;
+	head->first = next;
+	return entry;
+}
+
+#define BATCH 48
+#define LOW_WATERMARK 32
+#define HIGH_WATERMARK 96
+/* Assuming the average number of elements per bucket is 64, when all buckets
+ * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
+ * 64*4096*32 ~ 20Mbyte
+ */
+
+static void *__alloc(struct bpf_mem_cache *c, int node)
+{
+	/* Allocate, but don't deplete atomic reserves that typical
+	 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
+	 * will allocate from the current numa node which is what we
+	 * want here.
+	 */
+	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT;
+
+	if (c->kmem_cache)
+		return kmem_cache_alloc_node(c->kmem_cache, flags, node);
+
+	return kmalloc_node(c->unit_size, flags, node);
+}
+
+static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	if (c->objcg)
+		return get_mem_cgroup_from_objcg(c->objcg);
+#endif
+
+#ifdef CONFIG_MEMCG
+	return root_mem_cgroup;
+#else
+	return NULL;
+#endif
+}
+
+/* Mostly runs from irq_work except __init phase. */
+static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
+{
+	struct mem_cgroup *memcg = NULL, *old_memcg;
+	unsigned long flags;
+	void *obj;
+	int i;
+
+	memcg = get_memcg(c);
+	old_memcg = set_active_memcg(memcg);
+	for (i = 0; i < cnt; i++) {
+		obj = __alloc(c, node);
+		if (!obj)
+			break;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			/* In RT irq_work runs in per-cpu kthread, so disable
+			 * interrupts to avoid preemption and interrupts and
+			 * reduce the chance of bpf prog executing on this cpu
+			 * when active counter is busy.
+			 */
+			local_irq_save(flags);
+		if (local_inc_return(&c->active) == 1) {
+			__llist_add(obj, &c->free_llist);
+			c->free_cnt++;
+		}
+		local_dec(&c->active);
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_restore(flags);
+	}
+	set_active_memcg(old_memcg);
+	mem_cgroup_put(memcg);
+}
+
+static void free_one(struct bpf_mem_cache *c, void *obj)
+{
+	if (c->kmem_cache)
+		kmem_cache_free(c->kmem_cache, obj);
+	else
+		kfree(obj);
+}
+
+static void free_bulk(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode, *t;
+	unsigned long flags;
+	int cnt;
+
+	do {
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_save(flags);
+		if (local_inc_return(&c->active) == 1) {
+			llnode = __llist_del_first(&c->free_llist);
+			if (llnode)
+				cnt = --c->free_cnt;
+			else
+				cnt = 0;
+		}
+		local_dec(&c->active);
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_restore(flags);
+		free_one(c, llnode);
+	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+
+	/* and drain free_llist_extra */
+	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
+		free_one(c, llnode);
+}
+
+static void bpf_mem_refill(struct irq_work *work)
+{
+	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
+	int cnt;
+
+	/* Racy access to free_cnt. It doesn't need to be 100% accurate */
+	cnt = c->free_cnt;
+	if (cnt < LOW_WATERMARK)
+		/* irq_work runs on this cpu and kmalloc will allocate
+		 * from the current numa node which is what we want here.
+		 */
+		alloc_bulk(c, BATCH, NUMA_NO_NODE);
+	else if (cnt > HIGH_WATERMARK)
+		free_bulk(c);
+}
+
+static void notrace irq_work_raise(struct bpf_mem_cache *c)
+{
+	irq_work_queue(&c->refill_work);
+}
+
+static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
+{
+	init_irq_work(&c->refill_work, bpf_mem_refill);
+	/* To avoid consuming memory assume that 1st run of bpf
+	 * prog won't be doing more than 4 map_update_elem from
+	 * irq disabled region
+	 */
+	alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu));
+}
+
+/* When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
+ * This is typical bpf hash map use case when all elements have equal size.
+ *
+ * When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
+ * kmalloc/kfree. Max allocation size is 4096 in this case.
+ * This is bpf_dynptr and bpf_kptr use case.
+ */
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
+{
+	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
+	struct bpf_mem_caches *cc, __percpu *pcc;
+	struct bpf_mem_cache *c, __percpu *pc;
+	struct kmem_cache *kmem_cache;
+	struct obj_cgroup *objcg = NULL;
+	char buf[32];
+	int cpu, i;
+
+	if (size) {
+		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
+		if (!pc)
+			return -ENOMEM;
+		size += LLIST_NODE_SZ; /* room for llist_node */
+		snprintf(buf, sizeof(buf), "bpf-%u", size);
+		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
+		if (!kmem_cache) {
+			free_percpu(pc);
+			return -ENOMEM;
+		}
+#ifdef CONFIG_MEMCG_KMEM
+		objcg = get_obj_cgroup_from_current();
+#endif
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(pc, cpu);
+			c->kmem_cache = kmem_cache;
+			c->unit_size = size;
+			c->objcg = objcg;
+			prefill_mem_cache(c, cpu);
+		}
+		ma->cache = pc;
+		return 0;
+	}
+
+	pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL);
+	if (!pcc)
+		return -ENOMEM;
+#ifdef CONFIG_MEMCG_KMEM
+	objcg = get_obj_cgroup_from_current();
+#endif
+	for_each_possible_cpu(cpu) {
+		cc = per_cpu_ptr(pcc, cpu);
+		for (i = 0; i < NUM_CACHES; i++) {
+			c = &cc->cache[i];
+			c->unit_size = sizes[i];
+			c->objcg = objcg;
+			prefill_mem_cache(c, cpu);
+		}
+	}
+	ma->caches = pcc;
+	return 0;
+}
+
+static void drain_mem_cache(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode, *t;
+
+	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
+		free_one(c, llnode);
+	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
+		free_one(c, llnode);
+}
+
+void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
+{
+	struct bpf_mem_caches *cc;
+	struct bpf_mem_cache *c;
+	int cpu, i;
+
+	if (ma->cache) {
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(ma->cache, cpu);
+			drain_mem_cache(c);
+		}
+		/* kmem_cache and memcg are the same across cpus */
+		kmem_cache_destroy(c->kmem_cache);
+		if (c->objcg)
+			obj_cgroup_put(c->objcg);
+		free_percpu(ma->cache);
+		ma->cache = NULL;
+	}
+	if (ma->caches) {
+		for_each_possible_cpu(cpu) {
+			cc = per_cpu_ptr(ma->caches, cpu);
+			for (i = 0; i < NUM_CACHES; i++) {
+				c = &cc->cache[i];
+				drain_mem_cache(c);
+			}
+		}
+		if (c->objcg)
+			obj_cgroup_put(c->objcg);
+		free_percpu(ma->caches);
+		ma->caches = NULL;
+	}
+}
+
+/* notrace is necessary here and in other functions to make sure
+ * bpf programs cannot attach to them and cause llist corruptions.
+ */
+static void notrace *unit_alloc(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode = NULL;
+	unsigned long flags;
+	int cnt = 0;
+
+	/* Disable irqs to prevent the following race for majority of prog types:
+	 * prog_A
+	 *   bpf_mem_alloc
+	 *      preemption or irq -> prog_B
+	 *        bpf_mem_alloc
+	 *
+	 * but prog_B could be a perf_event NMI prog.
+	 * Use per-cpu 'active' counter to order free_list access between
+	 * unit_alloc/unit_free/bpf_mem_refill.
+	 */
+	local_irq_save(flags);
+	if (local_inc_return(&c->active) == 1) {
+		llnode = __llist_del_first(&c->free_llist);
+		if (llnode)
+			cnt = --c->free_cnt;
+	}
+	local_dec(&c->active);
+	local_irq_restore(flags);
+
+	WARN_ON(cnt < 0);
+
+	if (cnt < LOW_WATERMARK)
+		irq_work_raise(c);
+	return llnode;
+}
+
+/* Though 'ptr' object could have been allocated on a different cpu
+ * add it to the free_llist of the current cpu.
+ * Let kfree() logic deal with it when it's later called from irq_work.
+ */
+static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
+{
+	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
+	unsigned long flags;
+	int cnt = 0;
+
+	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
+
+	local_irq_save(flags);
+	if (local_inc_return(&c->active) == 1) {
+		__llist_add(llnode, &c->free_llist);
+		cnt = ++c->free_cnt;
+	} else {
+		/* unit_free() cannot fail. Therefore add an object to atomic
+		 * llist. free_bulk() will drain it. Though free_llist_extra is
+		 * a per-cpu list we have to use atomic llist_add here, since
+		 * it also can be interrupted by bpf nmi prog that does another
+		 * unit_free() into the same free_llist_extra.
+		 */
+		llist_add(llnode, &c->free_llist_extra);
+	}
+	local_dec(&c->active);
+	local_irq_restore(flags);
+
+	if (cnt > HIGH_WATERMARK)
+		/* free few objects from current cpu into global kmalloc pool */
+		irq_work_raise(c);
+}
+
+/* Called from BPF program or from sys_bpf syscall.
+ * In both cases migration is disabled.
+ */
+void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
+{
+	int idx;
+	void *ret;
+
+	if (!size)
+		return ZERO_SIZE_PTR;
+
+	idx = bpf_mem_cache_idx(size + LLIST_NODE_SZ);
+	if (idx < 0)
+		return NULL;
+
+	ret = unit_alloc(this_cpu_ptr(ma->caches)->cache + idx);
+	return !ret ? NULL : ret + LLIST_NODE_SZ;
+}
+
+void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
+{
+	int idx;
+
+	if (!ptr)
+		return;
+
+	idx = bpf_mem_cache_idx(__ksize(ptr - LLIST_NODE_SZ));
+	if (idx < 0)
+		return;
+
+	unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
+}
+
+void notrace *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma)
+{
+	void *ret;
+
+	ret = unit_alloc(this_cpu_ptr(ma->cache));
+	return !ret ? NULL : ret + LLIST_NODE_SZ;
+}
+
+void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
+{
+	if (!ptr)
+		return;
+
+	unit_free(this_cpu_ptr(ma->cache), ptr);
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 02/15] bpf: Convert hash map to bpf_mem_alloc.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 03/15] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert bpf hash map to use bpf memory allocator.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index b301a63afa2f..bd23c8830d49 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -14,6 +14,7 @@
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
+#include <linux/bpf_mem_alloc.h>
 
 #define HTAB_CREATE_FLAG_MASK						\
 	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |	\
@@ -92,6 +93,7 @@ struct bucket {
 
 struct bpf_htab {
 	struct bpf_map map;
+	struct bpf_mem_alloc ma;
 	struct bucket *buckets;
 	void *elems;
 	union {
@@ -563,6 +565,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 			if (err)
 				goto free_prealloc;
 		}
+	} else {
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size);
+		if (err)
+			goto free_map_locked;
 	}
 
 	return &htab->map;
@@ -573,6 +579,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->ma);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
 	bpf_map_area_free(htab);
@@ -849,7 +856,7 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
 	check_and_free_fields(htab, l);
-	kfree(l);
+	bpf_mem_cache_free(&htab->ma, l);
 }
 
 static void htab_elem_free_rcu(struct rcu_head *head)
@@ -973,9 +980,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 				l_new = ERR_PTR(-E2BIG);
 				goto dec_count;
 			}
-		l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
-					     GFP_NOWAIT | __GFP_NOWARN,
-					     htab->map.numa_node);
+		l_new = bpf_mem_cache_alloc(&htab->ma);
 		if (!l_new) {
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
@@ -994,7 +999,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			pptr = bpf_map_alloc_percpu(&htab->map, size, 8,
 						    GFP_NOWAIT | __GFP_NOWARN);
 			if (!pptr) {
-				kfree(l_new);
+				bpf_mem_cache_free(&htab->ma, l_new);
 				l_new = ERR_PTR(-ENOMEM);
 				goto dec_count;
 			}
@@ -1489,6 +1494,7 @@ static void htab_map_free(struct bpf_map *map)
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->ma);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 03/15] selftests/bpf: Improve test coverage of test_maps
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 02/15] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 04/15] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Make test_maps more stressful with more parallelism in
update/delete/lookup/walk including different value sizes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_maps.c | 38 ++++++++++++++++---------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index cbebfaa7c1e8..d1ffc76814d9 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -264,10 +264,11 @@ static void test_hashmap_percpu(unsigned int task, void *data)
 	close(fd);
 }
 
+#define VALUE_SIZE 3
 static int helper_fill_hashmap(int max_entries)
 {
 	int i, fd, ret;
-	long long key, value;
+	long long key, value[VALUE_SIZE] = {};
 
 	fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(key), sizeof(value),
 			    max_entries, &map_opts);
@@ -276,8 +277,8 @@ static int helper_fill_hashmap(int max_entries)
 	      "err: %s, flags: 0x%x\n", strerror(errno), map_opts.map_flags);
 
 	for (i = 0; i < max_entries; i++) {
-		key = i; value = key;
-		ret = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		key = i; value[0] = key;
+		ret = bpf_map_update_elem(fd, &key, value, BPF_NOEXIST);
 		CHECK(ret != 0,
 		      "can't update hashmap",
 		      "err: %s\n", strerror(ret));
@@ -288,8 +289,8 @@ static int helper_fill_hashmap(int max_entries)
 
 static void test_hashmap_walk(unsigned int task, void *data)
 {
-	int fd, i, max_entries = 1000;
-	long long key, value, next_key;
+	int fd, i, max_entries = 10000;
+	long long key, value[VALUE_SIZE], next_key;
 	bool next_key_valid = true;
 
 	fd = helper_fill_hashmap(max_entries);
@@ -297,7 +298,7 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : &key,
 					 &next_key) == 0; i++) {
 		key = next_key;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
 	}
 
 	assert(i == max_entries);
@@ -305,9 +306,9 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
 	for (i = 0; next_key_valid; i++) {
 		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
-		value++;
-		assert(bpf_map_update_elem(fd, &key, &value, BPF_EXIST) == 0);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
+		value[0]++;
+		assert(bpf_map_update_elem(fd, &key, value, BPF_EXIST) == 0);
 		key = next_key;
 	}
 
@@ -316,8 +317,8 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : &key,
 					 &next_key) == 0; i++) {
 		key = next_key;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
-		assert(value - 1 == key);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
+		assert(value[0] - 1 == key);
 	}
 
 	assert(i == max_entries);
@@ -1371,16 +1372,16 @@ static void __run_parallel(unsigned int tasks,
 
 static void test_map_stress(void)
 {
+	run_parallel(100, test_hashmap_walk, NULL);
 	run_parallel(100, test_hashmap, NULL);
 	run_parallel(100, test_hashmap_percpu, NULL);
 	run_parallel(100, test_hashmap_sizes, NULL);
-	run_parallel(100, test_hashmap_walk, NULL);
 
 	run_parallel(100, test_arraymap, NULL);
 	run_parallel(100, test_arraymap_percpu, NULL);
 }
 
-#define TASKS 1024
+#define TASKS 100
 
 #define DO_UPDATE 1
 #define DO_DELETE 0
@@ -1432,6 +1433,8 @@ static void test_update_delete(unsigned int fn, void *data)
 	int fd = ((int *)data)[0];
 	int i, key, value, err;
 
+	if (fn & 1)
+		test_hashmap_walk(fn, NULL);
 	for (i = fn; i < MAP_SIZE; i += TASKS) {
 		key = value = i;
 
@@ -1455,7 +1458,7 @@ static void test_update_delete(unsigned int fn, void *data)
 
 static void test_map_parallel(void)
 {
-	int i, fd, key = 0, value = 0;
+	int i, fd, key = 0, value = 0, j = 0;
 	int data[2];
 
 	fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(key), sizeof(value),
@@ -1466,6 +1469,7 @@ static void test_map_parallel(void)
 		exit(1);
 	}
 
+again:
 	/* Use the same fd in children to add elements to this map:
 	 * child_0 adds key=0, key=1024, key=2048, ...
 	 * child_1 adds key=1, key=1025, key=2049, ...
@@ -1502,6 +1506,12 @@ static void test_map_parallel(void)
 	key = -1;
 	assert(bpf_map_get_next_key(fd, NULL, &key) < 0 && errno == ENOENT);
 	assert(bpf_map_get_next_key(fd, &key, &key) < 0 && errno == ENOENT);
+
+	key = 0;
+	bpf_map_delete_elem(fd, &key);
+	if (j++ < 5)
+		goto again;
+	close(fd);
 }
 
 static void test_map_rdonly(void)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 04/15] samples/bpf: Reduce syscall overhead in map_perf_test.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 03/15] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 05/15] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Make map_perf_test for preallocated and non-preallocated hash map
spend more time inside bpf program to focus performance analysis
on the speed of update/lookup/delete operations performed by bpf program.

It makes 'perf report' of bpf_mem_alloc look like:
 11.76%  map_perf_test    [k] _raw_spin_lock_irqsave
 11.26%  map_perf_test    [k] htab_map_update_elem
  9.70%  map_perf_test    [k] _raw_spin_lock
  9.47%  map_perf_test    [k] htab_map_delete_elem
  8.57%  map_perf_test    [k] memcpy_erms
  5.58%  map_perf_test    [k] alloc_htab_elem
  4.09%  map_perf_test    [k] __htab_map_lookup_elem
  3.44%  map_perf_test    [k] syscall_exit_to_user_mode
  3.13%  map_perf_test    [k] lookup_nulls_elem_raw
  3.05%  map_perf_test    [k] migrate_enable
  3.04%  map_perf_test    [k] memcmp
  2.67%  map_perf_test    [k] unit_free
  2.39%  map_perf_test    [k] lookup_elem_raw

Reduce default iteration count as well to make 'map_perf_test' quick enough
even on debug kernels.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/map_perf_test_kern.c | 44 ++++++++++++++++++++------------
 samples/bpf/map_perf_test_user.c |  2 +-
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test_kern.c
index 8773f22b6a98..7342c5b2f278 100644
--- a/samples/bpf/map_perf_test_kern.c
+++ b/samples/bpf/map_perf_test_kern.c
@@ -108,11 +108,14 @@ int stress_hmap(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&hash_map, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&hash_map, &key);
-	if (value)
-		bpf_map_delete_elem(&hash_map, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&hash_map, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&hash_map, &key);
+		if (value)
+			bpf_map_delete_elem(&hash_map, &key);
+	}
 
 	return 0;
 }
@@ -123,11 +126,14 @@ int stress_percpu_hmap(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&percpu_hash_map, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&percpu_hash_map, &key);
-	if (value)
-		bpf_map_delete_elem(&percpu_hash_map, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&percpu_hash_map, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&percpu_hash_map, &key);
+		if (value)
+			bpf_map_delete_elem(&percpu_hash_map, &key);
+	}
 	return 0;
 }
 
@@ -137,11 +143,14 @@ int stress_hmap_alloc(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&hash_map_alloc, &key);
-	if (value)
-		bpf_map_delete_elem(&hash_map_alloc, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&hash_map_alloc, &key);
+		if (value)
+			bpf_map_delete_elem(&hash_map_alloc, &key);
+	}
 	return 0;
 }
 
@@ -151,11 +160,14 @@ int stress_percpu_hmap_alloc(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&percpu_hash_map_alloc, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&percpu_hash_map_alloc, &key);
-	if (value)
-		bpf_map_delete_elem(&percpu_hash_map_alloc, &key);
+	for (i = 0; i < 10; i++) {
+		bpf_map_update_elem(&percpu_hash_map_alloc, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&percpu_hash_map_alloc, &key);
+		if (value)
+			bpf_map_delete_elem(&percpu_hash_map_alloc, &key);
+	}
 	return 0;
 }
 
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index b6fc174ab1f2..1bb53f4b29e1 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -72,7 +72,7 @@ static int test_flags = ~0;
 static uint32_t num_map_entries;
 static uint32_t inner_lru_hash_size;
 static int lru_hash_lookup_test_entries = 32;
-static uint32_t max_cnt = 1000000;
+static uint32_t max_cnt = 10000;
 
 static int check_test_flags(enum test_type t)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 05/15] bpf: Relax the requirement to use preallocated hash maps in tracing progs.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 04/15] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Since bpf hash map was converted to use bpf_mem_alloc it is safe to use
from tracing programs and in RT kernels.
But per-cpu hash map is still using dynamic allocation for per-cpu map
values, hence keep the warning for this map type.
In the future alloc_percpu_gfp can be front-end-ed with bpf_mem_cache
and this restriction will be completely lifted.
perf_event (NMI) bpf programs have to use preallocated hash maps,
because free_htab_elem() is using call_rcu which might crash if re-entered.

Sleepable bpf programs have to use preallocated hash maps, because
life time of the map elements is not protected by rcu_read_lock/unlock.
This restriction can be lifted in the future as well.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0194a36d0b36..3dce3166855f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12629,10 +12629,12 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 	 * For programs attached to PERF events this is mandatory as the
 	 * perf NMI can hit any arbitrary code sequence.
 	 *
-	 * All other trace types using preallocated hash maps are unsafe as
-	 * well because tracepoint or kprobes can be inside locked regions
-	 * of the memory allocator or at a place where a recursion into the
-	 * memory allocator would see inconsistent state.
+	 * All other trace types using non-preallocated per-cpu hash maps are
+	 * unsafe as well because tracepoint or kprobes can be inside locked
+	 * regions of the per-cpu memory allocator or at a place where a
+	 * recursion into the per-cpu memory allocator would see inconsistent
+	 * state. Non per-cpu hash maps are using bpf_mem_alloc-tor which is
+	 * safe to use from kprobe/fentry and in RT.
 	 *
 	 * On RT enabled kernels run-time allocation of all trace type
 	 * programs is strictly prohibited due to lock type constraints. On
@@ -12642,15 +12644,26 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 	 */
 	if (is_tracing_prog_type(prog_type) && !is_preallocated_map(map)) {
 		if (prog_type == BPF_PROG_TYPE_PERF_EVENT) {
+			/* perf_event bpf progs have to use preallocated hash maps
+			 * because non-prealloc is still relying on call_rcu to free
+			 * elements.
+			 */
 			verbose(env, "perf_event programs can only use preallocated hash map\n");
 			return -EINVAL;
 		}
-		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-			verbose(env, "trace type programs can only use preallocated hash map\n");
-			return -EINVAL;
+		if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+		    (map->inner_map_meta &&
+		     map->inner_map_meta->map_type == BPF_MAP_TYPE_PERCPU_HASH)) {
+			if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+				verbose(env,
+					"trace type programs can only use preallocated per-cpu hash map\n");
+				return -EINVAL;
+			}
+			WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
+			verbose(env,
+				"trace type programs with run-time allocated per-cpu hash maps are unsafe."
+				" Switch to preallocated hash maps.\n");
 		}
-		WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
-		verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n");
 	}
 
 	if (map_value_has_spin_lock(map)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 05/15] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-29 21:47   ` Daniel Borkmann
  2022-08-26  2:44 ` [PATCH v4 bpf-next 07/15] bpf: Optimize call_rcu " Alexei Starovoitov
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The atomic_inc/dec might cause extreme cache line bouncing when multiple cpus
access the same bpf map. Based on specified max_entries for the hash map
calculate when percpu_counter becomes faster than atomic_t and use it for such
maps. For example samples/bpf/map_perf_test is using hash map with max_entries
1000. On a system with 16 cpus the 'map_perf_test 4' shows 14k events per
second using atomic_t. On a system with 15 cpus it shows 100k events per second
using percpu. map_perf_test is an extreme case where all cpus colliding on
atomic_t which causes extreme cache bouncing. Note that the slow path of
percpu_counter is 5k events per secound vs 14k for atomic, so the heuristic is
necessary. See comment in the code why the heuristic is based on
num_online_cpus().

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 70 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 62 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index bd23c8830d49..8f68c6e13339 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -101,7 +101,12 @@ struct bpf_htab {
 		struct bpf_lru lru;
 	};
 	struct htab_elem *__percpu *extra_elems;
-	atomic_t count;	/* number of elements in this hashtable */
+	/* number of elements in non-preallocated hashtable are kept
+	 * in either pcount or count
+	 */
+	struct percpu_counter pcount;
+	atomic_t count;
+	bool use_percpu_counter;
 	u32 n_buckets;	/* number of hash buckets */
 	u32 elem_size;	/* size of each element in bytes */
 	u32 hashrnd;
@@ -552,6 +557,29 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	htab_init_buckets(htab);
 
+/* compute_batch_value() computes batch value as num_online_cpus() * 2
+ * and __percpu_counter_compare() needs
+ * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus()
+ * for percpu_counter to be faster than atomic_t. In practice the average bpf
+ * hash map size is 10k, which means that a system with 64 cpus will fill
+ * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore
+ * define our own batch count as 32 then 10k hash map can be filled up to 80%:
+ * 10k - 8k > 32 _batch_ * 64 _cpus_
+ * and __percpu_counter_compare() will still be fast. At that point hash map
+ * collisions will dominate its performance anyway. Assume that hash map filled
+ * to 50+% isn't going to be O(1) and use the following formula to choose
+ * between percpu_counter and atomic_t.
+ */
+#define PERCPU_COUNTER_BATCH 32
+	if (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH)
+		htab->use_percpu_counter = true;
+
+	if (htab->use_percpu_counter) {
+		err = percpu_counter_init(&htab->pcount, 0, GFP_KERNEL);
+		if (err)
+			goto free_map_locked;
+	}
+
 	if (prealloc) {
 		err = prealloc_init(htab);
 		if (err)
@@ -878,6 +906,31 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 	}
 }
 
+static bool is_map_full(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		return __percpu_counter_compare(&htab->pcount, htab->map.max_entries,
+						PERCPU_COUNTER_BATCH) >= 0;
+	return atomic_read(&htab->count) >= htab->map.max_entries;
+}
+
+static void inc_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_inc(&htab->count);
+}
+
+static void dec_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_dec(&htab->count);
+}
+
+
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	htab_put_fd_value(htab, l);
@@ -886,7 +939,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 		check_and_free_fields(htab, l);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
-		atomic_dec(&htab->count);
+		dec_elem_count(htab);
 		l->htab = htab;
 		call_rcu(&l->rcu, htab_elem_free_rcu);
 	}
@@ -970,16 +1023,15 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = container_of(l, struct htab_elem, fnode);
 		}
 	} else {
-		if (atomic_inc_return(&htab->count) > htab->map.max_entries)
-			if (!old_elem) {
+		if (is_map_full(htab))
+			if (!old_elem)
 				/* when map is full and update() is replacing
 				 * old element, it's ok to allocate, since
 				 * old element will be freed immediately.
 				 * Otherwise return an error
 				 */
-				l_new = ERR_PTR(-E2BIG);
-				goto dec_count;
-			}
+				return ERR_PTR(-E2BIG);
+		inc_elem_count(htab);
 		l_new = bpf_mem_cache_alloc(&htab->ma);
 		if (!l_new) {
 			l_new = ERR_PTR(-ENOMEM);
@@ -1021,7 +1073,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 	l_new->hash = hash;
 	return l_new;
 dec_count:
-	atomic_dec(&htab->count);
+	dec_elem_count(htab);
 	return l_new;
 }
 
@@ -1495,6 +1547,8 @@ static void htab_map_free(struct bpf_map *map)
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
 	bpf_mem_alloc_destroy(&htab->ma);
+	if (htab->use_percpu_counter)
+		percpu_counter_destroy(&htab->pcount);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 07/15] bpf: Optimize call_rcu in non-preallocated hash map.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 08/15] bpf: Adjust low/high watermarks in bpf_mem_cache Alexei Starovoitov
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Doing call_rcu() million times a second becomes a bottle neck.
Convert non-preallocated hash map from call_rcu to SLAB_TYPESAFE_BY_RCU.
The rcu critical section is no longer observed for one htab element
which makes non-preallocated hash map behave just like preallocated hash map.
The map elements are released back to kernel memory after observing
rcu critical section.
This improves 'map_perf_test 4' performance from 100k events per second
to 250k events per second.

bpf_mem_alloc + percpu_counter + typesafe_by_rcu provide 10x performance
boost to non-preallocated hash map and make it within few % of preallocated map
while consuming fraction of memory.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c                      |  8 ++++++--
 kernel/bpf/memalloc.c                     |  2 +-
 tools/testing/selftests/bpf/progs/timer.c | 11 -----------
 3 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8f68c6e13339..299ab98f9811 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -940,8 +940,12 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		dec_elem_count(htab);
-		l->htab = htab;
-		call_rcu(&l->rcu, htab_elem_free_rcu);
+		if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH) {
+			l->htab = htab;
+			call_rcu(&l->rcu, htab_elem_free_rcu);
+		} else {
+			htab_elem_free(htab, l);
+		}
 	}
 }
 
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 29f340016e9e..c1817f14c25a 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -277,7 +277,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
 			return -ENOMEM;
 		size += LLIST_NODE_SZ; /* room for llist_node */
 		snprintf(buf, sizeof(buf), "bpf-%u", size);
-		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
+		kmem_cache = kmem_cache_create(buf, size, 8, SLAB_TYPESAFE_BY_RCU, NULL);
 		if (!kmem_cache) {
 			free_percpu(pc);
 			return -ENOMEM;
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
index 5f5309791649..0053c5402173 100644
--- a/tools/testing/selftests/bpf/progs/timer.c
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -208,17 +208,6 @@ static int timer_cb2(void *map, int *key, struct hmap_elem *val)
 		 */
 		bpf_map_delete_elem(map, key);
 
-		/* in non-preallocated hashmap both 'key' and 'val' are RCU
-		 * protected and still valid though this element was deleted
-		 * from the map. Arm this timer for ~35 seconds. When callback
-		 * finishes the call_rcu will invoke:
-		 * htab_elem_free_rcu
-		 *   check_and_free_timer
-		 *     bpf_timer_cancel_and_free
-		 * to cancel this 35 second sleep and delete the timer for real.
-		 */
-		if (bpf_timer_start(&val->timer, 1ull << 35, 0) != 0)
-			err |= 256;
 		ok |= 4;
 	}
 	return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 08/15] bpf: Adjust low/high watermarks in bpf_mem_cache
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 07/15] bpf: Optimize call_rcu " Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The same low/high watermarks for every bucket in bpf_mem_cache consume
significant amount of memory. Preallocating 64 elements of 4096 bytes each in
the free list is not efficient. Make low/high watermarks and batching value
dependent on element size. This change brings significant memory savings.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 50 +++++++++++++++++++++++++++++++------------
 1 file changed, 36 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index c1817f14c25a..775c38132c4d 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -100,6 +100,7 @@ struct bpf_mem_cache {
 	int unit_size;
 	/* count of objects in free_llist */
 	int free_cnt;
+	int low_watermark, high_watermark, batch;
 };
 
 struct bpf_mem_caches {
@@ -118,14 +119,6 @@ static struct llist_node notrace *__llist_del_first(struct llist_head *head)
 	return entry;
 }
 
-#define BATCH 48
-#define LOW_WATERMARK 32
-#define HIGH_WATERMARK 96
-/* Assuming the average number of elements per bucket is 64, when all buckets
- * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
- * 64*4096*32 ~ 20Mbyte
- */
-
 static void *__alloc(struct bpf_mem_cache *c, int node)
 {
 	/* Allocate, but don't deplete atomic reserves that typical
@@ -216,7 +209,7 @@ static void free_bulk(struct bpf_mem_cache *c)
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			local_irq_restore(flags);
 		free_one(c, llnode);
-	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
 
 	/* and drain free_llist_extra */
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
@@ -230,12 +223,12 @@ static void bpf_mem_refill(struct irq_work *work)
 
 	/* Racy access to free_cnt. It doesn't need to be 100% accurate */
 	cnt = c->free_cnt;
-	if (cnt < LOW_WATERMARK)
+	if (cnt < c->low_watermark)
 		/* irq_work runs on this cpu and kmalloc will allocate
 		 * from the current numa node which is what we want here.
 		 */
-		alloc_bulk(c, BATCH, NUMA_NO_NODE);
-	else if (cnt > HIGH_WATERMARK)
+		alloc_bulk(c, c->batch, NUMA_NO_NODE);
+	else if (cnt > c->high_watermark)
 		free_bulk(c);
 }
 
@@ -244,9 +237,38 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c)
 	irq_work_queue(&c->refill_work);
 }
 
+/* For typical bpf map case that uses bpf_mem_cache_alloc and single bucket
+ * the freelist cache will be elem_size * 64 (or less) on each cpu.
+ *
+ * For bpf programs that don't have statically known allocation sizes and
+ * assuming (low_mark + high_mark) / 2 as an average number of elements per
+ * bucket and all buckets are used the total amount of memory in freelists
+ * on each cpu will be:
+ * 64*16 + 64*32 + 64*64 + 64*96 + 64*128 + 64*196 + 64*256 + 32*512 + 16*1024 + 8*2048 + 4*4096
+ * == ~ 116 Kbyte using below heuristic.
+ * Initialized, but unused bpf allocator (not bpf map specific one) will
+ * consume ~ 11 Kbyte per cpu.
+ * Typical case will be between 11K and 116K closer to 11K.
+ * bpf progs can and should share bpf_mem_cache when possible.
+ */
+
 static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
 {
 	init_irq_work(&c->refill_work, bpf_mem_refill);
+	if (c->unit_size <= 256) {
+		c->low_watermark = 32;
+		c->high_watermark = 96;
+	} else {
+		/* When page_size == 4k, order-0 cache will have low_mark == 2
+		 * and high_mark == 6 with batch alloc of 3 individual pages at
+		 * a time.
+		 * 8k allocs and above low == 1, high == 3, batch == 1.
+		 */
+		c->low_watermark = max(32 * 256 / c->unit_size, 1);
+		c->high_watermark = max(96 * 256 / c->unit_size, 3);
+	}
+	c->batch = max((c->high_watermark - c->low_watermark) / 4 * 3, 1);
+
 	/* To avoid consuming memory assume that 1st run of bpf
 	 * prog won't be doing more than 4 map_update_elem from
 	 * irq disabled region
@@ -388,7 +410,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
 
 	WARN_ON(cnt < 0);
 
-	if (cnt < LOW_WATERMARK)
+	if (cnt < c->low_watermark)
 		irq_work_raise(c);
 	return llnode;
 }
@@ -421,7 +443,7 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 	local_dec(&c->active);
 	local_irq_restore(flags);
 
-	if (cnt > HIGH_WATERMARK)
+	if (cnt > c->high_watermark)
 		/* free few objects from current cpu into global kmalloc pool */
 		irq_work_raise(c);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (7 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 08/15] bpf: Adjust low/high watermarks in bpf_mem_cache Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 10/15] bpf: Add percpu allocation support to bpf_mem_alloc Alexei Starovoitov
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
solves the memory consumption issue, avoids kmem_cache_destroy latency and
keeps bpf hash map performance the same.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 64 +++++++++++++++++++++++++++++++++++++++++--
 kernel/bpf/syscall.c  |  5 +++-
 2 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 775c38132c4d..6a252d495f6c 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -101,6 +101,11 @@ struct bpf_mem_cache {
 	/* count of objects in free_llist */
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
+
+	struct rcu_head rcu;
+	struct llist_head free_by_rcu;
+	struct llist_head waiting_for_gp;
+	atomic_t call_rcu_in_progress;
 };
 
 struct bpf_mem_caches {
@@ -189,6 +194,45 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
 		kfree(obj);
 }
 
+static void __free_rcu(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+	struct llist_node *llnode = llist_del_all(&c->waiting_for_gp);
+	struct llist_node *pos, *t;
+
+	llist_for_each_safe(pos, t, llnode)
+		free_one(c, pos);
+	atomic_set(&c->call_rcu_in_progress, 0);
+}
+
+static void enque_to_free(struct bpf_mem_cache *c, void *obj)
+{
+	struct llist_node *llnode = obj;
+
+	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
+	 * Nothing races to add to free_by_rcu list.
+	 */
+	__llist_add(llnode, &c->free_by_rcu);
+}
+
+static void do_call_rcu(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode, *t;
+
+	if (atomic_xchg(&c->call_rcu_in_progress, 1))
+		return;
+
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
+	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
+		/* There is no concurrent __llist_add(waiting_for_gp) access.
+		 * It doesn't race with llist_del_all either.
+		 * But there could be two concurrent llist_del_all(waiting_for_gp):
+		 * from __free_rcu() and from drain_mem_cache().
+		 */
+		__llist_add(llnode, &c->waiting_for_gp);
+	call_rcu(&c->rcu, __free_rcu);
+}
+
 static void free_bulk(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode, *t;
@@ -208,12 +252,13 @@ static void free_bulk(struct bpf_mem_cache *c)
 		local_dec(&c->active);
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			local_irq_restore(flags);
-		free_one(c, llnode);
+		enque_to_free(c, llnode);
 	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
 
 	/* and drain free_llist_extra */
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
-		free_one(c, llnode);
+		enque_to_free(c, llnode);
+	do_call_rcu(c);
 }
 
 static void bpf_mem_refill(struct irq_work *work)
@@ -299,7 +344,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
 			return -ENOMEM;
 		size += LLIST_NODE_SZ; /* room for llist_node */
 		snprintf(buf, sizeof(buf), "bpf-%u", size);
-		kmem_cache = kmem_cache_create(buf, size, 8, SLAB_TYPESAFE_BY_RCU, NULL);
+		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
 		if (!kmem_cache) {
 			free_percpu(pc);
 			return -ENOMEM;
@@ -341,6 +386,15 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode, *t;
 
+	/* The caller has done rcu_barrier() and no progs are using this
+	 * bpf_mem_cache, but htab_map_free() called bpf_mem_cache_free() for
+	 * all remaining elements and they can be in free_by_rcu or in
+	 * waiting_for_gp lists, so drain those lists now.
+	 */
+	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
+		free_one(c, llnode);
+	llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
+		free_one(c, llnode);
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
 		free_one(c, llnode);
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
@@ -362,6 +416,10 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 		kmem_cache_destroy(c->kmem_cache);
 		if (c->objcg)
 			obj_cgroup_put(c->objcg);
+		/* c->waiting_for_gp list was drained, but __free_rcu might
+		 * still execute. Wait for it now before we free 'c'.
+		 */
+		rcu_barrier();
 		free_percpu(ma->cache);
 		ma->cache = NULL;
 	}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4e9d4622aef7..074c901fbb4e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -638,7 +638,10 @@ static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
 		bpf_map_free_id(map, do_idr_lock);
 		btf_put(map->btf);
 		INIT_WORK(&map->work, bpf_map_free_deferred);
-		schedule_work(&map->work);
+		/* Avoid spawning kworkers, since they all might contend
+		 * for the same mutex like slab_mutex.
+		 */
+		queue_work(system_unbound_wq, &map->work);
 	}
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 10/15] bpf: Add percpu allocation support to bpf_mem_alloc.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (8 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 11/15] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc Alexei Starovoitov
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations.
Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects.
bpf_mem_cache_free() will free them back into global per-cpu pool after
observing RCU grace period.
per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps.

The free list cache consists of tuples { llist_node, per-cpu pointer }
Unlike alloc_percpu() that returns per-cpu pointer
the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and
bpf_mem_cache_free() expects to receive it back.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_mem_alloc.h |  2 +-
 kernel/bpf/hashtab.c          |  2 +-
 kernel/bpf/memalloc.c         | 44 +++++++++++++++++++++++++++++++----
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 804733070f8d..653ed1584a03 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -12,7 +12,7 @@ struct bpf_mem_alloc {
 	struct bpf_mem_cache __percpu *cache;
 };
 
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size);
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 
 /* kmalloc/kfree equivalent: */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 299ab98f9811..8daa1132d43c 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -594,7 +594,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 				goto free_prealloc;
 		}
 	} else {
-		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size);
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
 		if (err)
 			goto free_map_locked;
 	}
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 6a252d495f6c..54455a64699b 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -101,6 +101,7 @@ struct bpf_mem_cache {
 	/* count of objects in free_llist */
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
+	bool percpu;
 
 	struct rcu_head rcu;
 	struct llist_head free_by_rcu;
@@ -133,6 +134,19 @@ static void *__alloc(struct bpf_mem_cache *c, int node)
 	 */
 	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT;
 
+	if (c->percpu) {
+		void **obj = kmem_cache_alloc_node(c->kmem_cache, flags, node);
+		void *pptr = __alloc_percpu_gfp(c->unit_size, 8, flags);
+
+		if (!obj || !pptr) {
+			free_percpu(pptr);
+			kfree(obj);
+			return NULL;
+		}
+		obj[1] = pptr;
+		return obj;
+	}
+
 	if (c->kmem_cache)
 		return kmem_cache_alloc_node(c->kmem_cache, flags, node);
 
@@ -188,6 +202,12 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 
 static void free_one(struct bpf_mem_cache *c, void *obj)
 {
+	if (c->percpu) {
+		free_percpu(((void **)obj)[1]);
+		kmem_cache_free(c->kmem_cache, obj);
+		return;
+	}
+
 	if (c->kmem_cache)
 		kmem_cache_free(c->kmem_cache, obj);
 	else
@@ -328,21 +348,30 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
  * kmalloc/kfree. Max allocation size is 4096 in this case.
  * This is bpf_dynptr and bpf_kptr use case.
  */
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 {
 	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
 	struct bpf_mem_caches *cc, __percpu *pcc;
 	struct bpf_mem_cache *c, __percpu *pc;
-	struct kmem_cache *kmem_cache;
+	struct kmem_cache *kmem_cache = NULL;
 	struct obj_cgroup *objcg = NULL;
 	char buf[32];
-	int cpu, i;
+	int cpu, i, unit_size;
 
 	if (size) {
 		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
 		if (!pc)
 			return -ENOMEM;
-		size += LLIST_NODE_SZ; /* room for llist_node */
+
+		if (percpu) {
+			unit_size = size;
+			/* room for llist_node and per-cpu pointer */
+			size = LLIST_NODE_SZ + sizeof(void *);
+		} else {
+			size += LLIST_NODE_SZ; /* room for llist_node */
+			unit_size = size;
+		}
+
 		snprintf(buf, sizeof(buf), "bpf-%u", size);
 		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
 		if (!kmem_cache) {
@@ -355,14 +384,19 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
 		for_each_possible_cpu(cpu) {
 			c = per_cpu_ptr(pc, cpu);
 			c->kmem_cache = kmem_cache;
-			c->unit_size = size;
+			c->unit_size = unit_size;
 			c->objcg = objcg;
+			c->percpu = percpu;
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
 		return 0;
 	}
 
+	/* size == 0 && percpu is an invalid combination */
+	if (WARN_ON_ONCE(percpu))
+		return -EINVAL;
+
 	pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL);
 	if (!pcc)
 		return -ENOMEM;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 11/15] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (9 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 10/15] bpf: Add percpu allocation support to bpf_mem_alloc Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 12/15] bpf: Remove tracing program restriction on map types Alexei Starovoitov
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert dynamic allocations in percpu hash map from alloc_percpu() to
bpf_mem_cache_alloc() from per-cpu bpf_mem_alloc. Since bpf_mem_alloc frees
objects after RCU gp the call_rcu() is removed. pcpu_init_value() now needs to
zero-fill per-cpu allocations, since dynamically allocated map elements are now
similar to full prealloc, since alloc_percpu() is not called inline and the
elements are reused in the freelist.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 45 +++++++++++++++++++-------------------------
 1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8daa1132d43c..89f26cbddef5 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -94,6 +94,7 @@ struct bucket {
 struct bpf_htab {
 	struct bpf_map map;
 	struct bpf_mem_alloc ma;
+	struct bpf_mem_alloc pcpu_ma;
 	struct bucket *buckets;
 	void *elems;
 	union {
@@ -121,14 +122,14 @@ struct htab_elem {
 		struct {
 			void *padding;
 			union {
-				struct bpf_htab *htab;
 				struct pcpu_freelist_node fnode;
 				struct htab_elem *batch_flink;
 			};
 		};
 	};
 	union {
-		struct rcu_head rcu;
+		/* pointer to per-cpu pointer */
+		void *ptr_to_pptr;
 		struct bpf_lru_node lru_node;
 	};
 	u32 hash;
@@ -435,8 +436,6 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	bool zero_seed = (attr->map_flags & BPF_F_ZERO_SEED);
 	int numa_node = bpf_map_attr_numa_node(attr);
 
-	BUILD_BUG_ON(offsetof(struct htab_elem, htab) !=
-		     offsetof(struct htab_elem, hash_node.pprev));
 	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
 		     offsetof(struct htab_elem, hash_node.pprev));
 
@@ -597,6 +596,12 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
 		if (err)
 			goto free_map_locked;
+		if (percpu) {
+			err = bpf_mem_alloc_init(&htab->pcpu_ma,
+						 round_up(htab->map.value_size, 8), true);
+			if (err)
+				goto free_map_locked;
+		}
 	}
 
 	return &htab->map;
@@ -607,6 +612,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
@@ -882,19 +888,11 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
-		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
+		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
 	check_and_free_fields(htab, l);
 	bpf_mem_cache_free(&htab->ma, l);
 }
 
-static void htab_elem_free_rcu(struct rcu_head *head)
-{
-	struct htab_elem *l = container_of(head, struct htab_elem, rcu);
-	struct bpf_htab *htab = l->htab;
-
-	htab_elem_free(htab, l);
-}
-
 static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 {
 	struct bpf_map *map = &htab->map;
@@ -940,12 +938,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		dec_elem_count(htab);
-		if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH) {
-			l->htab = htab;
-			call_rcu(&l->rcu, htab_elem_free_rcu);
-		} else {
-			htab_elem_free(htab, l);
-		}
+		htab_elem_free(htab, l);
 	}
 }
 
@@ -970,13 +963,12 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
 static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr,
 			    void *value, bool onallcpus)
 {
-	/* When using prealloc and not setting the initial value on all cpus,
-	 * zero-fill element values for other cpus (just as what happens when
-	 * not using prealloc). Otherwise, bpf program has no way to ensure
+	/* When not setting the initial value on all cpus, zero-fill element
+	 * values for other cpus. Otherwise, bpf program has no way to ensure
 	 * known initial values for cpus other than current one
 	 * (onallcpus=false always when coming from bpf prog).
 	 */
-	if (htab_is_prealloc(htab) && !onallcpus) {
+	if (!onallcpus) {
 		u32 size = round_up(htab->map.value_size, 8);
 		int current_cpu = raw_smp_processor_id();
 		int cpu;
@@ -1047,18 +1039,18 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 
 	memcpy(l_new->key, key, key_size);
 	if (percpu) {
-		size = round_up(size, 8);
 		if (prealloc) {
 			pptr = htab_elem_get_ptr(l_new, key_size);
 		} else {
 			/* alloc_percpu zero-fills */
-			pptr = bpf_map_alloc_percpu(&htab->map, size, 8,
-						    GFP_NOWAIT | __GFP_NOWARN);
+			pptr = bpf_mem_cache_alloc(&htab->pcpu_ma);
 			if (!pptr) {
 				bpf_mem_cache_free(&htab->ma, l_new);
 				l_new = ERR_PTR(-ENOMEM);
 				goto dec_count;
 			}
+			l_new->ptr_to_pptr = pptr;
+			pptr = *(void **)pptr;
 		}
 
 		pcpu_init_value(htab, pptr, value, onallcpus);
@@ -1550,6 +1542,7 @@ static void htab_map_free(struct bpf_map *map)
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
 	if (htab->use_percpu_counter)
 		percpu_counter_destroy(&htab->pcount);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 12/15] bpf: Remove tracing program restriction on map types
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (10 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 11/15] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs Alexei Starovoitov
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The hash map is now fully converted to bpf_mem_alloc. Its implementation is not
allocating synchronously and not calling call_rcu() directly. It's now safe to
use non-preallocated hash maps in all types of tracing programs including
BPF_PROG_TYPE_PERF_EVENT that runs out of NMI context.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 42 ------------------------------------------
 1 file changed, 42 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3dce3166855f..57ec06b1d09d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12623,48 +12623,6 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 
 {
 	enum bpf_prog_type prog_type = resolve_prog_type(prog);
-	/*
-	 * Validate that trace type programs use preallocated hash maps.
-	 *
-	 * For programs attached to PERF events this is mandatory as the
-	 * perf NMI can hit any arbitrary code sequence.
-	 *
-	 * All other trace types using non-preallocated per-cpu hash maps are
-	 * unsafe as well because tracepoint or kprobes can be inside locked
-	 * regions of the per-cpu memory allocator or at a place where a
-	 * recursion into the per-cpu memory allocator would see inconsistent
-	 * state. Non per-cpu hash maps are using bpf_mem_alloc-tor which is
-	 * safe to use from kprobe/fentry and in RT.
-	 *
-	 * On RT enabled kernels run-time allocation of all trace type
-	 * programs is strictly prohibited due to lock type constraints. On
-	 * !RT kernels it is allowed for backwards compatibility reasons for
-	 * now, but warnings are emitted so developers are made aware of
-	 * the unsafety and can fix their programs before this is enforced.
-	 */
-	if (is_tracing_prog_type(prog_type) && !is_preallocated_map(map)) {
-		if (prog_type == BPF_PROG_TYPE_PERF_EVENT) {
-			/* perf_event bpf progs have to use preallocated hash maps
-			 * because non-prealloc is still relying on call_rcu to free
-			 * elements.
-			 */
-			verbose(env, "perf_event programs can only use preallocated hash map\n");
-			return -EINVAL;
-		}
-		if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
-		    (map->inner_map_meta &&
-		     map->inner_map_meta->map_type == BPF_MAP_TYPE_PERCPU_HASH)) {
-			if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-				verbose(env,
-					"trace type programs can only use preallocated per-cpu hash map\n");
-				return -EINVAL;
-			}
-			WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
-			verbose(env,
-				"trace type programs with run-time allocated per-cpu hash maps are unsafe."
-				" Switch to preallocated hash maps.\n");
-		}
-	}
 
 	if (map_value_has_spin_lock(map)) {
 		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (11 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 12/15] bpf: Remove tracing program restriction on map types Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 14/15] bpf: Remove prealloc-only restriction for " Alexei Starovoitov
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
Then use call_rcu() to wait for normal progs to finish
and finally do free_one() on each element when freeing objects
into global memory pool.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 54455a64699b..9caeeaaf9bcb 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -225,6 +225,13 @@ static void __free_rcu(struct rcu_head *head)
 	atomic_set(&c->call_rcu_in_progress, 0);
 }
 
+static void __free_rcu_tasks_trace(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+
+	call_rcu(&c->rcu, __free_rcu);
+}
+
 static void enque_to_free(struct bpf_mem_cache *c, void *obj)
 {
 	struct llist_node *llnode = obj;
@@ -250,7 +257,11 @@ static void do_call_rcu(struct bpf_mem_cache *c)
 		 * from __free_rcu() and from drain_mem_cache().
 		 */
 		__llist_add(llnode, &c->waiting_for_gp);
-	call_rcu(&c->rcu, __free_rcu);
+	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
+	 * Then use call_rcu() to wait for normal progs to finish
+	 * and finally do free_one() on each element.
+	 */
+	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
 }
 
 static void free_bulk(struct bpf_mem_cache *c)
@@ -453,6 +464,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 		/* c->waiting_for_gp list was drained, but __free_rcu might
 		 * still execute. Wait for it now before we free 'c'.
 		 */
+		rcu_barrier_tasks_trace();
 		rcu_barrier();
 		free_percpu(ma->cache);
 		ma->cache = NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 14/15] bpf: Remove prealloc-only restriction for sleepable bpf programs.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (12 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-26  2:44 ` [PATCH v4 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc Alexei Starovoitov
  2022-08-27 16:57 ` [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Andrii Nakryiko
  15 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Since hash map is now converted to bpf_mem_alloc and it's waiting for rcu and
rcu_tasks_trace GPs before freeing elements into global memory slabs it's safe
to use dynamically allocated hash maps in sleepable bpf programs.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 23 -----------------------
 1 file changed, 23 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 57ec06b1d09d..068b20ed34d2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12586,14 +12586,6 @@ static int check_pseudo_btf_id(struct bpf_verifier_env *env,
 	return err;
 }
 
-static int check_map_prealloc(struct bpf_map *map)
-{
-	return (map->map_type != BPF_MAP_TYPE_HASH &&
-		map->map_type != BPF_MAP_TYPE_PERCPU_HASH &&
-		map->map_type != BPF_MAP_TYPE_HASH_OF_MAPS) ||
-		!(map->map_flags & BPF_F_NO_PREALLOC);
-}
-
 static bool is_tracing_prog_type(enum bpf_prog_type type)
 {
 	switch (type) {
@@ -12608,15 +12600,6 @@ static bool is_tracing_prog_type(enum bpf_prog_type type)
 	}
 }
 
-static bool is_preallocated_map(struct bpf_map *map)
-{
-	if (!check_map_prealloc(map))
-		return false;
-	if (map->inner_map_meta && !check_map_prealloc(map->inner_map_meta))
-		return false;
-	return true;
-}
-
 static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 					struct bpf_map *map,
 					struct bpf_prog *prog)
@@ -12669,12 +12652,6 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		case BPF_MAP_TYPE_LRU_PERCPU_HASH:
 		case BPF_MAP_TYPE_ARRAY_OF_MAPS:
 		case BPF_MAP_TYPE_HASH_OF_MAPS:
-			if (!is_preallocated_map(map)) {
-				verbose(env,
-					"Sleepable programs can only use preallocated maps\n");
-				return -EINVAL;
-			}
-			break;
 		case BPF_MAP_TYPE_RINGBUF:
 		case BPF_MAP_TYPE_INODE_STORAGE:
 		case BPF_MAP_TYPE_SK_STORAGE:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (13 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 14/15] bpf: Remove prealloc-only restriction for " Alexei Starovoitov
@ 2022-08-26  2:44 ` Alexei Starovoitov
  2022-08-29 22:02   ` Daniel Borkmann
  2022-08-27 16:57 ` [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Andrii Nakryiko
  15 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-26  2:44 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce sysctl kernel.bpf_force_dyn_alloc to force dynamic allocation in bpf
hash map. All selftests/bpf should pass with bpf_force_dyn_alloc 0 or 1 and all
bpf programs (both sleepable and not) should not see any functional difference.
The sysctl's observable behavior should only be improved memory usage.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/filter.h | 2 ++
 kernel/bpf/core.c      | 2 ++
 kernel/bpf/hashtab.c   | 5 +++++
 kernel/bpf/syscall.c   | 9 +++++++++
 4 files changed, 18 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index a5f21dc3c432..eb4d4a0c0bde 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1009,6 +1009,8 @@ bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 }
 #endif
 
+extern int bpf_force_dyn_alloc;
+
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
 extern int bpf_jit_harden;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 639437f36928..a13e78ea4b90 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -533,6 +533,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
 	bpf_prog_kallsyms_del(fp);
 }
 
+int bpf_force_dyn_alloc __read_mostly;
+
 #ifdef CONFIG_BPF_JIT
 /* All BPF JIT sysctl knobs here. */
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 89f26cbddef5..f68a3400939e 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -505,6 +505,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	bpf_map_init_from_attr(&htab->map, attr);
 
+	if (!lru && bpf_force_dyn_alloc) {
+		prealloc = false;
+		htab->map.map_flags |= BPF_F_NO_PREALLOC;
+	}
+
 	if (percpu_lru) {
 		/* ensure each CPU's lru list has >=1 elements.
 		 * since we are at it, make each lru list has the same
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 074c901fbb4e..5c631244b63b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5299,6 +5299,15 @@ static struct ctl_table bpf_syscall_table[] = {
 		.mode		= 0644,
 		.proc_handler	= bpf_stats_handler,
 	},
+	{
+		.procname	= "bpf_force_dyn_alloc",
+		.data		= &bpf_force_dyn_alloc,
+		.maxlen		= sizeof(int),
+		.mode		= 0600,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{ }
 };
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator.
  2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (14 preceding siblings ...)
  2022-08-26  2:44 ` [PATCH v4 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc Alexei Starovoitov
@ 2022-08-27 16:57 ` Andrii Nakryiko
  2022-08-27 22:53   ` Kumar Kartikeya Dwivedi
  15 siblings, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2022-08-27 16:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

On Thu, Aug 25, 2022 at 7:44 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce any context BPF specific memory allocator.
>
> Tracing BPF programs can attach to kprobe and fentry. Hence they
> run in unknown context where calling plain kmalloc() might not be safe.
> Front-end kmalloc() with per-cpu cache of free elements.
> Refill this cache asynchronously from irq_work.
>
> Major achievements enabled by bpf_mem_alloc:
> - Dynamically allocated hash maps used to be 10 times slower than fully preallocated.
>   With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc.
> - Tracing bpf programs can use dynamically allocated hash maps.
>   Potentially saving lots of memory. Typical hash map is sparsely populated.
> - Sleepable bpf programs can used dynamically allocated hash maps.
>
> v3->v4:
> - fix build issue due to missing local.h on 32-bit arch
> - add Kumar's ack
> - proposal for next steps from Delyan:
> https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com/
>
> v2->v3:
> - Rewrote the free_list algorithm based on discussions with Kumar. Patch 1.
> - Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14.
> - Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is
>   requested to reduce memory consumption. Patch 15.
> - Fix: zero-fill percpu allocation
> - Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction
>
> v2 thread:
> https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/
>
> v1->v2:
> - Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
> - Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
> - Tuned watermarks per allocation size. Patch 8
> - Adopted this approach to per-cpu allocation. Patch 10.
> - Fully converted hash map to bpf_mem_alloc. Patch 11.
> - Removed tracing prog restriction on map types. Combination of all patches and final patch 12.
>
> v1 thread:
> https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/
>
> LWN article:
> https://lwn.net/Articles/899274/
>
> Future work:
> - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
> - convert lru map to bpf_mem_alloc
>
> Alexei Starovoitov (15):
>   bpf: Introduce any context BPF specific memory allocator.
>   bpf: Convert hash map to bpf_mem_alloc.
>   selftests/bpf: Improve test coverage of test_maps
>   samples/bpf: Reduce syscall overhead in map_perf_test.
>   bpf: Relax the requirement to use preallocated hash maps in tracing
>     progs.
>   bpf: Optimize element count in non-preallocated hash map.
>   bpf: Optimize call_rcu in non-preallocated hash map.
>   bpf: Adjust low/high watermarks in bpf_mem_cache
>   bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
>   bpf: Add percpu allocation support to bpf_mem_alloc.
>   bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
>   bpf: Remove tracing program restriction on map types
>   bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
>   bpf: Remove prealloc-only restriction for sleepable bpf programs.
>   bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.
>
>  include/linux/bpf_mem_alloc.h             |  26 +
>  include/linux/filter.h                    |   2 +
>  kernel/bpf/Makefile                       |   2 +-
>  kernel/bpf/core.c                         |   2 +
>  kernel/bpf/hashtab.c                      | 132 +++--
>  kernel/bpf/memalloc.c                     | 602 ++++++++++++++++++++++
>  kernel/bpf/syscall.c                      |  14 +-
>  kernel/bpf/verifier.c                     |  52 --
>  samples/bpf/map_perf_test_kern.c          |  44 +-
>  samples/bpf/map_perf_test_user.c          |   2 +-
>  tools/testing/selftests/bpf/progs/timer.c |  11 -
>  tools/testing/selftests/bpf/test_maps.c   |  38 +-
>  12 files changed, 796 insertions(+), 131 deletions(-)
>  create mode 100644 include/linux/bpf_mem_alloc.h
>  create mode 100644 kernel/bpf/memalloc.c
>
> --
> 2.30.2
>

It's great to lift all those NMI restrictions on non-prealloc hashmap!
This should also open up new maps (like qp-trie) that can't be
pre-sized to the NMI world as well.

But just to clarify, in NMI mode we can exhaust memory in caches (and
thus if we do a lot of allocation in single BPF program execution we
can fail some operations). That's unavoidable. But it's not 100% clear
what's the behavior in IRQ mode and separately from that in "usual"
less restrictive mode. Is my understanding correct that we shouldn't
run out of memory (assuming there is memory available, of course)
because replenishing of caches will interrupt BPF program execution?
Or am I wrong and we can still run out of memory if we don't have
enough pre-cached memory. I think it would be good to clearly state
such things (unless I missed them somewhere in patches). I'm trying to
understand if in non-restrictive mode we can still fail to allocate a
bunch of hashmap elements in a loop just because of the design of
bpf_mem_alloc?

But it looks great otherwise. For the series:

Acked-by: Andrii Nakryiko <andrii@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator.
  2022-08-27 16:57 ` [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Andrii Nakryiko
@ 2022-08-27 22:53   ` Kumar Kartikeya Dwivedi
  2022-08-29 15:47     ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-27 22:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, davem, daniel, andrii, tj, delyank, linux-mm,
	bpf, kernel-team

On Sat, 27 Aug 2022 at 18:57, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Aug 25, 2022 at 7:44 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce any context BPF specific memory allocator.
> >
> > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > run in unknown context where calling plain kmalloc() might not be safe.
> > Front-end kmalloc() with per-cpu cache of free elements.
> > Refill this cache asynchronously from irq_work.
> >
> > Major achievements enabled by bpf_mem_alloc:
> > - Dynamically allocated hash maps used to be 10 times slower than fully preallocated.
> >   With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc.
> > - Tracing bpf programs can use dynamically allocated hash maps.
> >   Potentially saving lots of memory. Typical hash map is sparsely populated.
> > - Sleepable bpf programs can used dynamically allocated hash maps.
> >
> > v3->v4:
> > - fix build issue due to missing local.h on 32-bit arch
> > - add Kumar's ack
> > - proposal for next steps from Delyan:
> > https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com/
> >
> > v2->v3:
> > - Rewrote the free_list algorithm based on discussions with Kumar. Patch 1.
> > - Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14.
> > - Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is
> >   requested to reduce memory consumption. Patch 15.
> > - Fix: zero-fill percpu allocation
> > - Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction
> >
> > v2 thread:
> > https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/
> >
> > v1->v2:
> > - Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
> > - Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
> > - Tuned watermarks per allocation size. Patch 8
> > - Adopted this approach to per-cpu allocation. Patch 10.
> > - Fully converted hash map to bpf_mem_alloc. Patch 11.
> > - Removed tracing prog restriction on map types. Combination of all patches and final patch 12.
> >
> > v1 thread:
> > https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/
> >
> > LWN article:
> > https://lwn.net/Articles/899274/
> >
> > Future work:
> > - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
> > - convert lru map to bpf_mem_alloc
> >
> > Alexei Starovoitov (15):
> >   bpf: Introduce any context BPF specific memory allocator.
> >   bpf: Convert hash map to bpf_mem_alloc.
> >   selftests/bpf: Improve test coverage of test_maps
> >   samples/bpf: Reduce syscall overhead in map_perf_test.
> >   bpf: Relax the requirement to use preallocated hash maps in tracing
> >     progs.
> >   bpf: Optimize element count in non-preallocated hash map.
> >   bpf: Optimize call_rcu in non-preallocated hash map.
> >   bpf: Adjust low/high watermarks in bpf_mem_cache
> >   bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
> >   bpf: Add percpu allocation support to bpf_mem_alloc.
> >   bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
> >   bpf: Remove tracing program restriction on map types
> >   bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
> >   bpf: Remove prealloc-only restriction for sleepable bpf programs.
> >   bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.
> >
> >  include/linux/bpf_mem_alloc.h             |  26 +
> >  include/linux/filter.h                    |   2 +
> >  kernel/bpf/Makefile                       |   2 +-
> >  kernel/bpf/core.c                         |   2 +
> >  kernel/bpf/hashtab.c                      | 132 +++--
> >  kernel/bpf/memalloc.c                     | 602 ++++++++++++++++++++++
> >  kernel/bpf/syscall.c                      |  14 +-
> >  kernel/bpf/verifier.c                     |  52 --
> >  samples/bpf/map_perf_test_kern.c          |  44 +-
> >  samples/bpf/map_perf_test_user.c          |   2 +-
> >  tools/testing/selftests/bpf/progs/timer.c |  11 -
> >  tools/testing/selftests/bpf/test_maps.c   |  38 +-
> >  12 files changed, 796 insertions(+), 131 deletions(-)
> >  create mode 100644 include/linux/bpf_mem_alloc.h
> >  create mode 100644 kernel/bpf/memalloc.c
> >
> > --
> > 2.30.2
> >
>
> It's great to lift all those NMI restrictions on non-prealloc hashmap!
> This should also open up new maps (like qp-trie) that can't be
> pre-sized to the NMI world as well.
>
> But just to clarify, in NMI mode we can exhaust memory in caches (and
> thus if we do a lot of allocation in single BPF program execution we
> can fail some operations). That's unavoidable. But it's not 100% clear
> what's the behavior in IRQ mode and separately from that in "usual"
> less restrictive mode. Is my understanding correct that we shouldn't
> run out of memory (assuming there is memory available, of course)
> because replenishing of caches will interrupt BPF program execution?

When I was reviewing the code what I understood was as follows:

There are two ways work is queued. On non-RT, it is queued for
execution in hardirq context (raised_list), on RT it will instead be
executed by per-CPU pinned irq_work kthreads (from lazy_list).

We cannot set the IRQ_WORK_HARD_IRQ to force RT to execute them in
hardirq context, as bpf_mem_refill may take sleepable non-raw
spinlocks when calling into kmalloc, which is disallowed.

So, to summarize the behavior:
In NMI context:
- for both RT and non-RT, once we deplete the cache we get -ENOMEM.
In IRQ context:
- for RT, it will fill it asynchronously by waking up the irq_work
kthread, so you may still get -ENOMEM (also depends on if bpf prog is
in hardirq or threaded irq context, since hardirq context would be
non-preemptible, delaying refilling from irq_work kthread context).
- for non-RT, it is already inside the interrupt handler hence you
will get -ENOMEM. Interrupt handlers keep interrupts disabled, so IPI
execution is delayed until the handler returns.
In softirq and task context:
- for RT, it will fill it asynchronously by waking up the irq_work
kthread, so you may still get -ENOMEM.
- for non-RT, it will send IPI to local cpu, which will execute the
work synchronously, so you will refill the cache by interrupting the
program. Even when executing softirq inside the exit path of
interrupts, at that point interrupts are enabled so it will refill
synchronously by raising local IPI.

For the last case (say task context), the problem of kmalloc
reentrancy comes to mind again, e.g if we are tracing in guts of
kmalloc and send local IPI which eventually calls kmalloc again (which
may deadlock). But remember that such cases are already possible
without BPF, interrupts which allocate may come in at any time, so the
kmalloc code itself will keep IRQs disabled at these places, hence we
are fine from BPF side as well.

Please let me know of any inaccuracies in the above description.


> Or am I wrong and we can still run out of memory if we don't have
> enough pre-cached memory. I think it would be good to clearly state
> such things (unless I missed them somewhere in patches). I'm trying to
> understand if in non-restrictive mode we can still fail to allocate a
> bunch of hashmap elements in a loop just because of the design of
> bpf_mem_alloc?
>
> But it looks great otherwise. For the series:
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator.
  2022-08-27 22:53   ` Kumar Kartikeya Dwivedi
@ 2022-08-29 15:47     ` Alexei Starovoitov
  2022-09-09 20:10       ` Andrii Nakryiko
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-29 15:47 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Andrii Nakryiko, davem, daniel, andrii, tj, delyank, linux-mm,
	bpf, kernel-team

On Sun, Aug 28, 2022 at 12:53:48AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Sat, 27 Aug 2022 at 18:57, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Aug 25, 2022 at 7:44 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Introduce any context BPF specific memory allocator.
> > >
> > > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > > run in unknown context where calling plain kmalloc() might not be safe.
> > > Front-end kmalloc() with per-cpu cache of free elements.
> > > Refill this cache asynchronously from irq_work.
> > >
> > > Major achievements enabled by bpf_mem_alloc:
> > > - Dynamically allocated hash maps used to be 10 times slower than fully preallocated.
> > >   With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc.
> > > - Tracing bpf programs can use dynamically allocated hash maps.
> > >   Potentially saving lots of memory. Typical hash map is sparsely populated.
> > > - Sleepable bpf programs can used dynamically allocated hash maps.
> > >
> > > v3->v4:
> > > - fix build issue due to missing local.h on 32-bit arch
> > > - add Kumar's ack
> > > - proposal for next steps from Delyan:
> > > https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com/
> > >
> > > v2->v3:
> > > - Rewrote the free_list algorithm based on discussions with Kumar. Patch 1.
> > > - Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14.
> > > - Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is
> > >   requested to reduce memory consumption. Patch 15.
> > > - Fix: zero-fill percpu allocation
> > > - Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction
> > >
> > > v2 thread:
> > > https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/
> > >
> > > v1->v2:
> > > - Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
> > > - Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
> > > - Tuned watermarks per allocation size. Patch 8
> > > - Adopted this approach to per-cpu allocation. Patch 10.
> > > - Fully converted hash map to bpf_mem_alloc. Patch 11.
> > > - Removed tracing prog restriction on map types. Combination of all patches and final patch 12.
> > >
> > > v1 thread:
> > > https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/
> > >
> > > LWN article:
> > > https://lwn.net/Articles/899274/
> > >
> > > Future work:
> > > - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
> > > - convert lru map to bpf_mem_alloc
> > >
> > > Alexei Starovoitov (15):
> > >   bpf: Introduce any context BPF specific memory allocator.
> > >   bpf: Convert hash map to bpf_mem_alloc.
> > >   selftests/bpf: Improve test coverage of test_maps
> > >   samples/bpf: Reduce syscall overhead in map_perf_test.
> > >   bpf: Relax the requirement to use preallocated hash maps in tracing
> > >     progs.
> > >   bpf: Optimize element count in non-preallocated hash map.
> > >   bpf: Optimize call_rcu in non-preallocated hash map.
> > >   bpf: Adjust low/high watermarks in bpf_mem_cache
> > >   bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
> > >   bpf: Add percpu allocation support to bpf_mem_alloc.
> > >   bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
> > >   bpf: Remove tracing program restriction on map types
> > >   bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
> > >   bpf: Remove prealloc-only restriction for sleepable bpf programs.
> > >   bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.
> > >
> > >  include/linux/bpf_mem_alloc.h             |  26 +
> > >  include/linux/filter.h                    |   2 +
> > >  kernel/bpf/Makefile                       |   2 +-
> > >  kernel/bpf/core.c                         |   2 +
> > >  kernel/bpf/hashtab.c                      | 132 +++--
> > >  kernel/bpf/memalloc.c                     | 602 ++++++++++++++++++++++
> > >  kernel/bpf/syscall.c                      |  14 +-
> > >  kernel/bpf/verifier.c                     |  52 --
> > >  samples/bpf/map_perf_test_kern.c          |  44 +-
> > >  samples/bpf/map_perf_test_user.c          |   2 +-
> > >  tools/testing/selftests/bpf/progs/timer.c |  11 -
> > >  tools/testing/selftests/bpf/test_maps.c   |  38 +-
> > >  12 files changed, 796 insertions(+), 131 deletions(-)
> > >  create mode 100644 include/linux/bpf_mem_alloc.h
> > >  create mode 100644 kernel/bpf/memalloc.c
> > >
> > > --
> > > 2.30.2
> > >
> >
> > It's great to lift all those NMI restrictions on non-prealloc hashmap!
> > This should also open up new maps (like qp-trie) that can't be
> > pre-sized to the NMI world as well.
> >
> > But just to clarify, in NMI mode we can exhaust memory in caches (and
> > thus if we do a lot of allocation in single BPF program execution we
> > can fail some operations). That's unavoidable. But it's not 100% clear
> > what's the behavior in IRQ mode and separately from that in "usual"
> > less restrictive mode. Is my understanding correct that we shouldn't
> > run out of memory (assuming there is memory available, of course)
> > because replenishing of caches will interrupt BPF program execution?
> 
> When I was reviewing the code what I understood was as follows:
> 
> There are two ways work is queued. On non-RT, it is queued for
> execution in hardirq context (raised_list), on RT it will instead be
> executed by per-CPU pinned irq_work kthreads (from lazy_list).
> 
> We cannot set the IRQ_WORK_HARD_IRQ to force RT to execute them in
> hardirq context, as bpf_mem_refill may take sleepable non-raw
> spinlocks when calling into kmalloc, which is disallowed.

Correct.

> So, to summarize the behavior:
> In NMI context:
> - for both RT and non-RT, once we deplete the cache we get -ENOMEM.
> In IRQ context:
> - for RT, it will fill it asynchronously by waking up the irq_work
> kthread, so you may still get -ENOMEM (also depends on if bpf prog is
> in hardirq or threaded irq context, since hardirq context would be
> non-preemptible, delaying refilling from irq_work kthread context).
> - for non-RT, it is already inside the interrupt handler hence you
> will get -ENOMEM. Interrupt handlers keep interrupts disabled, so IPI
> execution is delayed until the handler returns.
> In softirq and task context:
> - for RT, it will fill it asynchronously by waking up the irq_work
> kthread, so you may still get -ENOMEM.
> - for non-RT, it will send IPI to local cpu, which will execute the
> work synchronously, so you will refill the cache by interrupting the
> program. Even when executing softirq inside the exit path of
> interrupts, at that point interrupts are enabled so it will refill
> synchronously by raising local IPI.

Correct.

> For the last case (say task context), the problem of kmalloc
> reentrancy comes to mind again, e.g if we are tracing in guts of
> kmalloc and send local IPI which eventually calls kmalloc again (which
> may deadlock). But remember that such cases are already possible
> without BPF, interrupts which allocate may come in at any time, so the
> kmalloc code itself will keep IRQs disabled at these places, hence we
> are fine from BPF side as well.

Exactly.

> Please let me know of any inaccuracies in the above description.

All looks correct.
I'm hesitant to add such low level details to a bpf documentation though
because things will change soon and program writers often don't have
control on the execution context. When bpf prog is attached to a kernel
function 'foo' that function maybe called out of all possible contexts.
Just like rigth now the bpf progs call bpf_map_update_elem (which does
some kind of memory allocation inside) and don't think twice.
Smarter thresholds are on todo list.
When bpf_mem_alloc is exposed to bpf progs directly they will have
an ability to prefill the cache to guarantee availability of objects later.
Sleepable progs might skip cache altogether or try direct kmalloc and fallback
to cache or the other way around.
In other words the program writers should not rely on specific
implementation details of bpf_mem_alloc, RT or non-RT, IRQs on or off, etc.

> > Or am I wrong and we can still run out of memory if we don't have
> > enough pre-cached memory. I think it would be good to clearly state
> > such things (unless I missed them somewhere in patches). I'm trying to
> > understand if in non-restrictive mode we can still fail to allocate a
> > bunch of hashmap elements in a loop just because of the design of
> > bpf_mem_alloc?

The users shouldn't rely on internal details. bpf_mem_alloc can
return NULL. That's all they need to know.
If it's not acceptable the prog should prefill the cache. Then the prog
will have a guaranteed reserve at the time of allocation.
That api is work in progress. It's another step after Delyan's work.

> > But it looks great otherwise. For the series:
> >
> > Acked-by: Andrii Nakryiko <andrii@kernel.org>

Thanks for the review!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-26  2:44 ` [PATCH v4 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
@ 2022-08-29 21:30   ` Daniel Borkmann
  2022-08-29 21:45     ` Alexei Starovoitov
  2022-08-29 21:59   ` Daniel Borkmann
  2022-08-29 22:39   ` Martin KaFai Lau
  2 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2022-08-29 21:30 UTC (permalink / raw)
  To: Alexei Starovoitov, davem
  Cc: andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

On 8/26/22 4:44 AM, Alexei Starovoitov wrote:
[...]
> +
> +/* Called from BPF program or from sys_bpf syscall.
> + * In both cases migration is disabled.
> + */
> +void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
> +{
> +	int idx;
> +	void *ret;
> +
> +	if (!size)
> +		return ZERO_SIZE_PTR;
> +
> +	idx = bpf_mem_cache_idx(size + LLIST_NODE_SZ);
> +	if (idx < 0)
> +		return NULL;
> +
> +	ret = unit_alloc(this_cpu_ptr(ma->caches)->cache + idx);
> +	return !ret ? NULL : ret + LLIST_NODE_SZ;
> +}
> +
> +void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
> +{
> +	int idx;
> +
> +	if (!ptr)
> +		return;
> +
> +	idx = bpf_mem_cache_idx(__ksize(ptr - LLIST_NODE_SZ));
> +	if (idx < 0)
> +		return;
> +
> +	unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
> +}
> +
> +void notrace *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma)
> +{
> +	void *ret;
> +
> +	ret = unit_alloc(this_cpu_ptr(ma->cache));
> +	return !ret ? NULL : ret + LLIST_NODE_SZ;
> +}
> +
> +void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
> +{
> +	if (!ptr)
> +		return;
> +
> +	unit_free(this_cpu_ptr(ma->cache), ptr);
> +}

Looks like smp_processor_id() needs to be made aware that preemption might
be ok just not migration to a different CPU?

   [...]
   [  593.639025] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u16:246/1946
   [  593.639026] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u16:83/1642
   [  593.639138] caller is bpf_mem_cache_free+0x14/0x40
   [  593.640971] caller is bpf_mem_cache_free+0x14/0x40
   [  593.641060] CPU: 6 PID: 1642 Comm: kworker/u16:83 Not tainted 5.19.0-gf0d7b67fb6f8 #1
   [  593.641874] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
   [  593.641874] Workqueue: events_unbound bpf_map_free_deferred
   [  593.641874] Call Trace:
   [  593.641874]  <TASK>
   [  593.641874]  dump_stack_lvl+0x69/0x9b
   [  593.641874]  check_preemption_disabled+0x10b/0x120
   [  593.641874]  bpf_mem_cache_free+0x14/0x40
   [  593.641874]  htab_map_free+0x13f/0x2a0
   [  593.641874]  process_one_work+0x28e/0x580
   [  593.641874]  worker_thread+0x1fb/0x420
   [  593.641874]  ? rcu_lock_release+0x20/0x20
   [  593.641874]  kthread+0xf1/0x110
   [  593.641874]  ? kthread_blkcg+0x40/0x40
   [  593.641874]  ret_from_fork+0x22/0x30
   [  593.641874]  </TASK>
   [  593.654117] CPU: 5 PID: 1946 Comm: kworker/u16:246 Not tainted 5.19.0-gf0d7b67fb6f8 #1
   [  593.654317] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u16:83/1642
   [  593.654560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
   [  593.654560] Workqueue: events_unbound bpf_map_free_deferred
   [  593.654560] Call Trace:
   [  593.654560]  <TASK>
   [  593.654560]  dump_stack_lvl+0x69/0x9b
   [  593.658872] caller is bpf_mem_cache_free+0x14/0x40
   [  593.654560]  check_preemption_disabled+0x10b/0x120
   [  593.654560]  bpf_mem_cache_free+0x14/0x40
   [  593.654560]  htab_map_free+0x13f/0x2a0
   [  593.654560]  process_one_work+0x28e/0x580
   [  593.654560]  worker_thread+0x1fb/0x420
   [  593.654560]  ? rcu_lock_release+0x20/0x20
   [  593.654560]  kthread+0xf1/0x110
   [  593.654560]  ? kthread_blkcg+0x40/0x40
   [  593.654560]  ret_from_fork+0x22/0x30
   [  593.654560]  </TASK>
   [...]
   [ 1158.399989] test_maps invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
   [ 1158.401948] CPU: 1 PID: 4147 Comm: test_maps Not tainted 5.19.0-gf0d7b67fb6f8 #1
   [ 1158.402612] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
   [ 1158.402666] Call Trace:
   [ 1158.402666]  <TASK>
   [ 1158.402666]  dump_stack_lvl+0x69/0x9b
   [ 1158.402666]  dump_header+0x4d/0x370
   [ 1158.402666]  out_of_memory+0x5a2/0x5e0
   [ 1158.402666]  __alloc_pages_slowpath+0xbf4/0x1140
   [ 1158.402666]  __alloc_pages+0x222/0x2c0
   [ 1158.402666]  folio_alloc+0x14/0x40
   [ 1158.402666]  filemap_alloc_folio+0x43/0x150
   [ 1158.402666]  __filemap_get_folio+0x263/0x3a0
   [ 1158.402666]  filemap_fault+0x20f/0x540
   [ 1158.410527]  __do_fault+0x4a/0x100
   [ 1158.410527]  do_fault+0x13a/0x630
   [ 1158.410527]  handle_mm_fault+0x83d/0x13d0
   [ 1158.410527]  do_user_addr_fault+0x33c/0x4e0
   [ 1158.410527]  exc_page_fault+0x72/0x280
   [ 1158.410527]  asm_exc_page_fault+0x22/0x30
   [ 1158.410527] RIP: 0033:0x7fa4d11a6ffc
   [ 1158.410527] Code: Unable to access opcode bytes at RIP 0x7fa4d11a6fd2.
   [ 1158.410527] RSP: 002b:00007ffd4fa8ac00 EFLAGS: 00000206
   [ 1158.410527] RAX: 0000000000000240 RBX: 0000000000000075 RCX: 00007fa4d11d4610
   [ 1158.410527] RDX: 0000000000000065 RSI: 00007fa4d11d42a0 RDI: 000055d328f91ac8
   [ 1158.410527] RBP: 00007ffd4fa8ace0 R08: 00007fa4d1198aa0 R09: 0000000000000001
   [ 1158.419624] R10: 00007ffd4fa8ace0 R11: 0000000000000246 R12: 000055d328f91ac8
   [ 1158.420234] R13: 000055d328f9e4e0 R14: 0000000000000000 R15: 00007fa4d11d3000
   [ 1158.420234]  </TASK>
   [...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-29 21:30   ` Daniel Borkmann
@ 2022-08-29 21:45     ` Alexei Starovoitov
  0 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-29 21:45 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Andrii Nakryiko, Tejun Heo,
	Kumar Kartikeya Dwivedi, Delyan Kratunov, linux-mm, bpf,
	Kernel Team

On Mon, Aug 29, 2022 at 2:30 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/26/22 4:44 AM, Alexei Starovoitov wrote:
> [...]
> > +
> > +/* Called from BPF program or from sys_bpf syscall.
> > + * In both cases migration is disabled.
> > + */
> > +void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
> > +{
> > +     int idx;
> > +     void *ret;
> > +
> > +     if (!size)
> > +             return ZERO_SIZE_PTR;
> > +
> > +     idx = bpf_mem_cache_idx(size + LLIST_NODE_SZ);
> > +     if (idx < 0)
> > +             return NULL;
> > +
> > +     ret = unit_alloc(this_cpu_ptr(ma->caches)->cache + idx);
> > +     return !ret ? NULL : ret + LLIST_NODE_SZ;
> > +}
> > +
> > +void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
> > +{
> > +     int idx;
> > +
> > +     if (!ptr)
> > +             return;
> > +
> > +     idx = bpf_mem_cache_idx(__ksize(ptr - LLIST_NODE_SZ));
> > +     if (idx < 0)
> > +             return;
> > +
> > +     unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
> > +}
> > +
> > +void notrace *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma)
> > +{
> > +     void *ret;
> > +
> > +     ret = unit_alloc(this_cpu_ptr(ma->cache));
> > +     return !ret ? NULL : ret + LLIST_NODE_SZ;
> > +}
> > +
> > +void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
> > +{
> > +     if (!ptr)
> > +             return;
> > +
> > +     unit_free(this_cpu_ptr(ma->cache), ptr);
> > +}
>
> Looks like smp_processor_id() needs to be made aware that preemption might
> be ok just not migration to a different CPU?

ahh. migration is not disabled when map is freed from worker.
this_cpu_ptr above and local_irq_save shortly after need to happen
on the same cpu, so I'm thinking to add migrate_disable
to htab free path.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map.
  2022-08-26  2:44 ` [PATCH v4 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
@ 2022-08-29 21:47   ` Daniel Borkmann
  2022-08-29 21:57     ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2022-08-29 21:47 UTC (permalink / raw)
  To: Alexei Starovoitov, davem
  Cc: andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

On 8/26/22 4:44 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> The atomic_inc/dec might cause extreme cache line bouncing when multiple cpus
> access the same bpf map. Based on specified max_entries for the hash map
> calculate when percpu_counter becomes faster than atomic_t and use it for such
> maps. For example samples/bpf/map_perf_test is using hash map with max_entries
> 1000. On a system with 16 cpus the 'map_perf_test 4' shows 14k events per
> second using atomic_t. On a system with 15 cpus it shows 100k events per second
> using percpu. map_perf_test is an extreme case where all cpus colliding on
> atomic_t which causes extreme cache bouncing. Note that the slow path of
> percpu_counter is 5k events per secound vs 14k for atomic, so the heuristic is
> necessary. See comment in the code why the heuristic is based on
> num_online_cpus().

nit: Could we include this logic inside percpu_counter logic, or as an extended
version of it? Except the heuristic of attr->max_entries / 2 > num_online_cpus() *
PERCPU_COUNTER_BATCH which toggles between plain atomic vs percpu_counter, the
rest feel generic enough that it could also be applicable outside bpf.

> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   kernel/bpf/hashtab.c | 70 +++++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 62 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index bd23c8830d49..8f68c6e13339 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -101,7 +101,12 @@ struct bpf_htab {
>   		struct bpf_lru lru;
>   	};
>   	struct htab_elem *__percpu *extra_elems;
> -	atomic_t count;	/* number of elements in this hashtable */
> +	/* number of elements in non-preallocated hashtable are kept
> +	 * in either pcount or count
> +	 */
> +	struct percpu_counter pcount;
> +	atomic_t count;
> +	bool use_percpu_counter;
>   	u32 n_buckets;	/* number of hash buckets */
>   	u32 elem_size;	/* size of each element in bytes */
>   	u32 hashrnd;
> @@ -552,6 +557,29 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
>   
>   	htab_init_buckets(htab);
>   
> +/* compute_batch_value() computes batch value as num_online_cpus() * 2
> + * and __percpu_counter_compare() needs
> + * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus()
> + * for percpu_counter to be faster than atomic_t. In practice the average bpf
> + * hash map size is 10k, which means that a system with 64 cpus will fill
> + * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore
> + * define our own batch count as 32 then 10k hash map can be filled up to 80%:
> + * 10k - 8k > 32 _batch_ * 64 _cpus_
> + * and __percpu_counter_compare() will still be fast. At that point hash map
> + * collisions will dominate its performance anyway. Assume that hash map filled
> + * to 50+% isn't going to be O(1) and use the following formula to choose
> + * between percpu_counter and atomic_t.
> + */
> +#define PERCPU_COUNTER_BATCH 32
> +	if (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH)
> +		htab->use_percpu_counter = true;
> +
> +	if (htab->use_percpu_counter) {
> +		err = percpu_counter_init(&htab->pcount, 0, GFP_KERNEL);
> +		if (err)
> +			goto free_map_locked;
> +	}
> +
>   	if (prealloc) {
>   		err = prealloc_init(htab);
>   		if (err)
> @@ -878,6 +906,31 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
>   	}
>   }
>   
> +static bool is_map_full(struct bpf_htab *htab)
> +{
> +	if (htab->use_percpu_counter)
> +		return __percpu_counter_compare(&htab->pcount, htab->map.max_entries,
> +						PERCPU_COUNTER_BATCH) >= 0;
> +	return atomic_read(&htab->count) >= htab->map.max_entries;
> +}
> +
> +static void inc_elem_count(struct bpf_htab *htab)
> +{
> +	if (htab->use_percpu_counter)
> +		percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH);
> +	else
> +		atomic_inc(&htab->count);
> +}
> +
> +static void dec_elem_count(struct bpf_htab *htab)
> +{
> +	if (htab->use_percpu_counter)
> +		percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH);
> +	else
> +		atomic_dec(&htab->count);
> +}
> +
> +

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map.
  2022-08-29 21:47   ` Daniel Borkmann
@ 2022-08-29 21:57     ` Alexei Starovoitov
  0 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-29 21:57 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Andrii Nakryiko, Tejun Heo,
	Kumar Kartikeya Dwivedi, Delyan Kratunov, linux-mm, bpf,
	Kernel Team

On Mon, Aug 29, 2022 at 2:47 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/26/22 4:44 AM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > The atomic_inc/dec might cause extreme cache line bouncing when multiple cpus
> > access the same bpf map. Based on specified max_entries for the hash map
> > calculate when percpu_counter becomes faster than atomic_t and use it for such
> > maps. For example samples/bpf/map_perf_test is using hash map with max_entries
> > 1000. On a system with 16 cpus the 'map_perf_test 4' shows 14k events per
> > second using atomic_t. On a system with 15 cpus it shows 100k events per second
> > using percpu. map_perf_test is an extreme case where all cpus colliding on
> > atomic_t which causes extreme cache bouncing. Note that the slow path of
> > percpu_counter is 5k events per secound vs 14k for atomic, so the heuristic is
> > necessary. See comment in the code why the heuristic is based on
> > num_online_cpus().
>
> nit: Could we include this logic inside percpu_counter logic, or as an extended
> version of it? Except the heuristic of attr->max_entries / 2 > num_online_cpus() *
> PERCPU_COUNTER_BATCH which toggles between plain atomic vs percpu_counter, the
> rest feel generic enough that it could also be applicable outside bpf.

The heuristic is probably not generic enough and this optimization
is a stop gap. It helps many cases, but doesn't solve all.
It's ok for this specific large hash map to count max_entries,
but we shouldn't claim generality to suggest this heuristic
to anyone else. I was thinking to do a follow up and create
a true generic combined percpu and atomic counter, similar
to percpu_ref that can switch from percpu to atomic.
But it's more of a wish-list task atm.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-26  2:44 ` [PATCH v4 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
  2022-08-29 21:30   ` Daniel Borkmann
@ 2022-08-29 21:59   ` Daniel Borkmann
  2022-08-29 22:04     ` Alexei Starovoitov
  2022-08-29 22:39   ` Martin KaFai Lau
  2 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2022-08-29 21:59 UTC (permalink / raw)
  To: Alexei Starovoitov, davem
  Cc: andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

On 8/26/22 4:44 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs can attach to kprobe and fentry. Hence they
> run in unknown context where calling plain kmalloc() might not be safe.
> 
> Front-end kmalloc() with minimal per-cpu cache of free elements.
> Refill this cache asynchronously from irq_work.
> 
> BPF programs always run with migration disabled.
> It's safe to allocate from cache of the current cpu with irqs disabled.
> Free-ing is always done into bucket of the current cpu as well.
> irq_work trims extra free elements from buckets with kfree
> and refills them with kmalloc, so global kmalloc logic takes care
> of freeing objects allocated by one cpu and freed on another.
> 
> struct bpf_mem_alloc supports two modes:
> - When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
>    This is typical bpf hash map use case when all elements have equal size.
> - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
>    kmalloc/kfree. Max allocation size is 4096 in this case.
>    This is bpf_dynptr and bpf_kptr use case.
> 
> bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
> bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.
> 
> The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
> 
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/bpf_mem_alloc.h |  26 ++
>   kernel/bpf/Makefile           |   2 +-
>   kernel/bpf/memalloc.c         | 476 ++++++++++++++++++++++++++++++++++
>   3 files changed, 503 insertions(+), 1 deletion(-)
>   create mode 100644 include/linux/bpf_mem_alloc.h
>   create mode 100644 kernel/bpf/memalloc.c
> 
[...]
> +#define NUM_CACHES 11
> +
> +struct bpf_mem_cache {
> +	/* per-cpu list of free objects of size 'unit_size'.
> +	 * All accesses are done with interrupts disabled and 'active' counter
> +	 * protection with __llist_add() and __llist_del_first().
> +	 */
> +	struct llist_head free_llist;
> +	local_t active;
> +
> +	/* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill
> +	 * are sequenced by per-cpu 'active' counter. But unit_free() cannot
> +	 * fail. When 'active' is busy the unit_free() will add an object to
> +	 * free_llist_extra.
> +	 */
> +	struct llist_head free_llist_extra;
> +
> +	/* kmem_cache != NULL when bpf_mem_alloc was created for specific
> +	 * element size.
> +	 */
> +	struct kmem_cache *kmem_cache;
> +	struct irq_work refill_work;
> +	struct obj_cgroup *objcg;
> +	int unit_size;
> +	/* count of objects in free_llist */
> +	int free_cnt;
> +};
> +
> +struct bpf_mem_caches {
> +	struct bpf_mem_cache cache[NUM_CACHES];
> +};
> +

Could we now also completely get rid of the current map prealloc infra (pcpu_freelist*
I mean), and replace it with above variant altogether? Would be nice to make it work
for this case, too, and then get rid of percpu_freelist.{h,c} .. it's essentially a
superset wrt functionality iiuc?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.
  2022-08-26  2:44 ` [PATCH v4 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc Alexei Starovoitov
@ 2022-08-29 22:02   ` Daniel Borkmann
  2022-08-29 22:27     ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2022-08-29 22:02 UTC (permalink / raw)
  To: Alexei Starovoitov, davem
  Cc: andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

On 8/26/22 4:44 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce sysctl kernel.bpf_force_dyn_alloc to force dynamic allocation in bpf
> hash map. All selftests/bpf should pass with bpf_force_dyn_alloc 0 or 1 and all
> bpf programs (both sleepable and not) should not see any functional difference.
> The sysctl's observable behavior should only be improved memory usage.
> 
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/filter.h | 2 ++
>   kernel/bpf/core.c      | 2 ++
>   kernel/bpf/hashtab.c   | 5 +++++
>   kernel/bpf/syscall.c   | 9 +++++++++
>   4 files changed, 18 insertions(+)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index a5f21dc3c432..eb4d4a0c0bde 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1009,6 +1009,8 @@ bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
>   }
>   #endif
>   
> +extern int bpf_force_dyn_alloc;
> +
>   #ifdef CONFIG_BPF_JIT
>   extern int bpf_jit_enable;
>   extern int bpf_jit_harden;
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 639437f36928..a13e78ea4b90 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -533,6 +533,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>   	bpf_prog_kallsyms_del(fp);
>   }
>   
> +int bpf_force_dyn_alloc __read_mostly;
> +
>   #ifdef CONFIG_BPF_JIT
>   /* All BPF JIT sysctl knobs here. */
>   int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON);
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 89f26cbddef5..f68a3400939e 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -505,6 +505,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
>   
>   	bpf_map_init_from_attr(&htab->map, attr);
>   
> +	if (!lru && bpf_force_dyn_alloc) {
> +		prealloc = false;
> +		htab->map.map_flags |= BPF_F_NO_PREALLOC;
> +	}
> +

The rationale is essentially for testing, right? Would be nice to avoid
making this patch uapi. It will just confuse users with implementation
details, imho, and then it's hard to remove it again.

>   	if (percpu_lru) {
>   		/* ensure each CPU's lru list has >=1 elements.
>   		 * since we are at it, make each lru list has the same
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 074c901fbb4e..5c631244b63b 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -5299,6 +5299,15 @@ static struct ctl_table bpf_syscall_table[] = {
>   		.mode		= 0644,
>   		.proc_handler	= bpf_stats_handler,
>   	},
> +	{
> +		.procname	= "bpf_force_dyn_alloc",
> +		.data		= &bpf_force_dyn_alloc,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0600,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
> +	},
>   	{ }
>   };
>   
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-29 21:59   ` Daniel Borkmann
@ 2022-08-29 22:04     ` Alexei Starovoitov
  0 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-29 22:04 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Andrii Nakryiko, Tejun Heo,
	Kumar Kartikeya Dwivedi, Delyan Kratunov, linux-mm, bpf,
	Kernel Team

On Mon, Aug 29, 2022 at 2:59 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/26/22 4:44 AM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > run in unknown context where calling plain kmalloc() might not be safe.
> >
> > Front-end kmalloc() with minimal per-cpu cache of free elements.
> > Refill this cache asynchronously from irq_work.
> >
> > BPF programs always run with migration disabled.
> > It's safe to allocate from cache of the current cpu with irqs disabled.
> > Free-ing is always done into bucket of the current cpu as well.
> > irq_work trims extra free elements from buckets with kfree
> > and refills them with kmalloc, so global kmalloc logic takes care
> > of freeing objects allocated by one cpu and freed on another.
> >
> > struct bpf_mem_alloc supports two modes:
> > - When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
> >    This is typical bpf hash map use case when all elements have equal size.
> > - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
> >    kmalloc/kfree. Max allocation size is 4096 in this case.
> >    This is bpf_dynptr and bpf_kptr use case.
> >
> > bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
> > bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.
> >
> > The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
> >
> > Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >   include/linux/bpf_mem_alloc.h |  26 ++
> >   kernel/bpf/Makefile           |   2 +-
> >   kernel/bpf/memalloc.c         | 476 ++++++++++++++++++++++++++++++++++
> >   3 files changed, 503 insertions(+), 1 deletion(-)
> >   create mode 100644 include/linux/bpf_mem_alloc.h
> >   create mode 100644 kernel/bpf/memalloc.c
> >
> [...]
> > +#define NUM_CACHES 11
> > +
> > +struct bpf_mem_cache {
> > +     /* per-cpu list of free objects of size 'unit_size'.
> > +      * All accesses are done with interrupts disabled and 'active' counter
> > +      * protection with __llist_add() and __llist_del_first().
> > +      */
> > +     struct llist_head free_llist;
> > +     local_t active;
> > +
> > +     /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill
> > +      * are sequenced by per-cpu 'active' counter. But unit_free() cannot
> > +      * fail. When 'active' is busy the unit_free() will add an object to
> > +      * free_llist_extra.
> > +      */
> > +     struct llist_head free_llist_extra;
> > +
> > +     /* kmem_cache != NULL when bpf_mem_alloc was created for specific
> > +      * element size.
> > +      */
> > +     struct kmem_cache *kmem_cache;
> > +     struct irq_work refill_work;
> > +     struct obj_cgroup *objcg;
> > +     int unit_size;
> > +     /* count of objects in free_llist */
> > +     int free_cnt;
> > +};
> > +
> > +struct bpf_mem_caches {
> > +     struct bpf_mem_cache cache[NUM_CACHES];
> > +};
> > +
>
> Could we now also completely get rid of the current map prealloc infra (pcpu_freelist*
> I mean), and replace it with above variant altogether? Would be nice to make it work
> for this case, too, and then get rid of percpu_freelist.{h,c} .. it's essentially a
> superset wrt functionality iiuc?

Eventually it would be possible to get rid of prealloc logic completely,
but not so fast. LRU map needs to be converted first.
Then a lot of production testing is necessary to gain confidence
and make sure we didn't miss any corner cases.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.
  2022-08-29 22:02   ` Daniel Borkmann
@ 2022-08-29 22:27     ` Alexei Starovoitov
  0 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-29 22:27 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Andrii Nakryiko, Tejun Heo,
	Kumar Kartikeya Dwivedi, Delyan Kratunov, linux-mm, bpf,
	Kernel Team

On Mon, Aug 29, 2022 at 3:02 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/26/22 4:44 AM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce sysctl kernel.bpf_force_dyn_alloc to force dynamic allocation in bpf
> > hash map. All selftests/bpf should pass with bpf_force_dyn_alloc 0 or 1 and all
> > bpf programs (both sleepable and not) should not see any functional difference.
> > The sysctl's observable behavior should only be improved memory usage.
> >
> > Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >   include/linux/filter.h | 2 ++
> >   kernel/bpf/core.c      | 2 ++
> >   kernel/bpf/hashtab.c   | 5 +++++
> >   kernel/bpf/syscall.c   | 9 +++++++++
> >   4 files changed, 18 insertions(+)
> >
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index a5f21dc3c432..eb4d4a0c0bde 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -1009,6 +1009,8 @@ bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
> >   }
> >   #endif
> >
> > +extern int bpf_force_dyn_alloc;
> > +
> >   #ifdef CONFIG_BPF_JIT
> >   extern int bpf_jit_enable;
> >   extern int bpf_jit_harden;
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 639437f36928..a13e78ea4b90 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -533,6 +533,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
> >       bpf_prog_kallsyms_del(fp);
> >   }
> >
> > +int bpf_force_dyn_alloc __read_mostly;
> > +
> >   #ifdef CONFIG_BPF_JIT
> >   /* All BPF JIT sysctl knobs here. */
> >   int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON);
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 89f26cbddef5..f68a3400939e 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -505,6 +505,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
> >
> >       bpf_map_init_from_attr(&htab->map, attr);
> >
> > +     if (!lru && bpf_force_dyn_alloc) {
> > +             prealloc = false;
> > +             htab->map.map_flags |= BPF_F_NO_PREALLOC;
> > +     }
> > +
>
> The rationale is essentially for testing, right? Would be nice to avoid
> making this patch uapi. It will just confuse users with implementation
> details, imho, and then it's hard to remove it again.

Not for testing, but for production.
The plan is to roll this sysctl gradually in the fleet and
hopefully observe memory saving without negative side effects,
but map usage patterns are wild. It will take a long time to get
the confidence that prelloc code from htab can be completely removed.
At scale usage might find all kinds of unforeseen issues.
Probably new alloc heuristics would need to be developed.
If 'git rm kernel/bpf/percpu_freelist.*' ever happens
(would be great, but who knows) then this sysctl will become a nop.
This patch is trivial enough and we could keep it internal,
but everybody else with a large fleet of servers would probably
be applying the same patch and will be repeating the same steps.
bpf usage in hyperscalers varies a lot.
Before 'git rm freelist' we probably flip the default for this sysctl
to get even broader coverage.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-26  2:44 ` [PATCH v4 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
  2022-08-29 21:30   ` Daniel Borkmann
  2022-08-29 21:59   ` Daniel Borkmann
@ 2022-08-29 22:39   ` Martin KaFai Lau
  2022-08-29 22:42     ` Alexei Starovoitov
  2 siblings, 1 reply; 32+ messages in thread
From: Martin KaFai Lau @ 2022-08-29 22:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, memxor, delyank, linux-mm, bpf, kernel-team

On Thu, Aug 25, 2022 at 07:44:16PM -0700, Alexei Starovoitov wrote:
> +/* Mostly runs from irq_work except __init phase. */
> +static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> +{
> +	struct mem_cgroup *memcg = NULL, *old_memcg;
> +	unsigned long flags;
> +	void *obj;
> +	int i;
> +
> +	memcg = get_memcg(c);
> +	old_memcg = set_active_memcg(memcg);
> +	for (i = 0; i < cnt; i++) {
> +		obj = __alloc(c, node);
> +		if (!obj)
> +			break;
> +		if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +			/* In RT irq_work runs in per-cpu kthread, so disable
> +			 * interrupts to avoid preemption and interrupts and
> +			 * reduce the chance of bpf prog executing on this cpu
> +			 * when active counter is busy.
> +			 */
> +			local_irq_save(flags);
> +		if (local_inc_return(&c->active) == 1) {
Is it because it is always '== 1' here so that there is
no need to free the obj when it is '!= 1' ?

> +			__llist_add(obj, &c->free_llist);
> +			c->free_cnt++;
> +		}
> +		local_dec(&c->active);
> +		if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +			local_irq_restore(flags);
> +	}
> +	set_active_memcg(old_memcg);
> +	mem_cgroup_put(memcg);
> +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-29 22:39   ` Martin KaFai Lau
@ 2022-08-29 22:42     ` Alexei Starovoitov
  2022-08-29 22:59       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-29 22:42 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Tejun Heo,
	Kumar Kartikeya Dwivedi, Delyan Kratunov, linux-mm, bpf,
	Kernel Team

On Mon, Aug 29, 2022 at 3:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Thu, Aug 25, 2022 at 07:44:16PM -0700, Alexei Starovoitov wrote:
> > +/* Mostly runs from irq_work except __init phase. */
> > +static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> > +{
> > +     struct mem_cgroup *memcg = NULL, *old_memcg;
> > +     unsigned long flags;
> > +     void *obj;
> > +     int i;
> > +
> > +     memcg = get_memcg(c);
> > +     old_memcg = set_active_memcg(memcg);
> > +     for (i = 0; i < cnt; i++) {
> > +             obj = __alloc(c, node);
> > +             if (!obj)
> > +                     break;
> > +             if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > +                     /* In RT irq_work runs in per-cpu kthread, so disable
> > +                      * interrupts to avoid preemption and interrupts and
> > +                      * reduce the chance of bpf prog executing on this cpu
> > +                      * when active counter is busy.
> > +                      */
> > +                     local_irq_save(flags);
> > +             if (local_inc_return(&c->active) == 1) {
> Is it because it is always '== 1' here so that there is
> no need to free the obj when it is '!= 1' ?

Great catch. It's a bug indeed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-29 22:42     ` Alexei Starovoitov
@ 2022-08-29 22:59       ` Kumar Kartikeya Dwivedi
  2022-08-29 23:13         ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-29 22:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Martin KaFai Lau, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Delyan Kratunov, linux-mm, bpf,
	Kernel Team

On Tue, 30 Aug 2022 at 00:43, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 29, 2022 at 3:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Thu, Aug 25, 2022 at 07:44:16PM -0700, Alexei Starovoitov wrote:
> > > +/* Mostly runs from irq_work except __init phase. */
> > > +static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> > > +{
> > > +     struct mem_cgroup *memcg = NULL, *old_memcg;
> > > +     unsigned long flags;
> > > +     void *obj;
> > > +     int i;
> > > +
> > > +     memcg = get_memcg(c);
> > > +     old_memcg = set_active_memcg(memcg);
> > > +     for (i = 0; i < cnt; i++) {
> > > +             obj = __alloc(c, node);
> > > +             if (!obj)
> > > +                     break;
> > > +             if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > +                     /* In RT irq_work runs in per-cpu kthread, so disable
> > > +                      * interrupts to avoid preemption and interrupts and
> > > +                      * reduce the chance of bpf prog executing on this cpu
> > > +                      * when active counter is busy.
> > > +                      */
> > > +                     local_irq_save(flags);
> > > +             if (local_inc_return(&c->active) == 1) {
> > Is it because it is always '== 1' here so that there is
> > no need to free the obj when it is '!= 1' ?
>
> Great catch. It's a bug indeed.

Is it a bug? It seems it will always be true for alloc_bulk. IIUC it
is only needed to prevent NMI's unit_alloc, unit_free touching
free_llist, so that NMI llist_adds atomically to free_llist_nmi
instead. Since this runs from irq_work, we run exclusive to other
unit_free, otherwise the __llist_add wouldn't be safe either.
unit_free already does local_irq_save.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 01/15] bpf: Introduce any context BPF specific memory allocator.
  2022-08-29 22:59       ` Kumar Kartikeya Dwivedi
@ 2022-08-29 23:13         ` Alexei Starovoitov
  0 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2022-08-29 23:13 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Martin KaFai Lau, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Delyan Kratunov, linux-mm, bpf,
	Kernel Team

On Mon, Aug 29, 2022 at 4:00 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 30 Aug 2022 at 00:43, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Aug 29, 2022 at 3:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Thu, Aug 25, 2022 at 07:44:16PM -0700, Alexei Starovoitov wrote:
> > > > +/* Mostly runs from irq_work except __init phase. */
> > > > +static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> > > > +{
> > > > +     struct mem_cgroup *memcg = NULL, *old_memcg;
> > > > +     unsigned long flags;
> > > > +     void *obj;
> > > > +     int i;
> > > > +
> > > > +     memcg = get_memcg(c);
> > > > +     old_memcg = set_active_memcg(memcg);
> > > > +     for (i = 0; i < cnt; i++) {
> > > > +             obj = __alloc(c, node);
> > > > +             if (!obj)
> > > > +                     break;
> > > > +             if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > +                     /* In RT irq_work runs in per-cpu kthread, so disable
> > > > +                      * interrupts to avoid preemption and interrupts and
> > > > +                      * reduce the chance of bpf prog executing on this cpu
> > > > +                      * when active counter is busy.
> > > > +                      */
> > > > +                     local_irq_save(flags);
> > > > +             if (local_inc_return(&c->active) == 1) {
> > > Is it because it is always '== 1' here so that there is
> > > no need to free the obj when it is '!= 1' ?
> >
> > Great catch. It's a bug indeed.
>
> Is it a bug? It seems it will always be true for alloc_bulk. IIUC it
> is only needed to prevent NMI's unit_alloc, unit_free touching
> free_llist, so that NMI llist_adds atomically to free_llist_nmi
> instead. Since this runs from irq_work, we run exclusive to other
> unit_free, otherwise the __llist_add wouldn't be safe either.
> unit_free already does local_irq_save.

Correct. It cannot happen in practice, but the code as written
will look 'buggy' to any static analyzer. So we have to 'fix' it.
I'm thinking to do:
WARN_ON_ONCE(local_inc_return(&c->active) != 1);
and the same in free_bulk.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator.
  2022-08-29 15:47     ` Alexei Starovoitov
@ 2022-09-09 20:10       ` Andrii Nakryiko
  0 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2022-09-09 20:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, davem, daniel, andrii, tj, delyank,
	linux-mm, bpf, kernel-team

On Mon, Aug 29, 2022 at 8:47 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Aug 28, 2022 at 12:53:48AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Sat, 27 Aug 2022 at 18:57, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Thu, Aug 25, 2022 at 7:44 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > Introduce any context BPF specific memory allocator.
> > > >
> > > > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > > > run in unknown context where calling plain kmalloc() might not be safe.
> > > > Front-end kmalloc() with per-cpu cache of free elements.
> > > > Refill this cache asynchronously from irq_work.
> > > >
> > > > Major achievements enabled by bpf_mem_alloc:
> > > > - Dynamically allocated hash maps used to be 10 times slower than fully preallocated.
> > > >   With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc.
> > > > - Tracing bpf programs can use dynamically allocated hash maps.
> > > >   Potentially saving lots of memory. Typical hash map is sparsely populated.
> > > > - Sleepable bpf programs can used dynamically allocated hash maps.
> > > >
> > > > v3->v4:
> > > > - fix build issue due to missing local.h on 32-bit arch
> > > > - add Kumar's ack
> > > > - proposal for next steps from Delyan:
> > > > https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com/
> > > >
> > > > v2->v3:
> > > > - Rewrote the free_list algorithm based on discussions with Kumar. Patch 1.
> > > > - Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14.
> > > > - Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is
> > > >   requested to reduce memory consumption. Patch 15.
> > > > - Fix: zero-fill percpu allocation
> > > > - Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction
> > > >
> > > > v2 thread:
> > > > https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/
> > > >
> > > > v1->v2:
> > > > - Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
> > > > - Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
> > > > - Tuned watermarks per allocation size. Patch 8
> > > > - Adopted this approach to per-cpu allocation. Patch 10.
> > > > - Fully converted hash map to bpf_mem_alloc. Patch 11.
> > > > - Removed tracing prog restriction on map types. Combination of all patches and final patch 12.
> > > >
> > > > v1 thread:
> > > > https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/
> > > >
> > > > LWN article:
> > > > https://lwn.net/Articles/899274/
> > > >
> > > > Future work:
> > > > - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
> > > > - convert lru map to bpf_mem_alloc
> > > >
> > > > Alexei Starovoitov (15):
> > > >   bpf: Introduce any context BPF specific memory allocator.
> > > >   bpf: Convert hash map to bpf_mem_alloc.
> > > >   selftests/bpf: Improve test coverage of test_maps
> > > >   samples/bpf: Reduce syscall overhead in map_perf_test.
> > > >   bpf: Relax the requirement to use preallocated hash maps in tracing
> > > >     progs.
> > > >   bpf: Optimize element count in non-preallocated hash map.
> > > >   bpf: Optimize call_rcu in non-preallocated hash map.
> > > >   bpf: Adjust low/high watermarks in bpf_mem_cache
> > > >   bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
> > > >   bpf: Add percpu allocation support to bpf_mem_alloc.
> > > >   bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
> > > >   bpf: Remove tracing program restriction on map types
> > > >   bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
> > > >   bpf: Remove prealloc-only restriction for sleepable bpf programs.
> > > >   bpf: Introduce sysctl kernel.bpf_force_dyn_alloc.
> > > >
> > > >  include/linux/bpf_mem_alloc.h             |  26 +
> > > >  include/linux/filter.h                    |   2 +
> > > >  kernel/bpf/Makefile                       |   2 +-
> > > >  kernel/bpf/core.c                         |   2 +
> > > >  kernel/bpf/hashtab.c                      | 132 +++--
> > > >  kernel/bpf/memalloc.c                     | 602 ++++++++++++++++++++++
> > > >  kernel/bpf/syscall.c                      |  14 +-
> > > >  kernel/bpf/verifier.c                     |  52 --
> > > >  samples/bpf/map_perf_test_kern.c          |  44 +-
> > > >  samples/bpf/map_perf_test_user.c          |   2 +-
> > > >  tools/testing/selftests/bpf/progs/timer.c |  11 -
> > > >  tools/testing/selftests/bpf/test_maps.c   |  38 +-
> > > >  12 files changed, 796 insertions(+), 131 deletions(-)
> > > >  create mode 100644 include/linux/bpf_mem_alloc.h
> > > >  create mode 100644 kernel/bpf/memalloc.c
> > > >
> > > > --
> > > > 2.30.2
> > > >
> > >
> > > It's great to lift all those NMI restrictions on non-prealloc hashmap!
> > > This should also open up new maps (like qp-trie) that can't be
> > > pre-sized to the NMI world as well.
> > >
> > > But just to clarify, in NMI mode we can exhaust memory in caches (and
> > > thus if we do a lot of allocation in single BPF program execution we
> > > can fail some operations). That's unavoidable. But it's not 100% clear
> > > what's the behavior in IRQ mode and separately from that in "usual"
> > > less restrictive mode. Is my understanding correct that we shouldn't
> > > run out of memory (assuming there is memory available, of course)
> > > because replenishing of caches will interrupt BPF program execution?
> >
> > When I was reviewing the code what I understood was as follows:
> >
> > There are two ways work is queued. On non-RT, it is queued for
> > execution in hardirq context (raised_list), on RT it will instead be
> > executed by per-CPU pinned irq_work kthreads (from lazy_list).
> >
> > We cannot set the IRQ_WORK_HARD_IRQ to force RT to execute them in
> > hardirq context, as bpf_mem_refill may take sleepable non-raw
> > spinlocks when calling into kmalloc, which is disallowed.
>
> Correct.
>
> > So, to summarize the behavior:
> > In NMI context:
> > - for both RT and non-RT, once we deplete the cache we get -ENOMEM.
> > In IRQ context:
> > - for RT, it will fill it asynchronously by waking up the irq_work
> > kthread, so you may still get -ENOMEM (also depends on if bpf prog is
> > in hardirq or threaded irq context, since hardirq context would be
> > non-preemptible, delaying refilling from irq_work kthread context).
> > - for non-RT, it is already inside the interrupt handler hence you
> > will get -ENOMEM. Interrupt handlers keep interrupts disabled, so IPI
> > execution is delayed until the handler returns.
> > In softirq and task context:
> > - for RT, it will fill it asynchronously by waking up the irq_work
> > kthread, so you may still get -ENOMEM.
> > - for non-RT, it will send IPI to local cpu, which will execute the
> > work synchronously, so you will refill the cache by interrupting the
> > program. Even when executing softirq inside the exit path of
> > interrupts, at that point interrupts are enabled so it will refill
> > synchronously by raising local IPI.
>
> Correct.
>
> > For the last case (say task context), the problem of kmalloc
> > reentrancy comes to mind again, e.g if we are tracing in guts of
> > kmalloc and send local IPI which eventually calls kmalloc again (which
> > may deadlock). But remember that such cases are already possible
> > without BPF, interrupts which allocate may come in at any time, so the
> > kmalloc code itself will keep IRQs disabled at these places, hence we
> > are fine from BPF side as well.
>
> Exactly.
>
> > Please let me know of any inaccuracies in the above description.
>
> All looks correct.

Yep, very nice summary, thanks Kumar!

> I'm hesitant to add such low level details to a bpf documentation though
> because things will change soon and program writers often don't have
> control on the execution context. When bpf prog is attached to a kernel
> function 'foo' that function maybe called out of all possible contexts.
> Just like rigth now the bpf progs call bpf_map_update_elem (which does
> some kind of memory allocation inside) and don't think twice.
> Smarter thresholds are on todo list.
> When bpf_mem_alloc is exposed to bpf progs directly they will have
> an ability to prefill the cache to guarantee availability of objects later.
> Sleepable progs might skip cache altogether or try direct kmalloc and fallback
> to cache or the other way around.
> In other words the program writers should not rely on specific
> implementation details of bpf_mem_alloc, RT or non-RT, IRQs on or off, etc.
>
> > > Or am I wrong and we can still run out of memory if we don't have
> > > enough pre-cached memory. I think it would be good to clearly state
> > > such things (unless I missed them somewhere in patches). I'm trying to
> > > understand if in non-restrictive mode we can still fail to allocate a
> > > bunch of hashmap elements in a loop just because of the design of
> > > bpf_mem_alloc?
>
> The users shouldn't rely on internal details. bpf_mem_alloc can
> return NULL. That's all they need to know.

Not really. Sure, users have to handle failures, it's inevitable. But
there is a big difference between some update failing because system
is critically low on memory or we exhausted hashmap capacity vs update
fails because it's 2nd or 4th or whatever update of almost empty
hashmap within single BPF program run. Understanding these technical
limitations are quite important in practice.

> If it's not acceptable the prog should prefill the cache. Then the prog
> will have a guaranteed reserve at the time of allocation.
> That api is work in progress. It's another step after Delyan's work.
>

While I understand that we don't want to make these guarantees as some
sort of set in stone API, it's still good to describe what are the
conditions when mem alloc might fail. E.g., knowing that single map
update is almost 100% guaranteed to succeed, while 100th update within
single program execution is 100% guaranteed to fail due to exhaustion
of pre-allocated mem cache would be nice to know. Exact number of
successful allocs within one program invocation is probably too much
to specify, but saying that BPF program not running in NMI and hardirq
on non-RT kernel can do many allocations without risk of exhausting
this small cache is important to understand. So I'd personally like to
have Kumar's summary somewhere in docs with explicit disclaimer that
this is general guidance and not guaranteed to be exactly like that in
the future kernel versions.

I also just realized that I'm also unclear now about 4K limit, does it
apply for hashmap use case where all alloc'ed elements are fixed
sized? I can find it in code when the time comes, but it's just an
example of questions that users will run into (or they won't think
about this and will be bitten in production).

But up to you about documenting this. I just wanted to make sure that
in "normal" context we don't have to be afraid to do multiple hash map
updates. And 4K limit for fixed-sized elements popped up as I wrote
this. If we don't, we should probably have selftest with hashmap with
>4KB map value.

But also, behavior in hardirq/NMI (and especially on RT kernels) means
that that sysctl that you had in v4 to force prealloc hashmaps to be
non-prealloc is probably too aggressive in general. I still see cases
when PREALLOC might be important and necessary.


> > > But it looks great otherwise. For the series:
> > >
> > > Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
> Thanks for the review!

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-09-09 20:10 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26  2:44 [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 01/15] bpf: Introduce any context " Alexei Starovoitov
2022-08-29 21:30   ` Daniel Borkmann
2022-08-29 21:45     ` Alexei Starovoitov
2022-08-29 21:59   ` Daniel Borkmann
2022-08-29 22:04     ` Alexei Starovoitov
2022-08-29 22:39   ` Martin KaFai Lau
2022-08-29 22:42     ` Alexei Starovoitov
2022-08-29 22:59       ` Kumar Kartikeya Dwivedi
2022-08-29 23:13         ` Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 02/15] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 03/15] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 04/15] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 05/15] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 06/15] bpf: Optimize element count in non-preallocated hash map Alexei Starovoitov
2022-08-29 21:47   ` Daniel Borkmann
2022-08-29 21:57     ` Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 07/15] bpf: Optimize call_rcu " Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 08/15] bpf: Adjust low/high watermarks in bpf_mem_cache Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 09/15] bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 10/15] bpf: Add percpu allocation support to bpf_mem_alloc Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 11/15] bpf: Convert percpu hash map to per-cpu bpf_mem_alloc Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 12/15] bpf: Remove tracing program restriction on map types Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 14/15] bpf: Remove prealloc-only restriction for " Alexei Starovoitov
2022-08-26  2:44 ` [PATCH v4 bpf-next 15/15] bpf: Introduce sysctl kernel.bpf_force_dyn_alloc Alexei Starovoitov
2022-08-29 22:02   ` Daniel Borkmann
2022-08-29 22:27     ` Alexei Starovoitov
2022-08-27 16:57 ` [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator Andrii Nakryiko
2022-08-27 22:53   ` Kumar Kartikeya Dwivedi
2022-08-29 15:47     ` Alexei Starovoitov
2022-09-09 20:10       ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.