All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
@ 2022-06-23  0:32 Alexei Starovoitov
  2022-06-23  0:32 ` [PATCH bpf-next 1/5] bpf: Introduce any context " Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-23  0:32 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, kafai, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce any context BPF specific memory allocator.

Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.
Front-end kmalloc() with per-cpu per-bucket cache of free elements.
Refill this cache asynchronously from irq_work.

There is a lot more work ahead, but this set is useful base.
Future work:
- get rid of call_rcu in hash map
- get rid of atomic_inc/dec in hash map
- tune watermarks per allocation size
- adopt this approach alloc_percpu_gfp
- expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
- add sysctl to force bpf_mem_alloc in hash map when safe even if pre-alloc
  requested to reduce memory consumption
- convert lru map to bpf_mem_alloc

Alexei Starovoitov (5):
  bpf: Introduce any context BPF specific memory allocator.
  bpf: Convert hash map to bpf_mem_alloc.
  selftests/bpf: Improve test coverage of test_maps
  samples/bpf: Reduce syscall overhead in map_perf_test.
  bpf: Relax the requirement to use preallocated hash maps in tracing
    progs.

 include/linux/bpf_mem_alloc.h           |  26 ++
 kernel/bpf/Makefile                     |   2 +-
 kernel/bpf/hashtab.c                    |  16 +-
 kernel/bpf/memalloc.c                   | 512 ++++++++++++++++++++++++
 kernel/bpf/verifier.c                   |  31 +-
 samples/bpf/map_perf_test_kern.c        |  22 +-
 tools/testing/selftests/bpf/test_maps.c |  38 +-
 7 files changed, 610 insertions(+), 37 deletions(-)
 create mode 100644 include/linux/bpf_mem_alloc.h
 create mode 100644 kernel/bpf/memalloc.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH bpf-next 1/5] bpf: Introduce any context BPF specific memory allocator.
  2022-06-23  0:32 [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
@ 2022-06-23  0:32 ` Alexei Starovoitov
  2022-06-25  1:23   ` John Fastabend
  2022-06-23  0:32 ` [PATCH bpf-next 2/5] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-23  0:32 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, kafai, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.

Front-end kmalloc() with per-cpu per-bucket cache of free elements.
Refill this cache asynchronously from irq_work.

BPF programs always run with migration disabled.
It's safe to allocate from cache of the current cpu with irqs disabled.
Free-ing is always done into bucket of the current cpu as well.
irq_work trims extra free elements from buckets with kfree
and refills them with kmalloc, so global kmalloc logic takes care
of freeing objects allocated by one cpu and freed on another.

struct bpf_mem_alloc supports two modes:
- When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
  This is typical bpf hash map use case when all elements have equal size.
- When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
  kmalloc/kfree. Max allocation size is 4096 in this case.
  This is bpf_dynptr and bpf_kptr use case.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_mem_alloc.h |  26 ++
 kernel/bpf/Makefile           |   2 +-
 kernel/bpf/memalloc.c         | 512 ++++++++++++++++++++++++++++++++++
 3 files changed, 539 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf_mem_alloc.h
 create mode 100644 kernel/bpf/memalloc.c

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
new file mode 100644
index 000000000000..804733070f8d
--- /dev/null
+++ b/include/linux/bpf_mem_alloc.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+#ifndef _BPF_MEM_ALLOC_H
+#define _BPF_MEM_ALLOC_H
+#include <linux/compiler_types.h>
+
+struct bpf_mem_cache;
+struct bpf_mem_caches;
+
+struct bpf_mem_alloc {
+	struct bpf_mem_caches __percpu *caches;
+	struct bpf_mem_cache __percpu *cache;
+};
+
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size);
+void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
+
+/* kmalloc/kfree equivalent: */
+void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size);
+void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr);
+
+/* kmem_cache_alloc/free equivalent: */
+void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma);
+void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
+
+#endif /* _BPF_MEM_ALLOC_H */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..11fb9220909b 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
-obj-$(CONFIG_BPF_SYSCALL) += btf.o
+obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
new file mode 100644
index 000000000000..01b041563fe1
--- /dev/null
+++ b/kernel/bpf/memalloc.c
@@ -0,0 +1,512 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+#include <linux/mm.h>
+#include <linux/llist.h>
+#include <linux/bpf.h>
+#include <linux/irq_work.h>
+#include <linux/bpf_mem_alloc.h>
+#include <linux/memcontrol.h>
+
+/* Any context (including NMI) BPF specific memory allocator.
+ *
+ * Tracing BPF programs can attach to kprobe and fentry. Hence they
+ * run in unknown context where calling plain kmalloc() might not be safe.
+ *
+ * Front-end kmalloc() with per-cpu per-bucket cache of free elements.
+ * Refill this cache asynchronously from irq_work.
+ *
+ * CPU_0 buckets
+ * 16 32 64 96 128 196 256 512 1024 2048 4096
+ * ...
+ * CPU_N buckets
+ * 16 32 64 96 128 196 256 512 1024 2048 4096
+ *
+ * The buckets are prefilled at the start.
+ * BPF programs always run with migration disabled.
+ * It's safe to allocate from cache of the current cpu with irqs disabled.
+ * Free-ing is always done into bucket of the current cpu as well.
+ * irq_work trims extra free elements from buckets with kfree
+ * and refills them with kmalloc, so global kmalloc logic takes care
+ * of freeing objects allocated by one cpu and freed on another.
+ *
+ * Every allocated objected is padded with extra 8 bytes that contains
+ * struct llist_node.
+ */
+
+/* similar to kmalloc, but sizeof == 8 bucket is gone */
+static u8 size_index[24] __ro_after_init = {
+	3,	/* 8 */
+	3,	/* 16 */
+	4,	/* 24 */
+	4,	/* 32 */
+	5,	/* 40 */
+	5,	/* 48 */
+	5,	/* 56 */
+	5,	/* 64 */
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+	6,	/* 104 */
+	6,	/* 112 */
+	6,	/* 120 */
+	6,	/* 128 */
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+};
+
+static int bpf_mem_cache_idx(size_t size)
+{
+	if (size <= 192) {
+		if (!size)
+			return -1;
+
+		return size_index[(size - 1) / 8] - 1;
+	} else {
+		if (size > PAGE_SIZE)
+			return -1;
+
+		return fls(size - 1) - 1;
+	}
+}
+
+#define NUM_CACHES 11
+
+struct bpf_mem_cache {
+	/* per-cpu list of free objects of size 'unit_size'.
+	 * All accesses are done with preemption disabled
+	 * with __llist_add() and __llist_del_first().
+	 */
+	struct llist_head free_llist;
+
+	/* NMI only free list.
+	 * All accesses are NMI-safe llist_add() and llist_del_first().
+	 *
+	 * Each allocated object is either on free_llist or on free_llist_nmi.
+	 * One cpu can allocate it from NMI by doing llist_del_first() from
+	 * free_llist_nmi, while another might free it back from non-NMI by
+	 * doing llist_add() into free_llist.
+	 */
+	struct llist_head free_llist_nmi;
+	/* kmem_cache != NULL when bpf_mem_alloc was created for specific
+	 * element size.
+	 */
+	struct kmem_cache *kmem_cache;
+	struct irq_work refill_work;
+	struct mem_cgroup *memcg;
+	int unit_size;
+	/* count of objects in free_llist */
+	int free_cnt;
+	/* count of objects in free_llist_nmi */
+	atomic_t free_cnt_nmi;
+	/* flag to refill nmi list too */
+	bool refill_nmi_list;
+};
+
+struct bpf_mem_caches {
+	struct bpf_mem_cache cache[NUM_CACHES];
+};
+
+static struct llist_node notrace *__llist_del_first(struct llist_head *head)
+{
+	struct llist_node *entry, *next;
+
+	entry = head->first;
+	if (!entry)
+		return NULL;
+	next = entry->next;
+	head->first = next;
+	return entry;
+}
+
+#define BATCH 48
+#define LOW_WATERMARK 32
+#define HIGH_WATERMARK 96
+/* Assuming the average number of elements per bucket is 64, when all buckets
+ * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
+ * 64*4096*32 ~ 20Mbyte
+ */
+
+/* extra macro useful for testing by randomizing in_nmi condition */
+#define bpf_in_nmi() in_nmi()
+
+static void *__alloc(struct bpf_mem_cache *c, int node)
+{
+	/* Allocate, but don't deplete atomic reserves that typical
+	 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
+	 * will allocate from the current numa node which is what we
+	 * want here.
+	 */
+	gfp_t flags = GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_ACCOUNT;
+
+	if (c->kmem_cache)
+		return kmem_cache_alloc_node(c->kmem_cache, flags, node);
+
+	return kmalloc_node(c->unit_size, flags, node);
+}
+/* Mostly runs from irq_work except __init phase. */
+static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
+{
+	struct mem_cgroup *old_memcg;
+	unsigned long flags;
+	void *obj;
+	int i;
+
+	old_memcg = set_active_memcg(c->memcg);
+	for (i = 0; i < cnt; i++) {
+		obj = __alloc(c, node);
+		if (!obj)
+			break;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			/* In RT irq_work runs in per-cpu kthread, so we have
+			 * to disable interrupts to avoid race with
+			 * bpf_mem_alloc/free. Note the read of free_cnt in
+			 * bpf_mem_refill is racy in RT. It's ok to do.
+			 */
+			local_irq_save(flags);
+		__llist_add(obj, &c->free_llist);
+		c->free_cnt++;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_restore(flags);
+	}
+	set_active_memcg(old_memcg);
+}
+
+/* Refill NMI specific llist. Mostly runs from irq_work except __init phase. */
+static void alloc_bulk_nmi(struct bpf_mem_cache *c, int cnt, int node)
+{
+	struct mem_cgroup *old_memcg;
+	void *obj;
+	int i;
+
+	old_memcg = set_active_memcg(c->memcg);
+	for (i = 0; i < cnt; i++) {
+		obj = __alloc(c, node);
+		if (!obj)
+			break;
+		llist_add(obj, &c->free_llist_nmi);
+		atomic_inc(&c->free_cnt_nmi);
+	}
+	set_active_memcg(old_memcg);
+}
+
+static void __free(struct bpf_mem_cache *c, void *obj)
+{
+	if (c->kmem_cache)
+		kmem_cache_free(c->kmem_cache, obj);
+	else
+		kfree(obj);
+}
+
+static void free_bulk(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode;
+	unsigned long flags;
+	int cnt;
+
+	do {
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_save(flags);
+		llnode = __llist_del_first(&c->free_llist);
+		if (llnode)
+			cnt = --c->free_cnt;
+		else
+			cnt = 0;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_restore(flags);
+		__free(c, llnode);
+	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+}
+
+static void free_bulk_nmi(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode;
+	int cnt;
+
+	do {
+		llnode = llist_del_first(&c->free_llist_nmi);
+		if (llnode)
+			cnt = atomic_dec_return(&c->free_cnt_nmi);
+		else
+			cnt = 0;
+		__free(c, llnode);
+	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
+}
+
+static void bpf_mem_refill(struct irq_work *work)
+{
+	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
+	int cnt;
+
+	cnt = c->free_cnt;
+	if (cnt < LOW_WATERMARK)
+		/* irq_work runs on this cpu and kmalloc will allocate
+		 * from the current numa node which is what we want here.
+		 */
+		alloc_bulk(c, BATCH, NUMA_NO_NODE);
+	else if (cnt > HIGH_WATERMARK)
+		free_bulk(c);
+
+	if (!c->refill_nmi_list)
+		/* don't refill NMI specific freelist
+		 * until alloc/free from NMI.
+		 */
+		return;
+	cnt = atomic_read(&c->free_cnt_nmi);
+	if (cnt < LOW_WATERMARK)
+		alloc_bulk_nmi(c, BATCH, NUMA_NO_NODE);
+	else if (cnt > HIGH_WATERMARK)
+		free_bulk_nmi(c);
+	c->refill_nmi_list = false;
+}
+
+static void notrace irq_work_raise(struct bpf_mem_cache *c, bool in_nmi)
+{
+	c->refill_nmi_list = in_nmi;
+	irq_work_queue(&c->refill_work);
+}
+
+static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
+{
+	init_irq_work(&c->refill_work, bpf_mem_refill);
+	/* To avoid consuming memory assume that 1st run of bpf
+	 * prog won't be doing more than 4 map_update_elem from
+	 * irq disabled region
+	 */
+	alloc_bulk(c, c->unit_size < 256 ? 4 : 1, cpu_to_node(cpu));
+
+	/* NMI progs are rare. Assume they have one map_update
+	 * per prog at the very beginning.
+	 */
+	alloc_bulk_nmi(c, 1, cpu_to_node(cpu));
+}
+
+/* When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
+ * This is typical bpf hash map use case when all elements have equal size.
+ *
+ * When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
+ * kmalloc/kfree. Max allocation size is 4096 in this case.
+ * This is bpf_dynptr and bpf_kptr use case.
+ */
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size)
+{
+	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
+	struct bpf_mem_caches *cc, __percpu *pcc;
+	struct bpf_mem_cache *c, __percpu *pc;
+	struct kmem_cache *kmem_cache;
+	struct mem_cgroup *memcg;
+	char buf[32];
+	int cpu, i;
+
+	if (size) {
+		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
+		if (!pc)
+			return -ENOMEM;
+		size += 8; /* room for llist_node */
+		snprintf(buf, sizeof(buf), "bpf-%u", size);
+		kmem_cache = kmem_cache_create(buf, size, 8, 0, NULL);
+		if (!kmem_cache) {
+			free_percpu(pc);
+			return -ENOMEM;
+		}
+		memcg = get_mem_cgroup_from_mm(current->mm);
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(pc, cpu);
+			c->kmem_cache = kmem_cache;
+			c->unit_size = size;
+			c->memcg = memcg;
+			prefill_mem_cache(c, cpu);
+		}
+		ma->cache = pc;
+		return 0;
+	}
+
+	pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL);
+	if (!pcc)
+		return -ENOMEM;
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	for_each_possible_cpu(cpu) {
+		cc = per_cpu_ptr(pcc, cpu);
+		for (i = 0; i < NUM_CACHES; i++) {
+			c = &cc->cache[i];
+			c->unit_size = sizes[i];
+			c->memcg = memcg;
+			prefill_mem_cache(c, cpu);
+		}
+	}
+	ma->caches = pcc;
+	return 0;
+}
+
+static void drain_mem_cache(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode;
+
+	for (;;) {
+		llnode = llist_del_first(&c->free_llist_nmi);
+		if (!llnode)
+			break;
+		if (c->kmem_cache)
+			kmem_cache_free(c->kmem_cache, llnode);
+		else
+			kfree(llnode);
+	}
+	for (;;) {
+		llnode = __llist_del_first(&c->free_llist);
+		if (!llnode)
+			break;
+		if (c->kmem_cache)
+			kmem_cache_free(c->kmem_cache, llnode);
+		else
+			kfree(llnode);
+	}
+}
+
+void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
+{
+	struct bpf_mem_caches *cc;
+	struct bpf_mem_cache *c;
+	int cpu, i;
+
+	if (ma->cache) {
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(ma->cache, cpu);
+			drain_mem_cache(c);
+		}
+		/* kmem_cache and memcg are the same across cpus */
+		kmem_cache_destroy(c->kmem_cache);
+		mem_cgroup_put(c->memcg);
+		free_percpu(ma->cache);
+		ma->cache = NULL;
+	}
+	if (ma->caches) {
+		for_each_possible_cpu(cpu) {
+			cc = per_cpu_ptr(ma->caches, cpu);
+			for (i = 0; i < NUM_CACHES; i++) {
+				c = &cc->cache[i];
+				drain_mem_cache(c);
+			}
+		}
+		mem_cgroup_put(c->memcg);
+		free_percpu(ma->caches);
+		ma->caches = NULL;
+	}
+}
+
+/* notrace is necessary here and in other functions to make sure
+ * bpf programs cannot attach to them and cause llist corruptions.
+ */
+static void notrace *unit_alloc(struct bpf_mem_cache *c)
+{
+	bool in_nmi = bpf_in_nmi();
+	struct llist_node *llnode;
+	unsigned long flags;
+	int cnt = 0;
+
+	if (unlikely(in_nmi)) {
+		llnode = llist_del_first(&c->free_llist_nmi);
+		if (llnode)
+			cnt = atomic_dec_return(&c->free_cnt_nmi);
+	} else {
+		/* Disable irqs to prevent the following race:
+		 * bpf_prog_A
+		 *   bpf_mem_alloc
+		 *      preemption or irq -> bpf_prog_B
+		 *        bpf_mem_alloc
+		 */
+		local_irq_save(flags);
+		llnode = __llist_del_first(&c->free_llist);
+		if (llnode)
+			cnt = --c->free_cnt;
+		local_irq_restore(flags);
+	}
+	WARN_ON(cnt < 0);
+
+	if (cnt < LOW_WATERMARK)
+		irq_work_raise(c, in_nmi);
+	return llnode;
+}
+
+/* Though 'ptr' object could have been allocated on a different cpu
+ * add it to the free_llist of the current cpu.
+ * Let kfree() logic deal with it when it's later called from irq_work.
+ */
+static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
+{
+	struct llist_node *llnode = ptr - 8;
+	bool in_nmi = bpf_in_nmi();
+	unsigned long flags;
+	int cnt;
+
+	BUILD_BUG_ON(sizeof(struct llist_node) > 8);
+
+	if (unlikely(in_nmi)) {
+		llist_add(llnode, &c->free_llist_nmi);
+		cnt = atomic_inc_return(&c->free_cnt_nmi);
+	} else {
+		local_irq_save(flags);
+		__llist_add(llnode, &c->free_llist);
+		cnt = ++c->free_cnt;
+		local_irq_restore(flags);
+	}
+	WARN_ON(cnt <= 0);
+
+	if (cnt > HIGH_WATERMARK)
+		/* free few objects from current cpu into global kmalloc pool */
+		irq_work_raise(c, in_nmi);
+}
+
+/* Called from BPF program or from sys_bpf syscall.
+ * In both cases migration is disabled.
+ */
+void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
+{
+	int idx;
+	void *ret;
+
+	if (!size)
+		return ZERO_SIZE_PTR;
+
+	idx = bpf_mem_cache_idx(size + 8);
+	if (idx < 0)
+		return NULL;
+
+	ret = unit_alloc(this_cpu_ptr(ma->caches)->cache + idx);
+	return !ret ? NULL : ret + 8;
+}
+
+void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
+{
+	int idx;
+
+	if (!ptr)
+		return;
+
+	idx = bpf_mem_cache_idx(__ksize(ptr - 8));
+	if (idx < 0)
+		return;
+
+	unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
+}
+
+void notrace *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma)
+{
+	void *ret;
+
+	ret = unit_alloc(this_cpu_ptr(ma->cache));
+	return !ret ? NULL : ret + 8;
+}
+
+void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
+{
+	if (!ptr)
+		return;
+
+	unit_free(this_cpu_ptr(ma->cache), ptr);
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH bpf-next 2/5] bpf: Convert hash map to bpf_mem_alloc.
  2022-06-23  0:32 [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
  2022-06-23  0:32 ` [PATCH bpf-next 1/5] bpf: Introduce any context " Alexei Starovoitov
@ 2022-06-23  0:32 ` Alexei Starovoitov
  2022-06-23  0:32 ` [PATCH bpf-next 3/5] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-23  0:32 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, kafai, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert bpf hash map to use bpf memory allocator.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 17fb69c0e0dc..62b7b6e6751b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -14,6 +14,7 @@
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
+#include <linux/bpf_mem_alloc.h>
 
 #define HTAB_CREATE_FLAG_MASK						\
 	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |	\
@@ -92,6 +93,7 @@ struct bucket {
 
 struct bpf_htab {
 	struct bpf_map map;
+	struct bpf_mem_alloc ma;
 	struct bucket *buckets;
 	void *elems;
 	union {
@@ -554,6 +556,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	htab_init_buckets(htab);
 
+	err = bpf_mem_alloc_init(&htab->ma, htab->elem_size);
+	if (err)
+		goto free_map_locked;
+
 	if (prealloc) {
 		err = prealloc_init(htab);
 		if (err)
@@ -577,6 +583,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->ma);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
 	kfree(htab);
@@ -853,7 +860,7 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
 	check_and_free_fields(htab, l);
-	kfree(l);
+	bpf_mem_cache_free(&htab->ma, l);
 }
 
 static void htab_elem_free_rcu(struct rcu_head *head)
@@ -977,9 +984,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 				l_new = ERR_PTR(-E2BIG);
 				goto dec_count;
 			}
-		l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
-					     GFP_ATOMIC | __GFP_NOWARN,
-					     htab->map.numa_node);
+		l_new = bpf_mem_cache_alloc(&htab->ma);
 		if (!l_new) {
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
@@ -998,7 +1003,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			pptr = bpf_map_alloc_percpu(&htab->map, size, 8,
 						    GFP_ATOMIC | __GFP_NOWARN);
 			if (!pptr) {
-				kfree(l_new);
+				bpf_mem_cache_free(&htab->ma, l_new);
 				l_new = ERR_PTR(-ENOMEM);
 				goto dec_count;
 			}
@@ -1493,6 +1498,7 @@ static void htab_map_free(struct bpf_map *map)
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->ma);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH bpf-next 3/5] selftests/bpf: Improve test coverage of test_maps
  2022-06-23  0:32 [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
  2022-06-23  0:32 ` [PATCH bpf-next 1/5] bpf: Introduce any context " Alexei Starovoitov
  2022-06-23  0:32 ` [PATCH bpf-next 2/5] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
@ 2022-06-23  0:32 ` Alexei Starovoitov
  2022-06-23  0:32 ` [PATCH bpf-next 4/5] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-23  0:32 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, kafai, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Make test_maps more stressful with more parallelism in
update/delete/lookup/walk including different value sizes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_maps.c | 38 ++++++++++++++++---------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index cbebfaa7c1e8..d1ffc76814d9 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -264,10 +264,11 @@ static void test_hashmap_percpu(unsigned int task, void *data)
 	close(fd);
 }
 
+#define VALUE_SIZE 3
 static int helper_fill_hashmap(int max_entries)
 {
 	int i, fd, ret;
-	long long key, value;
+	long long key, value[VALUE_SIZE] = {};
 
 	fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(key), sizeof(value),
 			    max_entries, &map_opts);
@@ -276,8 +277,8 @@ static int helper_fill_hashmap(int max_entries)
 	      "err: %s, flags: 0x%x\n", strerror(errno), map_opts.map_flags);
 
 	for (i = 0; i < max_entries; i++) {
-		key = i; value = key;
-		ret = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		key = i; value[0] = key;
+		ret = bpf_map_update_elem(fd, &key, value, BPF_NOEXIST);
 		CHECK(ret != 0,
 		      "can't update hashmap",
 		      "err: %s\n", strerror(ret));
@@ -288,8 +289,8 @@ static int helper_fill_hashmap(int max_entries)
 
 static void test_hashmap_walk(unsigned int task, void *data)
 {
-	int fd, i, max_entries = 1000;
-	long long key, value, next_key;
+	int fd, i, max_entries = 10000;
+	long long key, value[VALUE_SIZE], next_key;
 	bool next_key_valid = true;
 
 	fd = helper_fill_hashmap(max_entries);
@@ -297,7 +298,7 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : &key,
 					 &next_key) == 0; i++) {
 		key = next_key;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
 	}
 
 	assert(i == max_entries);
@@ -305,9 +306,9 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
 	for (i = 0; next_key_valid; i++) {
 		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
-		value++;
-		assert(bpf_map_update_elem(fd, &key, &value, BPF_EXIST) == 0);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
+		value[0]++;
+		assert(bpf_map_update_elem(fd, &key, value, BPF_EXIST) == 0);
 		key = next_key;
 	}
 
@@ -316,8 +317,8 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : &key,
 					 &next_key) == 0; i++) {
 		key = next_key;
-		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
-		assert(value - 1 == key);
+		assert(bpf_map_lookup_elem(fd, &key, value) == 0);
+		assert(value[0] - 1 == key);
 	}
 
 	assert(i == max_entries);
@@ -1371,16 +1372,16 @@ static void __run_parallel(unsigned int tasks,
 
 static void test_map_stress(void)
 {
+	run_parallel(100, test_hashmap_walk, NULL);
 	run_parallel(100, test_hashmap, NULL);
 	run_parallel(100, test_hashmap_percpu, NULL);
 	run_parallel(100, test_hashmap_sizes, NULL);
-	run_parallel(100, test_hashmap_walk, NULL);
 
 	run_parallel(100, test_arraymap, NULL);
 	run_parallel(100, test_arraymap_percpu, NULL);
 }
 
-#define TASKS 1024
+#define TASKS 100
 
 #define DO_UPDATE 1
 #define DO_DELETE 0
@@ -1432,6 +1433,8 @@ static void test_update_delete(unsigned int fn, void *data)
 	int fd = ((int *)data)[0];
 	int i, key, value, err;
 
+	if (fn & 1)
+		test_hashmap_walk(fn, NULL);
 	for (i = fn; i < MAP_SIZE; i += TASKS) {
 		key = value = i;
 
@@ -1455,7 +1458,7 @@ static void test_update_delete(unsigned int fn, void *data)
 
 static void test_map_parallel(void)
 {
-	int i, fd, key = 0, value = 0;
+	int i, fd, key = 0, value = 0, j = 0;
 	int data[2];
 
 	fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(key), sizeof(value),
@@ -1466,6 +1469,7 @@ static void test_map_parallel(void)
 		exit(1);
 	}
 
+again:
 	/* Use the same fd in children to add elements to this map:
 	 * child_0 adds key=0, key=1024, key=2048, ...
 	 * child_1 adds key=1, key=1025, key=2049, ...
@@ -1502,6 +1506,12 @@ static void test_map_parallel(void)
 	key = -1;
 	assert(bpf_map_get_next_key(fd, NULL, &key) < 0 && errno == ENOENT);
 	assert(bpf_map_get_next_key(fd, &key, &key) < 0 && errno == ENOENT);
+
+	key = 0;
+	bpf_map_delete_elem(fd, &key);
+	if (j++ < 5)
+		goto again;
+	close(fd);
 }
 
 static void test_map_rdonly(void)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH bpf-next 4/5] samples/bpf: Reduce syscall overhead in map_perf_test.
  2022-06-23  0:32 [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2022-06-23  0:32 ` [PATCH bpf-next 3/5] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
@ 2022-06-23  0:32 ` Alexei Starovoitov
  2022-06-23  0:32 ` [PATCH bpf-next 5/5] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
  2022-06-27  7:03 ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Christoph Hellwig
  5 siblings, 0 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-23  0:32 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, kafai, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Make map_perf_test for preallocated and non-preallocated hash map
spend more time inside bpf program to focus performance analysis
on the speed of update/lookup/delete operations performed by bpf program.

It makes 'perf report' of bpf_mem_alloc look like:
 11.76%  map_perf_test    [k] _raw_spin_lock_irqsave
 11.26%  map_perf_test    [k] htab_map_update_elem
  9.70%  map_perf_test    [k] _raw_spin_lock
  9.47%  map_perf_test    [k] htab_map_delete_elem
  8.57%  map_perf_test    [k] memcpy_erms
  5.58%  map_perf_test    [k] alloc_htab_elem
  4.09%  map_perf_test    [k] __htab_map_lookup_elem
  3.44%  map_perf_test    [k] syscall_exit_to_user_mode
  3.13%  map_perf_test    [k] lookup_nulls_elem_raw
  3.05%  map_perf_test    [k] migrate_enable
  3.04%  map_perf_test    [k] memcmp
  2.67%  map_perf_test    [k] unit_free
  2.39%  map_perf_test    [k] lookup_elem_raw

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/map_perf_test_kern.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test_kern.c
index 8773f22b6a98..d5af4df9d403 100644
--- a/samples/bpf/map_perf_test_kern.c
+++ b/samples/bpf/map_perf_test_kern.c
@@ -108,11 +108,14 @@ int stress_hmap(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&hash_map, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&hash_map, &key);
-	if (value)
-		bpf_map_delete_elem(&hash_map, &key);
+	for (i = 0; i < 30; i++) {
+		bpf_map_update_elem(&hash_map, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&hash_map, &key);
+		if (value)
+			bpf_map_delete_elem(&hash_map, &key);
+	}
 
 	return 0;
 }
@@ -137,11 +140,14 @@ int stress_hmap_alloc(struct pt_regs *ctx)
 	u32 key = bpf_get_current_pid_tgid();
 	long init_val = 1;
 	long *value;
+	int i;
 
-	bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
-	value = bpf_map_lookup_elem(&hash_map_alloc, &key);
-	if (value)
-		bpf_map_delete_elem(&hash_map_alloc, &key);
+	for (i = 0; i < 30; i++) {
+		bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
+		value = bpf_map_lookup_elem(&hash_map_alloc, &key);
+		if (value)
+			bpf_map_delete_elem(&hash_map_alloc, &key);
+	}
 	return 0;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH bpf-next 5/5] bpf: Relax the requirement to use preallocated hash maps in tracing progs.
  2022-06-23  0:32 [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2022-06-23  0:32 ` [PATCH bpf-next 4/5] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
@ 2022-06-23  0:32 ` Alexei Starovoitov
  2022-06-27  7:03 ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Christoph Hellwig
  5 siblings, 0 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-23  0:32 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, tj, kafai, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Since bpf hash map was converted to use bpf_mem_alloc it is safe to use
from tracing programs and in RT kernels.
But per-cpu hash map is still using dynamic allocation for per-cpu map
values, hence keep the warning for this map type.
In the future alloc_percpu_gfp can be front-end-ed with bpf_mem_cache
and this restriction will be completely lifted.
perf_event (NMI) bpf programs have to use preallocated hash maps,
because free_htab_elem() is using call_rcu which might crash if re-entered.

Sleepable bpf programs have to use preallocated hash maps, because
life time of the map elements is not protected by rcu_read_lock/unlock.
This restriction can be lifted in the future as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a20d7736a5b2..90d70304c6f1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12566,10 +12566,12 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 	 * For programs attached to PERF events this is mandatory as the
 	 * perf NMI can hit any arbitrary code sequence.
 	 *
-	 * All other trace types using preallocated hash maps are unsafe as
-	 * well because tracepoint or kprobes can be inside locked regions
-	 * of the memory allocator or at a place where a recursion into the
-	 * memory allocator would see inconsistent state.
+	 * All other trace types using non-preallocated per-cpu hash maps are
+	 * unsafe as well because tracepoint or kprobes can be inside locked
+	 * regions of the per-cpu memory allocator or at a place where a
+	 * recursion into the per-cpu memory allocator would see inconsistent
+	 * state. Non per-cpu hash maps are using bpf_mem_alloc-tor which is
+	 * safe to use from kprobe/fentry and in RT.
 	 *
 	 * On RT enabled kernels run-time allocation of all trace type
 	 * programs is strictly prohibited due to lock type constraints. On
@@ -12579,15 +12581,26 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 	 */
 	if (is_tracing_prog_type(prog_type) && !is_preallocated_map(map)) {
 		if (prog_type == BPF_PROG_TYPE_PERF_EVENT) {
+			/* perf_event bpf progs have to use preallocated hash maps
+			 * because non-prealloc is still relying on call_rcu to free
+			 * elements.
+			 */
 			verbose(env, "perf_event programs can only use preallocated hash map\n");
 			return -EINVAL;
 		}
-		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-			verbose(env, "trace type programs can only use preallocated hash map\n");
-			return -EINVAL;
+		if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+		    (map->inner_map_meta &&
+		     map->inner_map_meta->map_type == BPF_MAP_TYPE_PERCPU_HASH)) {
+			if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+				verbose(env,
+					"trace type programs can only use preallocated per-cpu hash map\n");
+				return -EINVAL;
+			}
+			WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
+			verbose(env,
+				"trace type programs with run-time allocated per-cpu hash maps are unsafe."
+				" Switch to preallocated hash maps.\n");
 		}
-		WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
-		verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n");
 	}
 
 	if (map_value_has_spin_lock(map)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* RE: [PATCH bpf-next 1/5] bpf: Introduce any context BPF specific memory allocator.
  2022-06-23  0:32 ` [PATCH bpf-next 1/5] bpf: Introduce any context " Alexei Starovoitov
@ 2022-06-25  1:23   ` John Fastabend
  2022-06-26 17:19     ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: John Fastabend @ 2022-06-25  1:23 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, andrii, tj, kafai, bpf, kernel-team

Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs can attach to kprobe and fentry. Hence they
> run in unknown context where calling plain kmalloc() might not be safe.
> 
> Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> Refill this cache asynchronously from irq_work.
> 
> BPF programs always run with migration disabled.
> It's safe to allocate from cache of the current cpu with irqs disabled.
> Free-ing is always done into bucket of the current cpu as well.
> irq_work trims extra free elements from buckets with kfree
> and refills them with kmalloc, so global kmalloc logic takes care
> of freeing objects allocated by one cpu and freed on another.
> 
> struct bpf_mem_alloc supports two modes:
> - When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
>   This is typical bpf hash map use case when all elements have equal size.
> - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
>   kmalloc/kfree. Max allocation size is 4096 in this case.
>   This is bpf_dynptr and bpf_kptr use case.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Some initial feedback but still looking over it. Figured it made more
sense to dump current thoughts then drop it this evening for Monday.

[...]

> +static int bpf_mem_cache_idx(size_t size)
    [...]

> +#define NUM_CACHES 11
> +
> +struct bpf_mem_cache {
> +	/* per-cpu list of free objects of size 'unit_size'.
> +	 * All accesses are done with preemption disabled
> +	 * with __llist_add() and __llist_del_first().
> +	 */
> +	struct llist_head free_llist;
> +
> +	/* NMI only free list.
> +	 * All accesses are NMI-safe llist_add() and llist_del_first().
> +	 *
> +	 * Each allocated object is either on free_llist or on free_llist_nmi.
> +	 * One cpu can allocate it from NMI by doing llist_del_first() from
> +	 * free_llist_nmi, while another might free it back from non-NMI by
> +	 * doing llist_add() into free_llist.
> +	 */
> +	struct llist_head free_llist_nmi;

stupid nit but newline here helps me read this.

> +	/* kmem_cache != NULL when bpf_mem_alloc was created for specific
> +	 * element size.
> +	 */
> +	struct kmem_cache *kmem_cache;

> +	struct irq_work refill_work;
> +	struct mem_cgroup *memcg;
> +	int unit_size;
> +	/* count of objects in free_llist */
> +	int free_cnt;
> +	/* count of objects in free_llist_nmi */
> +	atomic_t free_cnt_nmi;
> +	/* flag to refill nmi list too */
> +	bool refill_nmi_list;
> +};

What about having two types one for fixed size cache and one for buckets?
The logic below gets a bunch of if cases with just the single type. OTOH
I messed around with it for a bit and then had to duplicate most of the
codes so I'm not sure its entirely a good idea, but the __alloc() with
the 'if this else that' sort of made me think of it.

> +
> +static struct llist_node notrace *__llist_del_first(struct llist_head *head)
     [...]

> +
> +#define BATCH 48
> +#define LOW_WATERMARK 32
> +#define HIGH_WATERMARK 96
> +/* Assuming the average number of elements per bucket is 64, when all buckets
> + * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
> + * 64*4096*32 ~ 20Mbyte
> + */
> +
> +/* extra macro useful for testing by randomizing in_nmi condition */
> +#define bpf_in_nmi() in_nmi()
> +
> +static void *__alloc(struct bpf_mem_cache *c, int node)

For example with two types this mostly drops out. Of course then the callers
have to know the type so not sure. And you get two alloc_bulks and
so on. Its not obviously this works out well.

[...]

> +static void free_bulk_nmi(struct bpf_mem_cache *c)
> +{
> +	struct llist_node *llnode;
> +	int cnt;
> +
> +	do {
> +		llnode = llist_del_first(&c->free_llist_nmi);
> +		if (llnode)
> +			cnt = atomic_dec_return(&c->free_cnt_nmi);
> +		else
> +			cnt = 0;
> +		__free(c, llnode);
> +	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
> +}

Comment from irq_work_run_list,

	/*
	 * On PREEMPT_RT IRQ-work which is not marked as HARD will be processed
	 * in a per-CPU thread in preemptible context. Only the items which are
	 * marked as IRQ_WORK_HARD_IRQ will be processed in hardirq context.
	 */

Not an RT expert but I read this to mean in PREEMPT_RT case we can't assume
this is !preemptible? If I read correctly then is there a risk we get
two runners here? And by extension would need to worry about free_cnt
and friends getting corrupted.

> +
> +static void bpf_mem_refill(struct irq_work *work)
> +{
> +	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
> +	int cnt;
> +
> +	cnt = c->free_cnt;
> +	if (cnt < LOW_WATERMARK)
> +		/* irq_work runs on this cpu and kmalloc will allocate
> +		 * from the current numa node which is what we want here.
> +		 */
> +		alloc_bulk(c, BATCH, NUMA_NO_NODE);
> +	else if (cnt > HIGH_WATERMARK)
> +		free_bulk(c);
> +
> +	if (!c->refill_nmi_list)
> +		/* don't refill NMI specific freelist
> +		 * until alloc/free from NMI.
> +		 */
> +		return;
> +	cnt = atomic_read(&c->free_cnt_nmi);
> +	if (cnt < LOW_WATERMARK)
> +		alloc_bulk_nmi(c, BATCH, NUMA_NO_NODE);
> +	else if (cnt > HIGH_WATERMARK)
> +		free_bulk_nmi(c);
> +	c->refill_nmi_list = false;
> +}
> +
> +static void notrace irq_work_raise(struct bpf_mem_cache *c, bool in_nmi)
> +{
> +	c->refill_nmi_list = in_nmi;

Should this be,

	c->refill_nmi_list |= in_nmi;

this would resolve comment in unit_alloc? We don't want to clear it if
we end up calling irq_work_raise from in_nmi and then in another
context. It would be really hard to debug if the case is possible and
a busy box just doesn't refill nmi enough.

> +	irq_work_queue(&c->refill_work);
> +}
> +
> +static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
    [...]

> +
> +/* notrace is necessary here and in other functions to make sure
> + * bpf programs cannot attach to them and cause llist corruptions.
> + */

Thanks for the comment.

> +static void notrace *unit_alloc(struct bpf_mem_cache *c)
> +{
> +	bool in_nmi = bpf_in_nmi();
> +	struct llist_node *llnode;
> +	unsigned long flags;
> +	int cnt = 0;
> +
> +	if (unlikely(in_nmi)) {
> +		llnode = llist_del_first(&c->free_llist_nmi);
> +		if (llnode)
> +			cnt = atomic_dec_return(&c->free_cnt_nmi);

Dumb question maybe its Friday afternoon. If we are in_nmi() and preempt
disabled why do we need the atomic_dec_return()?

> +	} else {
> +		/* Disable irqs to prevent the following race:
> +		 * bpf_prog_A
> +		 *   bpf_mem_alloc
> +		 *      preemption or irq -> bpf_prog_B
> +		 *        bpf_mem_alloc
> +		 */
> +		local_irq_save(flags);
> +		llnode = __llist_del_first(&c->free_llist);
> +		if (llnode)
> +			cnt = --c->free_cnt;
> +		local_irq_restore(flags);
> +	}
> +	WARN_ON(cnt < 0);
> +

Is this a problem?

  in_nmi = false
  bpf_prog_A
   bpf_mem_alloc
   irq_restore
   irq -> bpf_prog_B
            bpf_mem_alloc
               in_nmi = true
               irq_work_raise(c, true)
   irq_work_raise(c, false)

At somepoint later
 
   bpf_mem_refill()
    refill_nmi_list <- false

The timing is tight but possible I suspect. See above simple fix would
be to just | the refill_nim_list bool? We shouldn't be clearing it
from a raise op.

> +	if (cnt < LOW_WATERMARK)
> +		irq_work_raise(c, in_nmi);
> +	return llnode;
> +}
>

OK need to drop for now. Will pick up reviewing the rest later.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 1/5] bpf: Introduce any context BPF specific memory allocator.
  2022-06-25  1:23   ` John Fastabend
@ 2022-06-26 17:19     ` Alexei Starovoitov
  0 siblings, 0 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-26 17:19 UTC (permalink / raw)
  To: John Fastabend; +Cc: davem, daniel, andrii, tj, kafai, bpf, kernel-team

On Fri, Jun 24, 2022 at 06:23:26PM -0700, John Fastabend wrote:
> Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > run in unknown context where calling plain kmalloc() might not be safe.
> > 
> > Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> > Refill this cache asynchronously from irq_work.
> > 
> > BPF programs always run with migration disabled.
> > It's safe to allocate from cache of the current cpu with irqs disabled.
> > Free-ing is always done into bucket of the current cpu as well.
> > irq_work trims extra free elements from buckets with kfree
> > and refills them with kmalloc, so global kmalloc logic takes care
> > of freeing objects allocated by one cpu and freed on another.
> > 
> > struct bpf_mem_alloc supports two modes:
> > - When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
> >   This is typical bpf hash map use case when all elements have equal size.
> > - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
> >   kmalloc/kfree. Max allocation size is 4096 in this case.
> >   This is bpf_dynptr and bpf_kptr use case.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> 
> Some initial feedback but still looking over it. Figured it made more
> sense to dump current thoughts then drop it this evening for Monday.
> 
> [...]
> 
> > +static int bpf_mem_cache_idx(size_t size)
>     [...]
> 
> > +#define NUM_CACHES 11
> > +
> > +struct bpf_mem_cache {
> > +	/* per-cpu list of free objects of size 'unit_size'.
> > +	 * All accesses are done with preemption disabled
> > +	 * with __llist_add() and __llist_del_first().
> > +	 */
> > +	struct llist_head free_llist;
> > +
> > +	/* NMI only free list.
> > +	 * All accesses are NMI-safe llist_add() and llist_del_first().
> > +	 *
> > +	 * Each allocated object is either on free_llist or on free_llist_nmi.
> > +	 * One cpu can allocate it from NMI by doing llist_del_first() from
> > +	 * free_llist_nmi, while another might free it back from non-NMI by
> > +	 * doing llist_add() into free_llist.
> > +	 */
> > +	struct llist_head free_llist_nmi;
> 
> stupid nit but newline here helps me read this.

sure.

> > +	/* kmem_cache != NULL when bpf_mem_alloc was created for specific
> > +	 * element size.
> > +	 */
> > +	struct kmem_cache *kmem_cache;
> 
> > +	struct irq_work refill_work;
> > +	struct mem_cgroup *memcg;
> > +	int unit_size;
> > +	/* count of objects in free_llist */
> > +	int free_cnt;
> > +	/* count of objects in free_llist_nmi */
> > +	atomic_t free_cnt_nmi;
> > +	/* flag to refill nmi list too */
> > +	bool refill_nmi_list;
> > +};
> 
> What about having two types one for fixed size cache and one for buckets?
> The logic below gets a bunch of if cases with just the single type. OTOH
> I messed around with it for a bit and then had to duplicate most of the
> codes so I'm not sure its entirely a good idea, but the __alloc() with
> the 'if this else that' sort of made me think of it.

Two 'if's in __alloc and __free are the only difference.
The rest of bpf_mem_cache logic is exactly the same between fixed size
and buckets.
I'm not sure what 'two types' would look like.

> > +
> > +static struct llist_node notrace *__llist_del_first(struct llist_head *head)
>      [...]
> 
> > +
> > +#define BATCH 48
> > +#define LOW_WATERMARK 32
> > +#define HIGH_WATERMARK 96
> > +/* Assuming the average number of elements per bucket is 64, when all buckets
> > + * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
> > + * 64*4096*32 ~ 20Mbyte
> > + */
> > +
> > +/* extra macro useful for testing by randomizing in_nmi condition */
> > +#define bpf_in_nmi() in_nmi()
> > +
> > +static void *__alloc(struct bpf_mem_cache *c, int node)
> 
> For example with two types this mostly drops out. Of course then the callers
> have to know the type so not sure. And you get two alloc_bulks and
> so on. Its not obviously this works out well.
> 
> [...]
> 
> > +static void free_bulk_nmi(struct bpf_mem_cache *c)
> > +{
> > +	struct llist_node *llnode;
> > +	int cnt;
> > +
> > +	do {
> > +		llnode = llist_del_first(&c->free_llist_nmi);
> > +		if (llnode)
> > +			cnt = atomic_dec_return(&c->free_cnt_nmi);
> > +		else
> > +			cnt = 0;
> > +		__free(c, llnode);
> > +	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
> > +}
> 
> Comment from irq_work_run_list,
> 
> 	/*
> 	 * On PREEMPT_RT IRQ-work which is not marked as HARD will be processed
> 	 * in a per-CPU thread in preemptible context. Only the items which are
> 	 * marked as IRQ_WORK_HARD_IRQ will be processed in hardirq context.
> 	 */
> 
> Not an RT expert but I read this to mean in PREEMPT_RT case we can't assume
> this is !preemptible? If I read correctly then is there a risk we get
> two runners here? And by extension would need to worry about free_cnt
> and friends getting corrupted.

Right. That's why there is local_irq_save:
                if (IS_ENABLED(CONFIG_PREEMPT_RT))
                        local_irq_save(flags);
                llnode = __llist_del_first(&c->free_llist);
                if (llnode)
                        cnt = --c->free_cnt;
in alloc/free_bulk specifically for RT.
So that alloc_bulk doesn't race with that kthread.
and free_cnt doesn't get corrupted.

> 
> > +
> > +static void bpf_mem_refill(struct irq_work *work)
> > +{
> > +	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
> > +	int cnt;
> > +
> > +	cnt = c->free_cnt;
> > +	if (cnt < LOW_WATERMARK)
> > +		/* irq_work runs on this cpu and kmalloc will allocate
> > +		 * from the current numa node which is what we want here.
> > +		 */
> > +		alloc_bulk(c, BATCH, NUMA_NO_NODE);
> > +	else if (cnt > HIGH_WATERMARK)
> > +		free_bulk(c);
> > +
> > +	if (!c->refill_nmi_list)
> > +		/* don't refill NMI specific freelist
> > +		 * until alloc/free from NMI.
> > +		 */
> > +		return;
> > +	cnt = atomic_read(&c->free_cnt_nmi);
> > +	if (cnt < LOW_WATERMARK)
> > +		alloc_bulk_nmi(c, BATCH, NUMA_NO_NODE);
> > +	else if (cnt > HIGH_WATERMARK)
> > +		free_bulk_nmi(c);
> > +	c->refill_nmi_list = false;
> > +}
> > +
> > +static void notrace irq_work_raise(struct bpf_mem_cache *c, bool in_nmi)
> > +{
> > +	c->refill_nmi_list = in_nmi;
> 
> Should this be,
> 
> 	c->refill_nmi_list |= in_nmi;
> 
> this would resolve comment in unit_alloc? We don't want to clear it if
> we end up calling irq_work_raise from in_nmi and then in another
> context. It would be really hard to debug if the case is possible and
> a busy box just doesn't refill nmi enough.

Excellent catch. Yes. That's a bug.

> 
> > +	irq_work_queue(&c->refill_work);
> > +}
> > +
> > +static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
>     [...]
> 
> > +
> > +/* notrace is necessary here and in other functions to make sure
> > + * bpf programs cannot attach to them and cause llist corruptions.
> > + */
> 
> Thanks for the comment.
> 
> > +static void notrace *unit_alloc(struct bpf_mem_cache *c)
> > +{
> > +	bool in_nmi = bpf_in_nmi();
> > +	struct llist_node *llnode;
> > +	unsigned long flags;
> > +	int cnt = 0;
> > +
> > +	if (unlikely(in_nmi)) {
> > +		llnode = llist_del_first(&c->free_llist_nmi);
> > +		if (llnode)
> > +			cnt = atomic_dec_return(&c->free_cnt_nmi);
> 
> Dumb question maybe its Friday afternoon. If we are in_nmi() and preempt
> disabled why do we need the atomic_dec_return()?

atomic instead of normal free_cnt_nmi-- ?
because it's nmi. The if(in_nmi) bits of unit_alloc can be reentrant.

> 
> > +	} else {
> > +		/* Disable irqs to prevent the following race:
> > +		 * bpf_prog_A
> > +		 *   bpf_mem_alloc
> > +		 *      preemption or irq -> bpf_prog_B
> > +		 *        bpf_mem_alloc
> > +		 */
> > +		local_irq_save(flags);
> > +		llnode = __llist_del_first(&c->free_llist);
> > +		if (llnode)
> > +			cnt = --c->free_cnt;
> > +		local_irq_restore(flags);
> > +	}
> > +	WARN_ON(cnt < 0);
> > +
> 
> Is this a problem?
> 
>   in_nmi = false
>   bpf_prog_A
>    bpf_mem_alloc
>    irq_restore
>    irq -> bpf_prog_B
>             bpf_mem_alloc
>                in_nmi = true
>                irq_work_raise(c, true)
>    irq_work_raise(c, false)
> 
> At somepoint later
>  
>    bpf_mem_refill()
>     refill_nmi_list <- false
> 
> The timing is tight but possible I suspect. See above simple fix would
> be to just | the refill_nim_list bool? We shouldn't be clearing it
> from a raise op.

Yes. refill_nmi_list |= in_nmi or similar is necessary.
This |= is not atomic, so just |= may not be good enough.
Probably something like this would be better:

if (in_nmi)
  c->refill_nmi_list = in_nmi;

so that irq_work_raise() will only set it
and bpf_mem_refill() will clear it.

> > +	if (cnt < LOW_WATERMARK)
> > +		irq_work_raise(c, in_nmi);
> > +	return llnode;
> > +}
> >
> 
> OK need to drop for now. Will pick up reviewing the rest later.

Thanks a lot for quick review! Looking forward to the rest.
There is still a ton of work to improve this algo:
- get rid of call_rcu in hash map
- get rid of atomic_inc/dec in hash map
- tune watermarks per allocation size
- adopt this approach alloc_percpu_gfp

The watermarks are hacky now. The batch of ~64 for large units feels wasteful.
Probably should be lover for any unit larger than 1k.
Also explicit bpf_mem_prealloc(... cnt ...) is needed, so that bpf prog
can preallocate 'cnt' elements from safe context. This would mean that
watermarks will be floating, so that bpf_mem_refill doesn't go and
immediately free_bulk after user requested prealloc.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-06-23  0:32 [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2022-06-23  0:32 ` [PATCH bpf-next 5/5] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
@ 2022-06-27  7:03 ` Christoph Hellwig
  2022-06-28  0:17   ` Christoph Lameter
  2022-07-04 20:34   ` Matthew Wilcox
  5 siblings, 2 replies; 72+ messages in thread
From: Christoph Hellwig @ 2022-06-27  7:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, andrii, tj, kafai, bpf, kernel-team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

I'd suggest you discuss you needs with the slab mainainers and the mm
community firs.

On Wed, Jun 22, 2022 at 05:32:25PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce any context BPF specific memory allocator.
> 
> Tracing BPF programs can attach to kprobe and fentry. Hence they
> run in unknown context where calling plain kmalloc() might not be safe.
> Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> Refill this cache asynchronously from irq_work.
> 
> There is a lot more work ahead, but this set is useful base.
> Future work:
> - get rid of call_rcu in hash map
> - get rid of atomic_inc/dec in hash map
> - tune watermarks per allocation size
> - adopt this approach alloc_percpu_gfp
> - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
> - add sysctl to force bpf_mem_alloc in hash map when safe even if pre-alloc
>   requested to reduce memory consumption
> - convert lru map to bpf_mem_alloc
> 
> Alexei Starovoitov (5):
>   bpf: Introduce any context BPF specific memory allocator.
>   bpf: Convert hash map to bpf_mem_alloc.
>   selftests/bpf: Improve test coverage of test_maps
>   samples/bpf: Reduce syscall overhead in map_perf_test.
>   bpf: Relax the requirement to use preallocated hash maps in tracing
>     progs.
> 
>  include/linux/bpf_mem_alloc.h           |  26 ++
>  kernel/bpf/Makefile                     |   2 +-
>  kernel/bpf/hashtab.c                    |  16 +-
>  kernel/bpf/memalloc.c                   | 512 ++++++++++++++++++++++++
>  kernel/bpf/verifier.c                   |  31 +-
>  samples/bpf/map_perf_test_kern.c        |  22 +-
>  tools/testing/selftests/bpf/test_maps.c |  38 +-
>  7 files changed, 610 insertions(+), 37 deletions(-)
>  create mode 100644 include/linux/bpf_mem_alloc.h
>  create mode 100644 kernel/bpf/memalloc.c
> 
> -- 
> 2.30.2
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-06-27  7:03 ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Christoph Hellwig
@ 2022-06-28  0:17   ` Christoph Lameter
  2022-06-28  5:01     ` Alexei Starovoitov
  2022-07-04 20:34   ` Matthew Wilcox
  1 sibling, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2022-06-28  0:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, David Miller, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce any context BPF specific memory allocator.
>
> Tracing BPF programs can attach to kprobe and fentry. Hence they
> run in unknown context where calling plain kmalloc() might not be safe.
> Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> Refill this cache asynchronously from irq_work.

GFP_ATOMIC etc is not going to work for you?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-06-28  0:17   ` Christoph Lameter
@ 2022-06-28  5:01     ` Alexei Starovoitov
  2022-06-28 13:57       ` Christoph Lameter
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-28  5:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, David Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Mon, Jun 27, 2022 at 5:17 PM Christoph Lameter <cl@gentwo.de> wrote:
>
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce any context BPF specific memory allocator.
> >
> > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > run in unknown context where calling plain kmalloc() might not be safe.
> > Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> > Refill this cache asynchronously from irq_work.
>
> GFP_ATOMIC etc is not going to work for you?

slab_alloc_node->slab_alloc->local_lock_irqsave
kprobe -> bpf prog -> slab_alloc_node -> deadlock.
In other words, the slow path of slab allocator takes locks.
Which makes it unsafe to use from tracing bpf progs.
That's why we preallocated all elements in bpf maps,
so there are no calls to mm or rcu logic.
bpf specific allocator cannot use locks at all.
try_lock approach could have been used in alloc path,
but free path cannot fail with try_lock.
Hence the algorithm in this patch is purely lockless.
bpf prog can attach to spin_unlock_irqrestore and
safely do bpf_mem_alloc.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-06-28  5:01     ` Alexei Starovoitov
@ 2022-06-28 13:57       ` Christoph Lameter
  2022-06-28 17:03         ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2022-06-28 13:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, David Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Mon, 27 Jun 2022, Alexei Starovoitov wrote:

> On Mon, Jun 27, 2022 at 5:17 PM Christoph Lameter <cl@gentwo.de> wrote:
> >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Introduce any context BPF specific memory allocator.
> > >
> > > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > > run in unknown context where calling plain kmalloc() might not be safe.
> > > Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> > > Refill this cache asynchronously from irq_work.
> >
> > GFP_ATOMIC etc is not going to work for you?
>
> slab_alloc_node->slab_alloc->local_lock_irqsave
> kprobe -> bpf prog -> slab_alloc_node -> deadlock.
> In other words, the slow path of slab allocator takes locks.

That is a relatively new feature due to RT logic support. without RT this
would be a simple irq disable.

Generally doing slab allocation  while debugging slab allocation is not
something that can work. Can we exempt RT locks/irqsave or slab alloc from
BPF tracing?

I would assume that other key items of kernel logic will have similar
issues.

> Which makes it unsafe to use from tracing bpf progs.
> That's why we preallocated all elements in bpf maps,
> so there are no calls to mm or rcu logic.
> bpf specific allocator cannot use locks at all.
> try_lock approach could have been used in alloc path,
> but free path cannot fail with try_lock.
> Hence the algorithm in this patch is purely lockless.
> bpf prog can attach to spin_unlock_irqrestore and
> safely do bpf_mem_alloc.

That is generally safe unless you get into reetrance issues with memory
allocation.

Which begs the question:

What happens if I try to use BPF to trace *your* shiny new memory
allocation functions in the BPF logic like bpf_mem_alloc? How do you stop
that from happening?


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-06-28 13:57       ` Christoph Lameter
@ 2022-06-28 17:03         ` Alexei Starovoitov
  2022-06-29  2:35           ` Christoph Lameter
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-28 17:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, David Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue, Jun 28, 2022 at 03:57:54PM +0200, Christoph Lameter wrote:
> On Mon, 27 Jun 2022, Alexei Starovoitov wrote:
> 
> > On Mon, Jun 27, 2022 at 5:17 PM Christoph Lameter <cl@gentwo.de> wrote:
> > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > Introduce any context BPF specific memory allocator.
> > > >
> > > > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > > > run in unknown context where calling plain kmalloc() might not be safe.
> > > > Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> > > > Refill this cache asynchronously from irq_work.
> > >
> > > GFP_ATOMIC etc is not going to work for you?
> >
> > slab_alloc_node->slab_alloc->local_lock_irqsave
> > kprobe -> bpf prog -> slab_alloc_node -> deadlock.
> > In other words, the slow path of slab allocator takes locks.
> 
> That is a relatively new feature due to RT logic support. without RT this
> would be a simple irq disable.

Not just RT.
It's a slow path:
        if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
            unlikely(!object || !slab || !node_match(slab, node))) {
              local_unlock_irqrestore(&s->cpu_slab->lock,...);
and that's not the only lock in there.
new_slab->allocate_slab... alloc_pages grabbing more locks.

> Generally doing slab allocation  while debugging slab allocation is not
> something that can work. Can we exempt RT locks/irqsave or slab alloc from
> BPF tracing?

People started doing lock profiling with bpf back in 2017.
People do rcu profiling now and attaching bpf progs to all kinds of low level
kernel internals: page alloc, etc.

> I would assume that other key items of kernel logic will have similar
> issues.

We're _not_ asking for any changes from mm/slab side.
Things were working all these years. We're making them more efficient now
by getting rid of 'lets prealloc everything' approach.

> > Which makes it unsafe to use from tracing bpf progs.
> > That's why we preallocated all elements in bpf maps,
> > so there are no calls to mm or rcu logic.
> > bpf specific allocator cannot use locks at all.
> > try_lock approach could have been used in alloc path,
> > but free path cannot fail with try_lock.
> > Hence the algorithm in this patch is purely lockless.
> > bpf prog can attach to spin_unlock_irqrestore and
> > safely do bpf_mem_alloc.
> 
> That is generally safe unless you get into reetrance issues with memory
> allocation.

Right. Generic slab/mm/page_alloc/rcu are not ready for reentrance and
are not safe from NMI either.
That's why we're added all kinds of safey mechanisms in bpf layers.

> Which begs the question:
> 
> What happens if I try to use BPF to trace *your* shiny new memory

'shiny and new' is overstatement. It's a trivial lock less freelist layer
on top of kmalloc. Please read the patch.

> allocation functions in the BPF logic like bpf_mem_alloc? How do you stop
> that from happening?

here is the comment in the patch:
/* notrace is necessary here and in other functions to make sure
 * bpf programs cannot attach to them and cause llist corruptions.
 */

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-06-28 17:03         ` Alexei Starovoitov
@ 2022-06-29  2:35           ` Christoph Lameter
  2022-06-29  2:49             ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2022-06-29  2:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, David Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue, 28 Jun 2022, Alexei Starovoitov wrote:

> > That is a relatively new feature due to RT logic support. without RT this
> > would be a simple irq disable.
>
> Not just RT.
> It's a slow path:
>         if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
>             unlikely(!object || !slab || !node_match(slab, node))) {
>               local_unlock_irqrestore(&s->cpu_slab->lock,...);
> and that's not the only lock in there.
> new_slab->allocate_slab... alloc_pages grabbing more locks.


Its not a lock for !RT.

The fastpath is lockless if hardware allows that but then we go into more
and more serialiation needs as the allocation gets more into the page
allocator logic.

> > allocation functions in the BPF logic like bpf_mem_alloc? How do you stop
> > that from happening?
>
> here is the comment in the patch:
> /* notrace is necessary here and in other functions to make sure
>  * bpf programs cannot attach to them and cause llist corruptions.
>  */

"notrace".... Ok Hmmm...

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-06-29  2:35           ` Christoph Lameter
@ 2022-06-29  2:49             ` Alexei Starovoitov
  2022-07-04 16:13               ` Vlastimil Babka
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-06-29  2:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, David Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue, Jun 28, 2022 at 7:35 PM Christoph Lameter <cl@gentwo.de> wrote:
>
> On Tue, 28 Jun 2022, Alexei Starovoitov wrote:
>
> > > That is a relatively new feature due to RT logic support. without RT this
> > > would be a simple irq disable.
> >
> > Not just RT.
> > It's a slow path:
> >         if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
> >             unlikely(!object || !slab || !node_match(slab, node))) {
> >               local_unlock_irqrestore(&s->cpu_slab->lock,...);
> > and that's not the only lock in there.
> > new_slab->allocate_slab... alloc_pages grabbing more locks.
>
>
> Its not a lock for !RT.
>
> The fastpath is lockless if hardware allows that but then we go into more
> and more serialiation needs as the allocation gets more into the page
> allocator logic.

On RT fast path == slow path with a lock.
On !RT fast path is lock less.
That's all correct.
bpf side has to make sure safety in all possible paths
therefore RT or !RT makes no difference.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-06-29  2:49             ` Alexei Starovoitov
@ 2022-07-04 16:13               ` Vlastimil Babka
  2022-07-06 17:43                 ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Vlastimil Babka @ 2022-07-04 16:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Christoph Lameter
  Cc: Christoph Hellwig, David Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Matthew Wilcox, Liam R. Howlett

On 6/29/22 04:49, Alexei Starovoitov wrote:
> On Tue, Jun 28, 2022 at 7:35 PM Christoph Lameter <cl@gentwo.de> wrote:
>>
>> On Tue, 28 Jun 2022, Alexei Starovoitov wrote:
>>
>> > > That is a relatively new feature due to RT logic support. without RT this
>> > > would be a simple irq disable.
>> >
>> > Not just RT.
>> > It's a slow path:
>> >         if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
>> >             unlikely(!object || !slab || !node_match(slab, node))) {
>> >               local_unlock_irqrestore(&s->cpu_slab->lock,...);
>> > and that's not the only lock in there.
>> > new_slab->allocate_slab... alloc_pages grabbing more locks.
>>
>>
>> Its not a lock for !RT.
>>
>> The fastpath is lockless if hardware allows that but then we go into more
>> and more serialiation needs as the allocation gets more into the page
>> allocator logic.

Yeah I don't think the recent RT-related changes made this much worse than
it already was. In alloc side you could perhaps try the really lockless
fastpaths only and fail if e.g. the per-cpu slabs were empty (but would BPF
be happy with that?). On the free side though you could end up having to
move a slab from partial to free list as a result, and now a spin lock is
needed (even before the RT changes), and you can't really fail a free...

> On RT fast path == slow path with a lock.
> On !RT fast path is lock less.
> That's all correct.
> bpf side has to make sure safety in all possible paths
> therefore RT or !RT makes no difference.

So AFAIK we don't right now have what BFP needs - an extra-constrained kind
of GFP_ATOMIC. I don't object you adding it privately. But it's another
reason to think about if these things can be generalized. For example we had
a discussion about the Maple tree having kinda similar kinds of requirements
to avoid its tree node preallocations always for the worst possible case.

I'm not sure we can sanely implement this within each of SLAB/SLUB/SLOB, or
rather provide a generic cache on top...

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-06-27  7:03 ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Christoph Hellwig
  2022-06-28  0:17   ` Christoph Lameter
@ 2022-07-04 20:34   ` Matthew Wilcox
  2022-07-06 17:50     ` Alexei Starovoitov
  1 sibling, 1 reply; 72+ messages in thread
From: Matthew Wilcox @ 2022-07-04 20:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexei Starovoitov, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Mon, Jun 27, 2022 at 12:03:08AM -0700, Christoph Hellwig wrote:
> I'd suggest you discuss you needs with the slab mainainers and the mm
> community firs.
> 
> On Wed, Jun 22, 2022 at 05:32:25PM -0700, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > Introduce any context BPF specific memory allocator.
> > 
> > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > run in unknown context where calling plain kmalloc() might not be safe.
> > Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> > Refill this cache asynchronously from irq_work.

I can't tell from your description whether a bump allocator would work
for you.  That is, can you tell which allocations need to persist past
program execution (and use kmalloc for them) and which can be freed as
soon as the program has finished (and can use the bump allocator)?

If so, we already have one for you, the page_frag allocator
(Documentation/vm/page_frags.rst).  It might need to be extended to meet
your needs, but it's certainly faster than the kmalloc allocator.

> > There is a lot more work ahead, but this set is useful base.
> > Future work:
> > - get rid of call_rcu in hash map
> > - get rid of atomic_inc/dec in hash map
> > - tune watermarks per allocation size
> > - adopt this approach alloc_percpu_gfp
> > - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
> > - add sysctl to force bpf_mem_alloc in hash map when safe even if pre-alloc
> >   requested to reduce memory consumption
> > - convert lru map to bpf_mem_alloc
> > 
> > Alexei Starovoitov (5):
> >   bpf: Introduce any context BPF specific memory allocator.
> >   bpf: Convert hash map to bpf_mem_alloc.
> >   selftests/bpf: Improve test coverage of test_maps
> >   samples/bpf: Reduce syscall overhead in map_perf_test.
> >   bpf: Relax the requirement to use preallocated hash maps in tracing
> >     progs.
> > 
> >  include/linux/bpf_mem_alloc.h           |  26 ++
> >  kernel/bpf/Makefile                     |   2 +-
> >  kernel/bpf/hashtab.c                    |  16 +-
> >  kernel/bpf/memalloc.c                   | 512 ++++++++++++++++++++++++
> >  kernel/bpf/verifier.c                   |  31 +-
> >  samples/bpf/map_perf_test_kern.c        |  22 +-
> >  tools/testing/selftests/bpf/test_maps.c |  38 +-
> >  7 files changed, 610 insertions(+), 37 deletions(-)
> >  create mode 100644 include/linux/bpf_mem_alloc.h
> >  create mode 100644 kernel/bpf/memalloc.c
> > 
> > -- 
> > 2.30.2
> > 
> ---end quoted text---
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-04 16:13               ` Vlastimil Babka
@ 2022-07-06 17:43                 ` Alexei Starovoitov
  2022-07-19 11:52                   ` Vlastimil Babka
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-06 17:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Christoph Hellwig, David Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Matthew Wilcox, Liam R. Howlett

On Mon, Jul 04, 2022 at 06:13:17PM +0200, Vlastimil Babka wrote:
> On 6/29/22 04:49, Alexei Starovoitov wrote:
> > On Tue, Jun 28, 2022 at 7:35 PM Christoph Lameter <cl@gentwo.de> wrote:
> >>
> >> On Tue, 28 Jun 2022, Alexei Starovoitov wrote:
> >>
> >> > > That is a relatively new feature due to RT logic support. without RT this
> >> > > would be a simple irq disable.
> >> >
> >> > Not just RT.
> >> > It's a slow path:
> >> >         if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
> >> >             unlikely(!object || !slab || !node_match(slab, node))) {
> >> >               local_unlock_irqrestore(&s->cpu_slab->lock,...);
> >> > and that's not the only lock in there.
> >> > new_slab->allocate_slab... alloc_pages grabbing more locks.
> >>
> >>
> >> Its not a lock for !RT.
> >>
> >> The fastpath is lockless if hardware allows that but then we go into more
> >> and more serialiation needs as the allocation gets more into the page
> >> allocator logic.
> 
> Yeah I don't think the recent RT-related changes made this much worse than
> it already was. In alloc side you could perhaps try the really lockless
> fastpaths only and fail if e.g. the per-cpu slabs were empty (but would BPF
> be happy with that?). On the free side though you could end up having to
> move a slab from partial to free list as a result, and now a spin lock is
> needed (even before the RT changes), and you can't really fail a free...
> 
> > On RT fast path == slow path with a lock.
> > On !RT fast path is lock less.
> > That's all correct.
> > bpf side has to make sure safety in all possible paths
> > therefore RT or !RT makes no difference.
> 
> So AFAIK we don't right now have what BFP needs - an extra-constrained kind
> of GFP_ATOMIC. I don't object you adding it privately. But it's another
> reason to think about if these things can be generalized. For example we had
> a discussion about the Maple tree having kinda similar kinds of requirements
> to avoid its tree node preallocations always for the worst possible case.

What kind of maple tree needs? Does it need to be fully reentrant and nmi safe?
Not really. The caller knows the context and can choose appropriate flags.
While bpf alloc doesn't know the context. The bpf prog can be called from
places where slab/page/kasan specific locks are held which makes all these
pieces non-reentrable.
The full prealloc of bpf maps (read: waste a lot of memory) was our solution until now.
This is specific to tracing bpf programs, of course.
bpf networking, bpf security, sleepable bpf are completely different.

> I'm not sure we can sanely implement this within each of SLAB/SLUB/SLOB, or
> rather provide a generic cache on top...

Notice that all of bpf cache functions are notrace/nokprobe/no locks.
The main difference vs all other allocators is bpf_mem_alloc from cache
and refill of the cache are two asynchronous operations. It allows the former
to be reentrant and nmi safe.
All in tree allocators sooner or later synchornously call into page_alloc,
kasan, memleak and other debugging facilites that grab locks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-04 20:34   ` Matthew Wilcox
@ 2022-07-06 17:50     ` Alexei Starovoitov
  2022-07-06 17:55       ` Matthew Wilcox
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-06 17:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Mon, Jul 04, 2022 at 09:34:23PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 27, 2022 at 12:03:08AM -0700, Christoph Hellwig wrote:
> > I'd suggest you discuss you needs with the slab mainainers and the mm
> > community firs.
> > 
> > On Wed, Jun 22, 2022 at 05:32:25PM -0700, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > > 
> > > Introduce any context BPF specific memory allocator.
> > > 
> > > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > > run in unknown context where calling plain kmalloc() might not be safe.
> > > Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> > > Refill this cache asynchronously from irq_work.
> 
> I can't tell from your description whether a bump allocator would work
> for you.  That is, can you tell which allocations need to persist past
> program execution (and use kmalloc for them) and which can be freed as
> soon as the program has finished (and can use the bump allocator)?
> 
> If so, we already have one for you, the page_frag allocator
> (Documentation/vm/page_frags.rst).  It might need to be extended to meet
> your needs, but it's certainly faster than the kmalloc allocator.

Already looked at it, and into mempool, and everything we could find.
All 'normal' allocators sooner or later synchornously call into page_alloc,
kasan, memleak and other debugging facilites that grab locks which
make them unusable for bpf tracing progs.
The main difference of bpf specific alloc vs the rest is bpf_mem_alloc from
the cache and refill of the cache are two asynchronous operations.
It allows the former to be reentrant and nmi safe.

The speed of bpf mem alloc is secondary here.
The lock less free list is there to absorb async refill.
irq_work_queue is not instant while bpf prog might do several bpf_mem_alloc
before irq_work has a chance to refill.
We're thinking about self adjusting low/high watermarks that will be tunned by
the bpf verifier that has visibility into what program is going to do.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 17:50     ` Alexei Starovoitov
@ 2022-07-06 17:55       ` Matthew Wilcox
  2022-07-06 18:05         ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Wilcox @ 2022-07-06 17:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 06, 2022 at 10:50:34AM -0700, Alexei Starovoitov wrote:
> On Mon, Jul 04, 2022 at 09:34:23PM +0100, Matthew Wilcox wrote:
> > On Mon, Jun 27, 2022 at 12:03:08AM -0700, Christoph Hellwig wrote:
> > > I'd suggest you discuss you needs with the slab mainainers and the mm
> > > community firs.
> > > 
> > > On Wed, Jun 22, 2022 at 05:32:25PM -0700, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > 
> > > > Introduce any context BPF specific memory allocator.
> > > > 
> > > > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > > > run in unknown context where calling plain kmalloc() might not be safe.
> > > > Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> > > > Refill this cache asynchronously from irq_work.
> > 
> > I can't tell from your description whether a bump allocator would work
> > for you.  That is, can you tell which allocations need to persist past
> > program execution (and use kmalloc for them) and which can be freed as
> > soon as the program has finished (and can use the bump allocator)?
> > 
> > If so, we already have one for you, the page_frag allocator
> > (Documentation/vm/page_frags.rst).  It might need to be extended to meet
> > your needs, but it's certainly faster than the kmalloc allocator.
> 
> Already looked at it, and into mempool, and everything we could find.
> All 'normal' allocators sooner or later synchornously call into page_alloc,

Today it does, yes.  But it might be adaptable to your needs if only I
knew what those needs were.  For example, I assume that a BPF program
has a fairly tight limit on how much memory it can cause to be allocated.
Right?


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 17:55       ` Matthew Wilcox
@ 2022-07-06 18:05         ` Alexei Starovoitov
  2022-07-06 18:21           ` Matthew Wilcox
  2022-07-08 13:41           ` Michal Hocko
  0 siblings, 2 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-06 18:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 06, 2022 at 10:50:34AM -0700, Alexei Starovoitov wrote:
> > On Mon, Jul 04, 2022 at 09:34:23PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jun 27, 2022 at 12:03:08AM -0700, Christoph Hellwig wrote:
> > > > I'd suggest you discuss you needs with the slab mainainers and the mm
> > > > community firs.
> > > > 
> > > > On Wed, Jun 22, 2022 at 05:32:25PM -0700, Alexei Starovoitov wrote:
> > > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > > 
> > > > > Introduce any context BPF specific memory allocator.
> > > > > 
> > > > > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > > > > run in unknown context where calling plain kmalloc() might not be safe.
> > > > > Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> > > > > Refill this cache asynchronously from irq_work.
> > > 
> > > I can't tell from your description whether a bump allocator would work
> > > for you.  That is, can you tell which allocations need to persist past
> > > program execution (and use kmalloc for them) and which can be freed as
> > > soon as the program has finished (and can use the bump allocator)?
> > > 
> > > If so, we already have one for you, the page_frag allocator
> > > (Documentation/vm/page_frags.rst).  It might need to be extended to meet
> > > your needs, but it's certainly faster than the kmalloc allocator.
> > 
> > Already looked at it, and into mempool, and everything we could find.
> > All 'normal' allocators sooner or later synchornously call into page_alloc,
> 
> Today it does, yes.  But it might be adaptable to your needs if only I
> knew what those needs were.  

needs: fully reentrant, nmi safe, any context safe.
I feel you're still failing to grasp 'any context' part.
bpf prog can attach to all available_filter_functions and should be able
to alloc from there.
Or kprobe anywhere and everywhere in the kernel .text that is not marked notrace/nokprobe.

> For example, I assume that a BPF program
> has a fairly tight limit on how much memory it can cause to be allocated.
> Right?

No. It's constrained by memcg limits only. It can allocate gigabytes.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 18:05         ` Alexei Starovoitov
@ 2022-07-06 18:21           ` Matthew Wilcox
  2022-07-06 18:26             ` Alexei Starovoitov
  2022-07-08 13:41           ` Michal Hocko
  1 sibling, 1 reply; 72+ messages in thread
From: Matthew Wilcox @ 2022-07-06 18:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 06, 2022 at 11:05:25AM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > For example, I assume that a BPF program
> > has a fairly tight limit on how much memory it can cause to be allocated.
> > Right?
> 
> No. It's constrained by memcg limits only. It can allocate gigabytes.

I'm confused.  A BPF program is limited to executing 4096 insns and
using a maximum of 512 bytes of stack space, but it can allocate an
unlimited amount of heap?  That seems wrong.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 18:21           ` Matthew Wilcox
@ 2022-07-06 18:26             ` Alexei Starovoitov
  2022-07-06 18:31               ` Matthew Wilcox
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-06 18:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 06, 2022 at 07:21:29PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 06, 2022 at 11:05:25AM -0700, Alexei Starovoitov wrote:
> > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > > For example, I assume that a BPF program
> > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > Right?
> > 
> > No. It's constrained by memcg limits only. It can allocate gigabytes.
> 
> I'm confused.  A BPF program is limited to executing 4096 insns and
> using a maximum of 512 bytes of stack space, but it can allocate an
> unlimited amount of heap?  That seems wrong.

4k insn limit was lifted years ago.
bpf progs are pretty close to be at parity with kernel modules.
Except that they are safe, portable, and users have full visibility into them.
It's not a blob of bytes unlike .ko.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 18:26             ` Alexei Starovoitov
@ 2022-07-06 18:31               ` Matthew Wilcox
  2022-07-06 18:36                 ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Wilcox @ 2022-07-06 18:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 06, 2022 at 11:26:35AM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 06, 2022 at 07:21:29PM +0100, Matthew Wilcox wrote:
> > On Wed, Jul 06, 2022 at 11:05:25AM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > > > For example, I assume that a BPF program
> > > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > > Right?
> > > 
> > > No. It's constrained by memcg limits only. It can allocate gigabytes.
> > 
> > I'm confused.  A BPF program is limited to executing 4096 insns and
> > using a maximum of 512 bytes of stack space, but it can allocate an
> > unlimited amount of heap?  That seems wrong.
> 
> 4k insn limit was lifted years ago.

You might want to update the documentation.
https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html
still says 4096.

> bpf progs are pretty close to be at parity with kernel modules.
> Except that they are safe, portable, and users have full visibility into them.
> It's not a blob of bytes unlike .ko.

It doesn't seem unreasonable to expect them to behave like kernel
modules, then.  If they want to allocate memory in NMI context, then
they should get to preallocate it before they go into NMI context.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 18:31               ` Matthew Wilcox
@ 2022-07-06 18:36                 ` Alexei Starovoitov
  2022-07-06 18:40                   ` Matthew Wilcox
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-06 18:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 06, 2022 at 07:31:34PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 06, 2022 at 11:26:35AM -0700, Alexei Starovoitov wrote:
> > On Wed, Jul 06, 2022 at 07:21:29PM +0100, Matthew Wilcox wrote:
> > > On Wed, Jul 06, 2022 at 11:05:25AM -0700, Alexei Starovoitov wrote:
> > > > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > > > > For example, I assume that a BPF program
> > > > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > > > Right?
> > > > 
> > > > No. It's constrained by memcg limits only. It can allocate gigabytes.
> > > 
> > > I'm confused.  A BPF program is limited to executing 4096 insns and
> > > using a maximum of 512 bytes of stack space, but it can allocate an
> > > unlimited amount of heap?  That seems wrong.
> > 
> > 4k insn limit was lifted years ago.
> 
> You might want to update the documentation.
> https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html
> still says 4096.

No. Please read what you're quoting first.

> > bpf progs are pretty close to be at parity with kernel modules.
> > Except that they are safe, portable, and users have full visibility into them.
> > It's not a blob of bytes unlike .ko.
> 
> It doesn't seem unreasonable to expect them to behave like kernel
> modules, then.  If they want to allocate memory in NMI context, then
> they should get to preallocate it before they go into NMI context.

You're still missing 'any context' part from the previous email.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 18:36                 ` Alexei Starovoitov
@ 2022-07-06 18:40                   ` Matthew Wilcox
  2022-07-06 18:51                     ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Wilcox @ 2022-07-06 18:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 06, 2022 at 11:36:19AM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 06, 2022 at 07:31:34PM +0100, Matthew Wilcox wrote:
> > On Wed, Jul 06, 2022 at 11:26:35AM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jul 06, 2022 at 07:21:29PM +0100, Matthew Wilcox wrote:
> > > > On Wed, Jul 06, 2022 at 11:05:25AM -0700, Alexei Starovoitov wrote:
> > > > > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > > > > > For example, I assume that a BPF program
> > > > > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > > > > Right?
> > > > > 
> > > > > No. It's constrained by memcg limits only. It can allocate gigabytes.
> > > > 
> > > > I'm confused.  A BPF program is limited to executing 4096 insns and
> > > > using a maximum of 512 bytes of stack space, but it can allocate an
> > > > unlimited amount of heap?  That seems wrong.
> > > 
> > > 4k insn limit was lifted years ago.
> > 
> > You might want to update the documentation.
> > https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html
> > still says 4096.
> 
> No. Please read what you're quoting first.

I did read it.  It says

: The only limit known to the user space is BPF_MAXINSNS (4096). It’s the
: maximum number of instructions that the unprivileged bpf program can have.

It really seems pretty clear to me.  You're saying my understanding
is wrong.  So it must be badly written.  Even now, I don't understand
how I've misunderstood it.

> > > bpf progs are pretty close to be at parity with kernel modules.
> > > Except that they are safe, portable, and users have full visibility into them.
> > > It's not a blob of bytes unlike .ko.
> > 
> > It doesn't seem unreasonable to expect them to behave like kernel
> > modules, then.  If they want to allocate memory in NMI context, then
> > they should get to preallocate it before they go into NMI context.
> 
> You're still missing 'any context' part from the previous email.

Really, this is not a productive way to respond.  I want to help
and you're just snarling at me.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 18:40                   ` Matthew Wilcox
@ 2022-07-06 18:51                     ` Alexei Starovoitov
  2022-07-06 18:55                       ` Matthew Wilcox
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-06 18:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 06, 2022 at 07:40:44PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 06, 2022 at 11:36:19AM -0700, Alexei Starovoitov wrote:
> > On Wed, Jul 06, 2022 at 07:31:34PM +0100, Matthew Wilcox wrote:
> > > On Wed, Jul 06, 2022 at 11:26:35AM -0700, Alexei Starovoitov wrote:
> > > > On Wed, Jul 06, 2022 at 07:21:29PM +0100, Matthew Wilcox wrote:
> > > > > On Wed, Jul 06, 2022 at 11:05:25AM -0700, Alexei Starovoitov wrote:
> > > > > > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > > > > > > For example, I assume that a BPF program
> > > > > > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > > > > > Right?
> > > > > > 
> > > > > > No. It's constrained by memcg limits only. It can allocate gigabytes.
> > > > > 
> > > > > I'm confused.  A BPF program is limited to executing 4096 insns and
> > > > > using a maximum of 512 bytes of stack space, but it can allocate an
> > > > > unlimited amount of heap?  That seems wrong.
> > > > 
> > > > 4k insn limit was lifted years ago.
> > > 
> > > You might want to update the documentation.
> > > https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html
> > > still says 4096.
> > 
> > No. Please read what you're quoting first.
> 
> I did read it.  It says
> 
> : The only limit known to the user space is BPF_MAXINSNS (4096). It’s the
> : maximum number of instructions that the unprivileged bpf program can have.
> 
> It really seems pretty clear to me.  You're saying my understanding
> is wrong.  So it must be badly written.  Even now, I don't understand
> how I've misunderstood it.

huh? Still missing 'unprivileged' in the above ?
and completely ignoring the rest of the paragraph in the link above?

> > > > bpf progs are pretty close to be at parity with kernel modules.
> > > > Except that they are safe, portable, and users have full visibility into them.
> > > > It's not a blob of bytes unlike .ko.
> > > 
> > > It doesn't seem unreasonable to expect them to behave like kernel
> > > modules, then.  If they want to allocate memory in NMI context, then
> > > they should get to preallocate it before they go into NMI context.
> > 
> > You're still missing 'any context' part from the previous email.
> 
> Really, this is not a productive way to respond.  I want to help
> and you're just snarling at me.

To help you need to understand what bpf is while you're failing
to read the doc.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 18:51                     ` Alexei Starovoitov
@ 2022-07-06 18:55                       ` Matthew Wilcox
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Wilcox @ 2022-07-06 18:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, davem, daniel, andrii, tj, kafai, bpf,
	kernel-team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 06, 2022 at 11:51:03AM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 06, 2022 at 07:40:44PM +0100, Matthew Wilcox wrote:
> > On Wed, Jul 06, 2022 at 11:36:19AM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jul 06, 2022 at 07:31:34PM +0100, Matthew Wilcox wrote:
> > > > On Wed, Jul 06, 2022 at 11:26:35AM -0700, Alexei Starovoitov wrote:
> > > > > On Wed, Jul 06, 2022 at 07:21:29PM +0100, Matthew Wilcox wrote:
> > > > > > On Wed, Jul 06, 2022 at 11:05:25AM -0700, Alexei Starovoitov wrote:
> > > > > > > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > > > > > > > For example, I assume that a BPF program
> > > > > > > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > > > > > > Right?
> > > > > > > 
> > > > > > > No. It's constrained by memcg limits only. It can allocate gigabytes.
> > > > > > 
> > > > > > I'm confused.  A BPF program is limited to executing 4096 insns and
> > > > > > using a maximum of 512 bytes of stack space, but it can allocate an
> > > > > > unlimited amount of heap?  That seems wrong.
> > > > > 
> > > > > 4k insn limit was lifted years ago.
> > > > 
> > > > You might want to update the documentation.
> > > > https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html
> > > > still says 4096.
> > > 
> > > No. Please read what you're quoting first.
> > 
> > I did read it.  It says
> > 
> > : The only limit known to the user space is BPF_MAXINSNS (4096). It’s the
> > : maximum number of instructions that the unprivileged bpf program can have.
> > 
> > It really seems pretty clear to me.  You're saying my understanding
> > is wrong.  So it must be badly written.  Even now, I don't understand
> > how I've misunderstood it.
> 
> huh? Still missing 'unprivileged' in the above ?
> and completely ignoring the rest of the paragraph in the link above?

Are you saying that unprivileged programs are not allowed to allocate
memory?  Or that there is a limit on the amount of heap memory that
unprivileged programs are allowed to allocate.

> > > > > bpf progs are pretty close to be at parity with kernel modules.
> > > > > Except that they are safe, portable, and users have full visibility into them.
> > > > > It's not a blob of bytes unlike .ko.
> > > > 
> > > > It doesn't seem unreasonable to expect them to behave like kernel
> > > > modules, then.  If they want to allocate memory in NMI context, then
> > > > they should get to preallocate it before they go into NMI context.
> > > 
> > > You're still missing 'any context' part from the previous email.
> > 
> > Really, this is not a productive way to respond.  I want to help
> > and you're just snarling at me.
> 
> To help you need to understand what bpf is while you're failing
> to read the doc.

Part of working with other people is understanding that they don't know
as much as you do about your thing, and at the same time understanding
that you don't know as much as they do about their thing.  That means
explaining things in a way which makes sense to them instead of getting
all huffy when they don't understand the details of your thing.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 18:05         ` Alexei Starovoitov
  2022-07-06 18:21           ` Matthew Wilcox
@ 2022-07-08 13:41           ` Michal Hocko
  2022-07-08 17:48             ` Alexei Starovoitov
  1 sibling, 1 reply; 72+ messages in thread
From: Michal Hocko @ 2022-07-08 13:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Matthew Wilcox, Christoph Hellwig, davem, daniel, andrii, tj,
	kafai, bpf, kernel-team, linux-mm, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	Vlastimil Babka

On Wed 06-07-22 11:05:25, Alexei Starovoitov wrote:
> On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
[...]
> > For example, I assume that a BPF program
> > has a fairly tight limit on how much memory it can cause to be allocated.
> > Right?
> 
> No. It's constrained by memcg limits only. It can allocate gigabytes.
 
I have very briefly had a look at the core allocator parts (please note
that my understanding of BPF is really close to zero so I might be
missing a lot of implicit stuff). So by constrained by memcg you mean
__GFP_ACCOUNT done from the allocation context (irq_work). The complete
gfp mask is GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_ACCOUNT
which means this allocation is not allowed to sleep and GFP_ATOMIC
implies __GFP_HIGH to say that access to memory reserves is allowed.
Memcg charging code interprets this that the hard limit can be breached
under assumption that these are rare and will be compensated in some
way. The bulk allocator implemented here, however, doesn't reflect that
and continues allocating as it sees a success so the breach of the limit
is only bound by the number of objects to be allocated. If those can be
really large then this is a clear problem and __GFP_HIGH usage is not
really appropriate.

Also, I do not see any tracking of the overall memory sitting in these
pools and I think this would be really appropriate. As there doesn't
seem to be any reclaim mechanism implemented this can hide quite some
unreachable memory.

Finally it is not really clear to what kind of entity is the life time
of these caches bound to. Let's say the system goes OOM, is any process
responsible for it and a clean up would be done if it gets killed?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-08 13:41           ` Michal Hocko
@ 2022-07-08 17:48             ` Alexei Starovoitov
  2022-07-08 20:13               ` Yosry Ahmed
                                 ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-08 17:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Christoph Hellwig, davem, daniel, andrii, tj,
	kafai, bpf, kernel-team, linux-mm, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	Vlastimil Babka

On Fri, Jul 08, 2022 at 03:41:47PM +0200, Michal Hocko wrote:
> On Wed 06-07-22 11:05:25, Alexei Starovoitov wrote:
> > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> [...]
> > > For example, I assume that a BPF program
> > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > Right?
> > 
> > No. It's constrained by memcg limits only. It can allocate gigabytes.
>  
> I have very briefly had a look at the core allocator parts (please note
> that my understanding of BPF is really close to zero so I might be
> missing a lot of implicit stuff). So by constrained by memcg you mean
> __GFP_ACCOUNT done from the allocation context (irq_work). The complete
> gfp mask is GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_ACCOUNT
> which means this allocation is not allowed to sleep and GFP_ATOMIC
> implies __GFP_HIGH to say that access to memory reserves is allowed.
> Memcg charging code interprets this that the hard limit can be breached
> under assumption that these are rare and will be compensated in some
> way. The bulk allocator implemented here, however, doesn't reflect that
> and continues allocating as it sees a success so the breach of the limit
> is only bound by the number of objects to be allocated. If those can be
> really large then this is a clear problem and __GFP_HIGH usage is not
> really appropriate.

That was a copy paste from the networking stack. See kmalloc_reserve().
Not sure whether it's a bug there or not.
In a separate thread we've agreed to convert all of bpf allocations
to GFP_NOWAIT. For this patch set I've already fixed it in my branch.

> Also, I do not see any tracking of the overall memory sitting in these
> pools and I think this would be really appropriate. As there doesn't
> seem to be any reclaim mechanism implemented this can hide quite some
> unreachable memory.
> 
> Finally it is not really clear to what kind of entity is the life time
> of these caches bound to. Let's say the system goes OOM, is any process
> responsible for it and a clean up would be done if it gets killed?

We've been asking these questions for years and have been trying to
come up with a solution.
bpf progs are not analogous to user space processes. 
There are bpf progs that function completely without user space component.
bpf progs are pretty close to be full featured kernel modules with
the difference that bpf progs are safe, portable and users have
full visibility into them (source code, line info, type info, etc)
They are not binary blobs unlike kernel modules.
But from OOM perspective they're pretty much like .ko-s.
Which kernel module would you force unload when system is OOMing ?
Force unloading ko-s will likely crash the system.
Force unloading bpf progs maybe equally bad. The system won't crash,
but it may be a sorrow state. The bpf could have been doing security
enforcement or network firewall or providing key insights to critical
user space components like systemd or health check daemon.
We've been discussing ideas on how to rank and auto cleanup
the system state when progs have to be unloaded. Some sort of
destructor mechanism. Fingers crossed we will have it eventually.
bpf infra keeps track of everything, of course.
Technically we can detach, unpin and unload everything and all memory
will be returned back to the system.
Anyhow not a new problem. Orthogonal to this patch set.
bpf progs have been doing memory allocation from day one. 8 years ago.
This patch set is trying to make it 100% safe.
Currently it's 99% safe.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-08 17:48             ` Alexei Starovoitov
@ 2022-07-08 20:13               ` Yosry Ahmed
  2022-07-08 21:55               ` Shakeel Butt
  2022-07-11 12:22               ` Michal Hocko
  2 siblings, 0 replies; 72+ messages in thread
From: Yosry Ahmed @ 2022-07-08 20:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michal Hocko, Matthew Wilcox, Christoph Hellwig, davem,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, Linux-MM, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Fri, Jul 8, 2022 at 10:49 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jul 08, 2022 at 03:41:47PM +0200, Michal Hocko wrote:
> > On Wed 06-07-22 11:05:25, Alexei Starovoitov wrote:
> > > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > [...]
> > > > For example, I assume that a BPF program
> > > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > > Right?
> > >
> > > No. It's constrained by memcg limits only. It can allocate gigabytes.
> >
> > I have very briefly had a look at the core allocator parts (please note
> > that my understanding of BPF is really close to zero so I might be
> > missing a lot of implicit stuff). So by constrained by memcg you mean
> > __GFP_ACCOUNT done from the allocation context (irq_work). The complete
> > gfp mask is GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_ACCOUNT
> > which means this allocation is not allowed to sleep and GFP_ATOMIC
> > implies __GFP_HIGH to say that access to memory reserves is allowed.
> > Memcg charging code interprets this that the hard limit can be breached
> > under assumption that these are rare and will be compensated in some
> > way. The bulk allocator implemented here, however, doesn't reflect that
> > and continues allocating as it sees a success so the breach of the limit
> > is only bound by the number of objects to be allocated. If those can be
> > really large then this is a clear problem and __GFP_HIGH usage is not
> > really appropriate.
>
> That was a copy paste from the networking stack. See kmalloc_reserve().
> Not sure whether it's a bug there or not.
> In a separate thread we've agreed to convert all of bpf allocations
> to GFP_NOWAIT. For this patch set I've already fixed it in my branch.
>
> > Also, I do not see any tracking of the overall memory sitting in these
> > pools and I think this would be really appropriate. As there doesn't
> > seem to be any reclaim mechanism implemented this can hide quite some
> > unreachable memory.
> >
> > Finally it is not really clear to what kind of entity is the life time
> > of these caches bound to. Let's say the system goes OOM, is any process
> > responsible for it and a clean up would be done if it gets killed?
>
> We've been asking these questions for years and have been trying to
> come up with a solution.
> bpf progs are not analogous to user space processes.
> There are bpf progs that function completely without user space component.
> bpf progs are pretty close to be full featured kernel modules with
> the difference that bpf progs are safe, portable and users have
> full visibility into them (source code, line info, type info, etc)
> They are not binary blobs unlike kernel modules.
> But from OOM perspective they're pretty much like .ko-s.
> Which kernel module would you force unload when system is OOMing ?
> Force unloading ko-s will likely crash the system.
> Force unloading bpf progs maybe equally bad. The system won't crash,
> but it may be a sorrow state. The bpf could have been doing security
> enforcement or network firewall or providing key insights to critical
> user space components like systemd or health check daemon.
> We've been discussing ideas on how to rank and auto cleanup
> the system state when progs have to be unloaded. Some sort of
> destructor mechanism. Fingers crossed we will have it eventually.
> bpf infra keeps track of everything, of course.
> Technically we can detach, unpin and unload everything and all memory
> will be returned back to the system.
> Anyhow not a new problem. Orthogonal to this patch set.
> bpf progs have been doing memory allocation from day one. 8 years ago.
> This patch set is trying to make it 100% safe.
> Currently it's 99% safe.
>

I think part of Michal's concern here is about memory sitting in
caches that is not yet used by any bpf allocation. I honestly didn't
look at the patches, so I don't know, but if the amount of cached
memory in the bpf allocator is significant then maybe it's worth
reclaiming it on memory pressure? Just thinking out loud.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-08 17:48             ` Alexei Starovoitov
  2022-07-08 20:13               ` Yosry Ahmed
@ 2022-07-08 21:55               ` Shakeel Butt
  2022-07-10  5:26                 ` Alexei Starovoitov
  2022-07-11 12:22               ` Michal Hocko
  2 siblings, 1 reply; 72+ messages in thread
From: Shakeel Butt @ 2022-07-08 21:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michal Hocko, Matthew Wilcox, Christoph Hellwig, davem, daniel,
	andrii, tj, kafai, bpf, kernel-team, linux-mm, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	Vlastimil Babka

On Fri, Jul 08, 2022 at 10:48:58AM -0700, Alexei Starovoitov wrote:
> On Fri, Jul 08, 2022 at 03:41:47PM +0200, Michal Hocko wrote:
> > On Wed 06-07-22 11:05:25, Alexei Starovoitov wrote:
> > > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > [...]
> > > > For example, I assume that a BPF program
> > > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > > Right?
> > > 
> > > No. It's constrained by memcg limits only. It can allocate gigabytes.
> >  
> > I have very briefly had a look at the core allocator parts (please note
> > that my understanding of BPF is really close to zero so I might be
> > missing a lot of implicit stuff). So by constrained by memcg you mean
> > __GFP_ACCOUNT done from the allocation context (irq_work). The complete
> > gfp mask is GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_ACCOUNT
> > which means this allocation is not allowed to sleep and GFP_ATOMIC
> > implies __GFP_HIGH to say that access to memory reserves is allowed.
> > Memcg charging code interprets this that the hard limit can be breached
> > under assumption that these are rare and will be compensated in some
> > way. The bulk allocator implemented here, however, doesn't reflect that
> > and continues allocating as it sees a success so the breach of the limit
> > is only bound by the number of objects to be allocated. If those can be
> > really large then this is a clear problem and __GFP_HIGH usage is not
> > really appropriate.
> 
> That was a copy paste from the networking stack. See kmalloc_reserve().
> Not sure whether it's a bug there or not.

kmalloc_reserve() is good. Most of calls to kmalloc_reserve() are for
skbs and we don't use __GFP_ACCOUNT for skbs. Actually skbs are charged
to memcg through a separate interface (i.e. mem_cgroup_charge_skmem())

> In a separate thread we've agreed to convert all of bpf allocations
> to GFP_NOWAIT. For this patch set I've already fixed it in my branch.
> 
> > Also, I do not see any tracking of the overall memory sitting in these
> > pools and I think this would be really appropriate. As there doesn't
> > seem to be any reclaim mechanism implemented this can hide quite some
> > unreachable memory.
> > 
> > Finally it is not really clear to what kind of entity is the life time
> > of these caches bound to. Let's say the system goes OOM, is any process
> > responsible for it and a clean up would be done if it gets killed?
> 
> We've been asking these questions for years and have been trying to
> come up with a solution.
> bpf progs are not analogous to user space processes. 
> There are bpf progs that function completely without user space component.
> bpf progs are pretty close to be full featured kernel modules with
> the difference that bpf progs are safe, portable and users have
> full visibility into them (source code, line info, type info, etc)
> They are not binary blobs unlike kernel modules.
> But from OOM perspective they're pretty much like .ko-s.
> Which kernel module would you force unload when system is OOMing ?
> Force unloading ko-s will likely crash the system.
> Force unloading bpf progs maybe equally bad. The system won't crash,
> but it may be a sorrow state. The bpf could have been doing security
> enforcement or network firewall or providing key insights to critical
> user space components like systemd or health check daemon.
> We've been discussing ideas on how to rank and auto cleanup
> the system state when progs have to be unloaded. Some sort of
> destructor mechanism. Fingers crossed we will have it eventually.
> bpf infra keeps track of everything, of course.
> Technically we can detach, unpin and unload everything and all memory
> will be returned back to the system.
> Anyhow not a new problem. Orthogonal to this patch set.
> bpf progs have been doing memory allocation from day one. 8 years ago.
> This patch set is trying to make it 100% safe.
> Currently it's 99% safe.

Most probably Michal's comment was on free objects sitting in the caches
(also pointed out by Yosry). Should we drain them on memory pressure /
OOM or should we ignore them as the amount of memory is not significant?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-08 21:55               ` Shakeel Butt
@ 2022-07-10  5:26                 ` Alexei Starovoitov
  2022-07-10  7:32                   ` Shakeel Butt
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-10  5:26 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Fri, Jul 8, 2022 at 2:55 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Jul 08, 2022 at 10:48:58AM -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 08, 2022 at 03:41:47PM +0200, Michal Hocko wrote:
> > > On Wed 06-07-22 11:05:25, Alexei Starovoitov wrote:
> > > > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > > [...]
> > > > > For example, I assume that a BPF program
> > > > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > > > Right?
> > > >
> > > > No. It's constrained by memcg limits only. It can allocate gigabytes.
> > >
> > > I have very briefly had a look at the core allocator parts (please note
> > > that my understanding of BPF is really close to zero so I might be
> > > missing a lot of implicit stuff). So by constrained by memcg you mean
> > > __GFP_ACCOUNT done from the allocation context (irq_work). The complete
> > > gfp mask is GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_ACCOUNT
> > > which means this allocation is not allowed to sleep and GFP_ATOMIC
> > > implies __GFP_HIGH to say that access to memory reserves is allowed.
> > > Memcg charging code interprets this that the hard limit can be breached
> > > under assumption that these are rare and will be compensated in some
> > > way. The bulk allocator implemented here, however, doesn't reflect that
> > > and continues allocating as it sees a success so the breach of the limit
> > > is only bound by the number of objects to be allocated. If those can be
> > > really large then this is a clear problem and __GFP_HIGH usage is not
> > > really appropriate.
> >
> > That was a copy paste from the networking stack. See kmalloc_reserve().
> > Not sure whether it's a bug there or not.
>
> kmalloc_reserve() is good. Most of calls to kmalloc_reserve() are for
> skbs and we don't use __GFP_ACCOUNT for skbs. Actually skbs are charged
> to memcg through a separate interface (i.e. mem_cgroup_charge_skmem())
>
> > In a separate thread we've agreed to convert all of bpf allocations
> > to GFP_NOWAIT. For this patch set I've already fixed it in my branch.
> >
> > > Also, I do not see any tracking of the overall memory sitting in these
> > > pools and I think this would be really appropriate. As there doesn't
> > > seem to be any reclaim mechanism implemented this can hide quite some
> > > unreachable memory.
> > >
> > > Finally it is not really clear to what kind of entity is the life time
> > > of these caches bound to. Let's say the system goes OOM, is any process
> > > responsible for it and a clean up would be done if it gets killed?
> >
> > We've been asking these questions for years and have been trying to
> > come up with a solution.
> > bpf progs are not analogous to user space processes.
> > There are bpf progs that function completely without user space component.
> > bpf progs are pretty close to be full featured kernel modules with
> > the difference that bpf progs are safe, portable and users have
> > full visibility into them (source code, line info, type info, etc)
> > They are not binary blobs unlike kernel modules.
> > But from OOM perspective they're pretty much like .ko-s.
> > Which kernel module would you force unload when system is OOMing ?
> > Force unloading ko-s will likely crash the system.
> > Force unloading bpf progs maybe equally bad. The system won't crash,
> > but it may be a sorrow state. The bpf could have been doing security
> > enforcement or network firewall or providing key insights to critical
> > user space components like systemd or health check daemon.
> > We've been discussing ideas on how to rank and auto cleanup
> > the system state when progs have to be unloaded. Some sort of
> > destructor mechanism. Fingers crossed we will have it eventually.
> > bpf infra keeps track of everything, of course.
> > Technically we can detach, unpin and unload everything and all memory
> > will be returned back to the system.
> > Anyhow not a new problem. Orthogonal to this patch set.
> > bpf progs have been doing memory allocation from day one. 8 years ago.
> > This patch set is trying to make it 100% safe.
> > Currently it's 99% safe.
>
> Most probably Michal's comment was on free objects sitting in the caches
> (also pointed out by Yosry). Should we drain them on memory pressure /
> OOM or should we ignore them as the amount of memory is not significant?

Are you suggesting to design a shrinker for 0.01% of the memory
consumed by bpf?
And such drain would help... how?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-10  5:26                 ` Alexei Starovoitov
@ 2022-07-10  7:32                   ` Shakeel Butt
  2022-07-11 12:15                     ` Michal Hocko
  0 siblings, 1 reply; 72+ messages in thread
From: Shakeel Butt @ 2022-07-10  7:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michal Hocko, Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Sat, Jul 09, 2022 at 10:26:23PM -0700, Alexei Starovoitov wrote:
> On Fri, Jul 8, 2022 at 2:55 PM Shakeel Butt <shakeelb@google.com> wrote:
[...]
> >
> > Most probably Michal's comment was on free objects sitting in the caches
> > (also pointed out by Yosry). Should we drain them on memory pressure /
> > OOM or should we ignore them as the amount of memory is not significant?
> 
> Are you suggesting to design a shrinker for 0.01% of the memory
> consumed by bpf?

No, just claim that the memory sitting on such caches is insignificant.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-10  7:32                   ` Shakeel Butt
@ 2022-07-11 12:15                     ` Michal Hocko
  2022-07-12  4:39                       ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Michal Hocko @ 2022-07-11 12:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Matthew Wilcox, Christoph Hellwig,
	David S. Miller, Daniel Borkmann, Andrii Nakryiko, Tejun Heo,
	Martin KaFai Lau, bpf, Kernel Team, linux-mm, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	Vlastimil Babka

On Sun 10-07-22 07:32:13, Shakeel Butt wrote:
> On Sat, Jul 09, 2022 at 10:26:23PM -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 8, 2022 at 2:55 PM Shakeel Butt <shakeelb@google.com> wrote:
> [...]
> > >
> > > Most probably Michal's comment was on free objects sitting in the caches
> > > (also pointed out by Yosry). Should we drain them on memory pressure /
> > > OOM or should we ignore them as the amount of memory is not significant?
> > 
> > Are you suggesting to design a shrinker for 0.01% of the memory
> > consumed by bpf?
> 
> No, just claim that the memory sitting on such caches is insignificant.

yes, that is not really clear from the patch description. Earlier you
have said that the memory consumed might go into GBs. If that is a
memory that is actively used and not really reclaimable then bad luck.
There are other users like that in the kernel and this is not a new
problem. I think it would really help to add a counter to describe both
the overall memory claimed by the bpf allocator and actively used
portion of it. If you use our standard vmstat infrastructure then we can
easily show that information in the OOM report.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-08 17:48             ` Alexei Starovoitov
  2022-07-08 20:13               ` Yosry Ahmed
  2022-07-08 21:55               ` Shakeel Butt
@ 2022-07-11 12:22               ` Michal Hocko
  2 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2022-07-11 12:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Matthew Wilcox, Christoph Hellwig, davem, daniel, andrii, tj,
	kafai, bpf, kernel-team, linux-mm, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	Vlastimil Babka

On Fri 08-07-22 10:48:58, Alexei Starovoitov wrote:
> On Fri, Jul 08, 2022 at 03:41:47PM +0200, Michal Hocko wrote:
[...]
> > Finally it is not really clear to what kind of entity is the life time
> > of these caches bound to. Let's say the system goes OOM, is any process
> > responsible for it and a clean up would be done if it gets killed?
> 
> We've been asking these questions for years and have been trying to
> come up with a solution.
> bpf progs are not analogous to user space processes. 
> There are bpf progs that function completely without user space component.
> bpf progs are pretty close to be full featured kernel modules with
> the difference that bpf progs are safe, portable and users have
> full visibility into them (source code, line info, type info, etc)
> They are not binary blobs unlike kernel modules.
> But from OOM perspective they're pretty much like .ko-s.
> Which kernel module would you force unload when system is OOMing ?
> Force unloading ko-s will likely crash the system.
> Force unloading bpf progs maybe equally bad. The system won't crash,
> but it may be a sorrow state. The bpf could have been doing security
> enforcement or network firewall or providing key insights to critical
> user space components like systemd or health check daemon.
> We've been discussing ideas on how to rank and auto cleanup
> the system state when progs have to be unloaded. Some sort of
> destructor mechanism. Fingers crossed we will have it eventually.
> bpf infra keeps track of everything, of course.
> Technically we can detach, unpin and unload everything and all memory
> will be returned back to the system.
> Anyhow not a new problem. Orthogonal to this patch set.
> bpf progs have been doing memory allocation from day one. 8 years ago.
> This patch set is trying to make it 100% safe.
> Currently it's 99% safe.

OK, thanks for the clarification. There is still one thing that is not
really clear to me. Without a proper ownership bound to any process why
is it desired/helpful to account the memory to a memcg?

We have discussed something similar in a different email thread and I
still didn't manage to find time to put all the parts together. But if
the initiator (or however you call the process which loads the program)
exits then this might be the last process in the specific cgroup and so
it can be offlined and mostly invisible to an admin.

As you have explained there is nothing really actionable on this memory
by the OOM killer either. So does it actually buy us much to account?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-11 12:15                     ` Michal Hocko
@ 2022-07-12  4:39                       ` Alexei Starovoitov
  2022-07-12  7:40                         ` Michal Hocko
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-12  4:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Mon, Jul 11, 2022 at 02:15:07PM +0200, Michal Hocko wrote:
> On Sun 10-07-22 07:32:13, Shakeel Butt wrote:
> > On Sat, Jul 09, 2022 at 10:26:23PM -0700, Alexei Starovoitov wrote:
> > > On Fri, Jul 8, 2022 at 2:55 PM Shakeel Butt <shakeelb@google.com> wrote:
> > [...]
> > > >
> > > > Most probably Michal's comment was on free objects sitting in the caches
> > > > (also pointed out by Yosry). Should we drain them on memory pressure /
> > > > OOM or should we ignore them as the amount of memory is not significant?
> > > 
> > > Are you suggesting to design a shrinker for 0.01% of the memory
> > > consumed by bpf?
> > 
> > No, just claim that the memory sitting on such caches is insignificant.
> 
> yes, that is not really clear from the patch description. Earlier you
> have said that the memory consumed might go into GBs. If that is a
> memory that is actively used and not really reclaimable then bad luck.
> There are other users like that in the kernel and this is not a new
> problem. I think it would really help to add a counter to describe both
> the overall memory claimed by the bpf allocator and actively used
> portion of it. If you use our standard vmstat infrastructure then we can
> easily show that information in the OOM report.

OOM report can potentially be extended with info about bpf consumed
memory, but it's not clear whether it will help OOM analysis.
bpftool map show
prints all map data already.
Some devs use bpf to inspect bpf maps for finer details in run-time.
drgn scripts pull that data from crash dumps.
There is no need for new counters.
The idea of bpf specific counters/limits was rejected by memcg folks.

> OK, thanks for the clarification. There is still one thing that is not
> really clear to me. Without a proper ownership bound to any process why
> is it desired/helpful to account the memory to a memcg?

The first step is to have a limit. memcg provides it.

> We have discussed something similar in a different email thread and I
> still didn't manage to find time to put all the parts together. But if
> the initiator (or however you call the process which loads the program)
> exits then this might be the last process in the specific cgroup and so
> it can be offlined and mostly invisible to an admin.

Roman already sent reparenting fix:
https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/

> As you have explained there is nothing really actionable on this memory
> by the OOM killer either. So does it actually buy us much to account?

It will be actionable. One step at a time.
In the other thread we've discussed an idea to make memcg selectable when
bpf objects are created. The user might create a special memcg and use it for
all things bpf. This might be the way to provide bpf specific accounting
and limits.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12  4:39                       ` Alexei Starovoitov
@ 2022-07-12  7:40                         ` Michal Hocko
  2022-07-12  8:39                           ` Yafang Shao
                                             ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Michal Hocko @ 2022-07-12  7:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Shakeel Butt, Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Mon 11-07-22 21:39:14, Alexei Starovoitov wrote:
> On Mon, Jul 11, 2022 at 02:15:07PM +0200, Michal Hocko wrote:
> > On Sun 10-07-22 07:32:13, Shakeel Butt wrote:
> > > On Sat, Jul 09, 2022 at 10:26:23PM -0700, Alexei Starovoitov wrote:
> > > > On Fri, Jul 8, 2022 at 2:55 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > [...]
> > > > >
> > > > > Most probably Michal's comment was on free objects sitting in the caches
> > > > > (also pointed out by Yosry). Should we drain them on memory pressure /
> > > > > OOM or should we ignore them as the amount of memory is not significant?
> > > > 
> > > > Are you suggesting to design a shrinker for 0.01% of the memory
> > > > consumed by bpf?
> > > 
> > > No, just claim that the memory sitting on such caches is insignificant.
> > 
> > yes, that is not really clear from the patch description. Earlier you
> > have said that the memory consumed might go into GBs. If that is a
> > memory that is actively used and not really reclaimable then bad luck.
> > There are other users like that in the kernel and this is not a new
> > problem. I think it would really help to add a counter to describe both
> > the overall memory claimed by the bpf allocator and actively used
> > portion of it. If you use our standard vmstat infrastructure then we can
> > easily show that information in the OOM report.
> 
> OOM report can potentially be extended with info about bpf consumed
> memory, but it's not clear whether it will help OOM analysis.

If GBs of memory can be sitting there then it is surely an interesting
information to have when seeing OOM. One of the big shortcomings of the
OOM analysis is unaccounted memory.

> bpftool map show
> prints all map data already.
> Some devs use bpf to inspect bpf maps for finer details in run-time.
> drgn scripts pull that data from crash dumps.
> There is no need for new counters.
> The idea of bpf specific counters/limits was rejected by memcg folks.

I would argue that integration into vmstat is useful not only for oom
analysis but also for regular health check scripts watching /proc/vmstat
content. I do not think most of those generic tools are BPF aware. So
unless there is a good reason to not account this memory there then I
would vote for adding them. They are cheap and easy to integrate.
 
> > OK, thanks for the clarification. There is still one thing that is not
> > really clear to me. Without a proper ownership bound to any process why
> > is it desired/helpful to account the memory to a memcg?
> 
> The first step is to have a limit. memcg provides it.

I am sorry but this doesn't really explain it. Could you elaborate
please? Is the limit supposed to protect against adversaries? Or is it
just to prevent from accidental runaways? Is it purely for accounting
purposes?

> > We have discussed something similar in a different email thread and I
> > still didn't manage to find time to put all the parts together. But if
> > the initiator (or however you call the process which loads the program)
> > exits then this might be the last process in the specific cgroup and so
> > it can be offlined and mostly invisible to an admin.
> 
> Roman already sent reparenting fix:
> https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/

Reparenting is nice but not a silver bullet. Consider a shallow
hierarchy where the charging happens in the first level under the root
memcg. Reparenting to the root is just pushing everything under the
system resources category.
 
> > As you have explained there is nothing really actionable on this memory
> > by the OOM killer either. So does it actually buy us much to account?
> 
> It will be actionable. One step at a time.
> In the other thread we've discussed an idea to make memcg selectable when
> bpf objects are created. The user might create a special memcg and use it for
> all things bpf. This might be the way to provide bpf specific accounting
> and limits.

Do you have a reference for those discussions?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12  7:40                         ` Michal Hocko
@ 2022-07-12  8:39                           ` Yafang Shao
  2022-07-12  9:52                             ` Michal Hocko
  2022-07-12 18:40                           ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
  2022-07-13  2:27                           ` Roman Gushchin
  2 siblings, 1 reply; 72+ messages in thread
From: Yafang Shao @ 2022-07-12  8:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 11-07-22 21:39:14, Alexei Starovoitov wrote:
> > On Mon, Jul 11, 2022 at 02:15:07PM +0200, Michal Hocko wrote:
> > > On Sun 10-07-22 07:32:13, Shakeel Butt wrote:
> > > > On Sat, Jul 09, 2022 at 10:26:23PM -0700, Alexei Starovoitov wrote:
> > > > > On Fri, Jul 8, 2022 at 2:55 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > [...]
> > > > > >
> > > > > > Most probably Michal's comment was on free objects sitting in the caches
> > > > > > (also pointed out by Yosry). Should we drain them on memory pressure /
> > > > > > OOM or should we ignore them as the amount of memory is not significant?
> > > > >
> > > > > Are you suggesting to design a shrinker for 0.01% of the memory
> > > > > consumed by bpf?
> > > >
> > > > No, just claim that the memory sitting on such caches is insignificant.
> > >
> > > yes, that is not really clear from the patch description. Earlier you
> > > have said that the memory consumed might go into GBs. If that is a
> > > memory that is actively used and not really reclaimable then bad luck.
> > > There are other users like that in the kernel and this is not a new
> > > problem. I think it would really help to add a counter to describe both
> > > the overall memory claimed by the bpf allocator and actively used
> > > portion of it. If you use our standard vmstat infrastructure then we can
> > > easily show that information in the OOM report.
> >
> > OOM report can potentially be extended with info about bpf consumed
> > memory, but it's not clear whether it will help OOM analysis.
>
> If GBs of memory can be sitting there then it is surely an interesting
> information to have when seeing OOM. One of the big shortcomings of the
> OOM analysis is unaccounted memory.
>
> > bpftool map show
> > prints all map data already.
> > Some devs use bpf to inspect bpf maps for finer details in run-time.
> > drgn scripts pull that data from crash dumps.
> > There is no need for new counters.
> > The idea of bpf specific counters/limits was rejected by memcg folks.
>
> I would argue that integration into vmstat is useful not only for oom
> analysis but also for regular health check scripts watching /proc/vmstat
> content. I do not think most of those generic tools are BPF aware. So
> unless there is a good reason to not account this memory there then I
> would vote for adding them. They are cheap and easy to integrate.
>
> > > OK, thanks for the clarification. There is still one thing that is not
> > > really clear to me. Without a proper ownership bound to any process why
> > > is it desired/helpful to account the memory to a memcg?
> >
> > The first step is to have a limit. memcg provides it.
>
> I am sorry but this doesn't really explain it. Could you elaborate
> please? Is the limit supposed to protect against adversaries? Or is it
> just to prevent from accidental runaways? Is it purely for accounting
> purposes?
>
> > > We have discussed something similar in a different email thread and I
> > > still didn't manage to find time to put all the parts together. But if
> > > the initiator (or however you call the process which loads the program)
> > > exits then this might be the last process in the specific cgroup and so
> > > it can be offlined and mostly invisible to an admin.
> >
> > Roman already sent reparenting fix:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
>
> Reparenting is nice but not a silver bullet. Consider a shallow
> hierarchy where the charging happens in the first level under the root
> memcg. Reparenting to the root is just pushing everything under the
> system resources category.
>

Agreed. That's why I don't like reparenting.
Reparenting just reparent the charged pages and then redirect the new
charge, but can't reparents the 'limit' of the original memcg.
So it is a risk if the original memcg is still being charged. We have
to forbid the destruction of the original memcg.

> > > As you have explained there is nothing really actionable on this memory
> > > by the OOM killer either. So does it actually buy us much to account?
> >
> > It will be actionable. One step at a time.
> > In the other thread we've discussed an idea to make memcg selectable when
> > bpf objects are created. The user might create a special memcg and use it for
> > all things bpf. This might be the way to provide bpf specific accounting
> > and limits.
>
> Do you have a reference for those discussions?
>

I think it is https://lore.kernel.org/bpf/CALOAHbCM=ZxwutQOPmJx2LKY3Pd_hs+8v8r4-ybwPbBNBuNjXA@mail.gmail.com/
.
Introducing independent memcg to manage pinned bpf programs and maps
and forbid the user to destroy them if bpf programs are not unpinned,
that's the best workaround so sar, per my analysis.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12  8:39                           ` Yafang Shao
@ 2022-07-12  9:52                             ` Michal Hocko
  2022-07-12 15:25                               ` Shakeel Butt
                                                 ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Michal Hocko @ 2022-07-12  9:52 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > Roman already sent reparenting fix:
> > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
> >
> > Reparenting is nice but not a silver bullet. Consider a shallow
> > hierarchy where the charging happens in the first level under the root
> > memcg. Reparenting to the root is just pushing everything under the
> > system resources category.
> >
> 
> Agreed. That's why I don't like reparenting.
> Reparenting just reparent the charged pages and then redirect the new
> charge, but can't reparents the 'limit' of the original memcg.
> So it is a risk if the original memcg is still being charged. We have
> to forbid the destruction of the original memcg.

yes, I was toying with an idea like that. I guess we really want a
measure to keep cgroups around if they are bound to a resource which is
sticky itself. I am not sure how many other resources like BPF (aka
module like) we already do charge for memcg but considering the
potential memory consumption just reparenting will not help in general
case I am afraid.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12  9:52                             ` Michal Hocko
@ 2022-07-12 15:25                               ` Shakeel Butt
  2022-07-12 16:32                                 ` Tejun Heo
  2022-07-12 16:24                               ` Tejun Heo
  2022-07-13  2:39                               ` Roman Gushchin
  2 siblings, 1 reply; 72+ messages in thread
From: Shakeel Butt @ 2022-07-12 15:25 UTC (permalink / raw)
  To: Michal Hocko, Yosry Ahmed, Muchun Song, Johannes Weiner
  Cc: Yafang Shao, Alexei Starovoitov, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Vlastimil Babka

CCing Yosry, Muchun & Johannes

On Tue, Jul 12, 2022 at 2:52 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > Roman already sent reparenting fix:
> > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
> > >
> > > Reparenting is nice but not a silver bullet. Consider a shallow
> > > hierarchy where the charging happens in the first level under the root
> > > memcg. Reparenting to the root is just pushing everything under the
> > > system resources category.
> > >
> >
> > Agreed. That's why I don't like reparenting.
> > Reparenting just reparent the charged pages and then redirect the new
> > charge, but can't reparents the 'limit' of the original memcg.
> > So it is a risk if the original memcg is still being charged. We have
> > to forbid the destruction of the original memcg.
>
> yes, I was toying with an idea like that. I guess we really want a
> measure to keep cgroups around if they are bound to a resource which is
> sticky itself. I am not sure how many other resources like BPF (aka
> module like) we already do charge for memcg but considering the
> potential memory consumption just reparenting will not help in general
> case I am afraid.

Another very obvious example is the filesystem shared between multiple
jobs. We had a similar discussion [1] on LRU reparenting patch series.

For this use-case internally we have a memcg= mount option where the
given memcg is the common ancestor (think of pod in k8s environment)
of the jobs who are sharing the filesystem.

[1] https://lore.kernel.org/all/20210330101531.82752-1-songmuchun@bytedance.com/

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12  9:52                             ` Michal Hocko
  2022-07-12 15:25                               ` Shakeel Butt
@ 2022-07-12 16:24                               ` Tejun Heo
  2022-07-18 14:13                                 ` Michal Hocko
  2022-07-13  2:39                               ` Roman Gushchin
  2 siblings, 1 reply; 72+ messages in thread
From: Tejun Heo @ 2022-07-12 16:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yafang Shao, Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

Hello, Michal.

On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> > Agreed. That's why I don't like reparenting.
> > Reparenting just reparent the charged pages and then redirect the new
> > charge, but can't reparents the 'limit' of the original memcg.
> > So it is a risk if the original memcg is still being charged. We have
> > to forbid the destruction of the original memcg.
> 
> yes, I was toying with an idea like that. I guess we really want a
> measure to keep cgroups around if they are bound to a resource which is
> sticky itself. I am not sure how many other resources like BPF (aka
> module like) we already do charge for memcg but considering the
> potential memory consumption just reparenting will not help in general
> case I am afraid.

I think the solution here is an extra cgroup layering to represent
persistent resource tracking. In systemd-speak, a service should have a
cgroup representing a persistent service type and a cgroup representing the
current running instance. This way, the user (or system agent) can clearly
distinguish all resources that have ever been attributed to the service and
the resources that are accounted to the current instance while also giving
visibility into residual resources for services that are no longer running.

This gives userspace control over what to track for how long and also fits
what the kernel can do in terms of resource tracking. If we try to do
something smart from kernel side, there are cases which are inherently
insolvable. e.g. if a service instance creates tmpfs / shmem / whawtever and
leaves it pinned one way or another and then exits, and there's no one who
actively accessed it afterwards, there is no userland visible entity we can
reasonably attribute that memory to other than the parent cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12 15:25                               ` Shakeel Butt
@ 2022-07-12 16:32                                 ` Tejun Heo
  2022-07-12 17:26                                   ` Shakeel Butt
  0 siblings, 1 reply; 72+ messages in thread
From: Tejun Heo @ 2022-07-12 16:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Yosry Ahmed, Muchun Song, Johannes Weiner,
	Yafang Shao, Alexei Starovoitov, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

Hello,

On Tue, Jul 12, 2022 at 08:25:24AM -0700, Shakeel Butt wrote:
> Another very obvious example is the filesystem shared between multiple
> jobs. We had a similar discussion [1] on LRU reparenting patch series.

Hmm... if I'm understanding correctly, what's discussed in [1] can be solved
with proper reparenting and nesting, right?

> For this use-case internally we have a memcg= mount option where the
> given memcg is the common ancestor (think of pod in k8s environment)
> of the jobs who are sharing the filesystem.

Can you elaborate a bit more on this? We've never really supported correctly
accounting pages shared across cgroups because it can be very complicating
and the use cases aren't that wide-spread. What's being shared? How big is
the shared portion in relation to total memory usage? What's the cgroup
topology like?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12 16:32                                 ` Tejun Heo
@ 2022-07-12 17:26                                   ` Shakeel Butt
  2022-07-12 17:36                                     ` Tejun Heo
  0 siblings, 1 reply; 72+ messages in thread
From: Shakeel Butt @ 2022-07-12 17:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Yosry Ahmed, Muchun Song, Johannes Weiner,
	Yafang Shao, Alexei Starovoitov, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue, Jul 12, 2022 at 9:32 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Jul 12, 2022 at 08:25:24AM -0700, Shakeel Butt wrote:
> > Another very obvious example is the filesystem shared between multiple
> > jobs. We had a similar discussion [1] on LRU reparenting patch series.
>
> Hmm... if I'm understanding correctly, what's discussed in [1] can be solved
> with proper reparenting and nesting, right?
>

To some extent i.e. the zombies will go away but the accounting/stats
of the sub-jobs will be nondeterministic until all the possible shared
stuff is reparented. Let me give a more concrete example below.

> > For this use-case internally we have a memcg= mount option where the
> > given memcg is the common ancestor (think of pod in k8s environment)
> > of the jobs who are sharing the filesystem.
>
> Can you elaborate a bit more on this? We've never really supported correctly
> accounting pages shared across cgroups because it can be very complicating
> and the use cases aren't that wide-spread. What's being shared? How big is
> the shared portion in relation to total memory usage? What's the cgroup
> topology like?
>

One use-case we have is a build & test service which runs independent
builds and tests but all the build utilities (compiler, linker,
libraries) are shared between those builds and tests.

In terms of topology, the service has a top level cgroup (P) and all
independent builds and tests run in their own cgroup under P. These
builds/tests continuously come and go.

This service continuously monitors all the builds/tests running and
may kill some based on some criteria which includes memory usage.
However the memory usage is nondeterministic and killing a specific
build/test may not really free memory if most of the memory charged to
it is from shared build utilities.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12 17:26                                   ` Shakeel Butt
@ 2022-07-12 17:36                                     ` Tejun Heo
  2022-07-12 18:11                                       ` Shakeel Butt
  0 siblings, 1 reply; 72+ messages in thread
From: Tejun Heo @ 2022-07-12 17:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Yosry Ahmed, Muchun Song, Johannes Weiner,
	Yafang Shao, Alexei Starovoitov, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

Hello,

On Tue, Jul 12, 2022 at 10:26:22AM -0700, Shakeel Butt wrote:
> One use-case we have is a build & test service which runs independent
> builds and tests but all the build utilities (compiler, linker,
> libraries) are shared between those builds and tests.
> 
> In terms of topology, the service has a top level cgroup (P) and all
> independent builds and tests run in their own cgroup under P. These
> builds/tests continuously come and go.
> 
> This service continuously monitors all the builds/tests running and
> may kill some based on some criteria which includes memory usage.
> However the memory usage is nondeterministic and killing a specific
> build/test may not really free memory if most of the memory charged to
> it is from shared build utilities.

That doesn't sound too unusual. So, one saving grace here is that the memory
pressure in the stressed cgroup should trigger reclaim of the shared memory
which will be likely picked up by someone else, hopefully, under less memory
pressure. Can you give more concerete details? ie. describe a failing
scenario with actual ballpark memory numbers?

FWIW, at least from generic resource constrol standpoint, I think it may
make sense to have a way to escape certain resources to an ancestor for
shared resources provided that we can come up with a sane interface.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12 17:36                                     ` Tejun Heo
@ 2022-07-12 18:11                                       ` Shakeel Butt
  2022-07-12 18:43                                         ` Alexei Starovoitov
  2022-07-12 19:11                                         ` Mina Almasry
  0 siblings, 2 replies; 72+ messages in thread
From: Shakeel Butt @ 2022-07-12 18:11 UTC (permalink / raw)
  To: Tejun Heo, Mina Almasry
  Cc: Michal Hocko, Yosry Ahmed, Muchun Song, Johannes Weiner,
	Yafang Shao, Alexei Starovoitov, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

Ccing Mina who actually worked on upstreaming this. See [1] for
previous discussion and more use-cases.

[1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

On Tue, Jul 12, 2022 at 10:36 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Jul 12, 2022 at 10:26:22AM -0700, Shakeel Butt wrote:
> > One use-case we have is a build & test service which runs independent
> > builds and tests but all the build utilities (compiler, linker,
> > libraries) are shared between those builds and tests.
> >
> > In terms of topology, the service has a top level cgroup (P) and all
> > independent builds and tests run in their own cgroup under P. These
> > builds/tests continuously come and go.
> >
> > This service continuously monitors all the builds/tests running and
> > may kill some based on some criteria which includes memory usage.
> > However the memory usage is nondeterministic and killing a specific
> > build/test may not really free memory if most of the memory charged to
> > it is from shared build utilities.
>
> That doesn't sound too unusual. So, one saving grace here is that the memory
> pressure in the stressed cgroup should trigger reclaim of the shared memory
> which will be likely picked up by someone else, hopefully, under less memory
> pressure. Can you give more concerete details? ie. describe a failing
> scenario with actual ballpark memory numbers?

Mina, can you please provide details requested by Tejun?

>
> FWIW, at least from generic resource constrol standpoint, I think it may
> make sense to have a way to escape certain resources to an ancestor for
> shared resources provided that we can come up with a sane interface.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12  7:40                         ` Michal Hocko
  2022-07-12  8:39                           ` Yafang Shao
@ 2022-07-12 18:40                           ` Alexei Starovoitov
  2022-07-18 12:27                             ` Michal Hocko
  2022-07-13  2:27                           ` Roman Gushchin
  2 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-12 18:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue, Jul 12, 2022 at 09:40:13AM +0200, Michal Hocko wrote:
> On Mon 11-07-22 21:39:14, Alexei Starovoitov wrote:
> > On Mon, Jul 11, 2022 at 02:15:07PM +0200, Michal Hocko wrote:
> > > On Sun 10-07-22 07:32:13, Shakeel Butt wrote:
> > > > On Sat, Jul 09, 2022 at 10:26:23PM -0700, Alexei Starovoitov wrote:
> > > > > On Fri, Jul 8, 2022 at 2:55 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > [...]
> > > > > >
> > > > > > Most probably Michal's comment was on free objects sitting in the caches
> > > > > > (also pointed out by Yosry). Should we drain them on memory pressure /
> > > > > > OOM or should we ignore them as the amount of memory is not significant?
> > > > > 
> > > > > Are you suggesting to design a shrinker for 0.01% of the memory
> > > > > consumed by bpf?
> > > > 
> > > > No, just claim that the memory sitting on such caches is insignificant.
> > > 
> > > yes, that is not really clear from the patch description. Earlier you
> > > have said that the memory consumed might go into GBs. If that is a
> > > memory that is actively used and not really reclaimable then bad luck.
> > > There are other users like that in the kernel and this is not a new
> > > problem. I think it would really help to add a counter to describe both
> > > the overall memory claimed by the bpf allocator and actively used
> > > portion of it. If you use our standard vmstat infrastructure then we can
> > > easily show that information in the OOM report.
> > 
> > OOM report can potentially be extended with info about bpf consumed
> > memory, but it's not clear whether it will help OOM analysis.
> 
> If GBs of memory can be sitting there then it is surely an interesting
> information to have when seeing OOM. One of the big shortcomings of the
> OOM analysis is unaccounted memory.
> 
> > bpftool map show
> > prints all map data already.
> > Some devs use bpf to inspect bpf maps for finer details in run-time.
> > drgn scripts pull that data from crash dumps.
> > There is no need for new counters.
> > The idea of bpf specific counters/limits was rejected by memcg folks.
> 
> I would argue that integration into vmstat is useful not only for oom
> analysis but also for regular health check scripts watching /proc/vmstat
> content. I do not think most of those generic tools are BPF aware. So
> unless there is a good reason to not account this memory there then I
> would vote for adding them. They are cheap and easy to integrate.

We've seen enough performance issues with such counters.
So, no, they are not cheap.
Remember bpf has to be optimized for all cases.
Some of them process millions of packets per second.
Others do millions of map update/delete per second which means
millions of alloc/free.

> > > OK, thanks for the clarification. There is still one thing that is not
> > > really clear to me. Without a proper ownership bound to any process why
> > > is it desired/helpful to account the memory to a memcg?
> > 
> > The first step is to have a limit. memcg provides it.
> 
> I am sorry but this doesn't really explain it. Could you elaborate
> please? Is the limit supposed to protect against adversaries? Or is it
> just to prevent from accidental runaways? 

yes to above two.

> Is it purely for accounting
> purposes?

also soft yes. Once the user be able to select memcg it will become
a strong yes.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12 18:11                                       ` Shakeel Butt
@ 2022-07-12 18:43                                         ` Alexei Starovoitov
  2022-07-13 13:56                                           ` Yafang Shao
  2022-07-12 19:11                                         ` Mina Almasry
  1 sibling, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2022-07-12 18:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Mina Almasry, Michal Hocko, Yosry Ahmed, Muchun Song,
	Johannes Weiner, Yafang Shao, Matthew Wilcox, Christoph Hellwig,
	David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, bpf, Kernel Team, linux-mm, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	Vlastimil Babka

On Tue, Jul 12, 2022 at 11:11:25AM -0700, Shakeel Butt wrote:
> Ccing Mina who actually worked on upstreaming this. See [1] for
> previous discussion and more use-cases.
> 
> [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

Doesn't look like that it landed upstream?

For bpf side we're thinking of something similar.
We cannot do memcg= mount option, of course.
Instead memcg path or FD will passed to bpf side to be used later.
So the user can select a memcg instead of taking it from current.

Yafang,
I'm assuming you're working on something like this?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12 18:11                                       ` Shakeel Butt
  2022-07-12 18:43                                         ` Alexei Starovoitov
@ 2022-07-12 19:11                                         ` Mina Almasry
  1 sibling, 0 replies; 72+ messages in thread
From: Mina Almasry @ 2022-07-12 19:11 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Michal Hocko, Yosry Ahmed, Muchun Song,
	Johannes Weiner, Yafang Shao, Alexei Starovoitov, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue, Jul 12, 2022 at 11:11 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Ccing Mina who actually worked on upstreaming this. See [1] for
> previous discussion and more use-cases.
>
> [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
>
> On Tue, Jul 12, 2022 at 10:36 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Tue, Jul 12, 2022 at 10:26:22AM -0700, Shakeel Butt wrote:
> > > One use-case we have is a build & test service which runs independent
> > > builds and tests but all the build utilities (compiler, linker,
> > > libraries) are shared between those builds and tests.
> > >
> > > In terms of topology, the service has a top level cgroup (P) and all
> > > independent builds and tests run in their own cgroup under P. These
> > > builds/tests continuously come and go.
> > >
> > > This service continuously monitors all the builds/tests running and
> > > may kill some based on some criteria which includes memory usage.
> > > However the memory usage is nondeterministic and killing a specific
> > > build/test may not really free memory if most of the memory charged to
> > > it is from shared build utilities.
> >
> > That doesn't sound too unusual. So, one saving grace here is that the memory
> > pressure in the stressed cgroup should trigger reclaim of the shared memory
> > which will be likely picked up by someone else, hopefully, under less memory
> > pressure. Can you give more concerete details? ie. describe a failing
> > scenario with actual ballpark memory numbers?
>
> Mina, can you please provide details requested by Tejun?
>

As far as I am aware the builds/tests service Shakeel mentioned is a
theoretical use case we're considering, but the actual use cases we're
running are the 3 I listed in my cover letter in my original proposal:

https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

Still, the use case Shakeel is talking about is almost identical to
use case #2 in that proposal:
"Our infrastructure has large meta jobs such as kubernetes which spawn
multiple subtasks which share a tmpfs mount. These jobs and its
subtasks use that tmpfs mount for various purposes such as data
sharing or persistent data between the subtask restarts. In kubernetes
terminology, the meta job is similar to pods and subtasks are
containers under pods. We want the shared memory to be
deterministically charged to the kubernetes's pod and independent to
the lifetime of containers under the pod."

To run such a job we do the following:

- We setup a hierarchy like so:
                   pod_container
                  /           |                 \
container_a    container_b     container_c

- We set up a tmpfs mount with memcg= pod_container. This instructs
the kernel to charge all of this tmpfs user data to pod_container,
instead of the memcg of the task which faults in the shared memory.

- We set up the pod_container.max to be the maximum amount of memory
allowed to the _entire_ job.

- We set up container_a.max, container_b.max, and container_c.max to
be the limit of each of sub-tasks a, b, and c respectively, not
including the shared memory, which is allocated via the tmpfs mount
and charged directly to pod_container.


For some rough numbers, you can imagine a scenario:

tmpfs memcg=pod_container,size=100MB

                                 pod_container.max=130MB
                    /                           |
             \
container_a.max=10MB    container_b.max=20MB    container_c.max=30MB


Thanks to memcg=pod_container, neither tasks a, b, and c are charged
for the shared memory, so they can stay within their 10MB, 20MB, and
30MB limits respectively. This gives us fine grained control to
deterministically charge the shared memory and apply limits on the
memory usage of the individual sub-tasks and the overall amount of
memory the entire pod should consume.

For transparency's sake, this is Johannes's comments on the API:
https://lore.kernel.org/linux-mm/YZvppKvUPTIytM%2Fc@cmpxchg.org/

As Tejun puts it:

"it may make sense to have a way to escape certain resources to an ancestor for
shared resources provided that we can come up with a sane interface"

The interface Johannes has opted for is to reparent memory to the
common ancestor _when it is accessed by a task in another memcg_. This
doesn't work for us for a few reasons, one of which in the example
above container_a may get charged for all the 100MB of shared memory
if it's the unlucky task that faults in all the shared memory.


> >
> > FWIW, at least from generic resource constrol standpoint, I think it may
> > make sense to have a way to escape certain resources to an ancestor for
> > shared resources provided that we can come up with a sane interface.
> >
> > Thanks.
> >
> > --
> > tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12  7:40                         ` Michal Hocko
  2022-07-12  8:39                           ` Yafang Shao
  2022-07-12 18:40                           ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
@ 2022-07-13  2:27                           ` Roman Gushchin
  2 siblings, 0 replies; 72+ messages in thread
From: Roman Gushchin @ 2022-07-13  2:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue, Jul 12, 2022 at 09:40:13AM +0200, Michal Hocko wrote:
> On Mon 11-07-22 21:39:14, Alexei Starovoitov wrote:
> > On Mon, Jul 11, 2022 at 02:15:07PM +0200, Michal Hocko wrote:
> > > On Sun 10-07-22 07:32:13, Shakeel Butt wrote:
> > > > On Sat, Jul 09, 2022 at 10:26:23PM -0700, Alexei Starovoitov wrote:
> > > > > On Fri, Jul 8, 2022 at 2:55 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > [...]
> > > > > >
> > > > > > Most probably Michal's comment was on free objects sitting in the caches
> > > > > > (also pointed out by Yosry). Should we drain them on memory pressure /
> > > > > > OOM or should we ignore them as the amount of memory is not significant?
> > > > > 
> > > > > Are you suggesting to design a shrinker for 0.01% of the memory
> > > > > consumed by bpf?
> > > > 
> > > > No, just claim that the memory sitting on such caches is insignificant.
> > > 
> > > yes, that is not really clear from the patch description. Earlier you
> > > have said that the memory consumed might go into GBs. If that is a
> > > memory that is actively used and not really reclaimable then bad luck.
> > > There are other users like that in the kernel and this is not a new
> > > problem. I think it would really help to add a counter to describe both
> > > the overall memory claimed by the bpf allocator and actively used
> > > portion of it. If you use our standard vmstat infrastructure then we can
> > > easily show that information in the OOM report.
> > 
> > OOM report can potentially be extended with info about bpf consumed
> > memory, but it's not clear whether it will help OOM analysis.
> 
> If GBs of memory can be sitting there then it is surely an interesting
> information to have when seeing OOM. One of the big shortcomings of the
> OOM analysis is unaccounted memory.
> 
> > bpftool map show
> > prints all map data already.
> > Some devs use bpf to inspect bpf maps for finer details in run-time.
> > drgn scripts pull that data from crash dumps.
> > There is no need for new counters.
> > The idea of bpf specific counters/limits was rejected by memcg folks.
> 
> I would argue that integration into vmstat is useful not only for oom
> analysis but also for regular health check scripts watching /proc/vmstat
> content. I do not think most of those generic tools are BPF aware. So
> unless there is a good reason to not account this memory there then I
> would vote for adding them. They are cheap and easy to integrate.
>  
> > > OK, thanks for the clarification. There is still one thing that is not
> > > really clear to me. Without a proper ownership bound to any process why
> > > is it desired/helpful to account the memory to a memcg?
> > 
> > The first step is to have a limit. memcg provides it.
> 
> I am sorry but this doesn't really explain it. Could you elaborate
> please? Is the limit supposed to protect against adversaries? Or is it
> just to prevent from accidental runaways? Is it purely for accounting
> purposes?
> 
> > > We have discussed something similar in a different email thread and I
> > > still didn't manage to find time to put all the parts together. But if
> > > the initiator (or however you call the process which loads the program)
> > > exits then this might be the last process in the specific cgroup and so
> > > it can be offlined and mostly invisible to an admin.
> > 
> > Roman already sent reparenting fix:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/

Just to be clear:
for the actual memory which is backing up bpf maps (slabs, percpu or vmallocs)
reparenting was implemented several years ago. Nothing is changing here.

This patch only adds reparenting to the map->memcg pointer (by replacing it
to an objcg), which affects *new* allocations which are happening after
the deletion of the cgroup. This would help to reduce the number of dying cgroups,
but unlikely significantly, this is why it hasn't been implemented from scratch.

Thanks!

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12  9:52                             ` Michal Hocko
  2022-07-12 15:25                               ` Shakeel Butt
  2022-07-12 16:24                               ` Tejun Heo
@ 2022-07-13  2:39                               ` Roman Gushchin
  2022-07-13 14:24                                 ` Yafang Shao
  2022-07-18 17:55                                 ` Yosry Ahmed
  2 siblings, 2 replies; 72+ messages in thread
From: Roman Gushchin @ 2022-07-13  2:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yafang Shao, Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > Roman already sent reparenting fix:
> > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
> > >
> > > Reparenting is nice but not a silver bullet. Consider a shallow
> > > hierarchy where the charging happens in the first level under the root
> > > memcg. Reparenting to the root is just pushing everything under the
> > > system resources category.
> > >
> > 
> > Agreed. That's why I don't like reparenting.
> > Reparenting just reparent the charged pages and then redirect the new
> > charge, but can't reparents the 'limit' of the original memcg.
> > So it is a risk if the original memcg is still being charged. We have
> > to forbid the destruction of the original memcg.

I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe*
bpf maps is an exception), I don't think there is a better choice.

> yes, I was toying with an idea like that. I guess we really want a
> measure to keep cgroups around if they are bound to a resource which is
> sticky itself. I am not sure how many other resources like BPF (aka
> module like) we already do charge for memcg but considering the
> potential memory consumption just reparenting will not help in general
> case I am afraid.

Well, then we have to make these objects a first-class citizens in cgroup API,
like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc.
I easily can see some value here, but it's a big API change.

With the current approach when a bpf map pins a memory cgroup of the creator
process (which I think is completely transparent for most bpf users), I don't
think preventing the deletion of a such cgroup is possible. It will break too
many things.

But honestly I don't see why userspace can't handle it. If there is a cgroup which
contains shared bpf maps, why would it delete it? It's a weird use case, I don't
think we have to optimize for it. Also, we do a ton of optimizations for live
cgroups (e.g. css refcounting being percpu) which are not working for a deleted
cgroup. So noone really should expect any properties from dying cgroups.

Thanks!

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12 18:43                                         ` Alexei Starovoitov
@ 2022-07-13 13:56                                           ` Yafang Shao
  0 siblings, 0 replies; 72+ messages in thread
From: Yafang Shao @ 2022-07-13 13:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Shakeel Butt, Tejun Heo, Mina Almasry, Michal Hocko, Yosry Ahmed,
	Muchun Song, Johannes Weiner, Matthew Wilcox, Christoph Hellwig,
	David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, bpf, Kernel Team, linux-mm, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	Vlastimil Babka

On Wed, Jul 13, 2022 at 2:43 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jul 12, 2022 at 11:11:25AM -0700, Shakeel Butt wrote:
> > Ccing Mina who actually worked on upstreaming this. See [1] for
> > previous discussion and more use-cases.
> >
> > [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
>
> Doesn't look like that it landed upstream?
>
> For bpf side we're thinking of something similar.
> We cannot do memcg= mount option, of course.
> Instead memcg path or FD will passed to bpf side to be used later.
> So the user can select a memcg instead of taking it from current.
>
> Yafang,
> I'm assuming you're working on something like this?

Yes, I'm working on it. I will post it once it is ready, but probably
not shortly.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-13  2:39                               ` Roman Gushchin
@ 2022-07-13 14:24                                 ` Yafang Shao
  2022-07-13 16:24                                   ` Tejun Heo
  2022-07-18 17:55                                 ` Yosry Ahmed
  1 sibling, 1 reply; 72+ messages in thread
From: Yafang Shao @ 2022-07-13 14:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Tejun Heo, Martin KaFai Lau, bpf, Kernel Team,
	linux-mm, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Wed, Jul 13, 2022 at 10:39 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> > On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > Roman already sent reparenting fix:
> > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
> > > >
> > > > Reparenting is nice but not a silver bullet. Consider a shallow
> > > > hierarchy where the charging happens in the first level under the root
> > > > memcg. Reparenting to the root is just pushing everything under the
> > > > system resources category.
> > > >
> > >
> > > Agreed. That's why I don't like reparenting.
> > > Reparenting just reparent the charged pages and then redirect the new
> > > charge, but can't reparents the 'limit' of the original memcg.
> > > So it is a risk if the original memcg is still being charged. We have
> > > to forbid the destruction of the original memcg.
>
> I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe*
> bpf maps is an exception), I don't think there is a better choice.
>
> > yes, I was toying with an idea like that. I guess we really want a
> > measure to keep cgroups around if they are bound to a resource which is
> > sticky itself. I am not sure how many other resources like BPF (aka
> > module like) we already do charge for memcg but considering the
> > potential memory consumption just reparenting will not help in general
> > case I am afraid.
>
> Well, then we have to make these objects a first-class citizens in cgroup API,
> like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc.
> I easily can see some value here, but it's a big API change.
>
> With the current approach when a bpf map pins a memory cgroup of the creator
> process (which I think is completely transparent for most bpf users), I don't
> think preventing the deletion of a such cgroup is possible. It will break too
> many things.
>
> But honestly I don't see why userspace can't handle it. If there is a cgroup which
> contains shared bpf maps, why would it delete it? It's a weird use case, I don't
> think we have to optimize for it. Also, we do a ton of optimizations for live
> cgroups (e.g. css refcounting being percpu) which are not working for a deleted
> cgroup. So noone really should expect any properties from dying cgroups.
>

I think we have discussed why the user can't handle it easily.
Actually It's NOT a weird use case if you are a k8s user.  (Of course
it may seem weird to the systemd user, but unfortunately systemd
doesn't rule the whole world. )
I have told you that it is not reasonable to refuse a containerized
process to pin bpf programs, but if you are not familiar with k8s, it
is not easy to explain clearly why it is a trouble for deployment.
But I can try to explain to you from a *systemd user's* perspective.

                   bpf-memcg                       (must be persistent)
                  /                \
  bpf-foo-memcg       bpf-bar-memcg   (must be persistent, and limit here)
-------------------------------------------------------
           /                              \
    bpf-foo pod              bpf-bar pod    (being created and
destroyed, but not limited)

I assume the above hierarchy is what you expect.
But you know, in the k8s environment, everything is pod-based, that
means if we use the above hierarchy in the k8s environment, the k8s's
limiting, monitoring, debugging must be changed consequently.  That
means it may be a fullstack change in k8s, a great refactor.

So below hierarchy is a reasonable solution,
                                          bpf-memcg
                                                |
  bpf-foo pod                    bpf-foo-memcg     (limited)
       /          \                                /
(charge)     (not-charged)      (charged)
proc-foo                     bpf-foo

And then keep the bpf-memgs persistent.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-13 14:24                                 ` Yafang Shao
@ 2022-07-13 16:24                                   ` Tejun Heo
  2022-07-14  6:15                                     ` Yafang Shao
  0 siblings, 1 reply; 72+ messages in thread
From: Tejun Heo @ 2022-07-13 16:24 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Roman Gushchin, Michal Hocko, Alexei Starovoitov, Shakeel Butt,
	Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, bpf,
	Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

Hello,

On Wed, Jul 13, 2022 at 10:24:05PM +0800, Yafang Shao wrote:
> I have told you that it is not reasonable to refuse a containerized
> process to pin bpf programs, but if you are not familiar with k8s, it
> is not easy to explain clearly why it is a trouble for deployment.
> But I can try to explain to you from a *systemd user's* perspective.

The way systemd currently sets up cgroup hierarchy doesn't work for
persistent per-service resource tracking. It needs to introduce an extra
layer for that which woudl be a significant change for systemd too.

> I assume the above hierarchy is what you expect.
> But you know, in the k8s environment, everything is pod-based, that
> means if we use the above hierarchy in the k8s environment, the k8s's
> limiting, monitoring, debugging must be changed consequently.  That
> means it may be a fullstack change in k8s, a great refactor.
> 
> So below hierarchy is a reasonable solution,
>                                           bpf-memcg
>                                                 |
>   bpf-foo pod                    bpf-foo-memcg     (limited)
>        /          \                                /
> (charge)     (not-charged)      (charged)
> proc-foo                     bpf-foo
> 
> And then keep the bpf-memgs persistent.

It looks like you draw the diagram with variable width font and it's
difficult to tell what you're trying to say. That said, I don't think the
argument you're making is a good one in general. The topic at hand is future
architectural direction in handling shared resources, which was never well
supported before. ie. We're not talking about breaking existing behaviors.

We don't want to architect kernel features to suit the expectations of one
particular application. It has to be longer term than that and it can't be
an one way road. Sometimes the kernel adapts to existing applications
because the expectations make sense. At other times, kernel takes a
direction which may require some work from applications to use new
capabilities because that makes more sense in the long term.

Let's keep the discussion more focused on technical merits.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-13 16:24                                   ` Tejun Heo
@ 2022-07-14  6:15                                     ` Yafang Shao
  0 siblings, 0 replies; 72+ messages in thread
From: Yafang Shao @ 2022-07-14  6:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Michal Hocko, Alexei Starovoitov, Shakeel Butt,
	Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, bpf,
	Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Thu, Jul 14, 2022 at 12:24 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Jul 13, 2022 at 10:24:05PM +0800, Yafang Shao wrote:
> > I have told you that it is not reasonable to refuse a containerized
> > process to pin bpf programs, but if you are not familiar with k8s, it
> > is not easy to explain clearly why it is a trouble for deployment.
> > But I can try to explain to you from a *systemd user's* perspective.
>
> The way systemd currently sets up cgroup hierarchy doesn't work for
> persistent per-service resource tracking. It needs to introduce an extra
> layer for that which woudl be a significant change for systemd too.
>
> > I assume the above hierarchy is what you expect.
> > But you know, in the k8s environment, everything is pod-based, that
> > means if we use the above hierarchy in the k8s environment, the k8s's
> > limiting, monitoring, debugging must be changed consequently.  That
> > means it may be a fullstack change in k8s, a great refactor.
> >
> > So below hierarchy is a reasonable solution,
> >                                           bpf-memcg
> >                                                 |
> >   bpf-foo pod                    bpf-foo-memcg     (limited)
> >        /          \                                /
> > (charge)     (not-charged)      (charged)
> > proc-foo                     bpf-foo
> >
> > And then keep the bpf-memgs persistent.
>
> It looks like you draw the diagram with variable width font and it's
> difficult to tell what you're trying to say.

Maybe below diagram is more clear to you ?
                                          bpf-memcg
                                                |
  bpf-foo pod                    bpf-foo-memcg     (limited)
       /          \                                /
(charge)     (not-charged)      (charged)
   |                         \                /
   |                          \              /
proc-foo                   bpf-foo

bpf-foo is loaded by process-foo, but it is not charge to the bpf-foo
pod, while it is remotely charge to bpf-foo-memcg.

>  That said, I don't think the
> argument you're making is a good one in general. The topic at hand is future
> architectural direction in handling shared resources, which was never well
> supported before. ie. We're not talking about breaking existing behaviors.
>
> We don't want to architect kernel features to suit the expectations of one
> particular application. It has to be longer term than that and it can't be
> an one way road. Sometimes the kernel adapts to existing applications
> because the expectations make sense. At other times, kernel takes a
> direction which may require some work from applications to use new
> capabilities because that makes more sense in the long term.
>

The shared resources or remote charge is not a new issue, see also
task->active_memcg. The case (map->memcg or map->objcg) we are
handling now is similar with task->active_memcg. If we want to make it
generic, I think we can start with task->active_memcg.

To make it generic, I have some superficial thinking on the cgroup side.
1) Can we extend the cgroup tree to cgroup graph ?
2) Can we extend the cgroup from process-based (cgroup.procs) to
resource-based (cgroup.resources) ?

Regarding question 1).
Originally the charge direction is vertical,  looks like a tree, as below,
      parent
         ^
          |
       cgroup

But after the task->active_memcg, there's a newly horizontal charge, as below,
      parent
         ^
          |
       cgroup  ---->  friend

They will have a same ancestor, so finally it looks like a graph,
              ancestor
              /             \
           ...               ...
           /                  \
       cgroup  ----  friend


Regarding question 2).
The lifecycle of a leaf cgroup is same with the processes inside it.
But after the remote charge been introduced, the lifecycle of a leaf
cgroup may be same with the process in other cgroups. That said, it is
not sufficient to be treated as process-based, because what it really
care about is the resources, so may be we should extend it to
resource-based.

> Let's keep the discussion more focused on technical merits.
>
> Thanks.
>
> --
> tejun



-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12 18:40                           ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
@ 2022-07-18 12:27                             ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2022-07-18 12:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Shakeel Butt, Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue 12-07-22 11:40:18, Alexei Starovoitov wrote:
> On Tue, Jul 12, 2022 at 09:40:13AM +0200, Michal Hocko wrote:
> > On Mon 11-07-22 21:39:14, Alexei Starovoitov wrote:
> > > On Mon, Jul 11, 2022 at 02:15:07PM +0200, Michal Hocko wrote:
> > > > On Sun 10-07-22 07:32:13, Shakeel Butt wrote:
> > > > > On Sat, Jul 09, 2022 at 10:26:23PM -0700, Alexei Starovoitov wrote:
> > > > > > On Fri, Jul 8, 2022 at 2:55 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > [...]
> > > > > > >
> > > > > > > Most probably Michal's comment was on free objects sitting in the caches
> > > > > > > (also pointed out by Yosry). Should we drain them on memory pressure /
> > > > > > > OOM or should we ignore them as the amount of memory is not significant?
> > > > > > 
> > > > > > Are you suggesting to design a shrinker for 0.01% of the memory
> > > > > > consumed by bpf?
> > > > > 
> > > > > No, just claim that the memory sitting on such caches is insignificant.
> > > > 
> > > > yes, that is not really clear from the patch description. Earlier you
> > > > have said that the memory consumed might go into GBs. If that is a
> > > > memory that is actively used and not really reclaimable then bad luck.
> > > > There are other users like that in the kernel and this is not a new
> > > > problem. I think it would really help to add a counter to describe both
> > > > the overall memory claimed by the bpf allocator and actively used
> > > > portion of it. If you use our standard vmstat infrastructure then we can
> > > > easily show that information in the OOM report.
> > > 
> > > OOM report can potentially be extended with info about bpf consumed
> > > memory, but it's not clear whether it will help OOM analysis.
> > 
> > If GBs of memory can be sitting there then it is surely an interesting
> > information to have when seeing OOM. One of the big shortcomings of the
> > OOM analysis is unaccounted memory.
> > 
> > > bpftool map show
> > > prints all map data already.
> > > Some devs use bpf to inspect bpf maps for finer details in run-time.
> > > drgn scripts pull that data from crash dumps.
> > > There is no need for new counters.
> > > The idea of bpf specific counters/limits was rejected by memcg folks.
> > 
> > I would argue that integration into vmstat is useful not only for oom
> > analysis but also for regular health check scripts watching /proc/vmstat
> > content. I do not think most of those generic tools are BPF aware. So
> > unless there is a good reason to not account this memory there then I
> > would vote for adding them. They are cheap and easy to integrate.
> 
> We've seen enough performance issues with such counters.

Not sure we are talking about the same thing. These counters are used by
the page allocator as well (e.g. PGALLOC, PGFREE) without a noticeable
overhead.

> So, no, they are not cheap.
> Remember bpf has to be optimized for all cases.
> Some of them process millions of packets per second.
> Others do millions of map update/delete per second which means
> millions of alloc/free.

I thought the whole point is to allocate from a different context than
the one where the memory is used.

In any case, these were my few cents to help with "usual pains when OOM
is hit". I can see you are not much into discussing in more details so
I won't burn much more of our time here.

Let me just reiterate that OOM reports with a large part of the
consumption outside of usual counters are a PITA and essentially
undebuggable without a local reproducer which can get pretty tricky with
custom eBPF programs running on the affected system you do not have
access to. So I believe that large in-kernel memory consumers should
be accounted somewhere so it is at least clear where one should look
for. If numbers tell that the accounting is prohibitely expensive than
it would be great to have an estimation at least.

Thanks

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-12 16:24                               ` Tejun Heo
@ 2022-07-18 14:13                                 ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2022-07-18 14:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yafang Shao, Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue 12-07-22 06:24:20, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> > > Agreed. That's why I don't like reparenting.
> > > Reparenting just reparent the charged pages and then redirect the new
> > > charge, but can't reparents the 'limit' of the original memcg.
> > > So it is a risk if the original memcg is still being charged. We have
> > > to forbid the destruction of the original memcg.
> > 
> > yes, I was toying with an idea like that. I guess we really want a
> > measure to keep cgroups around if they are bound to a resource which is
> > sticky itself. I am not sure how many other resources like BPF (aka
> > module like) we already do charge for memcg but considering the
> > potential memory consumption just reparenting will not help in general
> > case I am afraid.
> 
> I think the solution here is an extra cgroup layering to represent
> persistent resource tracking. In systemd-speak, a service should have a
> cgroup representing a persistent service type and a cgroup representing the
> current running instance. This way, the user (or system agent) can clearly
> distinguish all resources that have ever been attributed to the service and
> the resources that are accounted to the current instance while also giving
> visibility into residual resources for services that are no longer running.

Just to make sure I follow what you mean here. You are suggesting that
the cgroup manager would ensure that there will be a parent responsible
for the resource and that cgroup will not go away as long as there are
resources (well modulo admin interferese which is not really interesting
usecase).

I do agree that this approach would be easier from the kernel
perspective. It is also more error prone because some resources might be
so runtime specific that it is hard to predict and configure them in
advance. My thinking about "sticky" cgroups was based on the reference
count of approach when the kernel knows that the resource requires some
sort of tear down which is not process life scoped and would take a
reference on the cgroup to keep it alive. I can see a concern that this
can get quite subtle in many cases though.

> This gives userspace control over what to track for how long and also fits
> what the kernel can do in terms of resource tracking. If we try to do
> something smart from kernel side, there are cases which are inherently
> insolvable. e.g. if a service instance creates tmpfs / shmem / whawtever and
> leaves it pinned one way or another and then exits, and there's no one who
> actively accessed it afterwards, there is no userland visible entity we can
> reasonably attribute that memory to other than the parent cgroup.

yeah, tmpfs would be another example which is even more complex because
a single file can "belong" to different memcgs.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-13  2:39                               ` Roman Gushchin
  2022-07-13 14:24                                 ` Yafang Shao
@ 2022-07-18 17:55                                 ` Yosry Ahmed
  2022-07-19 11:30                                   ` cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.) Michal Hocko
  1 sibling, 1 reply; 72+ messages in thread
From: Yosry Ahmed @ 2022-07-18 17:55 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Yafang Shao, Alexei Starovoitov, Shakeel Butt,
	Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue, Jul 12, 2022 at 7:39 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> > On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > Roman already sent reparenting fix:
> > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
> > > >
> > > > Reparenting is nice but not a silver bullet. Consider a shallow
> > > > hierarchy where the charging happens in the first level under the root
> > > > memcg. Reparenting to the root is just pushing everything under the
> > > > system resources category.
> > > >
> > >
> > > Agreed. That's why I don't like reparenting.
> > > Reparenting just reparent the charged pages and then redirect the new
> > > charge, but can't reparents the 'limit' of the original memcg.
> > > So it is a risk if the original memcg is still being charged. We have
> > > to forbid the destruction of the original memcg.
>
> I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe*
> bpf maps is an exception), I don't think there is a better choice.
>
> > yes, I was toying with an idea like that. I guess we really want a
> > measure to keep cgroups around if they are bound to a resource which is
> > sticky itself. I am not sure how many other resources like BPF (aka
> > module like) we already do charge for memcg but considering the
> > potential memory consumption just reparenting will not help in general
> > case I am afraid.
>
> Well, then we have to make these objects a first-class citizens in cgroup API,
> like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc.
> I easily can see some value here, but it's a big API change.
>
> With the current approach when a bpf map pins a memory cgroup of the creator
> process (which I think is completely transparent for most bpf users), I don't
> think preventing the deletion of a such cgroup is possible. It will break too
> many things.
>
> But honestly I don't see why userspace can't handle it. If there is a cgroup which
> contains shared bpf maps, why would it delete it? It's a weird use case, I don't
> think we have to optimize for it. Also, we do a ton of optimizations for live
> cgroups (e.g. css refcounting being percpu) which are not working for a deleted
> cgroup. So noone really should expect any properties from dying cgroups.
>

Just a random thought here, and I can easily be wrong (and this can
easily be the wrong thread for this), but if we introduce a more
generic concept to generally tie a resource explicitly to a cgroup
(tmpfs, bpf maps, etc) using cgroupfs interfaces, and then prevent the
cgroup from being deleted unless the resource is freed or moved to a
different cgroup?

This would be optional, so the current status quo is maintainable, but
also gives flexibility to admins to assign resources to cgroups to
make sure nothing is ( unaccounted / accounted to a zombie memcg /
reparented to an unrelated parent ). This might be too fine-grained to
be practical but I just thought it might be useful. We will also need
to define an OOM behavior for such resources. Things like bpf maps
will be unreclaimable, but tmpfs memory can be swapped out.

I think this also partially addresses Johannes's concerns that the
memcg= mount option uses file system mounts to create shareable
resource domains outside of the cgroup hierarchy.

> Thanks!
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-18 17:55                                 ` Yosry Ahmed
@ 2022-07-19 11:30                                   ` Michal Hocko
  2022-07-19 18:00                                     ` Yosry Ahmed
  0 siblings, 1 reply; 72+ messages in thread
From: Michal Hocko @ 2022-07-19 11:30 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Roman Gushchin, Yafang Shao, Alexei Starovoitov, Shakeel Butt,
	Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Mon 18-07-22 10:55:59, Yosry Ahmed wrote:
> On Tue, Jul 12, 2022 at 7:39 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> > > On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > > Roman already sent reparenting fix:
> > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
> > > > >
> > > > > Reparenting is nice but not a silver bullet. Consider a shallow
> > > > > hierarchy where the charging happens in the first level under the root
> > > > > memcg. Reparenting to the root is just pushing everything under the
> > > > > system resources category.
> > > > >
> > > >
> > > > Agreed. That's why I don't like reparenting.
> > > > Reparenting just reparent the charged pages and then redirect the new
> > > > charge, but can't reparents the 'limit' of the original memcg.
> > > > So it is a risk if the original memcg is still being charged. We have
> > > > to forbid the destruction of the original memcg.
> >
> > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe*
> > bpf maps is an exception), I don't think there is a better choice.
> >
> > > yes, I was toying with an idea like that. I guess we really want a
> > > measure to keep cgroups around if they are bound to a resource which is
> > > sticky itself. I am not sure how many other resources like BPF (aka
> > > module like) we already do charge for memcg but considering the
> > > potential memory consumption just reparenting will not help in general
> > > case I am afraid.
> >
> > Well, then we have to make these objects a first-class citizens in cgroup API,
> > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc.
> > I easily can see some value here, but it's a big API change.
> >
> > With the current approach when a bpf map pins a memory cgroup of the creator
> > process (which I think is completely transparent for most bpf users), I don't
> > think preventing the deletion of a such cgroup is possible. It will break too
> > many things.
> >
> > But honestly I don't see why userspace can't handle it. If there is a cgroup which
> > contains shared bpf maps, why would it delete it? It's a weird use case, I don't
> > think we have to optimize for it. Also, we do a ton of optimizations for live
> > cgroups (e.g. css refcounting being percpu) which are not working for a deleted
> > cgroup. So noone really should expect any properties from dying cgroups.
> >
> 
> Just a random thought here, and I can easily be wrong (and this can
> easily be the wrong thread for this), but if we introduce a more
> generic concept to generally tie a resource explicitly to a cgroup
> (tmpfs, bpf maps, etc) using cgroupfs interfaces, and then prevent the
> cgroup from being deleted unless the resource is freed or moved to a
> different cgroup?

My understanding is that Tejun would prefer a user space defined policy
by a proper layering. And I would tend to agree that this is less prone
to corner cases.

Anyway, how would you envision such an interface?

> This would be optional, so the current status quo is maintainable, but
> also gives flexibility to admins to assign resources to cgroups to
> make sure nothing is ( unaccounted / accounted to a zombie memcg /
> reparented to an unrelated parent ). This might be too fine-grained to
> be practical but I just thought it might be useful. We will also need
> to define an OOM behavior for such resources. Things like bpf maps
> will be unreclaimable, but tmpfs memory can be swapped out.

Keep in mind that the swap is a shared resource in itself. So tmpfs is
essentially a sticky resource as well. A tmpfs file is not bound to any
proces life time the same way BPF program is. You might need less
priviledges to remove a file but in principle they are consuming
resources without any explicit owner.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
  2022-07-06 17:43                 ` Alexei Starovoitov
@ 2022-07-19 11:52                   ` Vlastimil Babka
  0 siblings, 0 replies; 72+ messages in thread
From: Vlastimil Babka @ 2022-07-19 11:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Lameter, Christoph Hellwig, David Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Matthew Wilcox, Liam R. Howlett,
	Michal Hocko, Shakeel Butt, Yafang Shao

On 7/6/22 19:43, Alexei Starovoitov wrote:
> On Mon, Jul 04, 2022 at 06:13:17PM +0200, Vlastimil Babka wrote:
>> 
>> > On RT fast path == slow path with a lock.
>> > On !RT fast path is lock less.
>> > That's all correct.
>> > bpf side has to make sure safety in all possible paths
>> > therefore RT or !RT makes no difference.
>> 
>> So AFAIK we don't right now have what BFP needs - an extra-constrained kind
>> of GFP_ATOMIC. I don't object you adding it privately. But it's another
>> reason to think about if these things can be generalized. For example we had
>> a discussion about the Maple tree having kinda similar kinds of requirements
>> to avoid its tree node preallocations always for the worst possible case.
> 
> What kind of maple tree needs? Does it need to be fully reentrant and nmi safe?
> Not really. The caller knows the context and can choose appropriate flags.
> While bpf alloc doesn't know the context. The bpf prog can be called from
> places where slab/page/kasan specific locks are held which makes all these
> pieces non-reentrable.

Sure, the context restrictions can differ between bpf, maple tree and other
users, but I think there's common need not to be dependend on slab/page
allocator implementation internals and its locking. So the common
allocator/cache on top would need to be implemented in a way to support the
most restricted context (e.g. bpf), thus be lockless and whatnot.
But then the individual users would be able to specify different details such as
- how much to preallocate in order to not run out of the cache
- what is allowed if we run out of cache - only async refill (bpf?) or also
e.g. GFP_NOWAIT for less restricted users?

> The full prealloc of bpf maps (read: waste a lot of memory) was our solution until now.
> This is specific to tracing bpf programs, of course.
> bpf networking, bpf security, sleepable bpf are completely different.
> 
>> I'm not sure we can sanely implement this within each of SLAB/SLUB/SLOB, or
>> rather provide a generic cache on top...
> 
> Notice that all of bpf cache functions are notrace/nokprobe/no locks.
> The main difference vs all other allocators is bpf_mem_alloc from cache
> and refill of the cache are two asynchronous operations. It allows the former
> to be reentrant and nmi safe.
> All in tree allocators sooner or later synchornously call into page_alloc,
> kasan, memleak and other debugging facilites that grab locks.
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 11:30                                   ` cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.) Michal Hocko
@ 2022-07-19 18:00                                     ` Yosry Ahmed
  2022-07-19 18:01                                       ` Yosry Ahmed
  2022-07-19 18:46                                       ` Mina Almasry
  0 siblings, 2 replies; 72+ messages in thread
From: Yosry Ahmed @ 2022-07-19 18:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Yafang Shao, Alexei Starovoitov, Shakeel Butt,
	Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue, Jul 19, 2022 at 4:30 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 18-07-22 10:55:59, Yosry Ahmed wrote:
> > On Tue, Jul 12, 2022 at 7:39 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> > > > On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> > > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > [...]
> > > > > > > Roman already sent reparenting fix:
> > > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
> > > > > >
> > > > > > Reparenting is nice but not a silver bullet. Consider a shallow
> > > > > > hierarchy where the charging happens in the first level under the root
> > > > > > memcg. Reparenting to the root is just pushing everything under the
> > > > > > system resources category.
> > > > > >
> > > > >
> > > > > Agreed. That's why I don't like reparenting.
> > > > > Reparenting just reparent the charged pages and then redirect the new
> > > > > charge, but can't reparents the 'limit' of the original memcg.
> > > > > So it is a risk if the original memcg is still being charged. We have
> > > > > to forbid the destruction of the original memcg.
> > >
> > > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe*
> > > bpf maps is an exception), I don't think there is a better choice.
> > >
> > > > yes, I was toying with an idea like that. I guess we really want a
> > > > measure to keep cgroups around if they are bound to a resource which is
> > > > sticky itself. I am not sure how many other resources like BPF (aka
> > > > module like) we already do charge for memcg but considering the
> > > > potential memory consumption just reparenting will not help in general
> > > > case I am afraid.
> > >
> > > Well, then we have to make these objects a first-class citizens in cgroup API,
> > > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc.
> > > I easily can see some value here, but it's a big API change.
> > >
> > > With the current approach when a bpf map pins a memory cgroup of the creator
> > > process (which I think is completely transparent for most bpf users), I don't
> > > think preventing the deletion of a such cgroup is possible. It will break too
> > > many things.
> > >
> > > But honestly I don't see why userspace can't handle it. If there is a cgroup which
> > > contains shared bpf maps, why would it delete it? It's a weird use case, I don't
> > > think we have to optimize for it. Also, we do a ton of optimizations for live
> > > cgroups (e.g. css refcounting being percpu) which are not working for a deleted
> > > cgroup. So noone really should expect any properties from dying cgroups.
> > >
> >
> > Just a random thought here, and I can easily be wrong (and this can
> > easily be the wrong thread for this), but if we introduce a more
> > generic concept to generally tie a resource explicitly to a cgroup
> > (tmpfs, bpf maps, etc) using cgroupfs interfaces, and then prevent the
> > cgroup from being deleted unless the resource is freed or moved to a
> > different cgroup?
>
> My understanding is that Tejun would prefer a user space defined policy
> by a proper layering. And I would tend to agree that this is less prone
> to corner cases.
>
> Anyway, how would you envision such an interface?

I imagine something like cgroup.sticky.[bpf/tmpfs/..] (I suck at naming things).

The file can contain a list of bpf map ids or tmpfs inode ids or mount
paths. You can write a new id/path to the file to add one, or read the
file to see a list of them. Basically very similar semantics to
cgroup.procs. Instead of processes, we have different types of sticky
resources here that usually outlive processes. The charging of such
resources would be deterministically attributed to this cgroup
(instead of the cgroup of the process that created the map or touched
the tmpfs file first).

Since the system admin has explicitly attributed these resources to
this cgroup, it makes sense that the cgroup cannot be deleted as long
as these resources exist and are charged to it, similar to
cgroup.procs. This also addresses some of the zombie cgroups problems.

The obvious question is: to maintain the current behavior, bpf maps
and tmpfs mounts will be by default initially charged the same way as
today. bpf maps for the cgroup of the process that created them, and
tmpfs pages on first touch basis. How do we move their charging after
their creation in the current way to a cgroup.sticky file?

For things like bpf maps, it should be relatively simple to move
charges the same way we do when we move processes, because the bpf map
is by default charged to one memcg. For tmpfs, it would be challenging
to move the charges because pages could be individually attributed to
different cgroups already. For this, we can impose a (imo reasonable)
restriction that for tmpfs (and similar resources that are currently
not attributed to a single cgroup), that you can only add a tmpfs
mount to cgroup.sticky.tmpfs for the first time if no pages have been
charged yet (no files created). Once the tmpfs mount lives in any
cgroup.sticky.tmpfs file, we know that it is charged to one cgroup,
and we can move the charges more easily.

The system admin would basically mount the tmpfs directory and then
directly write it to a cgroup.sticky.tmpfs file. For that point
onwards, all tmpfs pages will be charged to that cgroup, and they need
to be freed or moved before the cgroup can be removed.

I could be missing something here, but I imagine such an interface(s)
would address both the bpf progs/maps charging concerns, and our
previous memcg= mount attempts. Please correct me if I am wrong.

>
> > This would be optional, so the current status quo is maintainable, but
> > also gives flexibility to admins to assign resources to cgroups to
> > make sure nothing is ( unaccounted / accounted to a zombie memcg /
> > reparented to an unrelated parent ). This might be too fine-grained to
> > be practical but I just thought it might be useful. We will also need
> > to define an OOM behavior for such resources. Things like bpf maps
> > will be unreclaimable, but tmpfs memory can be swapped out.
>
> Keep in mind that the swap is a shared resource in itself. So tmpfs is
> essentially a sticky resource as well. A tmpfs file is not bound to any
> proces life time the same way BPF program is. You might need less
> priviledges to remove a file but in principle they are consuming
> resources without any explicit owner.

I fully agree, which is exactly why I am suggesting having a defined
way to handle such "sticky" resources that outlive processes.
Currently cgroups operate in terms of processes and this can be a
limitation for such resources.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 18:00                                     ` Yosry Ahmed
@ 2022-07-19 18:01                                       ` Yosry Ahmed
  2022-07-19 18:46                                       ` Mina Almasry
  1 sibling, 0 replies; 72+ messages in thread
From: Yosry Ahmed @ 2022-07-19 18:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Yafang Shao, Alexei Starovoitov, Shakeel Butt,
	Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka,
	Mina Almasry

+Mina Almasry

On Tue, Jul 19, 2022 at 11:00 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Jul 19, 2022 at 4:30 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 18-07-22 10:55:59, Yosry Ahmed wrote:
> > > On Tue, Jul 12, 2022 at 7:39 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> > > > > On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> > > > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > [...]
> > > > > > > > Roman already sent reparenting fix:
> > > > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
> > > > > > >
> > > > > > > Reparenting is nice but not a silver bullet. Consider a shallow
> > > > > > > hierarchy where the charging happens in the first level under the root
> > > > > > > memcg. Reparenting to the root is just pushing everything under the
> > > > > > > system resources category.
> > > > > > >
> > > > > >
> > > > > > Agreed. That's why I don't like reparenting.
> > > > > > Reparenting just reparent the charged pages and then redirect the new
> > > > > > charge, but can't reparents the 'limit' of the original memcg.
> > > > > > So it is a risk if the original memcg is still being charged. We have
> > > > > > to forbid the destruction of the original memcg.
> > > >
> > > > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe*
> > > > bpf maps is an exception), I don't think there is a better choice.
> > > >
> > > > > yes, I was toying with an idea like that. I guess we really want a
> > > > > measure to keep cgroups around if they are bound to a resource which is
> > > > > sticky itself. I am not sure how many other resources like BPF (aka
> > > > > module like) we already do charge for memcg but considering the
> > > > > potential memory consumption just reparenting will not help in general
> > > > > case I am afraid.
> > > >
> > > > Well, then we have to make these objects a first-class citizens in cgroup API,
> > > > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc.
> > > > I easily can see some value here, but it's a big API change.
> > > >
> > > > With the current approach when a bpf map pins a memory cgroup of the creator
> > > > process (which I think is completely transparent for most bpf users), I don't
> > > > think preventing the deletion of a such cgroup is possible. It will break too
> > > > many things.
> > > >
> > > > But honestly I don't see why userspace can't handle it. If there is a cgroup which
> > > > contains shared bpf maps, why would it delete it? It's a weird use case, I don't
> > > > think we have to optimize for it. Also, we do a ton of optimizations for live
> > > > cgroups (e.g. css refcounting being percpu) which are not working for a deleted
> > > > cgroup. So noone really should expect any properties from dying cgroups.
> > > >
> > >
> > > Just a random thought here, and I can easily be wrong (and this can
> > > easily be the wrong thread for this), but if we introduce a more
> > > generic concept to generally tie a resource explicitly to a cgroup
> > > (tmpfs, bpf maps, etc) using cgroupfs interfaces, and then prevent the
> > > cgroup from being deleted unless the resource is freed or moved to a
> > > different cgroup?
> >
> > My understanding is that Tejun would prefer a user space defined policy
> > by a proper layering. And I would tend to agree that this is less prone
> > to corner cases.
> >
> > Anyway, how would you envision such an interface?
>
> I imagine something like cgroup.sticky.[bpf/tmpfs/..] (I suck at naming things).
>
> The file can contain a list of bpf map ids or tmpfs inode ids or mount
> paths. You can write a new id/path to the file to add one, or read the
> file to see a list of them. Basically very similar semantics to
> cgroup.procs. Instead of processes, we have different types of sticky
> resources here that usually outlive processes. The charging of such
> resources would be deterministically attributed to this cgroup
> (instead of the cgroup of the process that created the map or touched
> the tmpfs file first).
>
> Since the system admin has explicitly attributed these resources to
> this cgroup, it makes sense that the cgroup cannot be deleted as long
> as these resources exist and are charged to it, similar to
> cgroup.procs. This also addresses some of the zombie cgroups problems.
>
> The obvious question is: to maintain the current behavior, bpf maps
> and tmpfs mounts will be by default initially charged the same way as
> today. bpf maps for the cgroup of the process that created them, and
> tmpfs pages on first touch basis. How do we move their charging after
> their creation in the current way to a cgroup.sticky file?
>
> For things like bpf maps, it should be relatively simple to move
> charges the same way we do when we move processes, because the bpf map
> is by default charged to one memcg. For tmpfs, it would be challenging
> to move the charges because pages could be individually attributed to
> different cgroups already. For this, we can impose a (imo reasonable)
> restriction that for tmpfs (and similar resources that are currently
> not attributed to a single cgroup), that you can only add a tmpfs
> mount to cgroup.sticky.tmpfs for the first time if no pages have been
> charged yet (no files created). Once the tmpfs mount lives in any
> cgroup.sticky.tmpfs file, we know that it is charged to one cgroup,
> and we can move the charges more easily.
>
> The system admin would basically mount the tmpfs directory and then
> directly write it to a cgroup.sticky.tmpfs file. For that point
> onwards, all tmpfs pages will be charged to that cgroup, and they need
> to be freed or moved before the cgroup can be removed.
>
> I could be missing something here, but I imagine such an interface(s)
> would address both the bpf progs/maps charging concerns, and our
> previous memcg= mount attempts. Please correct me if I am wrong.
>
> >
> > > This would be optional, so the current status quo is maintainable, but
> > > also gives flexibility to admins to assign resources to cgroups to
> > > make sure nothing is ( unaccounted / accounted to a zombie memcg /
> > > reparented to an unrelated parent ). This might be too fine-grained to
> > > be practical but I just thought it might be useful. We will also need
> > > to define an OOM behavior for such resources. Things like bpf maps
> > > will be unreclaimable, but tmpfs memory can be swapped out.
> >
> > Keep in mind that the swap is a shared resource in itself. So tmpfs is
> > essentially a sticky resource as well. A tmpfs file is not bound to any
> > proces life time the same way BPF program is. You might need less
> > priviledges to remove a file but in principle they are consuming
> > resources without any explicit owner.
>
> I fully agree, which is exactly why I am suggesting having a defined
> way to handle such "sticky" resources that outlive processes.
> Currently cgroups operate in terms of processes and this can be a
> limitation for such resources.
>
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 18:00                                     ` Yosry Ahmed
  2022-07-19 18:01                                       ` Yosry Ahmed
@ 2022-07-19 18:46                                       ` Mina Almasry
  2022-07-19 19:16                                         ` Tejun Heo
  2022-07-20 12:26                                         ` Michal Hocko
  1 sibling, 2 replies; 72+ messages in thread
From: Mina Almasry @ 2022-07-19 18:46 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Hocko, Roman Gushchin, Yafang Shao, Alexei Starovoitov,
	Shakeel Butt, Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue, Jul 19, 2022 at 11:01 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Jul 19, 2022 at 4:30 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 18-07-22 10:55:59, Yosry Ahmed wrote:
> > > On Tue, Jul 12, 2022 at 7:39 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> > > > > On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> > > > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > [...]
> > > > > > > > Roman already sent reparenting fix:
> > > > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/
> > > > > > >
> > > > > > > Reparenting is nice but not a silver bullet. Consider a shallow
> > > > > > > hierarchy where the charging happens in the first level under the root
> > > > > > > memcg. Reparenting to the root is just pushing everything under the
> > > > > > > system resources category.
> > > > > > >
> > > > > >
> > > > > > Agreed. That's why I don't like reparenting.
> > > > > > Reparenting just reparent the charged pages and then redirect the new
> > > > > > charge, but can't reparents the 'limit' of the original memcg.
> > > > > > So it is a risk if the original memcg is still being charged. We have
> > > > > > to forbid the destruction of the original memcg.
> > > >
> > > > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe*
> > > > bpf maps is an exception), I don't think there is a better choice.
> > > >
> > > > > yes, I was toying with an idea like that. I guess we really want a
> > > > > measure to keep cgroups around if they are bound to a resource which is
> > > > > sticky itself. I am not sure how many other resources like BPF (aka
> > > > > module like) we already do charge for memcg but considering the
> > > > > potential memory consumption just reparenting will not help in general
> > > > > case I am afraid.
> > > >
> > > > Well, then we have to make these objects a first-class citizens in cgroup API,
> > > > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc.
> > > > I easily can see some value here, but it's a big API change.
> > > >
> > > > With the current approach when a bpf map pins a memory cgroup of the creator
> > > > process (which I think is completely transparent for most bpf users), I don't
> > > > think preventing the deletion of a such cgroup is possible. It will break too
> > > > many things.
> > > >
> > > > But honestly I don't see why userspace can't handle it. If there is a cgroup which
> > > > contains shared bpf maps, why would it delete it? It's a weird use case, I don't
> > > > think we have to optimize for it. Also, we do a ton of optimizations for live
> > > > cgroups (e.g. css refcounting being percpu) which are not working for a deleted
> > > > cgroup. So noone really should expect any properties from dying cgroups.
> > > >
> > >
> > > Just a random thought here, and I can easily be wrong (and this can
> > > easily be the wrong thread for this), but if we introduce a more
> > > generic concept to generally tie a resource explicitly to a cgroup
> > > (tmpfs, bpf maps, etc) using cgroupfs interfaces, and then prevent the
> > > cgroup from being deleted unless the resource is freed or moved to a
> > > different cgroup?
> >
> > My understanding is that Tejun would prefer a user space defined policy
> > by a proper layering. And I would tend to agree that this is less prone
> > to corner cases.
> >
> > Anyway, how would you envision such an interface?
>
> I imagine something like cgroup.sticky.[bpf/tmpfs/..] (I suck at naming things).
>
> The file can contain a list of bpf map ids or tmpfs inode ids or mount
> paths. You can write a new id/path to the file to add one, or read the
> file to see a list of them. Basically very similar semantics to
> cgroup.procs. Instead of processes, we have different types of sticky
> resources here that usually outlive processes. The charging of such
> resources would be deterministically attributed to this cgroup
> (instead of the cgroup of the process that created the map or touched
> the tmpfs file first).
>
> Since the system admin has explicitly attributed these resources to
> this cgroup, it makes sense that the cgroup cannot be deleted as long
> as these resources exist and are charged to it, similar to
> cgroup.procs. This also addresses some of the zombie cgroups problems.
>
> The obvious question is: to maintain the current behavior, bpf maps
> and tmpfs mounts will be by default initially charged the same way as
> today. bpf maps for the cgroup of the process that created them, and
> tmpfs pages on first touch basis. How do we move their charging after
> their creation in the current way to a cgroup.sticky file?
>
> For things like bpf maps, it should be relatively simple to move
> charges the same way we do when we move processes, because the bpf map
> is by default charged to one memcg. For tmpfs, it would be challenging
> to move the charges because pages could be individually attributed to
> different cgroups already. For this, we can impose a (imo reasonable)
> restriction that for tmpfs (and similar resources that are currently
> not attributed to a single cgroup), that you can only add a tmpfs
> mount to cgroup.sticky.tmpfs for the first time if no pages have been
> charged yet (no files created). Once the tmpfs mount lives in any
> cgroup.sticky.tmpfs file, we know that it is charged to one cgroup,
> and we can move the charges more easily.
>
> The system admin would basically mount the tmpfs directory and then
> directly write it to a cgroup.sticky.tmpfs file. For that point
> onwards, all tmpfs pages will be charged to that cgroup, and they need
> to be freed or moved before the cgroup can be removed.
>
> I could be missing something here, but I imagine such an interface(s)
> would address both the bpf progs/maps charging concerns, and our
> previous memcg= mount attempts. Please correct me if I am wrong.
>
> >
> > > This would be optional, so the current status quo is maintainable, but
> > > also gives flexibility to admins to assign resources to cgroups to
> > > make sure nothing is ( unaccounted / accounted to a zombie memcg /
> > > reparented to an unrelated parent ). This might be too fine-grained to
> > > be practical but I just thought it might be useful. We will also need
> > > to define an OOM behavior for such resources. Things like bpf maps
> > > will be unreclaimable, but tmpfs memory can be swapped out.
> >
> > Keep in mind that the swap is a shared resource in itself. So tmpfs is
> > essentially a sticky resource as well. A tmpfs file is not bound to any
> > proces life time the same way BPF program is. You might need less
> > priviledges to remove a file but in principle they are consuming
> > resources without any explicit owner.
>
> I fully agree, which is exactly why I am suggesting having a defined
> way to handle such "sticky" resources that outlive processes.
> Currently cgroups operate in terms of processes and this can be a
> limitation for such resources.
>
> > --
> > Michal Hocko
> > SUSE Labs
>

An interface like cgroup.sticky.[bpf/tmpfs/..] would work for us
similar to tmpfs memcg= mount option. I would maybe rename it to
cgroup.charge_for.[bpf/tmpfs/etc] or something.

With regards to OOM, my proposal on this patchset is to return ENOSPC
to the caller if we hit the limit of the remote memcg and there is
nothing to kill:
https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

There is some precedent to doing this in the kernel. If a hugetlb
allocation hits the hugetlb_cgroup limit, we return ENOSPC to the
caller (and SIGBUS in the charge path). The reason there being that we
don't support oom-kill or reclaim or swap for hugetlb pages.

I think it is also reasonable to prevent removing the memcg if there
is cgroup.charge_for.[bpf/tmpfs/etc] still alive. Currently we prevent
removing the memcg if there are tasks attached. So we can also prevent
removing the memcg if there are bpf/tmpfs charge sources pending.

Would love to hear from Tejun/Michal/Johannes if this is the interface
you're looking for.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 18:46                                       ` Mina Almasry
@ 2022-07-19 19:16                                         ` Tejun Heo
  2022-07-19 19:30                                           ` Yosry Ahmed
  2022-07-20 12:26                                         ` Michal Hocko
  1 sibling, 1 reply; 72+ messages in thread
From: Tejun Heo @ 2022-07-19 19:16 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Yosry Ahmed, Michal Hocko, Roman Gushchin, Yafang Shao,
	Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

Hello,

On Tue, Jul 19, 2022 at 11:46:41AM -0700, Mina Almasry wrote:
> An interface like cgroup.sticky.[bpf/tmpfs/..] would work for us
> similar to tmpfs memcg= mount option. I would maybe rename it to
> cgroup.charge_for.[bpf/tmpfs/etc] or something.

So, I'm not a fan because having this in cgroupfs would create the
expectation that these resources can be moved across cgroups dynamically
(and that's the only way the interface can be useful, right?). I'd much
prefer something a lot more minimal - e.g. temporarily allow assuming an
ancestor identity while creating a resource or sth along that line, and to
add something like that, I think we need pretty strong arguments for why it
can't be handled through cgroup layering in userspace.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 19:16                                         ` Tejun Heo
@ 2022-07-19 19:30                                           ` Yosry Ahmed
  2022-07-19 19:38                                             ` Tejun Heo
  0 siblings, 1 reply; 72+ messages in thread
From: Yosry Ahmed @ 2022-07-19 19:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mina Almasry, Michal Hocko, Roman Gushchin, Yafang Shao,
	Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue, Jul 19, 2022 at 12:16 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Jul 19, 2022 at 11:46:41AM -0700, Mina Almasry wrote:
> > An interface like cgroup.sticky.[bpf/tmpfs/..] would work for us
> > similar to tmpfs memcg= mount option. I would maybe rename it to
> > cgroup.charge_for.[bpf/tmpfs/etc] or something.
>
> So, I'm not a fan because having this in cgroupfs would create the
> expectation that these resources can be moved across cgroups dynamically
> (and that's the only way the interface can be useful, right?). I'd much

Is there a reason why these resources cannot be moved across cgroups
dynamically? The only scenario I imagine is if you already have tmpfs
mounted and files charged to different cgroups, but once you attribute
tmpfs to one cgroup.charge_for.tmpfs (or sticky,..), I assume that we
can dynamically move the resources, right?

In fact, is there a reason why we can't move the tmpfs charges in that
scenario as well? When we move processes we loop their pages tables
and move pages and their stats, is there a reason why we wouldn't be
able to do this with tmpfs mounts or bpf maps as well?


> prefer something a lot more minimal - e.g. temporarily allow assuming an
> ancestor identity while creating a resource or sth along that line, and to
> add something like that, I think we need pretty strong arguments for why it
> can't be handled through cgroup layering in userspace.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 19:30                                           ` Yosry Ahmed
@ 2022-07-19 19:38                                             ` Tejun Heo
  2022-07-19 19:40                                               ` Yosry Ahmed
  2022-07-19 19:47                                               ` Mina Almasry
  0 siblings, 2 replies; 72+ messages in thread
From: Tejun Heo @ 2022-07-19 19:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Mina Almasry, Michal Hocko, Roman Gushchin, Yafang Shao,
	Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue, Jul 19, 2022 at 12:30:17PM -0700, Yosry Ahmed wrote:
> Is there a reason why these resources cannot be moved across cgroups
> dynamically? The only scenario I imagine is if you already have tmpfs
> mounted and files charged to different cgroups, but once you attribute
> tmpfs to one cgroup.charge_for.tmpfs (or sticky,..), I assume that we
> can dynamically move the resources, right?
> 
> In fact, is there a reason why we can't move the tmpfs charges in that
> scenario as well? When we move processes we loop their pages tables
> and move pages and their stats, is there a reason why we wouldn't be
> able to do this with tmpfs mounts or bpf maps as well?

Nothing is impossible but nothing is free as well. Moving charges around
traditionally caused a lot of headaches in the past and never became
reliable. There are inherent trade-offs here. You can make things more
dynamic usually by making hot paths more expensive or doing some
synchronization dancing which tends to be pretty hairy. People generally
don't wanna make hot paths slower, so we tend to end up with something
twisted which unfortunately turns out to be a headache in the long term.

In general, I'd rather keep resource associations as static as possible.
It's okay if we do something neat inside the kernel but if we create
userspace expectation that resources can be moved around dynamically, we'll
be stuck with that for a long time likely forfeiting future simplification /
optimization opportunities.

So, that's gonna be a fairly strong nack from my end.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 19:38                                             ` Tejun Heo
@ 2022-07-19 19:40                                               ` Yosry Ahmed
  2022-07-19 19:47                                               ` Mina Almasry
  1 sibling, 0 replies; 72+ messages in thread
From: Yosry Ahmed @ 2022-07-19 19:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mina Almasry, Michal Hocko, Roman Gushchin, Yafang Shao,
	Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue, Jul 19, 2022 at 12:38 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Tue, Jul 19, 2022 at 12:30:17PM -0700, Yosry Ahmed wrote:
> > Is there a reason why these resources cannot be moved across cgroups
> > dynamically? The only scenario I imagine is if you already have tmpfs
> > mounted and files charged to different cgroups, but once you attribute
> > tmpfs to one cgroup.charge_for.tmpfs (or sticky,..), I assume that we
> > can dynamically move the resources, right?
> >
> > In fact, is there a reason why we can't move the tmpfs charges in that
> > scenario as well? When we move processes we loop their pages tables
> > and move pages and their stats, is there a reason why we wouldn't be
> > able to do this with tmpfs mounts or bpf maps as well?
>
> Nothing is impossible but nothing is free as well. Moving charges around
> traditionally caused a lot of headaches in the past and never became
> reliable. There are inherent trade-offs here. You can make things more
> dynamic usually by making hot paths more expensive or doing some
> synchronization dancing which tends to be pretty hairy. People generally
> don't wanna make hot paths slower, so we tend to end up with something
> twisted which unfortunately turns out to be a headache in the long term.
>
> In general, I'd rather keep resource associations as static as possible.
> It's okay if we do something neat inside the kernel but if we create
> userspace expectation that resources can be moved around dynamically, we'll
> be stuck with that for a long time likely forfeiting future simplification /
> optimization opportunities.
>
> So, that's gonna be a fairly strong nack from my end.

Fair enough :)

Thanks for elaborating your point of view and the challenges that come
with this!

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 19:38                                             ` Tejun Heo
  2022-07-19 19:40                                               ` Yosry Ahmed
@ 2022-07-19 19:47                                               ` Mina Almasry
  2022-07-19 19:54                                                 ` Tejun Heo
  1 sibling, 1 reply; 72+ messages in thread
From: Mina Almasry @ 2022-07-19 19:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yosry Ahmed, Michal Hocko, Roman Gushchin, Yafang Shao,
	Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

On Tue, Jul 19, 2022 at 12:38 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Tue, Jul 19, 2022 at 12:30:17PM -0700, Yosry Ahmed wrote:
> > Is there a reason why these resources cannot be moved across cgroups
> > dynamically? The only scenario I imagine is if you already have tmpfs
> > mounted and files charged to different cgroups, but once you attribute
> > tmpfs to one cgroup.charge_for.tmpfs (or sticky,..), I assume that we
> > can dynamically move the resources, right?
> >
> > In fact, is there a reason why we can't move the tmpfs charges in that
> > scenario as well? When we move processes we loop their pages tables
> > and move pages and their stats, is there a reason why we wouldn't be
> > able to do this with tmpfs mounts or bpf maps as well?
>
> Nothing is impossible but nothing is free as well. Moving charges around
> traditionally caused a lot of headaches in the past and never became
> reliable. There are inherent trade-offs here. You can make things more
> dynamic usually by making hot paths more expensive or doing some
> synchronization dancing which tends to be pretty hairy. People generally
> don't wanna make hot paths slower, so we tend to end up with something
> twisted which unfortunately turns out to be a headache in the long term.
>
> In general, I'd rather keep resource associations as static as possible.
> It's okay if we do something neat inside the kernel but if we create
> userspace expectation that resources can be moved around dynamically, we'll
> be stuck with that for a long time likely forfeiting future simplification /
> optimization opportunities.
>
> So, that's gonna be a fairly strong nack from my end.
>

Hmm, sorry I might be missing something but I don't think we have the
same thing in mind?

My understanding is that the sysadmin can do something like this which
is relatively inexpensive to implement in the kernel:


mount -t tmpfs /mnt/mymountpoint
echo "/mnt/mymountpoint" > /path/to/cgroup/cgroup.charge_for.tmpfs


At that point all tmpfs charges for this tmpfs are directed to
/path/to/cgroup/memory.current.

Then the sysadmin can do something like:


echo "/mnt/mymountpoint" > /path/to/cgroup2/cgroup.charge_for.tmpfs


At that point all _future_ charges of that tmpfs will go to
cgroup2/memory.current. All existing charges remain at
cgroup/memory.current and get uncharged from there. Per my
understanding there is no need to move all the _existing_ charges from
cgroup/memory.current to cgroup2/memory.current.

Sorry, I don't mean to be insistent, just wanted to make sure we have
the same thing in mind. Speaking for ourselves we have a very similar
implementation locally and is perfectly usable (and in fact addresses
a number of pain points related to shared memory charging) without
dynamically moving existing charges on reassignment (the second echo
in my example).

> Thanks.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 19:47                                               ` Mina Almasry
@ 2022-07-19 19:54                                                 ` Tejun Heo
  2022-07-19 20:16                                                   ` Mina Almasry
  0 siblings, 1 reply; 72+ messages in thread
From: Tejun Heo @ 2022-07-19 19:54 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Yosry Ahmed, Michal Hocko, Roman Gushchin, Yafang Shao,
	Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka

Hello,

On Tue, Jul 19, 2022 at 12:47:39PM -0700, Mina Almasry wrote:
> Hmm, sorry I might be missing something but I don't think we have the
> same thing in mind?
> 
> My understanding is that the sysadmin can do something like this which
> is relatively inexpensive to implement in the kernel:
> 
> 
> mount -t tmpfs /mnt/mymountpoint
> echo "/mnt/mymountpoint" > /path/to/cgroup/cgroup.charge_for.tmpfs
> 
> 
> At that point all tmpfs charges for this tmpfs are directed to
> /path/to/cgroup/memory.current.
> 
> Then the sysadmin can do something like:
> 
> 
> echo "/mnt/mymountpoint" > /path/to/cgroup2/cgroup.charge_for.tmpfs
> 
> 
> At that point all _future_ charges of that tmpfs will go to
> cgroup2/memory.current. All existing charges remain at
> cgroup/memory.current and get uncharged from there. Per my
> understanding there is no need to move all the _existing_ charges from
> cgroup/memory.current to cgroup2/memory.current.

So, it's a lot better if the existing charges aren't moved around but it's
also kinda confusing if something can be moved around the tree arbitrarily
leaving charges behind. We already do get that from moving processes around
but most common usages are pretty static at this point and I think it'd be
better to avoid expanding the interface in that direction.

I'd much prefer something alont the line of `mount -t tmpfs -o cgroup=XXX`
where the tmpfs code checks whether the specified cgroup is one of the
ancestors and the mounting task has enough permission to shift the resource
there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 19:54                                                 ` Tejun Heo
@ 2022-07-19 20:16                                                   ` Mina Almasry
  2022-07-19 20:29                                                     ` Tejun Heo
  0 siblings, 1 reply; 72+ messages in thread
From: Mina Almasry @ 2022-07-19 20:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yosry Ahmed, Michal Hocko, Roman Gushchin, Yafang Shao,
	Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, Johannes Weiner

On Tue, Jul 19, 2022 at 12:54 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Jul 19, 2022 at 12:47:39PM -0700, Mina Almasry wrote:
> > Hmm, sorry I might be missing something but I don't think we have the
> > same thing in mind?
> >
> > My understanding is that the sysadmin can do something like this which
> > is relatively inexpensive to implement in the kernel:
> >
> >
> > mount -t tmpfs /mnt/mymountpoint
> > echo "/mnt/mymountpoint" > /path/to/cgroup/cgroup.charge_for.tmpfs
> >
> >
> > At that point all tmpfs charges for this tmpfs are directed to
> > /path/to/cgroup/memory.current.
> >
> > Then the sysadmin can do something like:
> >
> >
> > echo "/mnt/mymountpoint" > /path/to/cgroup2/cgroup.charge_for.tmpfs
> >
> >
> > At that point all _future_ charges of that tmpfs will go to
> > cgroup2/memory.current. All existing charges remain at
> > cgroup/memory.current and get uncharged from there. Per my
> > understanding there is no need to move all the _existing_ charges from
> > cgroup/memory.current to cgroup2/memory.current.
>
> So, it's a lot better if the existing charges aren't moved around but it's
> also kinda confusing if something can be moved around the tree arbitrarily
> leaving charges behind. We already do get that from moving processes around
> but most common usages are pretty static at this point and I think it'd be
> better to avoid expanding the interface in that direction.
>

I think I'm flexible in this sense. Would you like the kernel to
prevent reattaching the tmpfs to a different cgroup? To be honest we
have a use case for that, but I'm not going to die on this hill. I
guess the worst case scenario is that I can carry a local patch on our
kernel which allows reattaching to a different cgroup and directs
future charges there...

> I'd much prefer something alont the line of `mount -t tmpfs -o cgroup=XXX`
> where the tmpfs code checks whether the specified cgroup is one of the
> ancestors and the mounting task has enough permission to shift the resource
> there.
>

Actually this is pretty much the same interface I opted for in my
original proposal (except I named it memcg= rather than cgroup=):
https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

Curious, why do we need to check if the cgroup= is an ancestor? We
actually do have a use case where the cgroups are unrelated and the
common ancestor is root. Again, I'm not sure I want to die on this
hill. At worst I can remove the restriction in a local patch for our
kernel again...

Before I get too excited and implement these changes and submit
another iteration of my proposal above, I'd love to hear from
Johannes/Michal/Roman. My previous proposal was a pretty strong nack
from Johannes and Michal in particular.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 20:16                                                   ` Mina Almasry
@ 2022-07-19 20:29                                                     ` Tejun Heo
  0 siblings, 0 replies; 72+ messages in thread
From: Tejun Heo @ 2022-07-19 20:29 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Yosry Ahmed, Michal Hocko, Roman Gushchin, Yafang Shao,
	Alexei Starovoitov, Shakeel Butt, Matthew Wilcox,
	Christoph Hellwig, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, bpf, Kernel Team, linux-mm,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, Johannes Weiner

Hello,

On Tue, Jul 19, 2022 at 01:16:02PM -0700, Mina Almasry wrote:
> I think I'm flexible in this sense. Would you like the kernel to
> prevent reattaching the tmpfs to a different cgroup? To be honest we
> have a use case for that, but I'm not going to die on this hill. I
> guess the worst case scenario is that I can carry a local patch on our
> kernel which allows reattaching to a different cgroup and directs
> future charges there...

If it's not gonna be dynamic, putting it in a writable cgroupfs file feels
awakard to me, prone to creating incorrect expectations.

> > I'd much prefer something alont the line of `mount -t tmpfs -o cgroup=XXX`
> > where the tmpfs code checks whether the specified cgroup is one of the
> > ancestors and the mounting task has enough permission to shift the resource
> > there.
> 
> Actually this is pretty much the same interface I opted for in my
> original proposal (except I named it memcg= rather than cgroup=):
> https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

Right. Hmm... the discussion w/ Johannes didn't seem to have concluded, so
continue that here, I guess?

If the consensus becomes that we want to allow charging resources to an
ancestor, I like the the mount option interface a lot more than other
proposals.

> Curious, why do we need to check if the cgroup= is an ancestor? We
> actually do have a use case where the cgroups are unrelated and the
> common ancestor is root. Again, I'm not sure I want to die on this
> hill. At worst I can remove the restriction in a local patch for our
> kernel again...

First of all, permission doesn't capture the whole picture due to delegation
boundaries. Maybe we can use the same perm check as delegation checks but
keeping it inside the subtree seems simpler and less confusing to me. We can
always relax if necessary in the future.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)
  2022-07-19 18:46                                       ` Mina Almasry
  2022-07-19 19:16                                         ` Tejun Heo
@ 2022-07-20 12:26                                         ` Michal Hocko
  1 sibling, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2022-07-20 12:26 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Yosry Ahmed, Roman Gushchin, Yafang Shao, Alexei Starovoitov,
	Shakeel Butt, Matthew Wilcox, Christoph Hellwig, David S. Miller,
	Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Martin KaFai Lau,
	bpf, Kernel Team, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Vlastimil Babka

On Tue 19-07-22 11:46:41, Mina Almasry wrote:
[...]
> An interface like cgroup.sticky.[bpf/tmpfs/..] would work for us
> similar to tmpfs memcg= mount option. I would maybe rename it to
> cgroup.charge_for.[bpf/tmpfs/etc] or something.
> 
> With regards to OOM, my proposal on this patchset is to return ENOSPC
> to the caller if we hit the limit of the remote memcg and there is
> nothing to kill:
> https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

That would imply SIGBUS on the #PF path. Is this really the way how we
want to tell userspace that something they are not aware of like a limit
in a completely different resource domain has triggered?

> There is some precedent to doing this in the kernel. If a hugetlb
> allocation hits the hugetlb_cgroup limit, we return ENOSPC to the
> caller (and SIGBUS in the charge path). The reason there being that we
> don't support oom-kill or reclaim or swap for hugetlb pages.

Following hugetlb is not really a great idea because hugetlb has always
been quite special and its users are aware of that. The same doesn't
really apply to other resources like tmpfs.
 
> I think it is also reasonable to prevent removing the memcg if there
> is cgroup.charge_for.[bpf/tmpfs/etc] still alive. Currently we prevent
> removing the memcg if there are tasks attached. So we can also prevent
> removing the memcg if there are bpf/tmpfs charge sources pending.

I can imagine some way of keeping cgroups active even without tasks but
so far I haven't really seen a good way how to achieve that.

cgroup.sticky.[bpf/tmpfs/..] interface is really weird if you ask me.
For one thing I have hard time imagine how to identify those resources.
tmpfs by path is really strange because the same mount point can be
referenced through many paths. Not the mention the path can be
remounted/redirected to anything after the configurion which would just
lead to a lot of confusion.

Exposing internal ids is also far from great. It would also put an
additional burden on the kernel implementation to ensure there is no
overlap in resources among different cgroups.  Also how many of those
sticky resources do we want to grow?

To me this have way too many red flags that it sounds like an interface
which would break really easily.

The more I think about that the more I agree with Tejun that corner
cases are just waiting to jump out at us. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2022-07-20 12:27 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-23  0:32 [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 1/5] bpf: Introduce any context " Alexei Starovoitov
2022-06-25  1:23   ` John Fastabend
2022-06-26 17:19     ` Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 2/5] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 3/5] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 4/5] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 5/5] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
2022-06-27  7:03 ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Christoph Hellwig
2022-06-28  0:17   ` Christoph Lameter
2022-06-28  5:01     ` Alexei Starovoitov
2022-06-28 13:57       ` Christoph Lameter
2022-06-28 17:03         ` Alexei Starovoitov
2022-06-29  2:35           ` Christoph Lameter
2022-06-29  2:49             ` Alexei Starovoitov
2022-07-04 16:13               ` Vlastimil Babka
2022-07-06 17:43                 ` Alexei Starovoitov
2022-07-19 11:52                   ` Vlastimil Babka
2022-07-04 20:34   ` Matthew Wilcox
2022-07-06 17:50     ` Alexei Starovoitov
2022-07-06 17:55       ` Matthew Wilcox
2022-07-06 18:05         ` Alexei Starovoitov
2022-07-06 18:21           ` Matthew Wilcox
2022-07-06 18:26             ` Alexei Starovoitov
2022-07-06 18:31               ` Matthew Wilcox
2022-07-06 18:36                 ` Alexei Starovoitov
2022-07-06 18:40                   ` Matthew Wilcox
2022-07-06 18:51                     ` Alexei Starovoitov
2022-07-06 18:55                       ` Matthew Wilcox
2022-07-08 13:41           ` Michal Hocko
2022-07-08 17:48             ` Alexei Starovoitov
2022-07-08 20:13               ` Yosry Ahmed
2022-07-08 21:55               ` Shakeel Butt
2022-07-10  5:26                 ` Alexei Starovoitov
2022-07-10  7:32                   ` Shakeel Butt
2022-07-11 12:15                     ` Michal Hocko
2022-07-12  4:39                       ` Alexei Starovoitov
2022-07-12  7:40                         ` Michal Hocko
2022-07-12  8:39                           ` Yafang Shao
2022-07-12  9:52                             ` Michal Hocko
2022-07-12 15:25                               ` Shakeel Butt
2022-07-12 16:32                                 ` Tejun Heo
2022-07-12 17:26                                   ` Shakeel Butt
2022-07-12 17:36                                     ` Tejun Heo
2022-07-12 18:11                                       ` Shakeel Butt
2022-07-12 18:43                                         ` Alexei Starovoitov
2022-07-13 13:56                                           ` Yafang Shao
2022-07-12 19:11                                         ` Mina Almasry
2022-07-12 16:24                               ` Tejun Heo
2022-07-18 14:13                                 ` Michal Hocko
2022-07-13  2:39                               ` Roman Gushchin
2022-07-13 14:24                                 ` Yafang Shao
2022-07-13 16:24                                   ` Tejun Heo
2022-07-14  6:15                                     ` Yafang Shao
2022-07-18 17:55                                 ` Yosry Ahmed
2022-07-19 11:30                                   ` cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.) Michal Hocko
2022-07-19 18:00                                     ` Yosry Ahmed
2022-07-19 18:01                                       ` Yosry Ahmed
2022-07-19 18:46                                       ` Mina Almasry
2022-07-19 19:16                                         ` Tejun Heo
2022-07-19 19:30                                           ` Yosry Ahmed
2022-07-19 19:38                                             ` Tejun Heo
2022-07-19 19:40                                               ` Yosry Ahmed
2022-07-19 19:47                                               ` Mina Almasry
2022-07-19 19:54                                                 ` Tejun Heo
2022-07-19 20:16                                                   ` Mina Almasry
2022-07-19 20:29                                                     ` Tejun Heo
2022-07-20 12:26                                         ` Michal Hocko
2022-07-12 18:40                           ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
2022-07-18 12:27                             ` Michal Hocko
2022-07-13  2:27                           ` Roman Gushchin
2022-07-11 12:22               ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.