All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/7] bpf, mm: bpf memory usage
@ 2023-02-02  1:41 Yafang Shao
  2023-02-02  1:41 ` [PATCH bpf-next 1/7] mm: percpu: fix incorrect size in pcpu_obj_full_size() Yafang Shao
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-02  1:41 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki
  Cc: linux-mm, bpf, Yafang Shao

Currently we can't get bpf memory usage reliably. bpftool now shows the
bpf memory footprint, which is difference with bpf memory usage. The
difference can be quite great between the footprint showed in bpftool
and the memory actually allocated by bpf in some cases, for example,

- non-preallocated bpf map
  The non-preallocated bpf map memory usage is dynamically changed. The
  allocated elements count can be from 0 to the max entries. But the
  memory footprint in bpftool only shows a fixed number.
- bpf metadata consumes more memory than bpf element 
  In some corner cases, the bpf metadata can consumes a lot more memory
  than bpf element consumes. For example, it can happen when the element
  size is quite small.

We need a way to get the bpf memory usage especially there will be more
and more bpf programs running on the production environment and thus the
bpf memory usage is not trivial.

This patchset introduces a new map ops ->map_mem_usage to get the memory
usage. In this ops, the memory usage is got from the pointers which is
already allocated by a bpf map. To make the code simple, we igore some
small pointers as their size are quite small compared with the total
usage.

In order to get the memory size from the pointers, some generic mm helpers
are introduced firstly, for example, percpu_size(), vsize() and kvsize(). 

This patchset only implements the bpf memory usage for hashtab. I will
extend it to other maps and bpf progs (bpf progs can dynamically allocate
memory via bpf_obj_new()) in the future.

The detailed result can be found in patch #7.

Patch #1~#4: Generic mm helpers
Patch #5   : Introduce new ops
Patch #6   : Helpers for bpf_mem_alloc
Patch #7   : hashtab memory usage

Future works:
- extend it to other maps
- extend it to bpf prog
- per-container bpf memory usage 

Historical discussions,
- RFC PATCH v1 mm, bpf: Add BPF into /proc/meminfo
  https://lwn.net/Articles/917647/  
- RFC PATCH v2 mm, bpf: Add BPF into /proc/meminfo
  https://lwn.net/Articles/919848/

Yafang Shao (7):
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  mm: percpu: introduce percpu_size()
  mm: vmalloc: introduce vsize()
  mm: util: introduce kvsize()
  bpf: add new map ops ->map_mem_usage
  bpf: introduce bpf_mem_alloc_size()
  bpf: hashtab memory usage

 include/linux/bpf.h           |  2 ++
 include/linux/bpf_mem_alloc.h |  2 ++
 include/linux/percpu.h        |  1 +
 include/linux/slab.h          |  1 +
 include/linux/vmalloc.h       |  1 +
 kernel/bpf/hashtab.c          | 80 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/bpf/memalloc.c         | 70 +++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c          | 18 ++++++----
 mm/percpu-internal.h          |  4 ++-
 mm/percpu.c                   | 35 +++++++++++++++++++
 mm/util.c                     | 15 ++++++++
 mm/vmalloc.c                  | 17 +++++++++
 12 files changed, 237 insertions(+), 9 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 1/7] mm: percpu: fix incorrect size in pcpu_obj_full_size()
  2023-02-02  1:41 [PATCH bpf-next 0/7] bpf, mm: bpf memory usage Yafang Shao
@ 2023-02-02  1:41 ` Yafang Shao
  2023-02-02  1:41 ` [PATCH bpf-next 2/7] mm: percpu: introduce percpu_size() Yafang Shao
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-02  1:41 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki
  Cc: linux-mm, bpf, Yafang Shao, Vasily Averin

The extra space which is used to store the obj_cgroup membership is only
valid when kmemcg is enabled. The kmemcg can be disabled via the kernel
parameter "cgroup.memory=nokmem" at runtime.
This helper is also used in non-memcg code, for example the tracepoint,
so we should fix it.

It is found by code review. No real issue happens in production
environment.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vasily Averin <vvs@openvz.org>
---
 mm/percpu-internal.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 70b1ea2..2a95b1f 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 #include <linux/percpu.h>
+#include <linux/memcontrol.h>
 
 /*
  * pcpu_block_md is the metadata block struct.
@@ -125,7 +126,8 @@ static inline size_t pcpu_obj_full_size(size_t size)
 	size_t extra_size = 0;
 
 #ifdef CONFIG_MEMCG_KMEM
-	extra_size += size / PCPU_MIN_ALLOC_SIZE * sizeof(struct obj_cgroup *);
+	if (!mem_cgroup_kmem_disabled())
+		extra_size += size / PCPU_MIN_ALLOC_SIZE * sizeof(struct obj_cgroup *);
 #endif
 
 	return size * num_possible_cpus() + extra_size;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 2/7] mm: percpu: introduce percpu_size()
  2023-02-02  1:41 [PATCH bpf-next 0/7] bpf, mm: bpf memory usage Yafang Shao
  2023-02-02  1:41 ` [PATCH bpf-next 1/7] mm: percpu: fix incorrect size in pcpu_obj_full_size() Yafang Shao
@ 2023-02-02  1:41 ` Yafang Shao
  2023-02-02 14:32   ` Christoph Lameter
  2023-02-02  1:41 ` [PATCH bpf-next 3/7] mm: vmalloc: introduce vsize() Yafang Shao
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2023-02-02  1:41 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki
  Cc: linux-mm, bpf, Yafang Shao

Introduce a new helper percpu_size() to report full size of underlying
allocation of a percpu address.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/percpu.h |  1 +
 mm/percpu.c            | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1338ea2..7be4234 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -137,5 +137,6 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
 						__alignof__(type))
 
 extern unsigned long pcpu_nr_pages(void);
+extern size_t percpu_size(void __percpu *ptr);
 
 #endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu.c b/mm/percpu.c
index acd78da..5580688 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2302,6 +2302,41 @@ void free_percpu(void __percpu *ptr)
 }
 EXPORT_SYMBOL_GPL(free_percpu);
 
+/**
+ * percpu_size - report full size of underlying allocation of percpu addr
+ * @ptr: pointer to percpu area
+ *
+ * CONTEXT:
+ * Can be called from atomic context.
+ */
+size_t percpu_size(void __percpu *ptr)
+{
+	int bit_off, bits, end, off, size;
+	struct pcpu_chunk *chunk;
+	unsigned long flags;
+	void *addr;
+
+	if (!ptr)
+		return 0;
+
+	addr = __pcpu_ptr_to_addr(ptr);
+
+	spin_lock_irqsave(&pcpu_lock, flags);
+	chunk = pcpu_chunk_addr_search(addr);
+	off = addr - chunk->base_addr;
+	bit_off = off / PCPU_MIN_ALLOC_SIZE;
+
+	/* find end index */
+	end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
+			    bit_off + 1);
+	spin_unlock_irqrestore(&pcpu_lock, flags);
+
+	bits = end - bit_off;
+	size = bits * PCPU_MIN_ALLOC_SIZE;
+
+	return pcpu_obj_full_size(size);
+}
+
 bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr)
 {
 #ifdef CONFIG_SMP
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 3/7] mm: vmalloc: introduce vsize()
  2023-02-02  1:41 [PATCH bpf-next 0/7] bpf, mm: bpf memory usage Yafang Shao
  2023-02-02  1:41 ` [PATCH bpf-next 1/7] mm: percpu: fix incorrect size in pcpu_obj_full_size() Yafang Shao
  2023-02-02  1:41 ` [PATCH bpf-next 2/7] mm: percpu: introduce percpu_size() Yafang Shao
@ 2023-02-02  1:41 ` Yafang Shao
  2023-02-02 10:23   ` Christoph Hellwig
  2023-02-02  1:41 ` [PATCH bpf-next 4/7] mm: util: introduce kvsize() Yafang Shao
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2023-02-02  1:41 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki
  Cc: linux-mm, bpf, Yafang Shao, Christoph Hellwig

Introduce a helper to report full size of underlying allocation of a
vmalloc'ed address.

Suggested-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
---
 include/linux/vmalloc.h |  1 +
 mm/vmalloc.c            | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48a..7fbd390 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -297,4 +297,5 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 static inline bool vmalloc_dump_obj(void *object) { return false; }
 #endif
 
+size_t vsize(void *addr);
 #endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ca71de7..8499eba 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4057,6 +4057,23 @@ bool vmalloc_dump_obj(void *object)
 }
 #endif
 
+/* Report full size of underlying allocation of a vmalloc'ed addr */
+size_t vsize(void *addr)
+{
+	struct vmap_area *va;
+	size_t va_size = 0;
+
+	if (!addr)
+		return 0;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (va && va->vm)
+		va_size = va->vm->size;
+	spin_unlock(&vmap_area_lock);
+	return va_size;
+}
+
 #ifdef CONFIG_PROC_FS
 static void *s_start(struct seq_file *m, loff_t *pos)
 	__acquires(&vmap_purge_lock)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 4/7] mm: util: introduce kvsize()
  2023-02-02  1:41 [PATCH bpf-next 0/7] bpf, mm: bpf memory usage Yafang Shao
                   ` (2 preceding siblings ...)
  2023-02-02  1:41 ` [PATCH bpf-next 3/7] mm: vmalloc: introduce vsize() Yafang Shao
@ 2023-02-02  1:41 ` Yafang Shao
  2023-02-02  1:41 ` [PATCH bpf-next 5/7] bpf: add new map ops ->map_mem_usage Yafang Shao
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-02  1:41 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki
  Cc: linux-mm, bpf, Yafang Shao

Introduce a new help kvsize() to report full size of underlying allocation
of a kmalloc'ed addr or vmalloc'ed addr.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/slab.h |  1 +
 mm/util.c            | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af703..cccb4d8 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -762,6 +762,7 @@ static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t fla
 
 extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
 		      __realloc_size(3);
+extern size_t kvsize(void *addr);
 extern void kvfree(const void *addr);
 extern void kvfree_sensitive(const void *addr, size_t len);
 
diff --git a/mm/util.c b/mm/util.c
index b56c92f..f77d0cc 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -610,6 +610,21 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 EXPORT_SYMBOL(kvmalloc_node);
 
 /**
+ * kvsize() - Report full size of underlying allocation of adddr
+ * @addr: Pointer to kmalloc'ed or vmalloc'ed memory
+ *
+ * kvsize reports full size of underlying allocation of a kmalloc'ed addr
+ * or a vmalloc'ed addr.
+ */
+size_t kvsize(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		return vsize(addr);
+
+	return ksize(addr);
+}
+
+/**
  * kvfree() - Free memory.
  * @addr: Pointer to allocated memory.
  *
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 5/7] bpf: add new map ops ->map_mem_usage
  2023-02-02  1:41 [PATCH bpf-next 0/7] bpf, mm: bpf memory usage Yafang Shao
                   ` (3 preceding siblings ...)
  2023-02-02  1:41 ` [PATCH bpf-next 4/7] mm: util: introduce kvsize() Yafang Shao
@ 2023-02-02  1:41 ` Yafang Shao
  2023-02-02  1:41 ` [PATCH bpf-next 6/7] bpf: introduce bpf_mem_alloc_size() Yafang Shao
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-02  1:41 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki
  Cc: linux-mm, bpf, Yafang Shao

Add a new map ops ->map_mem_usage to print the memory usage of a
bpf map.

->map_mem_usage will get the map memory usage from the pointers
which will be freed in ->map_free. So it is very similar to ->map_free
except that it only get the underlaying memory size from the pointers
rather than freeing them. We just need to keep the pointers used in
->map_mem_usage in sync with the pointers in ->map_free.

This is a preparation for the followup change.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h  |  2 ++
 kernel/bpf/syscall.c | 18 +++++++++++-------
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e11db75..10eb8e9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -160,6 +160,8 @@ struct bpf_map_ops {
 				     bpf_callback_t callback_fn,
 				     void *callback_ctx, u64 flags);
 
+	unsigned long (*map_mem_usage)(const struct bpf_map *map);
+
 	/* BTF id of struct allocated by map_alloc */
 	int *map_btf_id;
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 99417b3..df52853 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -758,16 +758,20 @@ static fmode_t map_get_sys_perms(struct bpf_map *map, struct fd f)
 }
 
 #ifdef CONFIG_PROC_FS
-/* Provides an approximation of the map's memory footprint.
- * Used only to provide a backward compatibility and display
- * a reasonable "memlock" info.
- */
-static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
+/* Show the memory usage of a bpf map */
+static unsigned long bpf_map_memory_usage(const struct bpf_map *map)
 {
 	unsigned long size;
 
-	size = round_up(map->key_size + bpf_map_value_size(map), 8);
+	/* ->map_mem_usage will get the map memory size from the pointers
+	 * which will be freed in ->map_free. So it is very similar to
+	 * ->map_free except that it only get the underlaying memory size
+	 * from the pointers rather than freeing them.
+	 */
+	if (map->ops->map_mem_usage)
+		return map->ops->map_mem_usage(map);
 
+	size = round_up(map->key_size + bpf_map_value_size(map), 8);
 	return round_up(map->max_entries * size, PAGE_SIZE);
 }
 
@@ -799,7 +803,7 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 		   map->max_entries,
 		   map->map_flags,
 		   (unsigned long long)map->map_extra,
-		   bpf_map_memory_footprint(map),
+		   bpf_map_memory_usage(map),
 		   map->id,
 		   READ_ONCE(map->frozen));
 	if (type) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 6/7] bpf: introduce bpf_mem_alloc_size()
  2023-02-02  1:41 [PATCH bpf-next 0/7] bpf, mm: bpf memory usage Yafang Shao
                   ` (4 preceding siblings ...)
  2023-02-02  1:41 ` [PATCH bpf-next 5/7] bpf: add new map ops ->map_mem_usage Yafang Shao
@ 2023-02-02  1:41 ` Yafang Shao
  2023-02-02  4:53   ` kernel test robot
  2023-02-02  1:41 ` [PATCH bpf-next 7/7] bpf: hashtab memory usage Yafang Shao
  2023-02-04  2:15 ` [PATCH bpf-next 0/7] bpf, mm: bpf " John Fastabend
  7 siblings, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2023-02-02  1:41 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki
  Cc: linux-mm, bpf, Yafang Shao

Introduce helpers to get the memory usage of bpf_mem_alloc, includes the
bpf_mem_alloc pool and the in-use elements size. Note that we only count
the free list size in the bpf_mem_alloc pool but don't count other
lists, because there won't be too many elements on other lists. Ignoring
other lists could make the code simple.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf_mem_alloc.h |  2 ++
 kernel/bpf/memalloc.c         | 70 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 3e164b8..86d8dcf 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -24,5 +24,7 @@ struct bpf_mem_alloc {
 /* kmem_cache_alloc/free equivalent: */
 void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma);
 void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
+unsigned long bpf_mem_alloc_size(struct bpf_mem_alloc *ma);
+unsigned long bpf_mem_cache_elem_size(struct bpf_mem_alloc *ma, void *ptr);
 
 #endif /* _BPF_MEM_ALLOC_H */
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index ebcc3dd..ebf8964 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -224,6 +224,22 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
 	kfree(obj);
 }
 
+unsigned long bpf_mem_cache_size(struct bpf_mem_cache *c, void *obj)
+{
+	unsigned long size;
+
+	if (!obj)
+		return 0;
+
+	if (c->percpu_size) {
+		size = percpu_size(((void **)obj)[1]);
+		size += ksize(obj);
+		return size;
+	}
+
+	return ksize(obj);
+}
+
 static void __free_rcu(struct rcu_head *head)
 {
 	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
@@ -559,6 +575,41 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 	}
 }
 
+/* We only account the elements on free list */
+static unsigned long bpf_mem_cache_free_size(struct bpf_mem_cache *c)
+{
+	return c->unit_size * c->free_cnt;
+}
+
+/* Get the free list size of a bpf_mem_alloc pool. */
+unsigned long bpf_mem_alloc_size(struct bpf_mem_alloc *ma)
+{
+	struct bpf_mem_caches *cc;
+	struct bpf_mem_cache *c;
+	unsigned long size = 0;
+	int cpu, i;
+
+	if (ma->cache) {
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(ma->cache, cpu);
+			size += bpf_mem_cache_free_size(c);
+		}
+		size += percpu_size(ma->cache);
+	}
+	if (ma->caches) {
+		for_each_possible_cpu(cpu) {
+			cc = per_cpu_ptr(ma->caches, cpu);
+			for (i = 0; i < NUM_CACHES; i++) {
+				c = &cc->cache[i];
+				size += bpf_mem_cache_free_size(c);
+			}
+		}
+		size += percpu_size(ma->caches);
+	}
+
+	return size;
+}
+
 /* notrace is necessary here and in other functions to make sure
  * bpf programs cannot attach to them and cause llist corruptions.
  */
@@ -675,3 +726,22 @@ void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
 
 	unit_free(this_cpu_ptr(ma->cache), ptr);
 }
+
+/* Get elemet size from the element pointer @ptr */
+unsigned long notrace bpf_mem_cache_elem_size(struct bpf_mem_alloc *ma, void *ptr)
+{
+	struct llist_node *llnode;
+	struct bpf_mem_cache *c;
+	unsigned long size;
+
+	if (!ptr)
+		return 0;
+
+	llnode = ptr - LLIST_NODE_SZ;
+	migrate_disable();
+	c = this_cpu_ptr(ma->cache);
+	size = bpf_mem_cache_size(c, llnode);
+	migrate_enable();
+
+	return size;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 7/7] bpf: hashtab memory usage
  2023-02-02  1:41 [PATCH bpf-next 0/7] bpf, mm: bpf memory usage Yafang Shao
                   ` (5 preceding siblings ...)
  2023-02-02  1:41 ` [PATCH bpf-next 6/7] bpf: introduce bpf_mem_alloc_size() Yafang Shao
@ 2023-02-02  1:41 ` Yafang Shao
  2023-02-04  2:01   ` John Fastabend
  2023-02-05 22:14   ` Cong Wang
  2023-02-04  2:15 ` [PATCH bpf-next 0/7] bpf, mm: bpf " John Fastabend
  7 siblings, 2 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-02  1:41 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki
  Cc: linux-mm, bpf, Yafang Shao

Get htab memory usage from the htab pointers we have allocated. Some
small pointers are ignored as their size are quite small compared with
the total size.

The result as follows,
- before this change
1: hash  name count_map  flags 0x0  <<<< prealloc
        key 16B  value 24B  max_entries 1048576  memlock 41943040B
2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
        key 16B  value 24B  max_entries 1048576  memlock 41943040B
3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
        key 16B  value 24B  max_entries 1048576  memlock 41943040B

The memlock is always a fixed number whatever it is preallocated or
not, and whatever the allocated elements number is.

- after this change
1: hash  name count_map  flags 0x0  <<<< prealloc
        key 16B  value 24B  max_entries 1048576  memlock 109064464B
2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
        key 16B  value 24B  max_entries 1048576  memlock 117464320B
3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
        key 16B  value 24B  max_entries 1048576  memlock 16797952B

The memlock now is hashtab actually allocated.

At worst, the difference can be 10x, for example,
- before this change
4: hash  name count_map  flags 0x0
        key 4B  value 4B  max_entries 1048576  memlock 8388608B

- after this change
4: hash  name count_map  flags 0x0
        key 4B  value 4B  max_entries 1048576  memlock 83898640B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/hashtab.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 66bded1..cba540b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -273,6 +273,25 @@ static void htab_free_elems(struct bpf_htab *htab)
 	bpf_map_area_free(htab->elems);
 }
 
+static unsigned long htab_prealloc_elems_size(struct bpf_htab *htab)
+{
+	unsigned long size = 0;
+	int i;
+
+	if (!htab_is_percpu(htab))
+		return kvsize(htab->elems);
+
+	for (i = 0; i < htab->map.max_entries; i++) {
+		void __percpu *pptr;
+
+		pptr = htab_elem_get_ptr(get_htab_elem(htab, i),
+					htab->map.key_size);
+		size += percpu_size(pptr);
+	}
+	size += kvsize(htab->elems);
+	return size;
+}
+
 /* The LRU list has a lock (lru_lock). Each htab bucket has a lock
  * (bucket_lock). If both locks need to be acquired together, the lock
  * order is always lru_lock -> bucket_lock and this only happens in
@@ -864,6 +883,16 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 	bpf_mem_cache_free(&htab->ma, l);
 }
 
+static unsigned long htab_elem_size(struct bpf_htab *htab, struct htab_elem *l)
+{
+	unsigned long size = 0;
+
+	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
+		size += bpf_mem_cache_elem_size(&htab->pcpu_ma, l->ptr_to_pptr);
+
+	return size + bpf_mem_cache_elem_size(&htab->ma, l);
+}
+
 static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 {
 	struct bpf_map *map = &htab->map;
@@ -899,7 +928,6 @@ static void dec_elem_count(struct bpf_htab *htab)
 		atomic_dec(&htab->count);
 }
 
-
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	htab_put_fd_value(htab, l);
@@ -1457,6 +1485,31 @@ static void delete_all_elements(struct bpf_htab *htab)
 	migrate_enable();
 }
 
+static unsigned long htab_non_prealloc_elems_size(struct bpf_htab *htab)
+{
+	unsigned long size = 0;
+	unsigned long count;
+	int i;
+
+	rcu_read_lock();
+	for (i = 0; i < htab->n_buckets; i++) {
+		struct hlist_nulls_head *head = select_bucket(htab, i);
+		struct hlist_nulls_node *n;
+		struct htab_elem *l;
+
+		hlist_nulls_for_each_entry(l, n, head, hash_node) {
+			size = htab_elem_size(htab, l);
+			goto out;
+		}
+	}
+out:
+	rcu_read_unlock();
+	count = htab->use_percpu_counter ? percpu_counter_sum(&htab->pcount) :
+			atomic_read(&htab->count);
+
+	return size * count;
+}
+
 static void htab_free_malloced_timers(struct bpf_htab *htab)
 {
 	int i;
@@ -1523,6 +1576,26 @@ static void htab_map_free(struct bpf_map *map)
 	bpf_map_area_free(htab);
 }
 
+/* Get the htab memory usage from pointers we have already allocated.
+ * Some minor pointers are igored as their size are quite small compared
+ * with the total size.
+ */
+static unsigned long htab_mem_usage(const struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	unsigned long size = 0;
+
+	if (!htab_is_prealloc(htab))
+		size += htab_non_prealloc_elems_size(htab);
+	else
+		size += htab_prealloc_elems_size(htab);
+	size += percpu_size(htab->extra_elems);
+	size += kvsize(htab->buckets);
+	size += bpf_mem_alloc_size(&htab->pcpu_ma);
+	size += bpf_mem_alloc_size(&htab->ma);
+	return size;
+}
+
 static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
 				   struct seq_file *m)
 {
@@ -2191,6 +2264,7 @@ static int bpf_for_each_hash_elem(struct bpf_map *map, bpf_callback_t callback_f
 	.map_seq_show_elem = htab_map_seq_show_elem,
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2212,6 +2286,7 @@ static int bpf_for_each_hash_elem(struct bpf_map *map, bpf_callback_t callback_f
 	.map_seq_show_elem = htab_map_seq_show_elem,
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab_lru),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2363,6 +2438,7 @@ static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key,
 	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab_percpu),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2382,6 +2458,7 @@ static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key,
 	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab_lru_percpu),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2519,6 +2596,7 @@ static void htab_of_map_free(struct bpf_map *map)
 	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
 	.map_gen_lookup = htab_of_map_gen_lookup,
 	.map_check_btf = map_check_no_btf,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab),
 	.map_btf_id = &htab_map_btf_ids[0],
 };
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 6/7] bpf: introduce bpf_mem_alloc_size()
  2023-02-02  1:41 ` [PATCH bpf-next 6/7] bpf: introduce bpf_mem_alloc_size() Yafang Shao
@ 2023-02-02  4:53   ` kernel test robot
  2023-02-02 14:11     ` Yafang Shao
  0 siblings, 1 reply; 27+ messages in thread
From: kernel test robot @ 2023-02-02  4:53 UTC (permalink / raw)
  To: Yafang Shao, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin,
	42.hyeyoo, vbabka, urezki
  Cc: oe-kbuild-all, linux-mm, bpf, Yafang Shao

Hi Yafang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-percpu-fix-incorrect-size-in-pcpu_obj_full_size/20230202-094352
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230202014158.19616-7-laoar.shao%40gmail.com
patch subject: [PATCH bpf-next 6/7] bpf: introduce bpf_mem_alloc_size()
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20230202/202302021258.By6JZK71-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/996f3e2ac4dca054396d0f37ffe8ddb97fc4212f
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yafang-Shao/mm-percpu-fix-incorrect-size-in-pcpu_obj_full_size/20230202-094352
        git checkout 996f3e2ac4dca054396d0f37ffe8ddb97fc4212f
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash kernel/bpf/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/bpf/memalloc.c:227:15: warning: no previous prototype for 'bpf_mem_cache_size' [-Wmissing-prototypes]
     227 | unsigned long bpf_mem_cache_size(struct bpf_mem_cache *c, void *obj)
         |               ^~~~~~~~~~~~~~~~~~


vim +/bpf_mem_cache_size +227 kernel/bpf/memalloc.c

   226	
 > 227	unsigned long bpf_mem_cache_size(struct bpf_mem_cache *c, void *obj)
   228	{
   229		unsigned long size;
   230	
   231		if (!obj)
   232			return 0;
   233	
   234		if (c->percpu_size) {
   235			size = percpu_size(((void **)obj)[1]);
   236			size += ksize(obj);
   237			return size;
   238		}
   239	
   240		return ksize(obj);
   241	}
   242	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/7] mm: vmalloc: introduce vsize()
  2023-02-02  1:41 ` [PATCH bpf-next 3/7] mm: vmalloc: introduce vsize() Yafang Shao
@ 2023-02-02 10:23   ` Christoph Hellwig
  2023-02-02 14:10     ` Yafang Shao
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-02-02 10:23 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki, linux-mm, bpf, Christoph Hellwig

On Thu, Feb 02, 2023 at 01:41:54AM +0000, Yafang Shao wrote:
> Introduce a helper to report full size of underlying allocation of a
> vmalloc'ed address.

What is the use case for it?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/7] mm: vmalloc: introduce vsize()
  2023-02-02 10:23   ` Christoph Hellwig
@ 2023-02-02 14:10     ` Yafang Shao
  0 siblings, 0 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-02 14:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki, linux-mm, bpf

On Thu, Feb 2, 2023 at 6:23 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Feb 02, 2023 at 01:41:54AM +0000, Yafang Shao wrote:
> > Introduce a helper to report full size of underlying allocation of a
> > vmalloc'ed address.
>
> What is the use case for it?

The use case is in patch #4 and patch #7, to get the bpf memory usage
from the pointers.
#4: https://lore.kernel.org/bpf/20230202014158.19616-5-laoar.shao@gmail.com/T/#u
#7: https://lore.kernel.org/bpf/20230202014158.19616-8-laoar.shao@gmail.com/T/#u

I forgot to Cc you the full patchset. Sorry about that.
The full patchset is at
https://lore.kernel.org/bpf/20230202014158.19616-1-laoar.shao@gmail.com/

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 6/7] bpf: introduce bpf_mem_alloc_size()
  2023-02-02  4:53   ` kernel test robot
@ 2023-02-02 14:11     ` Yafang Shao
  0 siblings, 0 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-02 14:11 UTC (permalink / raw)
  To: kernel test robot
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki, oe-kbuild-all, linux-mm, bpf

On Thu, Feb 2, 2023 at 12:54 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Yafang,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on bpf-next/master]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-percpu-fix-incorrect-size-in-pcpu_obj_full_size/20230202-094352
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> patch link:    https://lore.kernel.org/r/20230202014158.19616-7-laoar.shao%40gmail.com
> patch subject: [PATCH bpf-next 6/7] bpf: introduce bpf_mem_alloc_size()
> config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20230202/202302021258.By6JZK71-lkp@intel.com/config)
> compiler: m68k-linux-gcc (GCC) 12.1.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/intel-lab-lkp/linux/commit/996f3e2ac4dca054396d0f37ffe8ddb97fc4212f
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Yafang-Shao/mm-percpu-fix-incorrect-size-in-pcpu_obj_full_size/20230202-094352
>         git checkout 996f3e2ac4dca054396d0f37ffe8ddb97fc4212f
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash kernel/bpf/
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
> >> kernel/bpf/memalloc.c:227:15: warning: no previous prototype for 'bpf_mem_cache_size' [-Wmissing-prototypes]
>      227 | unsigned long bpf_mem_cache_size(struct bpf_mem_cache *c, void *obj)
>          |               ^~~~~~~~~~~~~~~~~~

Should be defined with 'static'.
Thanks for the report. Will update it in the next version.

>
>
> vim +/bpf_mem_cache_size +227 kernel/bpf/memalloc.c
>
>    226
>  > 227  unsigned long bpf_mem_cache_size(struct bpf_mem_cache *c, void *obj)
>    228  {
>    229          unsigned long size;
>    230
>    231          if (!obj)
>    232                  return 0;
>    233
>    234          if (c->percpu_size) {
>    235                  size = percpu_size(((void **)obj)[1]);
>    236                  size += ksize(obj);
>    237                  return size;
>    238          }
>    239
>    240          return ksize(obj);
>    241  }
>    242
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests



-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/7] mm: percpu: introduce percpu_size()
  2023-02-02  1:41 ` [PATCH bpf-next 2/7] mm: percpu: introduce percpu_size() Yafang Shao
@ 2023-02-02 14:32   ` Christoph Lameter
  2023-02-02 15:01     ` Yafang Shao
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2023-02-02 14:32 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, akpm, penberg, rientjes,
	iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka, urezki,
	linux-mm, bpf

On Thu, 2 Feb 2023, Yafang Shao wrote:

> +	bits = end - bit_off;
> +	size = bits * PCPU_MIN_ALLOC_SIZE;
> +
> +	return pcpu_obj_full_size(size);

Dont you have to multiply by the number of online cpus? The per cpu area
are duplicated for those.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/7] mm: percpu: introduce percpu_size()
  2023-02-02 14:32   ` Christoph Lameter
@ 2023-02-02 15:01     ` Yafang Shao
  0 siblings, 0 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-02 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, akpm, penberg, rientjes,
	iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka, urezki,
	linux-mm, bpf

On Thu, Feb 2, 2023 at 10:32 PM Christoph Lameter <cl@gentwo.de> wrote:
>
> On Thu, 2 Feb 2023, Yafang Shao wrote:
>
> > +     bits = end - bit_off;
> > +     size = bits * PCPU_MIN_ALLOC_SIZE;
> > +
> > +     return pcpu_obj_full_size(size);
>
> Dont you have to multiply by the number of online cpus? The per cpu area
> are duplicated for those.

It is multiplied in pcpu_obj_full_size().

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH bpf-next 7/7] bpf: hashtab memory usage
  2023-02-02  1:41 ` [PATCH bpf-next 7/7] bpf: hashtab memory usage Yafang Shao
@ 2023-02-04  2:01   ` John Fastabend
  2023-02-05  3:55     ` Yafang Shao
  2023-02-05 22:14   ` Cong Wang
  1 sibling, 1 reply; 27+ messages in thread
From: John Fastabend @ 2023-02-04  2:01 UTC (permalink / raw)
  To: Yafang Shao, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin,
	42.hyeyoo, vbabka, urezki
  Cc: linux-mm, bpf, Yafang Shao

Yafang Shao wrote:
> Get htab memory usage from the htab pointers we have allocated. Some
> small pointers are ignored as their size are quite small compared with
> the total size.
> 
> The result as follows,
> - before this change
> 1: hash  name count_map  flags 0x0  <<<< prealloc
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 
> The memlock is always a fixed number whatever it is preallocated or
> not, and whatever the allocated elements number is.
> 
> - after this change
> 1: hash  name count_map  flags 0x0  <<<< prealloc
>         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
>         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
>         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> 
> The memlock now is hashtab actually allocated.
> 
> At worst, the difference can be 10x, for example,
> - before this change
> 4: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> 
> - after this change
> 4: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> 

This walks the entire map and buckets to get the size? Inside a
rcu critical section as well :/ it seems.

What am I missing, if you know how many elements are added (which
you can track on map updates) how come we can't just calculate the
memory size directly from this?

Thanks,
John

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH bpf-next 0/7] bpf, mm: bpf memory usage
  2023-02-02  1:41 [PATCH bpf-next 0/7] bpf, mm: bpf memory usage Yafang Shao
                   ` (6 preceding siblings ...)
  2023-02-02  1:41 ` [PATCH bpf-next 7/7] bpf: hashtab memory usage Yafang Shao
@ 2023-02-04  2:15 ` John Fastabend
  2023-02-05  4:03   ` Yafang Shao
  7 siblings, 1 reply; 27+ messages in thread
From: John Fastabend @ 2023-02-04  2:15 UTC (permalink / raw)
  To: Yafang Shao, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin,
	42.hyeyoo, vbabka, urezki
  Cc: linux-mm, bpf, Yafang Shao

Yafang Shao wrote:
> Currently we can't get bpf memory usage reliably. bpftool now shows the
> bpf memory footprint, which is difference with bpf memory usage. The
> difference can be quite great between the footprint showed in bpftool
> and the memory actually allocated by bpf in some cases, for example,
> 
> - non-preallocated bpf map
>   The non-preallocated bpf map memory usage is dynamically changed. The
>   allocated elements count can be from 0 to the max entries. But the
>   memory footprint in bpftool only shows a fixed number.
> - bpf metadata consumes more memory than bpf element 
>   In some corner cases, the bpf metadata can consumes a lot more memory
>   than bpf element consumes. For example, it can happen when the element
>   size is quite small.

Just following up slightly on previous comment.

The metadata should be fixed and knowable correct? What I'm getting at
is if this can be calculated directly instead of through a BPF helper
and walking the entire map.

> 
> We need a way to get the bpf memory usage especially there will be more
> and more bpf programs running on the production environment and thus the
> bpf memory usage is not trivial.

In our environments we track map usage so we always know how many entries
are in a map. I don't think we use this to calculate memory footprint
at the moment, but just for map usage. Seems though once you have this
calculating memory footprint can be done out of band because element
and overheads costs are fixed.

> 
> This patchset introduces a new map ops ->map_mem_usage to get the memory
> usage. In this ops, the memory usage is got from the pointers which is
> already allocated by a bpf map. To make the code simple, we igore some
> small pointers as their size are quite small compared with the total
> usage.
> 
> In order to get the memory size from the pointers, some generic mm helpers
> are introduced firstly, for example, percpu_size(), vsize() and kvsize(). 
> 
> This patchset only implements the bpf memory usage for hashtab. I will
> extend it to other maps and bpf progs (bpf progs can dynamically allocate
> memory via bpf_obj_new()) in the future.

My preference would be to calculate this out of band. Walking a
large map and doing it in a critical section to get the memory
usage seems not optimal 

> 
> The detailed result can be found in patch #7.
> 
> Patch #1~#4: Generic mm helpers
> Patch #5   : Introduce new ops
> Patch #6   : Helpers for bpf_mem_alloc
> Patch #7   : hashtab memory usage
> 
> Future works:
> - extend it to other maps
> - extend it to bpf prog
> - per-container bpf memory usage 
> 
> Historical discussions,
> - RFC PATCH v1 mm, bpf: Add BPF into /proc/meminfo
>   https://lwn.net/Articles/917647/  
> - RFC PATCH v2 mm, bpf: Add BPF into /proc/meminfo
>   https://lwn.net/Articles/919848/
> 
> Yafang Shao (7):
>   mm: percpu: fix incorrect size in pcpu_obj_full_size()
>   mm: percpu: introduce percpu_size()
>   mm: vmalloc: introduce vsize()
>   mm: util: introduce kvsize()
>   bpf: add new map ops ->map_mem_usage
>   bpf: introduce bpf_mem_alloc_size()
>   bpf: hashtab memory usage
> 
>  include/linux/bpf.h           |  2 ++
>  include/linux/bpf_mem_alloc.h |  2 ++
>  include/linux/percpu.h        |  1 +
>  include/linux/slab.h          |  1 +
>  include/linux/vmalloc.h       |  1 +
>  kernel/bpf/hashtab.c          | 80 ++++++++++++++++++++++++++++++++++++++++++-
>  kernel/bpf/memalloc.c         | 70 +++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c          | 18 ++++++----
>  mm/percpu-internal.h          |  4 ++-
>  mm/percpu.c                   | 35 +++++++++++++++++++
>  mm/util.c                     | 15 ++++++++
>  mm/vmalloc.c                  | 17 +++++++++
>  12 files changed, 237 insertions(+), 9 deletions(-)
> 
> -- 
> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 7/7] bpf: hashtab memory usage
  2023-02-04  2:01   ` John Fastabend
@ 2023-02-05  3:55     ` Yafang Shao
  2023-02-08  1:56       ` Alexei Starovoitov
  0 siblings, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2023-02-05  3:55 UTC (permalink / raw)
  To: John Fastabend
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, kpsingh, sdf,
	haoluo, jolsa, tj, dennis, cl, akpm, penberg, rientjes,
	iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka, urezki,
	linux-mm, bpf

On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Yafang Shao wrote:
> > Get htab memory usage from the htab pointers we have allocated. Some
> > small pointers are ignored as their size are quite small compared with
> > the total size.
> >
> > The result as follows,
> > - before this change
> > 1: hash  name count_map  flags 0x0  <<<< prealloc
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> >
> > The memlock is always a fixed number whatever it is preallocated or
> > not, and whatever the allocated elements number is.
> >
> > - after this change
> > 1: hash  name count_map  flags 0x0  <<<< prealloc
> >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> >
> > The memlock now is hashtab actually allocated.
> >
> > At worst, the difference can be 10x, for example,
> > - before this change
> > 4: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> >
> > - after this change
> > 4: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> >
>
> This walks the entire map and buckets to get the size? Inside a
> rcu critical section as well :/ it seems.
>

No, it doesn't walk the entire map and buckets, but just gets one
random element.
So it won't be a problem to do it inside a rcu critical section.

> What am I missing, if you know how many elements are added (which
> you can track on map updates) how come we can't just calculate the
> memory size directly from this?
>

It is less accurate and hard to understand. Take non-preallocated
percpu hashtab for example,
The size can be calculated as follows,
    key_size = round_up(htab->map.key_size, 8);
    value_size = round_up(htab->map.value_size, 8);
    pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
    usage = ((value_size * num_possible_cpus() +\
                    pcpu_meta_size + key_size) * max_entries

That is quite unfriendly to the newbies, and may be error-prone.

Furthermore, it is less accurate because there is underlying memory
allocation in the MM subsystem.
Now we can get a more accurate usage with little overhead. Why not do it?

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf, mm: bpf memory usage
  2023-02-04  2:15 ` [PATCH bpf-next 0/7] bpf, mm: bpf " John Fastabend
@ 2023-02-05  4:03   ` Yafang Shao
  2023-02-07  0:48     ` Ho-Ren Chuang
  2023-02-07  0:53     ` Ho-Ren Chuang
  0 siblings, 2 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-05  4:03 UTC (permalink / raw)
  To: John Fastabend
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, kpsingh, sdf,
	haoluo, jolsa, tj, dennis, cl, akpm, penberg, rientjes,
	iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka, urezki,
	linux-mm, bpf

On Sat, Feb 4, 2023 at 10:15 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Yafang Shao wrote:
> > Currently we can't get bpf memory usage reliably. bpftool now shows the
> > bpf memory footprint, which is difference with bpf memory usage. The
> > difference can be quite great between the footprint showed in bpftool
> > and the memory actually allocated by bpf in some cases, for example,
> >
> > - non-preallocated bpf map
> >   The non-preallocated bpf map memory usage is dynamically changed. The
> >   allocated elements count can be from 0 to the max entries. But the
> >   memory footprint in bpftool only shows a fixed number.
> > - bpf metadata consumes more memory than bpf element
> >   In some corner cases, the bpf metadata can consumes a lot more memory
> >   than bpf element consumes. For example, it can happen when the element
> >   size is quite small.
>
> Just following up slightly on previous comment.
>
> The metadata should be fixed and knowable correct?

The metadata of BPF itself is fixed, but the medata of MM allocation
depends on the kernel configuretion.

> What I'm getting at
> is if this can be calculated directly instead of through a BPF helper
> and walking the entire map.
>

As I explained in another thread, it doesn't walk the entire map.

> >
> > We need a way to get the bpf memory usage especially there will be more
> > and more bpf programs running on the production environment and thus the
> > bpf memory usage is not trivial.
>
> In our environments we track map usage so we always know how many entries
> are in a map. I don't think we use this to calculate memory footprint
> at the moment, but just for map usage. Seems though once you have this
> calculating memory footprint can be done out of band because element
> and overheads costs are fixed.
>
> >
> > This patchset introduces a new map ops ->map_mem_usage to get the memory
> > usage. In this ops, the memory usage is got from the pointers which is
> > already allocated by a bpf map. To make the code simple, we igore some
> > small pointers as their size are quite small compared with the total
> > usage.
> >
> > In order to get the memory size from the pointers, some generic mm helpers
> > are introduced firstly, for example, percpu_size(), vsize() and kvsize().
> >
> > This patchset only implements the bpf memory usage for hashtab. I will
> > extend it to other maps and bpf progs (bpf progs can dynamically allocate
> > memory via bpf_obj_new()) in the future.
>
> My preference would be to calculate this out of band. Walking a
> large map and doing it in a critical section to get the memory
> usage seems not optimal
>

I don't quite understand what you mean by calculating it out of band.
This patchset introduces a BPF helper which is used in bpftool, so it
is already out of band, right ?
We should do it in bpftool, because the sys admin wants a generic way
to get the system-wide bpf memory usage.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 7/7] bpf: hashtab memory usage
  2023-02-02  1:41 ` [PATCH bpf-next 7/7] bpf: hashtab memory usage Yafang Shao
  2023-02-04  2:01   ` John Fastabend
@ 2023-02-05 22:14   ` Cong Wang
  2023-02-06 11:52     ` Yafang Shao
  1 sibling, 1 reply; 27+ messages in thread
From: Cong Wang @ 2023-02-05 22:14 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki, linux-mm, bpf

On Thu, Feb 02, 2023 at 01:41:58AM +0000, Yafang Shao wrote:
> Get htab memory usage from the htab pointers we have allocated. Some
> small pointers are ignored as their size are quite small compared with
> the total size.
> 
> The result as follows,
> - before this change
> 1: hash  name count_map  flags 0x0  <<<< prealloc
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 
> The memlock is always a fixed number whatever it is preallocated or
> not, and whatever the allocated elements number is.
> 
> - after this change
> 1: hash  name count_map  flags 0x0  <<<< prealloc
>         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
>         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
>         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> 
> The memlock now is hashtab actually allocated.
> 
> At worst, the difference can be 10x, for example,
> - before this change
> 4: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> 
> - after this change
> 4: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/bpf/hashtab.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++-

What about other maps like regular array map?

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 7/7] bpf: hashtab memory usage
  2023-02-05 22:14   ` Cong Wang
@ 2023-02-06 11:52     ` Yafang Shao
  0 siblings, 0 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-06 11:52 UTC (permalink / raw)
  To: Cong Wang
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki, linux-mm, bpf

On Mon, Feb 6, 2023 at 6:14 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Thu, Feb 02, 2023 at 01:41:58AM +0000, Yafang Shao wrote:
> > Get htab memory usage from the htab pointers we have allocated. Some
> > small pointers are ignored as their size are quite small compared with
> > the total size.
> >
> > The result as follows,
> > - before this change
> > 1: hash  name count_map  flags 0x0  <<<< prealloc
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> >
> > The memlock is always a fixed number whatever it is preallocated or
> > not, and whatever the allocated elements number is.
> >
> > - after this change
> > 1: hash  name count_map  flags 0x0  <<<< prealloc
> >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> >
> > The memlock now is hashtab actually allocated.
> >
> > At worst, the difference can be 10x, for example,
> > - before this change
> > 4: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> >
> > - after this change
> > 4: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  kernel/bpf/hashtab.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>
> What about other maps like regular array map?
>

I haven't finished the work to support all bpf maps.

Most of the maps have fixed entries, so we can get the usage directly
from an map pointer or get the element size and then calculate it, for
example,
- arraymap usage
  size = kvsize(array);
- percpu arraymap usage
  size = kvsize(array);
  size += percpu_size(array->pptrs[0]) * array->map.max_entries;

But there's special case like cgroup_storage, the max_entries of which
is 0, for example,
122: cgroup_storage  flags 0x0
        key 16B  value 8B  max_entries 0  memlock 0B

For this case, we should calculate the count of entries first.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf, mm: bpf memory usage
  2023-02-05  4:03   ` Yafang Shao
@ 2023-02-07  0:48     ` Ho-Ren Chuang
  2023-02-07  7:02       ` Yafang Shao
  2023-02-07  0:53     ` Ho-Ren Chuang
  1 sibling, 1 reply; 27+ messages in thread
From: Ho-Ren Chuang @ 2023-02-07  0:48 UTC (permalink / raw)
  To: Yafang Shao
  Cc: John Fastabend, ast, daniel, andrii, kafai, songliubraving, yhs,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki, linux-mm, bpf, hao.xiang, yifeima, Xiaoning Ding,
	horenchuang

[-- Attachment #1: Type: text/plain, Size: 4066 bytes --]

Hi Yafang and everyone,

We've proposed very similar features at
https://lore.kernel.org/bpf/CAAYibXgiCOOEY9NvLXbY4ve7pH8xWrZjnczrj6SHy3x_TtOU1g@mail.gmail.com/#t


We are very excited seeing we are not the only ones eager to have this
feature upstream to monitor eBPF map's actual usage. This shows the need
for having such an ability in eBPF.


Regarding the use cases please also check
https://lore.kernel.org/all/CAADnVQLBt0snxv4bKwg1WKQ9wDFbaDCtZ03v1-LjOTYtsKPckQ@mail.gmail.com/#t
<https://lore.kernel.org/all/CAADnVQLBt0snxv4bKwg1WKQ9wDFbaDCtZ03v1-LjOTYtsKPckQ@mail.gmail.com/>
.
We are developing an app to monitor memory footprints used by eBPF
programs/maps similar to Linux `top` command.

Thank you,

On Sat, Feb 4, 2023 at 8:03 PM Yafang Shao <laoar.shao@gmail.com> wrote:

> On Sat, Feb 4, 2023 at 10:15 AM John Fastabend <john.fastabend@gmail.com>
> wrote:
> >
> > Yafang Shao wrote:
> > > Currently we can't get bpf memory usage reliably. bpftool now shows the
> > > bpf memory footprint, which is difference with bpf memory usage. The
> > > difference can be quite great between the footprint showed in bpftool
> > > and the memory actually allocated by bpf in some cases, for example,
> > >
> > > - non-preallocated bpf map
> > >   The non-preallocated bpf map memory usage is dynamically changed. The
> > >   allocated elements count can be from 0 to the max entries. But the
> > >   memory footprint in bpftool only shows a fixed number.
> > > - bpf metadata consumes more memory than bpf element
> > >   In some corner cases, the bpf metadata can consumes a lot more memory
> > >   than bpf element consumes. For example, it can happen when the
> element
> > >   size is quite small.
> >
> > Just following up slightly on previous comment.
> >
> > The metadata should be fixed and knowable correct?
>
> The metadata of BPF itself is fixed, but the medata of MM allocation
> depends on the kernel configuretion.
>
> > What I'm getting at
> > is if this can be calculated directly instead of through a BPF helper
> > and walking the entire map.
> >
>
> As I explained in another thread, it doesn't walk the entire map.
>
> > >
> > > We need a way to get the bpf memory usage especially there will be more
> > > and more bpf programs running on the production environment and thus
> the
> > > bpf memory usage is not trivial.
> >
> > In our environments we track map usage so we always know how many entries
> > are in a map. I don't think we use this to calculate memory footprint
> > at the moment, but just for map usage. Seems though once you have this
> > calculating memory footprint can be done out of band because element
> > and overheads costs are fixed.
> >
> > >
> > > This patchset introduces a new map ops ->map_mem_usage to get the
> memory
> > > usage. In this ops, the memory usage is got from the pointers which is
> > > already allocated by a bpf map. To make the code simple, we igore some
> > > small pointers as their size are quite small compared with the total
> > > usage.
> > >
> > > In order to get the memory size from the pointers, some generic mm
> helpers
> > > are introduced firstly, for example, percpu_size(), vsize() and
> kvsize().
> > >
> > > This patchset only implements the bpf memory usage for hashtab. I will
> > > extend it to other maps and bpf progs (bpf progs can dynamically
> allocate
> > > memory via bpf_obj_new()) in the future.
> >
> > My preference would be to calculate this out of band. Walking a
> > large map and doing it in a critical section to get the memory
> > usage seems not optimal
> >
>
> I don't quite understand what you mean by calculating it out of band.
> This patchset introduces a BPF helper which is used in bpftool, so it
> is already out of band, right ?
> We should do it in bpftool, because the sys admin wants a generic way
> to get the system-wide bpf memory usage.
>
> --
> Regards
> Yafang
>


-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任
1(540)449-9833

[-- Attachment #2: Type: text/html, Size: 6241 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf, mm: bpf memory usage
  2023-02-05  4:03   ` Yafang Shao
  2023-02-07  0:48     ` Ho-Ren Chuang
@ 2023-02-07  0:53     ` Ho-Ren Chuang
  1 sibling, 0 replies; 27+ messages in thread
From: Ho-Ren Chuang @ 2023-02-07  0:53 UTC (permalink / raw)
  To: Yafang Shao
  Cc: John Fastabend, ast, daniel, andrii, kafai, songliubraving, yhs,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki, linux-mm, bpf, hao.xiang, yifeima, Xiaoning Ding,
	horenchuang

Hi Yafang and everyone,

We've proposed very similar features at
https://lore.kernel.org/bpf/CAAYibXgiCOOEY9NvLXbY4ve7pH8xWrZjnczrj6SHy3x_TtOU1g@mail.gmail.com/#t

We are very excited seeing we are not the only ones eager to have this
feature upstream to monitor eBPF map's actual usage. This shows the
need for having such an ability in eBPF.

Regarding the use cases please also check
https://lore.kernel.org/all/CAADnVQLBt0snxv4bKwg1WKQ9wDFbaDCtZ03v1-LjOTYtsKPckQ@mail.gmail.com/#t
. We are developing an app to monitor memory footprints used by eBPF
programs/maps similar to Linux `top` command.

Thank you,


On Sat, Feb 4, 2023 at 8:03 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Feb 4, 2023 at 10:15 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Yafang Shao wrote:
> > > Currently we can't get bpf memory usage reliably. bpftool now shows the
> > > bpf memory footprint, which is difference with bpf memory usage. The
> > > difference can be quite great between the footprint showed in bpftool
> > > and the memory actually allocated by bpf in some cases, for example,
> > >
> > > - non-preallocated bpf map
> > >   The non-preallocated bpf map memory usage is dynamically changed. The
> > >   allocated elements count can be from 0 to the max entries. But the
> > >   memory footprint in bpftool only shows a fixed number.
> > > - bpf metadata consumes more memory than bpf element
> > >   In some corner cases, the bpf metadata can consumes a lot more memory
> > >   than bpf element consumes. For example, it can happen when the element
> > >   size is quite small.
> >
> > Just following up slightly on previous comment.
> >
> > The metadata should be fixed and knowable correct?
>
> The metadata of BPF itself is fixed, but the medata of MM allocation
> depends on the kernel configuretion.
>
> > What I'm getting at
> > is if this can be calculated directly instead of through a BPF helper
> > and walking the entire map.
> >
>
> As I explained in another thread, it doesn't walk the entire map.
>
> > >
> > > We need a way to get the bpf memory usage especially there will be more
> > > and more bpf programs running on the production environment and thus the
> > > bpf memory usage is not trivial.
> >
> > In our environments we track map usage so we always know how many entries
> > are in a map. I don't think we use this to calculate memory footprint
> > at the moment, but just for map usage. Seems though once you have this
> > calculating memory footprint can be done out of band because element
> > and overheads costs are fixed.
> >
> > >
> > > This patchset introduces a new map ops ->map_mem_usage to get the memory
> > > usage. In this ops, the memory usage is got from the pointers which is
> > > already allocated by a bpf map. To make the code simple, we igore some
> > > small pointers as their size are quite small compared with the total
> > > usage.
> > >
> > > In order to get the memory size from the pointers, some generic mm helpers
> > > are introduced firstly, for example, percpu_size(), vsize() and kvsize().
> > >
> > > This patchset only implements the bpf memory usage for hashtab. I will
> > > extend it to other maps and bpf progs (bpf progs can dynamically allocate
> > > memory via bpf_obj_new()) in the future.
> >
> > My preference would be to calculate this out of band. Walking a
> > large map and doing it in a critical section to get the memory
> > usage seems not optimal
> >
>
> I don't quite understand what you mean by calculating it out of band.
> This patchset introduces a BPF helper which is used in bpftool, so it
> is already out of band, right ?
> We should do it in bpftool, because the sys admin wants a generic way
> to get the system-wide bpf memory usage.
>
> --
> Regards
> Yafang



-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf, mm: bpf memory usage
  2023-02-07  0:48     ` Ho-Ren Chuang
@ 2023-02-07  7:02       ` Yafang Shao
  0 siblings, 0 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-07  7:02 UTC (permalink / raw)
  To: Ho-Ren Chuang
  Cc: John Fastabend, ast, daniel, andrii, kafai, songliubraving, yhs,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, vbabka,
	urezki, linux-mm, bpf, hao.xiang, yifeima, Xiaoning Ding,
	horenchuang

On Tue, Feb 7, 2023 at 8:49 AM Ho-Ren Chuang <horenc@vt.edu> wrote:
>
> Hi Yafang and everyone,
>
> We've proposed very similar features at https://lore.kernel.org/bpf/CAAYibXgiCOOEY9NvLXbY4ve7pH8xWrZjnczrj6SHy3x_TtOU1g@mail.gmail.com/#t
>

I have looked through your patchset. Maybe we can use max_entires  to
show the used_enties for preallocated hashtab?  Because for the
preallocated hashtab, the memory is already allocated, so it doesn't
matter how many entries it is using now. Then we can avoid the runtime
overhead which Alexei is worried about.

>
> We are very excited seeing we are not the only ones eager to have this feature upstream to monitor eBPF map's actual usage. This shows the need for having such an ability in eBPF.
>

Happy to hear that this feature could help you.
I think over time there will be more users who want to monitor the bpf
memory usage :)

>
> Regarding the use cases please also check https://lore.kernel.org/all/CAADnVQLBt0snxv4bKwg1WKQ9wDFbaDCtZ03v1-LjOTYtsKPckQ@mail.gmail.com/#t . We are developing an app to monitor memory footprints used by eBPF programs/maps similar to Linux `top` command.
>
>
> Thank you,
>

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 7/7] bpf: hashtab memory usage
  2023-02-05  3:55     ` Yafang Shao
@ 2023-02-08  1:56       ` Alexei Starovoitov
  2023-02-08  3:33         ` Yafang Shao
  0 siblings, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2023-02-08  1:56 UTC (permalink / raw)
  To: Yafang Shao
  Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Tejun Heo,
	dennis, Chris Lameter, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Vlastimil Babka, urezki, linux-mm, bpf

On Sat, Feb 4, 2023 at 7:56 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Yafang Shao wrote:
> > > Get htab memory usage from the htab pointers we have allocated. Some
> > > small pointers are ignored as their size are quite small compared with
> > > the total size.
> > >
> > > The result as follows,
> > > - before this change
> > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > >
> > > The memlock is always a fixed number whatever it is preallocated or
> > > not, and whatever the allocated elements number is.
> > >
> > > - after this change
> > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> > >
> > > The memlock now is hashtab actually allocated.
> > >
> > > At worst, the difference can be 10x, for example,
> > > - before this change
> > > 4: hash  name count_map  flags 0x0
> > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > >
> > > - after this change
> > > 4: hash  name count_map  flags 0x0
> > >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> > >
> >
> > This walks the entire map and buckets to get the size? Inside a
> > rcu critical section as well :/ it seems.
> >
>
> No, it doesn't walk the entire map and buckets, but just gets one
> random element.
> So it won't be a problem to do it inside a rcu critical section.
>
> > What am I missing, if you know how many elements are added (which
> > you can track on map updates) how come we can't just calculate the
> > memory size directly from this?
> >
>
> It is less accurate and hard to understand. Take non-preallocated
> percpu hashtab for example,
> The size can be calculated as follows,
>     key_size = round_up(htab->map.key_size, 8);
>     value_size = round_up(htab->map.value_size, 8);
>     pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
>     usage = ((value_size * num_possible_cpus() +\
>                     pcpu_meta_size + key_size) * max_entries
>
> That is quite unfriendly to the newbies, and may be error-prone.

Please do that instead.
map_mem_usage callback is a no go as I mentioned earlier.

> Furthermore, it is less accurate because there is underlying memory
> allocation in the MM subsystem.
> Now we can get a more accurate usage with little overhead. Why not do it?

because htab_mem_usage() is not maintainable long term.
100% accuracy is a non-goal.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 7/7] bpf: hashtab memory usage
  2023-02-08  1:56       ` Alexei Starovoitov
@ 2023-02-08  3:33         ` Yafang Shao
  2023-02-08  4:29           ` Alexei Starovoitov
  0 siblings, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2023-02-08  3:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Tejun Heo,
	dennis, Chris Lameter, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Vlastimil Babka, urezki, linux-mm, bpf

On Wed, Feb 8, 2023 at 9:56 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Feb 4, 2023 at 7:56 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Yafang Shao wrote:
> > > > Get htab memory usage from the htab pointers we have allocated. Some
> > > > small pointers are ignored as their size are quite small compared with
> > > > the total size.
> > > >
> > > > The result as follows,
> > > > - before this change
> > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > >
> > > > The memlock is always a fixed number whatever it is preallocated or
> > > > not, and whatever the allocated elements number is.
> > > >
> > > > - after this change
> > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> > > >
> > > > The memlock now is hashtab actually allocated.
> > > >
> > > > At worst, the difference can be 10x, for example,
> > > > - before this change
> > > > 4: hash  name count_map  flags 0x0
> > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > >
> > > > - after this change
> > > > 4: hash  name count_map  flags 0x0
> > > >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> > > >
> > >
> > > This walks the entire map and buckets to get the size? Inside a
> > > rcu critical section as well :/ it seems.
> > >
> >
> > No, it doesn't walk the entire map and buckets, but just gets one
> > random element.
> > So it won't be a problem to do it inside a rcu critical section.
> >
> > > What am I missing, if you know how many elements are added (which
> > > you can track on map updates) how come we can't just calculate the
> > > memory size directly from this?
> > >
> >
> > It is less accurate and hard to understand. Take non-preallocated
> > percpu hashtab for example,
> > The size can be calculated as follows,
> >     key_size = round_up(htab->map.key_size, 8);
> >     value_size = round_up(htab->map.value_size, 8);
> >     pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
> >     usage = ((value_size * num_possible_cpus() +\
> >                     pcpu_meta_size + key_size) * max_entries
> >
> > That is quite unfriendly to the newbies, and may be error-prone.
>
> Please do that instead.

I can do it as you suggested, but it seems we shouldn't keep all
estimates in one place. Because ...

> map_mem_usage callback is a no go as I mentioned earlier.

...we have to introduce the map_mem_usage callback. Take the lpm_trie
for example, its usage is
usage = (sizeof(struct lpm_trie_node) + trie->data_size) * trie->n_entries;

I don't think we want  to declare struct lpm_trie_node in kernel/bpf/syscall.c.
WDYT ?

>
> > Furthermore, it is less accurate because there is underlying memory
> > allocation in the MM subsystem.
> > Now we can get a more accurate usage with little overhead. Why not do it?
>
> because htab_mem_usage() is not maintainable long term.
> 100% accuracy is a non-goal.

htab_mem_usage() gives us an option to continue to make it more
accurate with considerable overhead.
But I won't insist on it if you think we don't need to make it more accurate.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 7/7] bpf: hashtab memory usage
  2023-02-08  3:33         ` Yafang Shao
@ 2023-02-08  4:29           ` Alexei Starovoitov
  2023-02-08 14:22             ` Yafang Shao
  0 siblings, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2023-02-08  4:29 UTC (permalink / raw)
  To: Yafang Shao
  Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Tejun Heo,
	dennis, Chris Lameter, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Vlastimil Babka, urezki, linux-mm, bpf

On Tue, Feb 7, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Feb 8, 2023 at 9:56 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sat, Feb 4, 2023 at 7:56 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > Yafang Shao wrote:
> > > > > Get htab memory usage from the htab pointers we have allocated. Some
> > > > > small pointers are ignored as their size are quite small compared with
> > > > > the total size.
> > > > >
> > > > > The result as follows,
> > > > > - before this change
> > > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > >
> > > > > The memlock is always a fixed number whatever it is preallocated or
> > > > > not, and whatever the allocated elements number is.
> > > > >
> > > > > - after this change
> > > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> > > > >
> > > > > The memlock now is hashtab actually allocated.
> > > > >
> > > > > At worst, the difference can be 10x, for example,
> > > > > - before this change
> > > > > 4: hash  name count_map  flags 0x0
> > > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > > >
> > > > > - after this change
> > > > > 4: hash  name count_map  flags 0x0
> > > > >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> > > > >
> > > >
> > > > This walks the entire map and buckets to get the size? Inside a
> > > > rcu critical section as well :/ it seems.
> > > >
> > >
> > > No, it doesn't walk the entire map and buckets, but just gets one
> > > random element.
> > > So it won't be a problem to do it inside a rcu critical section.
> > >
> > > > What am I missing, if you know how many elements are added (which
> > > > you can track on map updates) how come we can't just calculate the
> > > > memory size directly from this?
> > > >
> > >
> > > It is less accurate and hard to understand. Take non-preallocated
> > > percpu hashtab for example,
> > > The size can be calculated as follows,
> > >     key_size = round_up(htab->map.key_size, 8);
> > >     value_size = round_up(htab->map.value_size, 8);
> > >     pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
> > >     usage = ((value_size * num_possible_cpus() +\
> > >                     pcpu_meta_size + key_size) * max_entries
> > >
> > > That is quite unfriendly to the newbies, and may be error-prone.
> >
> > Please do that instead.
>
> I can do it as you suggested, but it seems we shouldn't keep all
> estimates in one place. Because ...
>
> > map_mem_usage callback is a no go as I mentioned earlier.
>
> ...we have to introduce the map_mem_usage callback. Take the lpm_trie
> for example, its usage is
> usage = (sizeof(struct lpm_trie_node) + trie->data_size) * trie->n_entries;

sizeof(struct lpm_trie_node) + trie->data_size + trie->map.value_size.

and it won't count the inner nodes, but _it is ok_.

> I don't think we want  to declare struct lpm_trie_node in kernel/bpf/syscall.c.
> WDYT ?

Good point. Fine. Let's go with callback, but please keep it
to a single function without loops like htab_non_prealloc_elems_size()
and htab_prealloc_elems_size().

Also please implement it for all maps.
Doing it just for hash and arguing that every byte of accuracy matters
while not addressing lpm and other maps doesn't give credibility
to the accuracy argument.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 7/7] bpf: hashtab memory usage
  2023-02-08  4:29           ` Alexei Starovoitov
@ 2023-02-08 14:22             ` Yafang Shao
  0 siblings, 0 replies; 27+ messages in thread
From: Yafang Shao @ 2023-02-08 14:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Tejun Heo,
	dennis, Chris Lameter, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Vlastimil Babka, urezki, linux-mm, bpf

On Wed, Feb 8, 2023 at 12:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 7, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Feb 8, 2023 at 9:56 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sat, Feb 4, 2023 at 7:56 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > >
> > > > > Yafang Shao wrote:
> > > > > > Get htab memory usage from the htab pointers we have allocated. Some
> > > > > > small pointers are ignored as their size are quite small compared with
> > > > > > the total size.
> > > > > >
> > > > > > The result as follows,
> > > > > > - before this change
> > > > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > >
> > > > > > The memlock is always a fixed number whatever it is preallocated or
> > > > > > not, and whatever the allocated elements number is.
> > > > > >
> > > > > > - after this change
> > > > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > > > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > > > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> > > > > >
> > > > > > The memlock now is hashtab actually allocated.
> > > > > >
> > > > > > At worst, the difference can be 10x, for example,
> > > > > > - before this change
> > > > > > 4: hash  name count_map  flags 0x0
> > > > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > > > >
> > > > > > - after this change
> > > > > > 4: hash  name count_map  flags 0x0
> > > > > >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> > > > > >
> > > > >
> > > > > This walks the entire map and buckets to get the size? Inside a
> > > > > rcu critical section as well :/ it seems.
> > > > >
> > > >
> > > > No, it doesn't walk the entire map and buckets, but just gets one
> > > > random element.
> > > > So it won't be a problem to do it inside a rcu critical section.
> > > >
> > > > > What am I missing, if you know how many elements are added (which
> > > > > you can track on map updates) how come we can't just calculate the
> > > > > memory size directly from this?
> > > > >
> > > >
> > > > It is less accurate and hard to understand. Take non-preallocated
> > > > percpu hashtab for example,
> > > > The size can be calculated as follows,
> > > >     key_size = round_up(htab->map.key_size, 8);
> > > >     value_size = round_up(htab->map.value_size, 8);
> > > >     pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
> > > >     usage = ((value_size * num_possible_cpus() +\
> > > >                     pcpu_meta_size + key_size) * max_entries
> > > >
> > > > That is quite unfriendly to the newbies, and may be error-prone.
> > >
> > > Please do that instead.
> >
> > I can do it as you suggested, but it seems we shouldn't keep all
> > estimates in one place. Because ...
> >
> > > map_mem_usage callback is a no go as I mentioned earlier.
> >
> > ...we have to introduce the map_mem_usage callback. Take the lpm_trie
> > for example, its usage is
> > usage = (sizeof(struct lpm_trie_node) + trie->data_size) * trie->n_entries;
>
> sizeof(struct lpm_trie_node) + trie->data_size + trie->map.value_size.
>

Thanks for correcting it.

> and it won't count the inner nodes, but _it is ok_.
>
> > I don't think we want  to declare struct lpm_trie_node in kernel/bpf/syscall.c.
> > WDYT ?
>
> Good point. Fine. Let's go with callback, but please keep it
> to a single function without loops like htab_non_prealloc_elems_size()
> and htab_prealloc_elems_size().
>
> Also please implement it for all maps.

Sure, I will do it.

> Doing it just for hash and arguing that every byte of accuracy matters
> while not addressing lpm and other maps doesn't give credibility
> to the accuracy argument.



-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-02-08 14:22 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-02  1:41 [PATCH bpf-next 0/7] bpf, mm: bpf memory usage Yafang Shao
2023-02-02  1:41 ` [PATCH bpf-next 1/7] mm: percpu: fix incorrect size in pcpu_obj_full_size() Yafang Shao
2023-02-02  1:41 ` [PATCH bpf-next 2/7] mm: percpu: introduce percpu_size() Yafang Shao
2023-02-02 14:32   ` Christoph Lameter
2023-02-02 15:01     ` Yafang Shao
2023-02-02  1:41 ` [PATCH bpf-next 3/7] mm: vmalloc: introduce vsize() Yafang Shao
2023-02-02 10:23   ` Christoph Hellwig
2023-02-02 14:10     ` Yafang Shao
2023-02-02  1:41 ` [PATCH bpf-next 4/7] mm: util: introduce kvsize() Yafang Shao
2023-02-02  1:41 ` [PATCH bpf-next 5/7] bpf: add new map ops ->map_mem_usage Yafang Shao
2023-02-02  1:41 ` [PATCH bpf-next 6/7] bpf: introduce bpf_mem_alloc_size() Yafang Shao
2023-02-02  4:53   ` kernel test robot
2023-02-02 14:11     ` Yafang Shao
2023-02-02  1:41 ` [PATCH bpf-next 7/7] bpf: hashtab memory usage Yafang Shao
2023-02-04  2:01   ` John Fastabend
2023-02-05  3:55     ` Yafang Shao
2023-02-08  1:56       ` Alexei Starovoitov
2023-02-08  3:33         ` Yafang Shao
2023-02-08  4:29           ` Alexei Starovoitov
2023-02-08 14:22             ` Yafang Shao
2023-02-05 22:14   ` Cong Wang
2023-02-06 11:52     ` Yafang Shao
2023-02-04  2:15 ` [PATCH bpf-next 0/7] bpf, mm: bpf " John Fastabend
2023-02-05  4:03   ` Yafang Shao
2023-02-07  0:48     ` Ho-Ren Chuang
2023-02-07  7:02       ` Yafang Shao
2023-02-07  0:53     ` Ho-Ren Chuang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.