linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
@ 2023-01-12 15:53 Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 01/11] mm: percpu: count memcg relevant memory only when kmemcg is enabled Yafang Shao
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

Currently there's no way to get BPF memory usage, while we can only
estimate the usage by bpftool or memcg, both of which are not reliable.

- bpftool
  `bpftool {map,prog} show` can show us the memlock of each map and
  prog, but the memlock is vary from the real memory size. The memlock
  of a bpf object is approximately
  `round_up(key_size + value_size, 8) * max_entries`,
  so 1) it can't apply to the non-preallocated bpf map which may
  increase or decrease the real memory size dynamically. 2) the element
  size of some bpf map is not `key_size + value_size`, for example the
  element size of htab is
  `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
  That said the differece between these two values may be very great if
  the key_size and value_size is small. For example in my verifaction,
  the size of memlock and real memory of a preallocated hash map are,

  $ grep BPF /proc/meminfo
  BPF:                 350 kB  <<< the size of preallocated memalloc pool

  (create hash map)

  $ bpftool map show
  41549: hash  name count_map  flags 0x0
        key 4B  value 4B  max_entries 1048576  memlock 8388608B

  $ grep BPF /proc/meminfo
  BPF:               82284 kB
 
  So the real memory size is $((82284 - 350)) which is 81934 kB 
  while the memlock is only 8192 kB. 

- memcg  
  With memcg we only know that the BPF memory usage is less than
  memory.kmem.usage_in_bytes (or memory.current in v2). Furthermore, we
  only know that the BPF memory usage is less than $MemTotal if the BPF
  object is charged into root memcg :)

So we need a way to get the BPF memory usage especially there will be
more and more bpf programs running on the production environment. The
memory usage of BPF memory is not trivial, which deserves a new item in
/proc/meminfo.

There're some ways to calculate the BPF memory usage. They all have pros
and cons.

- Option 1: Annotate BPF memory allocation only
  It is how I implemented in RFC v1. You can look into the detail and
  discussion on it via the link below[1]. 
  - pros
    We only need to annotate the BPF memory allocation, and then we can
    find these allocated memory in the free path automatically. So it is
    very easy to use, and we don't need to worry about the stat leak.
  - cons
    We must store the information of these allocated memory, in
    particular the allocated slab objects. So it takes extra memory. If
    we introduce a new member into struct page or add this member into
    page_ext, it will take at least 0.2% of the total memory on 64bit
    system, that is not acceptible.
    One way to reduce this memory overhead is to introduce dynamic page
    extension, but it will take great effort and it may not worth it.

- Option 2: Annotate both allocation and free
  It is similar to how I implemented in an earlier version[2].
  - pros
    There's almost no memory overhead.
  - cons    
    All the memory allocation and free must use the BPF helpers, but
    can't use the generic helpers like kfree/vfree/percpu_free. So if
    the user forget to use the helpers we introduced to allocate or
    free BPF memory, there will be stat leak.
    It is not easy to annotate some derferred allocation, in particular
    the kfree_rcu(). So the user have to use call_rcu() instead of
    kfree_rcu(). Another risk is that if we introduce other deferred
    free helpers in the future, this BPF statistic may break easily.
    
- Option 3: Calculate the memory size via the pointer
  It is how I implement in this patchset.    
  After allocating some BPF memory, we get the full size from the
  pointer and add it; Before freeing the BPF memory, we get the full
  size from the pointer and sub it.
  - pros    
    No memory overhead.    
    No code churn in MM core allocation and free path.
    The impementation is quite clear and easy to maintain.
  - cons
    The calculation is not embedded in the MM allocation/free path, so
    there will be some redundant code to execute to get the size via
    pointer.    
    BPF memory allocation and free must use the helpers we introduced,
    otherwise there will be stat leak.

I perfer the option 3. Its cons can be justified.    
- bpf_map_free should be paired with bpf_map_alloc, that's reasonable.
- Regarding the possible extra cpu cycles it may take, the user should
  not allocate and free memory in the critical path if it is latency
  sensitive. 

[1]. https://lwn.net/Articles/917647/
[2]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/

v1->v2: don't use page_ext (Vlastimil, Hyeonggon)

Yafang Shao (11):
  mm: percpu: count memcg relevant memory only when kmemcg is enabled
  mm: percpu: introduce percpu_size()
  mm: slab: rename obj_full_size()
  mm: slab: introduce ksize_full()
  mm: vmalloc: introduce vsize()
  mm: util: introduce kvsize()
  bpf: introduce new helpers bpf_ringbuf_pages_{alloc,free}
  bpf: use bpf_map_kzalloc in arraymap
  bpf: use bpf_map_kvcalloc in bpf_local_storage
  bpf: add and use bpf map free helpers
  bpf: introduce bpf memory statistics

 fs/proc/meminfo.c              |   4 ++
 include/linux/bpf.h            | 115 +++++++++++++++++++++++++++++++++++++++--
 include/linux/percpu.h         |   1 +
 include/linux/slab.h           |  10 ++++
 include/linux/vmalloc.h        |  15 ++++++
 kernel/bpf/arraymap.c          |  20 +++----
 kernel/bpf/bpf_cgrp_storage.c  |   2 +-
 kernel/bpf/bpf_inode_storage.c |   2 +-
 kernel/bpf/bpf_local_storage.c |  24 ++++-----
 kernel/bpf/bpf_task_storage.c  |   2 +-
 kernel/bpf/cpumap.c            |  13 +++--
 kernel/bpf/devmap.c            |  10 ++--
 kernel/bpf/hashtab.c           |   8 +--
 kernel/bpf/helpers.c           |   2 +-
 kernel/bpf/local_storage.c     |  12 ++---
 kernel/bpf/lpm_trie.c          |  14 ++---
 kernel/bpf/memalloc.c          |  19 ++++++-
 kernel/bpf/ringbuf.c           |  75 ++++++++++++++++++---------
 kernel/bpf/syscall.c           |  54 ++++++++++++++++++-
 mm/percpu-internal.h           |   4 +-
 mm/percpu.c                    |  35 +++++++++++++
 mm/slab.h                      |  19 ++++---
 mm/slab_common.c               |  52 +++++++++++++------
 mm/slob.c                      |   2 +-
 mm/util.c                      |  15 ++++++
 net/core/bpf_sk_storage.c      |   4 +-
 net/core/sock_map.c            |   2 +-
 net/xdp/xskmap.c               |   2 +-
 28 files changed, 422 insertions(+), 115 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 01/11] mm: percpu: count memcg relevant memory only when kmemcg is enabled
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 02/11] mm: percpu: introduce percpu_size() Yafang Shao
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao, Vasily Averin

The extra space which is used to store the obj_cgroup membership is only
valid when kmemcg is enabled. The kmemcg can be disabled via the kernel
parameter "cgroup.memory=nokmem" at runtime.
This helper is also used in non-memcg code, for example the tracepoint,
so we should fix it.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vasily Averin <vvs@openvz.org>
---
 mm/percpu-internal.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 70b1ea2..2a95b1f 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 #include <linux/percpu.h>
+#include <linux/memcontrol.h>
 
 /*
  * pcpu_block_md is the metadata block struct.
@@ -125,7 +126,8 @@ static inline size_t pcpu_obj_full_size(size_t size)
 	size_t extra_size = 0;
 
 #ifdef CONFIG_MEMCG_KMEM
-	extra_size += size / PCPU_MIN_ALLOC_SIZE * sizeof(struct obj_cgroup *);
+	if (!mem_cgroup_kmem_disabled())
+		extra_size += size / PCPU_MIN_ALLOC_SIZE * sizeof(struct obj_cgroup *);
 #endif
 
 	return size * num_possible_cpus() + extra_size;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 02/11] mm: percpu: introduce percpu_size()
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 01/11] mm: percpu: count memcg relevant memory only when kmemcg is enabled Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 03/11] mm: slab: rename obj_full_size() Yafang Shao
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

Introduce a new helper percpu_size() to report full size of underlying
allocation of a percpu address.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/percpu.h |  1 +
 mm/percpu.c            | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1338ea2..7be4234 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -137,5 +137,6 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
 						__alignof__(type))
 
 extern unsigned long pcpu_nr_pages(void);
+extern size_t percpu_size(void __percpu *ptr);
 
 #endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu.c b/mm/percpu.c
index acd78da..5580688 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2302,6 +2302,41 @@ void free_percpu(void __percpu *ptr)
 }
 EXPORT_SYMBOL_GPL(free_percpu);
 
+/**
+ * percpu_size - report full size of underlying allocation of percpu addr
+ * @ptr: pointer to percpu area
+ *
+ * CONTEXT:
+ * Can be called from atomic context.
+ */
+size_t percpu_size(void __percpu *ptr)
+{
+	int bit_off, bits, end, off, size;
+	struct pcpu_chunk *chunk;
+	unsigned long flags;
+	void *addr;
+
+	if (!ptr)
+		return 0;
+
+	addr = __pcpu_ptr_to_addr(ptr);
+
+	spin_lock_irqsave(&pcpu_lock, flags);
+	chunk = pcpu_chunk_addr_search(addr);
+	off = addr - chunk->base_addr;
+	bit_off = off / PCPU_MIN_ALLOC_SIZE;
+
+	/* find end index */
+	end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
+			    bit_off + 1);
+	spin_unlock_irqrestore(&pcpu_lock, flags);
+
+	bits = end - bit_off;
+	size = bits * PCPU_MIN_ALLOC_SIZE;
+
+	return pcpu_obj_full_size(size);
+}
+
 bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr)
 {
 #ifdef CONFIG_SMP
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 03/11] mm: slab: rename obj_full_size()
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 01/11] mm: percpu: count memcg relevant memory only when kmemcg is enabled Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 02/11] mm: percpu: introduce percpu_size() Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 04/11] mm: slab: introduce ksize_full() Yafang Shao
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

The helper obj_full_size() is a little misleading, because it is only
valid when kmemcg is enabled. Meanwhile it is only used when kmemcg is
enabled currently, so we just need to rename it to a more meaningful name.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/slab.h | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 7cc4329..35e0b3b 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -467,7 +467,10 @@ static inline void memcg_free_slab_cgroups(struct slab *slab)
 	slab->memcg_data = 0;
 }
 
-static inline size_t obj_full_size(struct kmem_cache *s)
+/*
+ * This helper is only valid when kmemcg isn't disabled.
+ */
+static inline size_t obj_kmemcg_size(struct kmem_cache *s)
 {
 	/*
 	 * For each accounted object there is an extra space which is used
@@ -508,7 +511,7 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 			goto out;
 	}
 
-	if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s)))
+	if (obj_cgroup_charge(objcg, flags, objects * obj_kmemcg_size(s)))
 		goto out;
 
 	*objcgp = objcg;
@@ -537,7 +540,7 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			if (!slab_objcgs(slab) &&
 			    memcg_alloc_slab_cgroups(slab, s, flags,
 							 false)) {
-				obj_cgroup_uncharge(objcg, obj_full_size(s));
+				obj_cgroup_uncharge(objcg, obj_kmemcg_size(s));
 				continue;
 			}
 
@@ -545,9 +548,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			obj_cgroup_get(objcg);
 			slab_objcgs(slab)[off] = objcg;
 			mod_objcg_state(objcg, slab_pgdat(slab),
-					cache_vmstat_idx(s), obj_full_size(s));
+					cache_vmstat_idx(s), obj_kmemcg_size(s));
 		} else {
-			obj_cgroup_uncharge(objcg, obj_full_size(s));
+			obj_cgroup_uncharge(objcg, obj_kmemcg_size(s));
 		}
 	}
 	obj_cgroup_put(objcg);
@@ -576,9 +579,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
 			continue;
 
 		objcgs[off] = NULL;
-		obj_cgroup_uncharge(objcg, obj_full_size(s));
+		obj_cgroup_uncharge(objcg, obj_kmemcg_size(s));
 		mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
-				-obj_full_size(s));
+				-obj_kmemcg_size(s));
 		obj_cgroup_put(objcg);
 	}
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 04/11] mm: slab: introduce ksize_full()
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (2 preceding siblings ...)
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 03/11] mm: slab: rename obj_full_size() Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 05/11] mm: vmalloc: introduce vsize() Yafang Shao
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

When the object is charged to kmemcg, it will alloc extra memory to
store the kmemcg ownership, so introduce a new helper to include this
memory size.

The reason we introduce a new helper other than changing the current
helper ksize() is that the allocation of the kmemcg ownership is a
nested allocation, which is independent of the original allocation. Some
user may relays on ksize() to get the layout of this slab, so we'd
better not changing it.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/slab.h        |  2 +-
 mm/slab_common.c | 52 ++++++++++++++++++++++++++++++++++++----------------
 mm/slob.c        |  2 +-
 3 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 35e0b3b..e07ae90 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -681,7 +681,7 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
 
 #endif /* CONFIG_SLOB */
 
-size_t __ksize(const void *objp);
+size_t ___ksize(const void *objp, bool full);
 
 static inline size_t slab_ksize(const struct kmem_cache *s)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1cba98a..4f1e2bc 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1021,21 +1021,11 @@ void kfree(const void *object)
 }
 EXPORT_SYMBOL(kfree);
 
-/**
- * __ksize -- Report full size of underlying allocation
- * @object: pointer to the object
- *
- * This should only be used internally to query the true size of allocations.
- * It is not meant to be a way to discover the usable size of an allocation
- * after the fact. Instead, use kmalloc_size_roundup(). Using memory beyond
- * the originally requested allocation size may trigger KASAN, UBSAN_BOUNDS,
- * and/or FORTIFY_SOURCE.
- *
- * Return: size of the actual memory used by @object in bytes
- */
-size_t __ksize(const void *object)
+size_t ___ksize(const void *object, bool full)
 {
+	size_t kmemcg_size = 0;
 	struct folio *folio;
+	struct slab *slab;
 
 	if (unlikely(object == ZERO_SIZE_PTR))
 		return 0;
@@ -1054,7 +1044,27 @@ size_t __ksize(const void *object)
 	skip_orig_size_check(folio_slab(folio)->slab_cache, object);
 #endif
 
-	return slab_ksize(folio_slab(folio)->slab_cache);
+	slab = folio_slab(folio);
+	if (memcg_kmem_enabled() && full && slab_objcgs(slab))
+		kmemcg_size = sizeof(struct obj_cgroup *);
+	return slab_ksize(slab->slab_cache) + kmemcg_size;
+}
+
+/**
+ * __ksize -- Report full size of underlying allocation
+ * @object: pointer to the object
+ *
+ * This should only be used internally to query the true size of allocations.
+ * It is not meant to be a way to discover the usable size of an allocation
+ * after the fact. Instead, use kmalloc_size_roundup(). Using memory beyond
+ * the originally requested allocation size may trigger KASAN, UBSAN_BOUNDS,
+ * and/or FORTIFY_SOURCE.
+ *
+ * Return: size of the actual memory used by @object in bytes
+ */
+size_t __ksize(const void *object)
+{
+	return ___ksize(object, false);
 }
 
 void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
@@ -1428,7 +1438,7 @@ void kfree_sensitive(const void *p)
 }
 EXPORT_SYMBOL(kfree_sensitive);
 
-size_t ksize(const void *objp)
+size_t _ksize(const void *objp, bool full)
 {
 	/*
 	 * We need to first check that the pointer to the object is valid.
@@ -1448,10 +1458,20 @@ size_t ksize(const void *objp)
 	if (unlikely(ZERO_OR_NULL_PTR(objp)) || !kasan_check_byte(objp))
 		return 0;
 
-	return kfence_ksize(objp) ?: __ksize(objp);
+	return kfence_ksize(objp) ?: ___ksize(objp, full);
 }
 EXPORT_SYMBOL(ksize);
 
+size_t ksize(const void *objp)
+{
+	return _ksize(objp, false);
+}
+
+size_t ksize_full(const void *objp)
+{
+	return _ksize(objp, true);
+}
+
 /* Tracepoints definitions. */
 EXPORT_TRACEPOINT_SYMBOL(kmalloc);
 EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
diff --git a/mm/slob.c b/mm/slob.c
index fe567fcf..8c46bdc 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -579,7 +579,7 @@ size_t kmalloc_size_roundup(size_t size)
 EXPORT_SYMBOL(kmalloc_size_roundup);
 
 /* can't use ksize for kmem_cache_alloc memory, only kmalloc */
-size_t __ksize(const void *block)
+size_t ___ksize(const void *block, bool full)
 {
 	struct folio *folio;
 	unsigned int align;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 05/11] mm: vmalloc: introduce vsize()
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (3 preceding siblings ...)
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 04/11] mm: slab: introduce ksize_full() Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 06/11] mm: util: introduce kvsize() Yafang Shao
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

Introduce a helper to report full size of underlying allocation of a
vmalloc'ed address.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/vmalloc.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48a..52f925f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -297,4 +297,19 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 static inline bool vmalloc_dump_obj(void *object) { return false; }
 #endif
 
+/* Report full size of underlying allocation of a vmalloc'ed addr */
+static inline size_t vsize(const void *addr)
+{
+	struct vm_struct *area;
+
+	if (!addr)
+		return 0;
+
+	area = find_vm_area(addr);
+	if (unlikely(!area))
+		return 0;
+
+	return area->size;
+}
+
 #endif /* _LINUX_VMALLOC_H */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 06/11] mm: util: introduce kvsize()
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (4 preceding siblings ...)
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 05/11] mm: vmalloc: introduce vsize() Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 07/11] bpf: introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

Introduce a new help kvsize() to report full size of underlying allocation
of a kmalloc'ed addr or vmalloc'ed addr.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/slab.h | 10 ++++++++++
 mm/util.c            | 15 +++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af703..740ddf7 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -226,6 +226,15 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
  */
 size_t ksize(const void *objp);
 
+/**
+ * ksize_full - Report full size of each accounted objp
+ * @objp: pointer to the object
+ *
+ * The difference between ksize() and ksize_full() is that ksize_full()
+ * includes the extra space which is used to store obj_cgroup membership.
+ */
+size_t ksize_full(const void *objp);
+
 #ifdef CONFIG_PRINTK
 bool kmem_valid_obj(void *object);
 void kmem_dump_obj(void *object);
@@ -762,6 +771,7 @@ static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t fla
 
 extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
 		      __realloc_size(3);
+extern size_t kvsize(const void *addr);
 extern void kvfree(const void *addr);
 extern void kvfree_sensitive(const void *addr, size_t len);
 
diff --git a/mm/util.c b/mm/util.c
index b56c92f..d6be4f94 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -610,6 +610,21 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 EXPORT_SYMBOL(kvmalloc_node);
 
 /**
+ * kvsize() - Report full size of underlying allocation of adddr
+ * @addr: Pointer to kmalloc'ed or vmalloc'ed memory
+ *
+ * kvsize reports full size of underlying allocation of a kmalloc'ed addr
+ * or a vmalloc'ed addr.
+ */
+size_t kvsize(const void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		return vsize(addr);
+
+	return ksize_full(addr);
+}
+
+/**
  * kvfree() - Free memory.
  * @addr: Pointer to allocated memory.
  *
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 07/11] bpf: introduce new helpers bpf_ringbuf_pages_{alloc,free}
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (5 preceding siblings ...)
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 06/11] mm: util: introduce kvsize() Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 08/11] bpf: use bpf_map_kzalloc in arraymap Yafang Shao
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

Allocate pages related memory into the new helper
bpf_ringbuf_pages_alloc(), then it can be handled as a single unit.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/ringbuf.c | 71 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 80f4b4d..3264bf5 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -92,6 +92,48 @@ struct bpf_ringbuf_hdr {
 	u32 pg_off;
 };
 
+static void bpf_ringbuf_pages_free(struct page **pages, int nr_pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		__free_page(pages[i]);
+	bpf_map_area_free(pages);
+}
+
+static struct page **bpf_ringbuf_pages_alloc(int nr_meta_pages,
+						int nr_data_pages, int numa_node,
+						const gfp_t flags)
+{
+	int nr_pages = nr_meta_pages + nr_data_pages;
+	struct page **pages, *page;
+	int array_size;
+	int i;
+
+	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
+	pages = bpf_map_area_alloc(array_size, numa_node);
+	if (!pages)
+		goto err;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_pages_node(numa_node, flags, 0);
+		if (!page) {
+			nr_pages = i;
+			goto err_free_pages;
+		}
+		pages[i] = page;
+		if (i >= nr_meta_pages)
+			pages[nr_data_pages + i] = page;
+	}
+
+	return pages;
+
+err_free_pages:
+	bpf_ringbuf_pages_free(pages, nr_pages);
+err:
+	return NULL;
+}
+
 static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 {
 	const gfp_t flags = GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL |
@@ -99,10 +141,8 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 	int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES;
 	int nr_data_pages = data_sz >> PAGE_SHIFT;
 	int nr_pages = nr_meta_pages + nr_data_pages;
-	struct page **pages, *page;
 	struct bpf_ringbuf *rb;
-	size_t array_size;
-	int i;
+	struct page **pages;
 
 	/* Each data page is mapped twice to allow "virtual"
 	 * continuous read of samples wrapping around the end of ring
@@ -121,22 +161,11 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 	 * when mmap()'ed in user-space, simplifying both kernel and
 	 * user-space implementations significantly.
 	 */
-	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
-	pages = bpf_map_area_alloc(array_size, numa_node);
+	pages = bpf_ringbuf_pages_alloc(nr_meta_pages, nr_data_pages,
+									numa_node, flags);
 	if (!pages)
 		return NULL;
 
-	for (i = 0; i < nr_pages; i++) {
-		page = alloc_pages_node(numa_node, flags, 0);
-		if (!page) {
-			nr_pages = i;
-			goto err_free_pages;
-		}
-		pages[i] = page;
-		if (i >= nr_meta_pages)
-			pages[nr_data_pages + i] = page;
-	}
-
 	rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
 		  VM_MAP | VM_USERMAP, PAGE_KERNEL);
 	if (rb) {
@@ -146,10 +175,6 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 		return rb;
 	}
 
-err_free_pages:
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages);
 	return NULL;
 }
 
@@ -219,12 +244,10 @@ static void bpf_ringbuf_free(struct bpf_ringbuf *rb)
 	 * to unmap rb itself with vunmap() below
 	 */
 	struct page **pages = rb->pages;
-	int i, nr_pages = rb->nr_pages;
+	int nr_pages = rb->nr_pages;
 
 	vunmap(rb);
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages);
+	bpf_ringbuf_pages_free(pages, nr_pages);
 }
 
 static void ringbuf_map_free(struct bpf_map *map)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 08/11] bpf: use bpf_map_kzalloc in arraymap
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (6 preceding siblings ...)
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 07/11] bpf: introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 09/11] bpf: use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

Allocates memory after map creation, then we can use the generic helper
bpf_map_kzalloc() instead of the open-coded kzalloc().

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/arraymap.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 4847069..e64a417 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1102,20 +1102,20 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
 	struct bpf_array_aux *aux;
 	struct bpf_map *map;
 
-	aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT);
-	if (!aux)
+	map = array_map_alloc(attr);
+	if (IS_ERR(map))
 		return ERR_PTR(-ENOMEM);
 
+	aux = bpf_map_kzalloc(map, sizeof(*aux), GFP_KERNEL);
+	if (!aux) {
+		array_map_free(map);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	INIT_WORK(&aux->work, prog_array_map_clear_deferred);
 	INIT_LIST_HEAD(&aux->poke_progs);
 	mutex_init(&aux->poke_mutex);
 
-	map = array_map_alloc(attr);
-	if (IS_ERR(map)) {
-		kfree(aux);
-		return map;
-	}
-
 	container_of(map, struct bpf_array, map)->aux = aux;
 	aux->map = map;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 09/11] bpf: use bpf_map_kvcalloc in bpf_local_storage
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (7 preceding siblings ...)
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 08/11] bpf: use bpf_map_kzalloc in arraymap Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 10/11] bpf: add and use bpf map free helpers Yafang Shao
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

Introduce new helper bpf_map_kvcalloc() for this memory allocation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  8 ++++++++
 kernel/bpf/bpf_local_storage.c |  4 ++--
 kernel/bpf/syscall.c           | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ae7771c..fb14cc6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1873,6 +1873,8 @@ int  generic_map_delete_batch(struct bpf_map *map,
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node);
 void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags);
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags);
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags);
 #else
@@ -1889,6 +1891,12 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	return kzalloc(size, flags);
 }
 
+static inline void *
+bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size, gfp_t flags)
+{
+	return kvcalloc(n, size, flags);
+}
+
 static inline void __percpu *
 bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
 		     gfp_t flags)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 373c3c2..35f4138 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -568,8 +568,8 @@ static struct bpf_local_storage_map *__bpf_local_storage_map_alloc(union bpf_att
 	nbuckets = max_t(u32, 2, nbuckets);
 	smap->bucket_log = ilog2(nbuckets);
 
-	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
-				 GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
+	smap->buckets = bpf_map_kvcalloc(&smap->map, sizeof(*smap->buckets),
+					 nbuckets, GFP_USER | __GFP_NOWARN);
 	if (!smap->buckets) {
 		bpf_map_area_free(smap);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 35ffd80..9e266e8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -470,6 +470,21 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	return ptr;
 }
 
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags)
+{
+	struct mem_cgroup *memcg, *old_memcg;
+	void *ptr;
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	ptr = kvcalloc(n, size, flags | __GFP_ACCOUNT);
+	set_active_memcg(old_memcg);
+	mem_cgroup_put(memcg);
+
+	return ptr;
+}
+
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 10/11] bpf: add and use bpf map free helpers
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (8 preceding siblings ...)
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 09/11] bpf: use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 11/11] bpf: introduce bpf memory statistics Yafang Shao
  2023-01-12 21:05 ` [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Alexei Starovoitov
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

Some new helpers are introduced to free bpf memory, instead of using the
generic free helpers. Then we can do something in these new helpers to
track the free of bpf memory in the future.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            | 19 +++++++++++++++++++
 kernel/bpf/arraymap.c          |  4 ++--
 kernel/bpf/bpf_cgrp_storage.c  |  2 +-
 kernel/bpf/bpf_inode_storage.c |  2 +-
 kernel/bpf/bpf_local_storage.c | 20 ++++++++++----------
 kernel/bpf/bpf_task_storage.c  |  2 +-
 kernel/bpf/cpumap.c            | 13 ++++++-------
 kernel/bpf/devmap.c            | 10 ++++++----
 kernel/bpf/hashtab.c           |  8 ++++----
 kernel/bpf/helpers.c           |  2 +-
 kernel/bpf/local_storage.c     | 12 ++++++------
 kernel/bpf/lpm_trie.c          | 14 +++++++-------
 net/core/bpf_sk_storage.c      |  4 ++--
 net/core/sock_map.c            |  2 +-
 net/xdp/xskmap.c               |  2 +-
 15 files changed, 68 insertions(+), 48 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fb14cc6..17c218e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1869,6 +1869,24 @@ int  generic_map_delete_batch(struct bpf_map *map,
 struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
 struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
 
+
+static inline void bpf_map_kfree(const void *ptr)
+{
+	kfree(ptr);
+}
+
+static inline void bpf_map_kvfree(const void *ptr)
+{
+	kvfree(ptr);
+}
+
+static inline void bpf_map_free_percpu(void __percpu *ptr)
+{
+	free_percpu(ptr);
+}
+
+#define bpf_map_kfree_rcu(ptr, rhf...) kvfree_rcu(ptr, ## rhf)
+
 #ifdef CONFIG_MEMCG_KMEM
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node);
@@ -1877,6 +1895,7 @@ void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
 		       gfp_t flags);
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags);
+
 #else
 static inline void *
 bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e64a417..d218bf0 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -24,7 +24,7 @@ static void bpf_array_free_percpu(struct bpf_array *array)
 	int i;
 
 	for (i = 0; i < array->map.max_entries; i++) {
-		free_percpu(array->pptrs[i]);
+		bpf_map_free_percpu(array->pptrs[i]);
 		cond_resched();
 	}
 }
@@ -1132,7 +1132,7 @@ static void prog_array_map_free(struct bpf_map *map)
 		list_del_init(&elem->list);
 		kfree(elem);
 	}
-	kfree(aux);
+	bpf_map_kfree(aux);
 	fd_array_map_free(map);
 }
 
diff --git a/kernel/bpf/bpf_cgrp_storage.c b/kernel/bpf/bpf_cgrp_storage.c
index 6cdf6d9..d0ac4eb 100644
--- a/kernel/bpf/bpf_cgrp_storage.c
+++ b/kernel/bpf/bpf_cgrp_storage.c
@@ -64,7 +64,7 @@ void bpf_cgrp_storage_free(struct cgroup *cgroup)
 	rcu_read_unlock();
 
 	if (free_cgroup_storage)
-		kfree_rcu(local_storage, rcu);
+		bpf_map_kfree_rcu(local_storage, rcu);
 }
 
 static struct bpf_local_storage_data *
diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index 05f4c66..0297f36 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -78,7 +78,7 @@ void bpf_inode_storage_free(struct inode *inode)
 	rcu_read_unlock();
 
 	if (free_inode_storage)
-		kfree_rcu(local_storage, rcu);
+		bpf_map_kfree_rcu(local_storage, rcu);
 }
 
 static void *bpf_fd_inode_storage_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 35f4138..2fd79a1 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -93,9 +93,9 @@ void bpf_local_storage_free_rcu(struct rcu_head *rcu)
 	 */
 	local_storage = container_of(rcu, struct bpf_local_storage, rcu);
 	if (rcu_trace_implies_rcu_gp())
-		kfree(local_storage);
+		bpf_map_kfree(local_storage);
 	else
-		kfree_rcu(local_storage, rcu);
+		bpf_map_kfree_rcu(local_storage, rcu);
 }
 
 static void bpf_selem_free_rcu(struct rcu_head *rcu)
@@ -104,9 +104,9 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
 
 	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
 	if (rcu_trace_implies_rcu_gp())
-		kfree(selem);
+		bpf_map_kfree(selem);
 	else
-		kfree_rcu(selem, rcu);
+		bpf_map_kfree_rcu(selem, rcu);
 }
 
 /* local_storage->lock must be held and selem->local_storage == local_storage.
@@ -162,7 +162,7 @@ static bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_stor
 	if (use_trace_rcu)
 		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
 	else
-		kfree_rcu(selem, rcu);
+		bpf_map_kfree_rcu(selem, rcu);
 
 	return free_local_storage;
 }
@@ -191,7 +191,7 @@ static void __bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem,
 			call_rcu_tasks_trace(&local_storage->rcu,
 				     bpf_local_storage_free_rcu);
 		else
-			kfree_rcu(local_storage, rcu);
+			bpf_map_kfree_rcu(local_storage, rcu);
 	}
 }
 
@@ -358,7 +358,7 @@ int bpf_local_storage_alloc(void *owner,
 	return 0;
 
 uncharge:
-	kfree(storage);
+	bpf_map_kfree(storage);
 	mem_uncharge(smap, owner, sizeof(*storage));
 	return err;
 }
@@ -402,7 +402,7 @@ struct bpf_local_storage_data *
 
 		err = bpf_local_storage_alloc(owner, smap, selem, gfp_flags);
 		if (err) {
-			kfree(selem);
+			bpf_map_kfree(selem);
 			mem_uncharge(smap, owner, smap->elem_size);
 			return ERR_PTR(err);
 		}
@@ -496,7 +496,7 @@ struct bpf_local_storage_data *
 	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
 	if (selem) {
 		mem_uncharge(smap, owner, smap->elem_size);
-		kfree(selem);
+		bpf_map_kfree(selem);
 	}
 	return ERR_PTR(err);
 }
@@ -713,6 +713,6 @@ void bpf_local_storage_map_free(struct bpf_map *map,
 	 */
 	synchronize_rcu();
 
-	kvfree(smap->buckets);
+	bpf_map_kvfree(smap->buckets);
 	bpf_map_area_free(smap);
 }
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index 1e48605..7287b02 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -91,7 +91,7 @@ void bpf_task_storage_free(struct task_struct *task)
 	rcu_read_unlock();
 
 	if (free_task_storage)
-		kfree_rcu(local_storage, rcu);
+		bpf_map_kfree_rcu(local_storage, rcu);
 }
 
 static void *bpf_pid_task_storage_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index e0b2d01..3470c13 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -164,8 +164,8 @@ static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 		/* The queue should be empty at this point */
 		__cpu_map_ring_cleanup(rcpu->queue);
 		ptr_ring_cleanup(rcpu->queue, NULL);
-		kfree(rcpu->queue);
-		kfree(rcpu);
+		bpf_map_kfree(rcpu->queue);
+		bpf_map_kfree(rcpu);
 	}
 }
 
@@ -484,11 +484,11 @@ static int __cpu_map_load_bpf_program(struct bpf_cpu_map_entry *rcpu,
 free_ptr_ring:
 	ptr_ring_cleanup(rcpu->queue, NULL);
 free_queue:
-	kfree(rcpu->queue);
+	bpf_map_kfree(rcpu->queue);
 free_bulkq:
-	free_percpu(rcpu->bulkq);
+	bpf_map_free_percpu(rcpu->bulkq);
 free_rcu:
-	kfree(rcpu);
+	bpf_map_kfree(rcpu);
 	return NULL;
 }
 
@@ -502,8 +502,7 @@ static void __cpu_map_entry_free(struct rcu_head *rcu)
 	 * find this entry.
 	 */
 	rcpu = container_of(rcu, struct bpf_cpu_map_entry, rcu);
-
-	free_percpu(rcpu->bulkq);
+	bpf_map_free_percpu(rcpu->bulkq);
 	/* Cannot kthread_stop() here, last put free rcpu resources */
 	put_cpu_map_entry(rcpu);
 }
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index d01e4c5..be10c47 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -218,7 +218,7 @@ static void dev_map_free(struct bpf_map *map)
 				if (dev->xdp_prog)
 					bpf_prog_put(dev->xdp_prog);
 				dev_put(dev->dev);
-				kfree(dev);
+				bpf_map_kfree(dev);
 			}
 		}
 
@@ -234,7 +234,7 @@ static void dev_map_free(struct bpf_map *map)
 			if (dev->xdp_prog)
 				bpf_prog_put(dev->xdp_prog);
 			dev_put(dev->dev);
-			kfree(dev);
+			bpf_map_kfree(dev);
 		}
 
 		bpf_map_area_free(dtab->netdev_map);
@@ -791,12 +791,14 @@ static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key)
 static void __dev_map_entry_free(struct rcu_head *rcu)
 {
 	struct bpf_dtab_netdev *dev;
+	struct bpf_dtab *dtab;
 
 	dev = container_of(rcu, struct bpf_dtab_netdev, rcu);
 	if (dev->xdp_prog)
 		bpf_prog_put(dev->xdp_prog);
 	dev_put(dev->dev);
-	kfree(dev);
+	dtab = dev->dtab;
+	bpf_map_kfree(dev);
 }
 
 static int dev_map_delete_elem(struct bpf_map *map, void *key)
@@ -881,7 +883,7 @@ static struct bpf_dtab_netdev *__dev_map_alloc_node(struct net *net,
 err_put_dev:
 	dev_put(dev->dev);
 err_out:
-	kfree(dev);
+	bpf_map_kfree(dev);
 	return ERR_PTR(-EINVAL);
 }
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 5aa2b55..1047788 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -266,7 +266,7 @@ static void htab_free_elems(struct bpf_htab *htab)
 
 		pptr = htab_elem_get_ptr(get_htab_elem(htab, i),
 					 htab->map.key_size);
-		free_percpu(pptr);
+		bpf_map_free_percpu(pptr);
 		cond_resched();
 	}
 free_elems:
@@ -584,7 +584,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	if (htab->use_percpu_counter)
 		percpu_counter_destroy(&htab->pcount);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
-		free_percpu(htab->map_locked[i]);
+		bpf_map_free_percpu(htab->map_locked[i]);
 	bpf_map_area_free(htab->buckets);
 	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
@@ -1511,14 +1511,14 @@ static void htab_map_free(struct bpf_map *map)
 		prealloc_destroy(htab);
 	}
 
-	free_percpu(htab->extra_elems);
+	bpf_map_free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
 	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
 	if (htab->use_percpu_counter)
 		percpu_counter_destroy(&htab->pcount);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
-		free_percpu(htab->map_locked[i]);
+		bpf_map_free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
 	bpf_map_area_free(htab);
 }
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 458db2d..49b0040 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1383,7 +1383,7 @@ void bpf_timer_cancel_and_free(void *val)
 	 */
 	if (this_cpu_read(hrtimer_running) != t)
 		hrtimer_cancel(&t->timer);
-	kfree(t);
+	bpf_map_kfree(t);
 }
 
 BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index e90d9f6..ed5cf5b 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -174,7 +174,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *key,
 	check_and_init_map_value(map, new->data);
 
 	new = xchg(&storage->buf, new);
-	kfree_rcu(new, rcu);
+	bpf_map_kfree_rcu(new, rcu);
 
 	return 0;
 }
@@ -526,7 +526,7 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
 	return storage;
 
 enomem:
-	kfree(storage);
+	bpf_map_kfree(storage);
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -535,8 +535,8 @@ static void free_shared_cgroup_storage_rcu(struct rcu_head *rcu)
 	struct bpf_cgroup_storage *storage =
 		container_of(rcu, struct bpf_cgroup_storage, rcu);
 
-	kfree(storage->buf);
-	kfree(storage);
+	bpf_map_kfree(storage->buf);
+	bpf_map_kfree(storage);
 }
 
 static void free_percpu_cgroup_storage_rcu(struct rcu_head *rcu)
@@ -544,8 +544,8 @@ static void free_percpu_cgroup_storage_rcu(struct rcu_head *rcu)
 	struct bpf_cgroup_storage *storage =
 		container_of(rcu, struct bpf_cgroup_storage, rcu);
 
-	free_percpu(storage->percpu_buf);
-	kfree(storage);
+	bpf_map_free_percpu(storage->percpu_buf);
+	bpf_map_kfree(storage);
 }
 
 void bpf_cgroup_storage_free(struct bpf_cgroup_storage *storage)
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index d833496..b2bee07 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -379,7 +379,7 @@ static int trie_update_elem(struct bpf_map *map,
 			trie->n_entries--;
 
 		rcu_assign_pointer(*slot, new_node);
-		kfree_rcu(node, rcu);
+		bpf_map_kfree_rcu(node, rcu);
 
 		goto out;
 	}
@@ -421,8 +421,8 @@ static int trie_update_elem(struct bpf_map *map,
 		if (new_node)
 			trie->n_entries--;
 
-		kfree(new_node);
-		kfree(im_node);
+		bpf_map_kfree(new_node);
+		bpf_map_kfree(im_node);
 	}
 
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
@@ -503,8 +503,8 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 		else
 			rcu_assign_pointer(
 				*trim2, rcu_access_pointer(parent->child[0]));
-		kfree_rcu(parent, rcu);
-		kfree_rcu(node, rcu);
+		bpf_map_kfree_rcu(parent, rcu);
+		bpf_map_kfree_rcu(node, rcu);
 		goto out;
 	}
 
@@ -518,7 +518,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 		rcu_assign_pointer(*trim, rcu_access_pointer(node->child[1]));
 	else
 		RCU_INIT_POINTER(*trim, NULL);
-	kfree_rcu(node, rcu);
+	bpf_map_kfree_rcu(node, rcu);
 
 out:
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
@@ -602,7 +602,7 @@ static void trie_free(struct bpf_map *map)
 				continue;
 			}
 
-			kfree(node);
+			bpf_map_kfree(node);
 			RCU_INIT_POINTER(*slot, NULL);
 			break;
 		}
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index bb378c3..7b6d7fd 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -64,7 +64,7 @@ void bpf_sk_storage_free(struct sock *sk)
 	rcu_read_unlock();
 
 	if (free_sk_storage)
-		kfree_rcu(sk_storage, rcu);
+		bpf_map_kfree_rcu(sk_storage, rcu);
 }
 
 static void bpf_sk_storage_map_free(struct bpf_map *map)
@@ -203,7 +203,7 @@ int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
 		} else {
 			ret = bpf_local_storage_alloc(newsk, smap, copy_selem, GFP_ATOMIC);
 			if (ret) {
-				kfree(copy_selem);
+				bpf_map_kfree(copy_selem);
 				atomic_sub(smap->elem_size,
 					   &newsk->sk_omem_alloc);
 				bpf_map_put(map);
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 22fa2c5..059e55c 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -888,7 +888,7 @@ static void sock_hash_free_elem(struct bpf_shtab *htab,
 				struct bpf_shtab_elem *elem)
 {
 	atomic_dec(&htab->count);
-	kfree_rcu(elem, rcu);
+	bpf_map_kfree_rcu(elem, rcu);
 }
 
 static void sock_hash_delete_from_link(struct bpf_map *map, struct sock *sk,
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 771d0fa..1cb24b1 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -33,7 +33,7 @@ static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
 static void xsk_map_node_free(struct xsk_map_node *node)
 {
 	bpf_map_put(&node->map->map);
-	kfree(node);
+	bpf_map_kfree(node);
 }
 
 static void xsk_map_sock_add(struct xdp_sock *xs, struct xsk_map_node *node)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH bpf-next v2 11/11] bpf: introduce bpf memory statistics
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (9 preceding siblings ...)
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 10/11] bpf: add and use bpf map free helpers Yafang Shao
@ 2023-01-12 15:53 ` Yafang Shao
  2023-01-12 21:05 ` [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Alexei Starovoitov
  11 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-12 15:53 UTC (permalink / raw)
  To: 42.hyeyoo, vbabka, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl,
	akpm, penberg, rientjes, iamjoonsoo.kim, roman.gushchin
  Cc: linux-mm, bpf, Yafang Shao

It introduces a new percpu global variable to store bpf memory
statistic. It will adds this percpu variable at bpf memory allocation
and subs this percpu variable at bpf memory freeing. A new item "BPF" is
added into /proc/meminfo to show the bpf memory statistic.

Pls. note that there're some deferred freeing, for example the
kfree_rcu(), vfree_deferred(). For these deferred freeing, the in-flight
memory may be not freed immediately after we subs them from the bpf memory
statistic. But it won't take long time to free them, so this behavior is
acceptible.

Below is the output,
$ grep BPF /proc/meminfo
BPF:              358 kB

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 fs/proc/meminfo.c     |  4 +++
 include/linux/bpf.h   | 92 +++++++++++++++++++++++++++++++++++++++++++++++----
 kernel/bpf/memalloc.c | 19 ++++++++++-
 kernel/bpf/ringbuf.c  |  4 +++
 kernel/bpf/syscall.c  | 39 ++++++++++++++++++++--
 5 files changed, 149 insertions(+), 9 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 4409601..5b67331 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -16,6 +16,7 @@
 #ifdef CONFIG_CMA
 #include <linux/cma.h>
 #endif
+#include <linux/bpf.h>
 #include <asm/page.h>
 #include "internal.h"
 
@@ -159,6 +160,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 	arch_report_meminfo(m);
 
+	seq_printf(m,  "BPF:            %8lu kB\n",
+			bpf_mem_stat_sum() >> 10);
+
 	return 0;
 }
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 17c218e..add307e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1869,23 +1869,71 @@ int  generic_map_delete_batch(struct bpf_map *map,
 struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
 struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
 
+struct bpf_mem_stat {
+	long stat;
+};
+
+DECLARE_PER_CPU(struct bpf_mem_stat, bpfmm);
+
+static inline void bpf_mem_stat_add(size_t cnt)
+{
+	this_cpu_add(bpfmm.stat, cnt);
+}
+
+static inline void bpf_mem_stat_sub(size_t cnt)
+{
+	this_cpu_sub(bpfmm.stat, cnt);
+}
+
+static inline long bpf_mem_stat_sum(void)
+{
+	struct bpf_mem_stat *this;
+	long sum = 0;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		this = &per_cpu(bpfmm, cpu);
+		sum += this->stat;
+	}
+
+	return sum;
+}
 
 static inline void bpf_map_kfree(const void *ptr)
 {
+	size_t sz = ksize_full(ptr);
+
+	if (sz)
+		bpf_mem_stat_sub(sz);
 	kfree(ptr);
 }
 
 static inline void bpf_map_kvfree(const void *ptr)
 {
+	size_t sz = kvsize(ptr);
+
+	if (sz)
+		bpf_mem_stat_sub(sz);
 	kvfree(ptr);
 }
 
 static inline void bpf_map_free_percpu(void __percpu *ptr)
 {
+	size_t sz = percpu_size(ptr);
+
+	if (sz)
+		bpf_mem_stat_sub(sz);
 	free_percpu(ptr);
 }
 
-#define bpf_map_kfree_rcu(ptr, rhf...) kvfree_rcu(ptr, ## rhf)
+#define bpf_map_kfree_rcu(ptr, rhf...)			\
+do {											\
+	size_t sz = kvsize(ptr);					\
+												\
+	if (sz)										\
+		bpf_mem_stat_sub(sz);					\
+	kvfree_rcu(ptr, ## rhf);					\
+} while (0)
 
 #ifdef CONFIG_MEMCG_KMEM
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
@@ -1901,26 +1949,54 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 		     int node)
 {
-	return kmalloc_node(size, flags, node);
+	void *ptr;
+	size_t sz;
+
+	ptr = kmalloc_node(size, flags, node);
+	sz = ksize_full(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
+	return ptr;
 }
 
 static inline void *
 bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 {
-	return kzalloc(size, flags);
+	void *ptr;
+	size_t sz;
+
+	ptr = kzalloc(size, flags);
+	sz = ksize_full(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
+	return ptr;
 }
 
 static inline void *
 bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size, gfp_t flags)
 {
-	return kvcalloc(n, size, flags);
+	void *ptr;
+	size_t sz;
+
+	ptr = kvcalloc(n, size, flags);
+	sz = kvsize(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
+	return ptr;
 }
 
 static inline void __percpu *
 bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
 		     gfp_t flags)
 {
-	return __alloc_percpu_gfp(size, align, flags);
+	void *ptr;
+	size_t sz;
+
+	ptr = __alloc_percpu_gfp(size, align, flags);
+	sz = percpu_size(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
+	return ptr;
 }
 #endif
 
@@ -2461,6 +2537,11 @@ static inline void bpf_prog_inc_misses_counter(struct bpf_prog *prog)
 static inline void bpf_cgrp_storage_free(struct cgroup *cgroup)
 {
 }
+
+static inline long bpf_mem_stat_sum(void)
+{
+	return 0;
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
@@ -2886,5 +2967,4 @@ static inline bool type_is_alloc(u32 type)
 {
 	return type & MEM_ALLOC;
 }
-
 #endif /* _LINUX_BPF_H */
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index ebcc3dd..4e35f287 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -129,6 +129,8 @@ static void *__alloc(struct bpf_mem_cache *c, int node)
 	 * want here.
 	 */
 	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT;
+	void *ptr;
+	size_t sz;
 
 	if (c->percpu_size) {
 		void **obj = kmalloc_node(c->percpu_size, flags, node);
@@ -140,10 +142,18 @@ static void *__alloc(struct bpf_mem_cache *c, int node)
 			return NULL;
 		}
 		obj[1] = pptr;
+		sz = ksize_full(obj);
+		sz += percpu_size(pptr);
+		if (sz)
+			bpf_mem_stat_add(sz);
 		return obj;
 	}
 
-	return kmalloc_node(c->unit_size, flags, node);
+	ptr = kmalloc_node(c->unit_size, flags, node);
+	sz = ksize_full(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
+	return ptr;
 }
 
 static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
@@ -215,12 +225,19 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 
 static void free_one(struct bpf_mem_cache *c, void *obj)
 {
+	size_t sz = ksize_full(obj);
+
 	if (c->percpu_size) {
+		sz += percpu_size(((void **)obj)[1]);
+		if (sz)
+			bpf_mem_stat_sub(sz);
 		free_percpu(((void **)obj)[1]);
 		kfree(obj);
 		return;
 	}
 
+	if (sz)
+		bpf_mem_stat_sub(sz);
 	kfree(obj);
 }
 
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 3264bf5..766c2f1 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -96,6 +96,7 @@ static void bpf_ringbuf_pages_free(struct page **pages, int nr_pages)
 {
 	int i;
 
+	bpf_mem_stat_sub(nr_pages * PAGE_SIZE);
 	for (i = 0; i < nr_pages; i++)
 		__free_page(pages[i]);
 	bpf_map_area_free(pages);
@@ -126,9 +127,12 @@ static struct page **bpf_ringbuf_pages_alloc(int nr_meta_pages,
 			pages[nr_data_pages + i] = page;
 	}
 
+	bpf_mem_stat_add(nr_pages * PAGE_SIZE);
 	return pages;
 
 err_free_pages:
+	if (nr_pages)
+		bpf_mem_stat_add(nr_pages * PAGE_SIZE);
 	bpf_ringbuf_pages_free(pages, nr_pages);
 err:
 	return NULL;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9e266e8..6ca2ceb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -46,6 +46,7 @@
 
 #define BPF_OBJ_FLAG_MASK   (BPF_F_RDONLY | BPF_F_WRONLY)
 
+DEFINE_PER_CPU(struct bpf_mem_stat, bpfmm);
 DEFINE_PER_CPU(int, bpf_prog_active);
 static DEFINE_IDR(prog_idr);
 static DEFINE_SPINLOCK(prog_idr_lock);
@@ -336,16 +337,34 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 
 void *bpf_map_area_alloc(u64 size, int numa_node)
 {
-	return __bpf_map_area_alloc(size, numa_node, false);
+	size_t sz;
+	void *ptr;
+
+	ptr = __bpf_map_area_alloc(size, numa_node, false);
+	sz = kvsize(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
+	return ptr;
 }
 
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node)
 {
-	return __bpf_map_area_alloc(size, numa_node, true);
+	size_t sz;
+	void *ptr;
+
+	ptr =  __bpf_map_area_alloc(size, numa_node, true);
+	sz = kvsize(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
+	return ptr;
 }
 
 void bpf_map_area_free(void *area)
 {
+	size_t sz = kvsize(area);
+
+	if (sz)
+		bpf_mem_stat_sub(sz);
 	kvfree(area);
 }
 
@@ -446,12 +465,16 @@ void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 {
 	struct mem_cgroup *memcg, *old_memcg;
 	void *ptr;
+	size_t sz;
 
 	memcg = bpf_map_get_memcg(map);
 	old_memcg = set_active_memcg(memcg);
 	ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node);
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
+	sz = ksize_full(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
 
 	return ptr;
 }
@@ -460,12 +483,16 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 {
 	struct mem_cgroup *memcg, *old_memcg;
 	void *ptr;
+	size_t sz;
 
 	memcg = bpf_map_get_memcg(map);
 	old_memcg = set_active_memcg(memcg);
 	ptr = kzalloc(size, flags | __GFP_ACCOUNT);
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
+	sz = ksize_full(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
 
 	return ptr;
 }
@@ -475,12 +502,16 @@ void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
 {
 	struct mem_cgroup *memcg, *old_memcg;
 	void *ptr;
+	size_t sz;
 
 	memcg = bpf_map_get_memcg(map);
 	old_memcg = set_active_memcg(memcg);
 	ptr = kvcalloc(n, size, flags | __GFP_ACCOUNT);
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
+	sz = kvsize(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
 
 	return ptr;
 }
@@ -490,12 +521,16 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 {
 	struct mem_cgroup *memcg, *old_memcg;
 	void __percpu *ptr;
+	size_t sz;
 
 	memcg = bpf_map_get_memcg(map);
 	old_memcg = set_active_memcg(memcg);
 	ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT);
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
+	sz = percpu_size(ptr);
+	if (sz)
+		bpf_mem_stat_add(sz);
 
 	return ptr;
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (10 preceding siblings ...)
  2023-01-12 15:53 ` [RFC PATCH bpf-next v2 11/11] bpf: introduce bpf memory statistics Yafang Shao
@ 2023-01-12 21:05 ` Alexei Starovoitov
  2023-01-13 11:53   ` Yafang Shao
  11 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2023-01-12 21:05 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, linux-mm, bpf

On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Currently there's no way to get BPF memory usage, while we can only
> estimate the usage by bpftool or memcg, both of which are not reliable.
>
> - bpftool
>   `bpftool {map,prog} show` can show us the memlock of each map and
>   prog, but the memlock is vary from the real memory size. The memlock
>   of a bpf object is approximately
>   `round_up(key_size + value_size, 8) * max_entries`,
>   so 1) it can't apply to the non-preallocated bpf map which may
>   increase or decrease the real memory size dynamically. 2) the element
>   size of some bpf map is not `key_size + value_size`, for example the
>   element size of htab is
>   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
>   That said the differece between these two values may be very great if
>   the key_size and value_size is small. For example in my verifaction,
>   the size of memlock and real memory of a preallocated hash map are,
>
>   $ grep BPF /proc/meminfo
>   BPF:                 350 kB  <<< the size of preallocated memalloc pool
>
>   (create hash map)
>
>   $ bpftool map show
>   41549: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 8388608B
>
>   $ grep BPF /proc/meminfo
>   BPF:               82284 kB
>
>   So the real memory size is $((82284 - 350)) which is 81934 kB
>   while the memlock is only 8192 kB.

hashmap with key 4b and value 4b looks artificial to me,
but since you're concerned with accuracy of bpftool reporting,
please fix the estimation in bpf_map_memory_footprint().
You're correct that:

> size of some bpf map is not `key_size + value_size`, for example the
>   element size of htab is
>   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`

So just teach bpf_map_memory_footprint() to do this more accurately.
Add bucket size to it as well.
Make it even more accurate with prealloc vs not.
Much simpler change than adding run-time overhead to every alloc/free
on bpf side.

Higher level point:
bpf side tracks all of its allocation. There is no need to do that
in generic mm side.
Exposing an aggregated single number if /proc/meminfo also looks wrong.
People should be able to "bpftool map show|awk sum of fields"
and get the same number.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-12 21:05 ` [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Alexei Starovoitov
@ 2023-01-13 11:53   ` Yafang Shao
  2023-01-17 17:25     ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Yafang Shao @ 2023-01-13 11:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, linux-mm, bpf

On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Currently there's no way to get BPF memory usage, while we can only
> > estimate the usage by bpftool or memcg, both of which are not reliable.
> >
> > - bpftool
> >   `bpftool {map,prog} show` can show us the memlock of each map and
> >   prog, but the memlock is vary from the real memory size. The memlock
> >   of a bpf object is approximately
> >   `round_up(key_size + value_size, 8) * max_entries`,
> >   so 1) it can't apply to the non-preallocated bpf map which may
> >   increase or decrease the real memory size dynamically. 2) the element
> >   size of some bpf map is not `key_size + value_size`, for example the
> >   element size of htab is
> >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> >   That said the differece between these two values may be very great if
> >   the key_size and value_size is small. For example in my verifaction,
> >   the size of memlock and real memory of a preallocated hash map are,
> >
> >   $ grep BPF /proc/meminfo
> >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> >
> >   (create hash map)
> >
> >   $ bpftool map show
> >   41549: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> >
> >   $ grep BPF /proc/meminfo
> >   BPF:               82284 kB
> >
> >   So the real memory size is $((82284 - 350)) which is 81934 kB
> >   while the memlock is only 8192 kB.
>
> hashmap with key 4b and value 4b looks artificial to me,
> but since you're concerned with accuracy of bpftool reporting,
> please fix the estimation in bpf_map_memory_footprint().

I thought bpf_map_memory_footprint() was deprecated, so I didn't try
to fix it before.

> You're correct that:
>
> > size of some bpf map is not `key_size + value_size`, for example the
> >   element size of htab is
> >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
>
> So just teach bpf_map_memory_footprint() to do this more accurately.
> Add bucket size to it as well.
> Make it even more accurate with prealloc vs not.
> Much simpler change than adding run-time overhead to every alloc/free
> on bpf side.
>

It seems that we'd better introduce ->memory_footprint for some
specific bpf maps. I will think about it.

> Higher level point:

Thanks for your thoughts.

> bpf side tracks all of its allocation. There is no need to do that
> in generic mm side.
> Exposing an aggregated single number if /proc/meminfo also looks wrong.

Do you mean that we shouldn't expose it in /proc/meminfo ?

> People should be able to "bpftool map show|awk sum of fields"
> and get the same number.


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-13 11:53   ` Yafang Shao
@ 2023-01-17 17:25     ` Alexei Starovoitov
  2023-01-18  3:07       ` Yafang Shao
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2023-01-17 17:25 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, linux-mm, bpf

On Fri, Jan 13, 2023 at 3:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > Currently there's no way to get BPF memory usage, while we can only
> > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > >
> > > - bpftool
> > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > >   prog, but the memlock is vary from the real memory size. The memlock
> > >   of a bpf object is approximately
> > >   `round_up(key_size + value_size, 8) * max_entries`,
> > >   so 1) it can't apply to the non-preallocated bpf map which may
> > >   increase or decrease the real memory size dynamically. 2) the element
> > >   size of some bpf map is not `key_size + value_size`, for example the
> > >   element size of htab is
> > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > >   That said the differece between these two values may be very great if
> > >   the key_size and value_size is small. For example in my verifaction,
> > >   the size of memlock and real memory of a preallocated hash map are,
> > >
> > >   $ grep BPF /proc/meminfo
> > >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> > >
> > >   (create hash map)
> > >
> > >   $ bpftool map show
> > >   41549: hash  name count_map  flags 0x0
> > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > >
> > >   $ grep BPF /proc/meminfo
> > >   BPF:               82284 kB
> > >
> > >   So the real memory size is $((82284 - 350)) which is 81934 kB
> > >   while the memlock is only 8192 kB.
> >
> > hashmap with key 4b and value 4b looks artificial to me,
> > but since you're concerned with accuracy of bpftool reporting,
> > please fix the estimation in bpf_map_memory_footprint().
>
> I thought bpf_map_memory_footprint() was deprecated, so I didn't try
> to fix it before.

It's not deprecated. It's trying to be accurate.
See bpf_map_value_size().
In the past we had to be precise when we calculated the required memory
before we allocated and that was causing ongoing maintenance issues.
Now bpf_map_memory_footprint() is an estimate for show_fdinfo.
It can be made more accurate for this map with corner case key/value sizes.

> > You're correct that:
> >
> > > size of some bpf map is not `key_size + value_size`, for example the
> > >   element size of htab is
> > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> >
> > So just teach bpf_map_memory_footprint() to do this more accurately.
> > Add bucket size to it as well.
> > Make it even more accurate with prealloc vs not.
> > Much simpler change than adding run-time overhead to every alloc/free
> > on bpf side.
> >
>
> It seems that we'd better introduce ->memory_footprint for some
> specific bpf maps. I will think about it.

No. Don't build it into a replica of what we had before.
Making existing bpf_map_memory_footprint() more accurate.

> > bpf side tracks all of its allocation. There is no need to do that
> > in generic mm side.
> > Exposing an aggregated single number if /proc/meminfo also looks wrong.
>
> Do you mean that we shouldn't expose it in /proc/meminfo ?

We should not because it helps one particular use case only.
Somebody else might want map mem info per container,
then somebody would need it per user, etc.
bpftool map show | awk
solves all those cases without adding new uapi-s.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-17 17:25     ` Alexei Starovoitov
@ 2023-01-18  3:07       ` Yafang Shao
  2023-01-18  5:39         ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Yafang Shao @ 2023-01-18  3:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, linux-mm, bpf

On Wed, Jan 18, 2023 at 1:25 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jan 13, 2023 at 3:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > Currently there's no way to get BPF memory usage, while we can only
> > > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > > >
> > > > - bpftool
> > > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > > >   prog, but the memlock is vary from the real memory size. The memlock
> > > >   of a bpf object is approximately
> > > >   `round_up(key_size + value_size, 8) * max_entries`,
> > > >   so 1) it can't apply to the non-preallocated bpf map which may
> > > >   increase or decrease the real memory size dynamically. 2) the element
> > > >   size of some bpf map is not `key_size + value_size`, for example the
> > > >   element size of htab is
> > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > >   That said the differece between these two values may be very great if
> > > >   the key_size and value_size is small. For example in my verifaction,
> > > >   the size of memlock and real memory of a preallocated hash map are,
> > > >
> > > >   $ grep BPF /proc/meminfo
> > > >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> > > >
> > > >   (create hash map)
> > > >
> > > >   $ bpftool map show
> > > >   41549: hash  name count_map  flags 0x0
> > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > >
> > > >   $ grep BPF /proc/meminfo
> > > >   BPF:               82284 kB
> > > >
> > > >   So the real memory size is $((82284 - 350)) which is 81934 kB
> > > >   while the memlock is only 8192 kB.
> > >
> > > hashmap with key 4b and value 4b looks artificial to me,
> > > but since you're concerned with accuracy of bpftool reporting,
> > > please fix the estimation in bpf_map_memory_footprint().
> >
> > I thought bpf_map_memory_footprint() was deprecated, so I didn't try
> > to fix it before.
>
> It's not deprecated. It's trying to be accurate.
> See bpf_map_value_size().
> In the past we had to be precise when we calculated the required memory
> before we allocated and that was causing ongoing maintenance issues.
> Now bpf_map_memory_footprint() is an estimate for show_fdinfo.
> It can be made more accurate for this map with corner case key/value sizes.
>

Thanks for the clarification.

> > > You're correct that:
> > >
> > > > size of some bpf map is not `key_size + value_size`, for example the
> > > >   element size of htab is
> > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > >
> > > So just teach bpf_map_memory_footprint() to do this more accurately.
> > > Add bucket size to it as well.
> > > Make it even more accurate with prealloc vs not.
> > > Much simpler change than adding run-time overhead to every alloc/free
> > > on bpf side.
> > >
> >
> > It seems that we'd better introduce ->memory_footprint for some
> > specific bpf maps. I will think about it.
>
> No. Don't build it into a replica of what we had before.
> Making existing bpf_map_memory_footprint() more accurate.
>

I just don't want to add many if-elses or switch-cases into
bpf_map_memory_footprint(), because I think it is a little ugly.
Introducing a new map ops could make it more clear.  For example,
static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
{
    unsigned long size;

    if (map->ops->map_mem_footprint)
        return map->ops->map_mem_footprint(map);

    size = round_up(map->key_size + bpf_map_value_size(map), 8);
    return round_up(map->max_entries * size, PAGE_SIZE);
}

> > > bpf side tracks all of its allocation. There is no need to do that
> > > in generic mm side.
> > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> >
> > Do you mean that we shouldn't expose it in /proc/meminfo ?
>
> We should not because it helps one particular use case only.
> Somebody else might want map mem info per container,
> then somebody would need it per user, etc.

It seems we should show memcg info and user info in bpftool map show.

> bpftool map show | awk
> solves all those cases without adding new uapi-s.

Makes sense to me.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-18  3:07       ` Yafang Shao
@ 2023-01-18  5:39         ` Alexei Starovoitov
  2023-01-18  6:49           ` Yafang Shao
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2023-01-18  5:39 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, linux-mm, bpf

On Tue, Jan 17, 2023 at 7:08 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Jan 18, 2023 at 1:25 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Jan 13, 2023 at 3:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > Currently there's no way to get BPF memory usage, while we can only
> > > > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > > > >
> > > > > - bpftool
> > > > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > > > >   prog, but the memlock is vary from the real memory size. The memlock
> > > > >   of a bpf object is approximately
> > > > >   `round_up(key_size + value_size, 8) * max_entries`,
> > > > >   so 1) it can't apply to the non-preallocated bpf map which may
> > > > >   increase or decrease the real memory size dynamically. 2) the element
> > > > >   size of some bpf map is not `key_size + value_size`, for example the
> > > > >   element size of htab is
> > > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > > >   That said the differece between these two values may be very great if
> > > > >   the key_size and value_size is small. For example in my verifaction,
> > > > >   the size of memlock and real memory of a preallocated hash map are,
> > > > >
> > > > >   $ grep BPF /proc/meminfo
> > > > >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> > > > >
> > > > >   (create hash map)
> > > > >
> > > > >   $ bpftool map show
> > > > >   41549: hash  name count_map  flags 0x0
> > > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > > >
> > > > >   $ grep BPF /proc/meminfo
> > > > >   BPF:               82284 kB
> > > > >
> > > > >   So the real memory size is $((82284 - 350)) which is 81934 kB
> > > > >   while the memlock is only 8192 kB.
> > > >
> > > > hashmap with key 4b and value 4b looks artificial to me,
> > > > but since you're concerned with accuracy of bpftool reporting,
> > > > please fix the estimation in bpf_map_memory_footprint().
> > >
> > > I thought bpf_map_memory_footprint() was deprecated, so I didn't try
> > > to fix it before.
> >
> > It's not deprecated. It's trying to be accurate.
> > See bpf_map_value_size().
> > In the past we had to be precise when we calculated the required memory
> > before we allocated and that was causing ongoing maintenance issues.
> > Now bpf_map_memory_footprint() is an estimate for show_fdinfo.
> > It can be made more accurate for this map with corner case key/value sizes.
> >
>
> Thanks for the clarification.
>
> > > > You're correct that:
> > > >
> > > > > size of some bpf map is not `key_size + value_size`, for example the
> > > > >   element size of htab is
> > > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > >
> > > > So just teach bpf_map_memory_footprint() to do this more accurately.
> > > > Add bucket size to it as well.
> > > > Make it even more accurate with prealloc vs not.
> > > > Much simpler change than adding run-time overhead to every alloc/free
> > > > on bpf side.
> > > >
> > >
> > > It seems that we'd better introduce ->memory_footprint for some
> > > specific bpf maps. I will think about it.
> >
> > No. Don't build it into a replica of what we had before.
> > Making existing bpf_map_memory_footprint() more accurate.
> >
>
> I just don't want to add many if-elses or switch-cases into
> bpf_map_memory_footprint(), because I think it is a little ugly.
> Introducing a new map ops could make it more clear.  For example,
> static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> {
>     unsigned long size;
>
>     if (map->ops->map_mem_footprint)
>         return map->ops->map_mem_footprint(map);
>
>     size = round_up(map->key_size + bpf_map_value_size(map), 8);
>     return round_up(map->max_entries * size, PAGE_SIZE);
> }

It is also ugly, because bpf_map_value_size() already has if-stmt.
I prefer to keep all estimates in one place.
There is no need to be 100% accurate.
With a callback devs will start thinking that this is somehow
a requirement to report precise memory.

> > > > bpf side tracks all of its allocation. There is no need to do that
> > > > in generic mm side.
> > > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> > >
> > > Do you mean that we shouldn't expose it in /proc/meminfo ?
> >
> > We should not because it helps one particular use case only.
> > Somebody else might want map mem info per container,
> > then somebody would need it per user, etc.
>
> It seems we should show memcg info and user info in bpftool map show.

Show memcg info? What do you have in mind?

The user info is often useless. We're printing it in bpftool prog show
and some folks suggested to remove it because it always prints 'uid 0'
Notice we use bpf iterators in both bpftool prog/map show
that prints the process that created the map.
That is much more useful than 'user id'.
In bpftool we can add 'verbosity' flag and print more things.
There is also json output.
And, of course, nothing stops you from having your own prog/map stats
collectors.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-18  5:39         ` Alexei Starovoitov
@ 2023-01-18  6:49           ` Yafang Shao
  2023-01-26  5:45             ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Yafang Shao @ 2023-01-18  6:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, linux-mm, bpf

On Wed, Jan 18, 2023 at 1:39 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 7:08 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Jan 18, 2023 at 1:25 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 3:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > Currently there's no way to get BPF memory usage, while we can only
> > > > > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > > > > >
> > > > > > - bpftool
> > > > > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > > > > >   prog, but the memlock is vary from the real memory size. The memlock
> > > > > >   of a bpf object is approximately
> > > > > >   `round_up(key_size + value_size, 8) * max_entries`,
> > > > > >   so 1) it can't apply to the non-preallocated bpf map which may
> > > > > >   increase or decrease the real memory size dynamically. 2) the element
> > > > > >   size of some bpf map is not `key_size + value_size`, for example the
> > > > > >   element size of htab is
> > > > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > > > >   That said the differece between these two values may be very great if
> > > > > >   the key_size and value_size is small. For example in my verifaction,
> > > > > >   the size of memlock and real memory of a preallocated hash map are,
> > > > > >
> > > > > >   $ grep BPF /proc/meminfo
> > > > > >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> > > > > >
> > > > > >   (create hash map)
> > > > > >
> > > > > >   $ bpftool map show
> > > > > >   41549: hash  name count_map  flags 0x0
> > > > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > > > >
> > > > > >   $ grep BPF /proc/meminfo
> > > > > >   BPF:               82284 kB
> > > > > >
> > > > > >   So the real memory size is $((82284 - 350)) which is 81934 kB
> > > > > >   while the memlock is only 8192 kB.
> > > > >
> > > > > hashmap with key 4b and value 4b looks artificial to me,
> > > > > but since you're concerned with accuracy of bpftool reporting,
> > > > > please fix the estimation in bpf_map_memory_footprint().
> > > >
> > > > I thought bpf_map_memory_footprint() was deprecated, so I didn't try
> > > > to fix it before.
> > >
> > > It's not deprecated. It's trying to be accurate.
> > > See bpf_map_value_size().
> > > In the past we had to be precise when we calculated the required memory
> > > before we allocated and that was causing ongoing maintenance issues.
> > > Now bpf_map_memory_footprint() is an estimate for show_fdinfo.
> > > It can be made more accurate for this map with corner case key/value sizes.
> > >
> >
> > Thanks for the clarification.
> >
> > > > > You're correct that:
> > > > >
> > > > > > size of some bpf map is not `key_size + value_size`, for example the
> > > > > >   element size of htab is
> > > > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > > >
> > > > > So just teach bpf_map_memory_footprint() to do this more accurately.
> > > > > Add bucket size to it as well.
> > > > > Make it even more accurate with prealloc vs not.
> > > > > Much simpler change than adding run-time overhead to every alloc/free
> > > > > on bpf side.
> > > > >
> > > >
> > > > It seems that we'd better introduce ->memory_footprint for some
> > > > specific bpf maps. I will think about it.
> > >
> > > No. Don't build it into a replica of what we had before.
> > > Making existing bpf_map_memory_footprint() more accurate.
> > >
> >
> > I just don't want to add many if-elses or switch-cases into
> > bpf_map_memory_footprint(), because I think it is a little ugly.
> > Introducing a new map ops could make it more clear.  For example,
> > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > {
> >     unsigned long size;
> >
> >     if (map->ops->map_mem_footprint)
> >         return map->ops->map_mem_footprint(map);
> >
> >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> >     return round_up(map->max_entries * size, PAGE_SIZE);
> > }
>
> It is also ugly, because bpf_map_value_size() already has if-stmt.
> I prefer to keep all estimates in one place.
> There is no need to be 100% accurate.

Per my investigation, it can be almost accurate with little effort.
Take the htab for example,
static unsigned long htab_mem_footprint(const struct bpf_map *map)
{
    struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
    unsigned long size = 0;

    if (!htab_is_prealloc(htab)) {
        size += htab_elements_size(htab);
    }
    size += kvsize(htab->elems);
    size += percpu_size(htab->extra_elems);
    size += kvsize(htab->buckets);
    size += bpf_mem_alloc_size(&htab->pcpu_ma);
    size += bpf_mem_alloc_size(&htab->ma);
    if (htab->use_percpu_counter)
        size += percpu_size(htab->pcount.counters);
    size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
    size += kvsize(htab);
    return size;
}

We just need to get the real memory size from the pointer instead of
calculating the size again.
For non-preallocated htab, it is a little trouble to get the element
size (not the unit_size), but it won't be a big deal.

> With a callback devs will start thinking that this is somehow
> a requirement to report precise memory.
>
> > > > > bpf side tracks all of its allocation. There is no need to do that
> > > > > in generic mm side.
> > > > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> > > >
> > > > Do you mean that we shouldn't expose it in /proc/meminfo ?
> > >
> > > We should not because it helps one particular use case only.
> > > Somebody else might want map mem info per container,
> > > then somebody would need it per user, etc.
> >
> > It seems we should show memcg info and user info in bpftool map show.
>
> Show memcg info? What do you have in mind?
>

Each bpf map is charged to a memcg. If we know a bpf map belongs to
which memcg, we can know the map mem info per container.
Currently we can get the memcg info from the process which loads it,
but it can't apply to pinned-bpf-map.
So it would be better if we can show it in bpftool-map-show.

> The user info is often useless. We're printing it in bpftool prog show
> and some folks suggested to remove it because it always prints 'uid 0'
> Notice we use bpf iterators in both bpftool prog/map show
> that prints the process that created the map.
> That is much more useful than 'user id'.
> In bpftool we can add 'verbosity' flag and print more things.
> There is also json output.
> And, of course, nothing stops you from having your own prog/map stats
> collectors.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-18  6:49           ` Yafang Shao
@ 2023-01-26  5:45             ` Alexei Starovoitov
  2023-01-28 11:49               ` Yafang Shao
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2023-01-26  5:45 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, linux-mm, bpf

On Tue, Jan 17, 2023 at 10:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > I just don't want to add many if-elses or switch-cases into
> > > bpf_map_memory_footprint(), because I think it is a little ugly.
> > > Introducing a new map ops could make it more clear.  For example,
> > > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > > {
> > >     unsigned long size;
> > >
> > >     if (map->ops->map_mem_footprint)
> > >         return map->ops->map_mem_footprint(map);
> > >
> > >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> > >     return round_up(map->max_entries * size, PAGE_SIZE);
> > > }
> >
> > It is also ugly, because bpf_map_value_size() already has if-stmt.
> > I prefer to keep all estimates in one place.
> > There is no need to be 100% accurate.
>
> Per my investigation, it can be almost accurate with little effort.
> Take the htab for example,
> static unsigned long htab_mem_footprint(const struct bpf_map *map)
> {
>     struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>     unsigned long size = 0;
>
>     if (!htab_is_prealloc(htab)) {
>         size += htab_elements_size(htab);
>     }
>     size += kvsize(htab->elems);
>     size += percpu_size(htab->extra_elems);
>     size += kvsize(htab->buckets);
>     size += bpf_mem_alloc_size(&htab->pcpu_ma);
>     size += bpf_mem_alloc_size(&htab->ma);
>     if (htab->use_percpu_counter)
>         size += percpu_size(htab->pcount.counters);
>     size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
>     size += kvsize(htab);
>     return size;
> }

Please don't.
Above doesn't look maintainable.
Look at kvsize(htab). Do you really care about hundred bytes?
Just accept that there will be a small constant difference
between what show_fdinfo reports and the real memory.
You cannot make it 100%.
There is kfence that will allocate 4k though you asked kmalloc(8).

> We just need to get the real memory size from the pointer instead of
> calculating the size again.
> For non-preallocated htab, it is a little trouble to get the element
> size (not the unit_size), but it won't be a big deal.

You'd have to convince mm folks that kvsize() is worth doing.
I don't think it will be easy.

> > With a callback devs will start thinking that this is somehow
> > a requirement to report precise memory.
> >
> > > > > > bpf side tracks all of its allocation. There is no need to do that
> > > > > > in generic mm side.
> > > > > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> > > > >
> > > > > Do you mean that we shouldn't expose it in /proc/meminfo ?
> > > >
> > > > We should not because it helps one particular use case only.
> > > > Somebody else might want map mem info per container,
> > > > then somebody would need it per user, etc.
> > >
> > > It seems we should show memcg info and user info in bpftool map show.
> >
> > Show memcg info? What do you have in mind?
> >
>
> Each bpf map is charged to a memcg. If we know a bpf map belongs to
> which memcg, we can know the map mem info per container.
> Currently we can get the memcg info from the process which loads it,
> but it can't apply to pinned-bpf-map.
> So it would be better if we can show it in bpftool-map-show.

That sounds useful.
Have you looked at bpf iterators and how bpftool is using
them to figure out which process loaded bpf prog and created particular map?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-26  5:45             ` Alexei Starovoitov
@ 2023-01-28 11:49               ` Yafang Shao
  2023-01-30 13:14                 ` Uladzislau Rezki
  0 siblings, 1 reply; 22+ messages in thread
From: Yafang Shao @ 2023-01-28 11:49 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrew Morton, urezki, Christoph Hellwig
  Cc: Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Roman Gushchin,
	linux-mm, bpf

On Thu, Jan 26, 2023 at 1:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 10:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > I just don't want to add many if-elses or switch-cases into
> > > > bpf_map_memory_footprint(), because I think it is a little ugly.
> > > > Introducing a new map ops could make it more clear.  For example,
> > > > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > > > {
> > > >     unsigned long size;
> > > >
> > > >     if (map->ops->map_mem_footprint)
> > > >         return map->ops->map_mem_footprint(map);
> > > >
> > > >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> > > >     return round_up(map->max_entries * size, PAGE_SIZE);
> > > > }
> > >
> > > It is also ugly, because bpf_map_value_size() already has if-stmt.
> > > I prefer to keep all estimates in one place.
> > > There is no need to be 100% accurate.
> >
> > Per my investigation, it can be almost accurate with little effort.
> > Take the htab for example,
> > static unsigned long htab_mem_footprint(const struct bpf_map *map)
> > {
> >     struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> >     unsigned long size = 0;
> >
> >     if (!htab_is_prealloc(htab)) {
> >         size += htab_elements_size(htab);
> >     }
> >     size += kvsize(htab->elems);
> >     size += percpu_size(htab->extra_elems);
> >     size += kvsize(htab->buckets);
> >     size += bpf_mem_alloc_size(&htab->pcpu_ma);
> >     size += bpf_mem_alloc_size(&htab->ma);
> >     if (htab->use_percpu_counter)
> >         size += percpu_size(htab->pcount.counters);
> >     size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
> >     size += kvsize(htab);
> >     return size;
> > }
>
> Please don't.
> Above doesn't look maintainable.

It is similar to htab_map_free(). These pointers are the pointers
which will be freed in map_free().
We just need to keep map_mem_footprint() in sync with map_free(). It
won't be a problem for maintenance.

> Look at kvsize(htab). Do you really care about hundred bytes?
> Just accept that there will be a small constant difference
> between what show_fdinfo reports and the real memory.

The point is we don't have a clear idea what the margin is.

> You cannot make it 100%.
> There is kfence that will allocate 4k though you asked kmalloc(8).
>

We already have ksize()[1], which covers the kfence.

[1]. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/mm/slab_common.c#n1431

> > We just need to get the real memory size from the pointer instead of
> > calculating the size again.
> > For non-preallocated htab, it is a little trouble to get the element
> > size (not the unit_size), but it won't be a big deal.
>
> You'd have to convince mm folks that kvsize() is worth doing.
> I don't think it will be easy.
>

As I mentioned above, we already have ksize(), so we only need to
introduce vsize().  Per my understanding, we can simply use
vm_struct->size to get the vmalloc size, see also the patch #5 in this
patchset[2].

Andrew, Uladzislau, Christoph,  do you have any comments on this newly
introduced vsize()[2] ?

[2]. https://lore.kernel.org/bpf/20230112155326.26902-6-laoar.shao@gmail.com/

> > > With a callback devs will start thinking that this is somehow
> > > a requirement to report precise memory.
> > >
> > > > > > > bpf side tracks all of its allocation. There is no need to do that
> > > > > > > in generic mm side.
> > > > > > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> > > > > >
> > > > > > Do you mean that we shouldn't expose it in /proc/meminfo ?
> > > > >
> > > > > We should not because it helps one particular use case only.
> > > > > Somebody else might want map mem info per container,
> > > > > then somebody would need it per user, etc.
> > > >
> > > > It seems we should show memcg info and user info in bpftool map show.
> > >
> > > Show memcg info? What do you have in mind?
> > >
> >
> > Each bpf map is charged to a memcg. If we know a bpf map belongs to
> > which memcg, we can know the map mem info per container.
> > Currently we can get the memcg info from the process which loads it,
> > but it can't apply to pinned-bpf-map.
> > So it would be better if we can show it in bpftool-map-show.
>
> That sounds useful.
> Have you looked at bpf iterators and how bpftool is using
> them to figure out which process loaded bpf prog and created particular map?

Yes, I have looked at it.  I know what you mean. It seems we can
introduce a memcg_iter or something else to implement it.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-28 11:49               ` Yafang Shao
@ 2023-01-30 13:14                 ` Uladzislau Rezki
  2023-01-31  6:28                   ` Yafang Shao
  0 siblings, 1 reply; 22+ messages in thread
From: Uladzislau Rezki @ 2023-01-30 13:14 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Andrew Morton, urezki, Christoph Hellwig,
	Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Roman Gushchin,
	linux-mm, bpf

On Sat, Jan 28, 2023 at 07:49:08PM +0800, Yafang Shao wrote:
> On Thu, Jan 26, 2023 at 1:45 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > I just don't want to add many if-elses or switch-cases into
> > > > > bpf_map_memory_footprint(), because I think it is a little ugly.
> > > > > Introducing a new map ops could make it more clear.  For example,
> > > > > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > > > > {
> > > > >     unsigned long size;
> > > > >
> > > > >     if (map->ops->map_mem_footprint)
> > > > >         return map->ops->map_mem_footprint(map);
> > > > >
> > > > >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> > > > >     return round_up(map->max_entries * size, PAGE_SIZE);
> > > > > }
> > > >
> > > > It is also ugly, because bpf_map_value_size() already has if-stmt.
> > > > I prefer to keep all estimates in one place.
> > > > There is no need to be 100% accurate.
> > >
> > > Per my investigation, it can be almost accurate with little effort.
> > > Take the htab for example,
> > > static unsigned long htab_mem_footprint(const struct bpf_map *map)
> > > {
> > >     struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > >     unsigned long size = 0;
> > >
> > >     if (!htab_is_prealloc(htab)) {
> > >         size += htab_elements_size(htab);
> > >     }
> > >     size += kvsize(htab->elems);
> > >     size += percpu_size(htab->extra_elems);
> > >     size += kvsize(htab->buckets);
> > >     size += bpf_mem_alloc_size(&htab->pcpu_ma);
> > >     size += bpf_mem_alloc_size(&htab->ma);
> > >     if (htab->use_percpu_counter)
> > >         size += percpu_size(htab->pcount.counters);
> > >     size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
> > >     size += kvsize(htab);
> > >     return size;
> > > }
> >
> > Please don't.
> > Above doesn't look maintainable.
> 
> It is similar to htab_map_free(). These pointers are the pointers
> which will be freed in map_free().
> We just need to keep map_mem_footprint() in sync with map_free(). It
> won't be a problem for maintenance.
> 
> > Look at kvsize(htab). Do you really care about hundred bytes?
> > Just accept that there will be a small constant difference
> > between what show_fdinfo reports and the real memory.
> 
> The point is we don't have a clear idea what the margin is.
> 
> > You cannot make it 100%.
> > There is kfence that will allocate 4k though you asked kmalloc(8).
> >
> 
> We already have ksize()[1], which covers the kfence.
> 
> [1]. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/mm/slab_common.c#n1431
> 
> > > We just need to get the real memory size from the pointer instead of
> > > calculating the size again.
> > > For non-preallocated htab, it is a little trouble to get the element
> > > size (not the unit_size), but it won't be a big deal.
> >
> > You'd have to convince mm folks that kvsize() is worth doing.
> > I don't think it will be easy.
> >
> 
> As I mentioned above, we already have ksize(), so we only need to
> introduce vsize().  Per my understanding, we can simply use
> vm_struct->size to get the vmalloc size, see also the patch #5 in this
> patchset[2].
> 
> Andrew, Uladzislau, Christoph,  do you have any comments on this newly
> introduced vsize()[2] ?
> 
> [2]. https://lore.kernel.org/bpf/20230112155326.26902-6-laoar.shao@gmail.com/
> 
<snip>
+/* Report full size of underlying allocation of a vmalloc'ed addr */
+static inline size_t vsize(const void *addr)
+{
+	struct vm_struct *area;
+
+	if (!addr)
+		return 0;
+
+	area = find_vm_area(addr);
+	if (unlikely(!area))
+		return 0;
+
+	return area->size;
+}
<snip>

You can not access area after the lock is dropped. We do not have any
ref counters for VA objects. Therefore it should be done like below:


<snip>
  spin_lock(&vmap_area_lock);
  va = __find_vmap_area(addr, &vmap_area_root);
  if (va && va->vm)
    va_size = va->vm->size;
  spin_unlock(&vmap_area_lock);

  return va_size;
<snip>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo
  2023-01-30 13:14                 ` Uladzislau Rezki
@ 2023-01-31  6:28                   ` Yafang Shao
  0 siblings, 0 replies; 22+ messages in thread
From: Yafang Shao @ 2023-01-31  6:28 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Alexei Starovoitov, Andrew Morton, Christoph Hellwig,
	Hyeonggon Yoo, Vlastimil Babka, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Tejun Heo, dennis, Chris Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Roman Gushchin,
	linux-mm, bpf

On Mon, Jan 30, 2023 at 9:14 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Sat, Jan 28, 2023 at 07:49:08PM +0800, Yafang Shao wrote:
> > On Thu, Jan 26, 2023 at 1:45 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > I just don't want to add many if-elses or switch-cases into
> > > > > > bpf_map_memory_footprint(), because I think it is a little ugly.
> > > > > > Introducing a new map ops could make it more clear.  For example,
> > > > > > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > > > > > {
> > > > > >     unsigned long size;
> > > > > >
> > > > > >     if (map->ops->map_mem_footprint)
> > > > > >         return map->ops->map_mem_footprint(map);
> > > > > >
> > > > > >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> > > > > >     return round_up(map->max_entries * size, PAGE_SIZE);
> > > > > > }
> > > > >
> > > > > It is also ugly, because bpf_map_value_size() already has if-stmt.
> > > > > I prefer to keep all estimates in one place.
> > > > > There is no need to be 100% accurate.
> > > >
> > > > Per my investigation, it can be almost accurate with little effort.
> > > > Take the htab for example,
> > > > static unsigned long htab_mem_footprint(const struct bpf_map *map)
> > > > {
> > > >     struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > > >     unsigned long size = 0;
> > > >
> > > >     if (!htab_is_prealloc(htab)) {
> > > >         size += htab_elements_size(htab);
> > > >     }
> > > >     size += kvsize(htab->elems);
> > > >     size += percpu_size(htab->extra_elems);
> > > >     size += kvsize(htab->buckets);
> > > >     size += bpf_mem_alloc_size(&htab->pcpu_ma);
> > > >     size += bpf_mem_alloc_size(&htab->ma);
> > > >     if (htab->use_percpu_counter)
> > > >         size += percpu_size(htab->pcount.counters);
> > > >     size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
> > > >     size += kvsize(htab);
> > > >     return size;
> > > > }
> > >
> > > Please don't.
> > > Above doesn't look maintainable.
> >
> > It is similar to htab_map_free(). These pointers are the pointers
> > which will be freed in map_free().
> > We just need to keep map_mem_footprint() in sync with map_free(). It
> > won't be a problem for maintenance.
> >
> > > Look at kvsize(htab). Do you really care about hundred bytes?
> > > Just accept that there will be a small constant difference
> > > between what show_fdinfo reports and the real memory.
> >
> > The point is we don't have a clear idea what the margin is.
> >
> > > You cannot make it 100%.
> > > There is kfence that will allocate 4k though you asked kmalloc(8).
> > >
> >
> > We already have ksize()[1], which covers the kfence.
> >
> > [1]. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/mm/slab_common.c#n1431
> >
> > > > We just need to get the real memory size from the pointer instead of
> > > > calculating the size again.
> > > > For non-preallocated htab, it is a little trouble to get the element
> > > > size (not the unit_size), but it won't be a big deal.
> > >
> > > You'd have to convince mm folks that kvsize() is worth doing.
> > > I don't think it will be easy.
> > >
> >
> > As I mentioned above, we already have ksize(), so we only need to
> > introduce vsize().  Per my understanding, we can simply use
> > vm_struct->size to get the vmalloc size, see also the patch #5 in this
> > patchset[2].
> >
> > Andrew, Uladzislau, Christoph,  do you have any comments on this newly
> > introduced vsize()[2] ?
> >
> > [2]. https://lore.kernel.org/bpf/20230112155326.26902-6-laoar.shao@gmail.com/
> >
> <snip>
> +/* Report full size of underlying allocation of a vmalloc'ed addr */
> +static inline size_t vsize(const void *addr)
> +{
> +       struct vm_struct *area;
> +
> +       if (!addr)
> +               return 0;
> +
> +       area = find_vm_area(addr);
> +       if (unlikely(!area))
> +               return 0;
> +
> +       return area->size;
> +}
> <snip>
>
> You can not access area after the lock is dropped. We do not have any
> ref counters for VA objects. Therefore it should be done like below:
>
>
> <snip>
>   spin_lock(&vmap_area_lock);
>   va = __find_vmap_area(addr, &vmap_area_root);
>   if (va && va->vm)
>     va_size = va->vm->size;
>   spin_unlock(&vmap_area_lock);
>
>   return va_size;
> <snip>
>

Ah, it should take this global lock.  I missed that.
Many thanks for the detailed explanation.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-01-31  6:29 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-12 15:53 [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 01/11] mm: percpu: count memcg relevant memory only when kmemcg is enabled Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 02/11] mm: percpu: introduce percpu_size() Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 03/11] mm: slab: rename obj_full_size() Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 04/11] mm: slab: introduce ksize_full() Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 05/11] mm: vmalloc: introduce vsize() Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 06/11] mm: util: introduce kvsize() Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 07/11] bpf: introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 08/11] bpf: use bpf_map_kzalloc in arraymap Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 09/11] bpf: use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 10/11] bpf: add and use bpf map free helpers Yafang Shao
2023-01-12 15:53 ` [RFC PATCH bpf-next v2 11/11] bpf: introduce bpf memory statistics Yafang Shao
2023-01-12 21:05 ` [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Alexei Starovoitov
2023-01-13 11:53   ` Yafang Shao
2023-01-17 17:25     ` Alexei Starovoitov
2023-01-18  3:07       ` Yafang Shao
2023-01-18  5:39         ` Alexei Starovoitov
2023-01-18  6:49           ` Yafang Shao
2023-01-26  5:45             ` Alexei Starovoitov
2023-01-28 11:49               ` Yafang Shao
2023-01-30 13:14                 ` Uladzislau Rezki
2023-01-31  6:28                   ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).