All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
@ 2022-06-19 15:50 Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 01/10] mm, memcg: Add a new helper memcg_should_recharge() Yafang Shao
                   ` (11 more replies)
  0 siblings, 12 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

After switching to memcg-based bpf memory accounting, the bpf memory is
charged to the loader's memcg by default, that causes unexpected issues for
us. For instance, the container of the loader may be restarted after
pinning progs and maps, but the bpf memcg will be left and pinned on the
system. Once the loader's new generation container is started, the leftover
pages won't be charged to it. That inconsistent behavior will make trouble
for the memory resource management for this container.

In the past few days, I have proposed two patchsets[1][2] to try to resolve
this issue, but in both of these two proposals the user code has to be
changed to adapt to it, that is a pain for us. This patchset relieves the
pain by triggering the recharge in libbpf. It also addresses Roman's
critical comments.

The key point we can avoid changing the user code is that there's a resue
path in libbpf. Once the bpf container is restarted again, it will try
to re-run the required bpf programs, if the bpf programs are the same with
the already pinned one, it will reuse them.

To make sure we either recharge all of them successfully or don't recharge
any of them. The recharge prograss is divided into three steps:
  - Pre charge to the new generation 
    To make sure once we uncharge from the old generation, we can always
    charge to the new generation succeesfully. If we can't pre charge to
    the new generation, we won't allow it to be uncharged from the old
    generation.
  - Uncharge from the old generation
    After pre charge to the new generation, we can uncharge from the old
    generation.
  - Post charge to the new generation
    Finnaly we can set pages' memcg_data to the new generation. 
In the pre charge step, we may succeed to charge some addresses, but fail
to charge a new address, then we should uncharge the already charged
addresses, so another recharge-err step is instroduced.
 
This pachset has finished recharging bpf hash map. which is mostly used
by our bpf services. The other maps hasn't been implemented yet. The bpf
progs hasn't been implemented neither.

The prev generation and the new generation may have the same parant,
that can be optimized in the future.

In the disccussion with Roman in the previous two proposals, he also
mentioned that the leftover page caches have similar issue.  There're key
differences between leftover page caches and leftover bpf programs:
  - The leftover page caches may not be reused again
    Because once a container exited, it may be deployed on another host
    next time for better resource management. That's why we fix leftover
    page caches by _trying_ to drop all its page caches when it is exiting.
    But regarding the bpf conatainer, it will always be deployed on the
    same host next time, that's why bpf programs are pinned.
 - The lefeover page caches can be reclaimed, but bpf memory can't.
   It means the leftover page caches can be accepted while the leftover bpf
   memory can't.
Regardless of these differences, we can also extend this method to
recharge leftover page caches if we need it, for example when we 'reuse' a
leftover inode, we recharge all its page caches to the new generation. But
unforunately there's no such a clear reuse path in page cache layer, so we
must build a resue path for it first:

      page cache's reuse path(X)           bpf's reuse path 
          |                                    |
   ------------------                   -------------
   | page cache layer|                  | bpf layer |
   ------------------                   -------------
      \                                     /
    page cache's recharge handler(X)     bpf's recharge handler
       \                                   /
       ------------------------------------
       |        Memcg layer               |
       |----------------------------------|

[1] https://lwn.net/Articles/887180/
[2] https://lwn.net/Articles/888549/


Yafang Shao (10):
  mm, memcg: Add a new helper memcg_should_recharge()
  bpftool: Show memcg info of bpf map
  mm, memcg: Add new helper obj_cgroup_from_current()
  mm, memcg: Make obj_cgroup_{charge, uncharge}_pages public
  mm: Add helper to recharge kmalloc'ed address
  mm: Add helper to recharge vmalloc'ed address
  mm: Add helper to recharge percpu address
  bpf: Recharge memory when reuse bpf map
  bpf: Make bpf_map_{save, release}_memcg public
  bpf: Support recharge for hash map

 include/linux/bpf.h            |  23 ++++++
 include/linux/memcontrol.h     |  22 ++++++
 include/linux/percpu.h         |   1 +
 include/linux/slab.h           |  18 +++++
 include/linux/vmalloc.h        |   2 +
 include/uapi/linux/bpf.h       |   4 +-
 kernel/bpf/hashtab.c           |  74 +++++++++++++++++++
 kernel/bpf/syscall.c           |  40 ++++++-----
 mm/memcontrol.c                |  35 +++++++--
 mm/percpu.c                    |  98 ++++++++++++++++++++++++++
 mm/slab.c                      |  85 ++++++++++++++++++++++
 mm/slob.c                      |   7 ++
 mm/slub.c                      | 125 +++++++++++++++++++++++++++++++++
 mm/util.c                      |   9 +++
 mm/vmalloc.c                   |  87 +++++++++++++++++++++++
 tools/bpf/bpftool/map.c        |   2 +
 tools/include/uapi/linux/bpf.h |   4 +-
 tools/lib/bpf/libbpf.c         |   2 +-
 18 files changed, 609 insertions(+), 29 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 01/10] mm, memcg: Add a new helper memcg_should_recharge()
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 02/10] bpftool: Show memcg info of bpf map Yafang Shao
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

Currently we only recharge pages from an offline memcg or reparented
memcg. In the future we may explicitly introduce a need_recharge
state for the memcg which should be recharged.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9ecead1042b9..cf074156c6ac 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1755,6 +1755,18 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
 	rcu_read_unlock();
 }
 
+static inline bool memcg_need_recharge(struct mem_cgroup *memcg)
+{
+	if (!memcg || memcg == root_mem_cgroup)
+		return false;
+	/*
+	 * Currently we only recharge pages from an offline memcg,
+	 * in the future we may explicitly introduce a need_recharge
+	 * state for the memcg which should be recharged.
+	 */
+	return memcg->kmemcg_id == memcg->id.id ? false : true;
+}
+
 #else
 static inline bool mem_cgroup_kmem_disabled(void)
 {
@@ -1806,6 +1818,11 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
 {
 }
 
+static inline bool memcg_need_recharge(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 02/10] bpftool: Show memcg info of bpf map
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 01/10] mm, memcg: Add a new helper memcg_should_recharge() Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 03/10] mm, memcg: Add new helper obj_cgroup_from_current() Yafang Shao
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

bpf map can be charged to a memcg, so we'd better show the memcg info to
do better bpf memory management. This patch adds a new field
"memcg_state" to show whether a bpf map is charged to a memcg and
whether the memcg is offlined. Currently it has three values,
   0 : belongs to root memcg
  -1 : its original memcg is not online now
   1 : its original memcg is online

For instance,

$ bpftool map show
2: array  name iterator.rodata  flags 0x480
        key 4B  value 98B  max_entries 1  memlock 4096B
        btf_id 240  frozen
        memcg_state 0
3: hash  name calico_failsafe  flags 0x1
        key 4B  value 1B  max_entries 65535  memlock 524288B
        memcg_state 1
6: lru_hash  name access_record  flags 0x0
        key 8B  value 24B  max_entries 102400  memlock 3276800B
        btf_id 256
        memcg_state -1

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/uapi/linux/bpf.h       |  4 +++-
 kernel/bpf/syscall.c           | 11 +++++++++++
 tools/bpf/bpftool/map.c        |  2 ++
 tools/include/uapi/linux/bpf.h |  4 +++-
 4 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e81362891596..f2f658e224a7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6092,7 +6092,9 @@ struct bpf_map_info {
 	__u32 btf_id;
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
-	__u32 :32;	/* alignment pad */
+	__s8  memcg_state;
+	__s8  :8;	/* alignment pad */
+	__u16 :16;	/* alignment pad */
 	__u64 map_extra;
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7d5af5b99f0d..d4659d58d288 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4168,6 +4168,17 @@ static int bpf_map_get_info_by_fd(struct file *file,
 	}
 	info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id;
 
+#ifdef CONFIG_MEMCG_KMEM
+	if (map->memcg) {
+		struct mem_cgroup *memcg = map->memcg;
+
+		if (memcg == root_mem_cgroup)
+			info.memcg_state = 0;
+		else
+			info.memcg_state = memcg_need_recharge(memcg) ? -1 : 1;
+	}
+#endif
+
 	if (bpf_map_is_dev_bound(map)) {
 		err = bpf_map_offload_info_fill(&info, map);
 		if (err)
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 38b6bc9c26c3..ba68512e83aa 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -525,6 +525,7 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
 		jsonw_end_array(json_wtr);
 	}
 
+	jsonw_int_field(json_wtr, "memcg_state", info->memcg_state);
 	emit_obj_refs_json(refs_table, info->id, json_wtr);
 
 	jsonw_end_object(json_wtr);
@@ -615,6 +616,7 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
 	if (frozen)
 		printf("%sfrozen", info->btf_id ? "  " : "");
 
+	printf("\n\tmemcg_state %d", info->memcg_state);
 	emit_obj_refs_plain(refs_table, info->id, "\n\tpids ");
 
 	printf("\n");
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e81362891596..f2f658e224a7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6092,7 +6092,9 @@ struct bpf_map_info {
 	__u32 btf_id;
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
-	__u32 :32;	/* alignment pad */
+	__s8  memcg_state;
+	__s8  :8;	/* alignment pad */
+	__u16 :16;	/* alignment pad */
 	__u64 map_extra;
 } __attribute__((aligned(8)));
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 03/10] mm, memcg: Add new helper obj_cgroup_from_current()
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 01/10] mm, memcg: Add a new helper memcg_should_recharge() Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 02/10] bpftool: Show memcg info of bpf map Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-23  3:01   ` Roman Gushchin
  2022-06-19 15:50 ` [RFC PATCH bpf-next 04/10] mm, memcg: Make obj_cgroup_{charge, uncharge}_pages public Yafang Shao
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

The difference between get_obj_cgroup_from_current() and obj_cgroup_from_current()
is that the later one doesn't add objcg's refcnt.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 24 ++++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index cf074156c6ac..402b42670bcd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1703,6 +1703,7 @@ bool mem_cgroup_kmem_disabled(void);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+struct obj_cgroup *obj_cgroup_from_current(void);
 struct obj_cgroup *get_obj_cgroup_from_current(void);
 struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index abec50f31fe6..350a7849dac3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2950,6 +2950,30 @@ struct obj_cgroup *get_obj_cgroup_from_page(struct page *page)
 	return objcg;
 }
 
+__always_inline struct obj_cgroup *obj_cgroup_from_current(void)
+{
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg;
+
+	if (memcg_kmem_bypass())
+		return NULL;
+
+	rcu_read_lock();
+	if (unlikely(active_memcg()))
+		memcg = active_memcg();
+	else
+		memcg = mem_cgroup_from_task(current);
+
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		objcg = rcu_dereference(memcg->objcg);
+		if (objcg)
+			break;
+	}
+	rcu_read_unlock();
+
+	return objcg;
+}
+
 static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
 {
 	mod_memcg_state(memcg, MEMCG_KMEM, nr_pages);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 04/10] mm, memcg: Make obj_cgroup_{charge, uncharge}_pages public
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
                   ` (2 preceding siblings ...)
  2022-06-19 15:50 ` [RFC PATCH bpf-next 03/10] mm, memcg: Add new helper obj_cgroup_from_current() Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 05/10] mm: Add helper to recharge kmalloc'ed address Yafang Shao
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

Make these two helpers public for later use.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h |  4 ++++
 mm/memcontrol.c            | 11 ++++-------
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 402b42670bcd..ec4637687d6a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1709,6 +1709,10 @@ struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
+			    unsigned int nr_pages);
+void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
+			       unsigned int nr_pages);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 350a7849dac3..0ba321afba3b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -260,9 +260,6 @@ bool mem_cgroup_kmem_disabled(void)
 	return cgroup_memory_nokmem;
 }
 
-static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages);
-
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
 	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
@@ -2991,8 +2988,8 @@ static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
  * @objcg: object cgroup to uncharge
  * @nr_pages: number of pages to uncharge
  */
-static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages)
+void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
+				unsigned int nr_pages)
 {
 	struct mem_cgroup *memcg;
 
@@ -3012,8 +3009,8 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
  *
  * Returns 0 on success, an error code on failure.
  */
-static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
-				   unsigned int nr_pages)
+int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
+			    unsigned int nr_pages)
 {
 	struct mem_cgroup *memcg;
 	int ret;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 05/10] mm: Add helper to recharge kmalloc'ed address
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
                   ` (3 preceding siblings ...)
  2022-06-19 15:50 ` [RFC PATCH bpf-next 04/10] mm, memcg: Make obj_cgroup_{charge, uncharge}_pages public Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 06/10] mm: Add helper to recharge vmalloc'ed address Yafang Shao
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

This patch introduces a helper to recharge the corresponding pages of a
given kmalloc'ed address. The recharge is divided into three steps,
  - pre charge to the new memcg
    To make sure once we uncharge from the old memcg, we can always charge
    to the new memcg succeesfully. If we can't pre charge to the new memcg,
    we won't allow it to be uncharged from the old memcg.
  - uncharge from the old memcg
    After pre charge to the new memcg, we can uncharge from the old memcg.
  - post charge to the new memcg
    Modify the counters of the new memcg.

Sometimes we may want to recharge many kmalloc'ed addresses to the same
memcg, in that case we should pre charge all these addresses first, then
do the uncharge and finnally do the post charge. But it may happens that
after succeesfully pre charge some address we fail to pre charge a new
address, then we have to cancel the finished pre charge, so charge err is
introduced for this purpose.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/slab.h |  17 ++++++
 mm/slab.c            |  85 +++++++++++++++++++++++++++++
 mm/slob.c            |   7 +++
 mm/slub.c            | 125 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 234 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0fefdf528e0d..18ab30aa8fe8 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -194,6 +194,23 @@ bool kmem_valid_obj(void *object);
 void kmem_dump_obj(void *object);
 #endif
 
+/*
+ * The recharge will be separated into three steps:
+ *	MEMCG_KMEM_PRE_CHARGE  : pre charge to the new memcg
+ *	MEMCG_KMEM_UNCHARGE    : uncharge from the old memcg
+ *	MEMCG_KMEM_POST_CHARGE : post charge to the new memcg
+ * and an error handler:
+ *	MEMCG_KMEM_CHARGE_ERR  : in pre charge state, we may succeed to
+ *	                         charge some objp's but fail to charge
+ *	                         a new one, then in this case we should
+ *	                         uncharge the already charged objp's.
+ */
+#define MEMCG_KMEM_PRE_CHARGE	0
+#define MEMCG_KMEM_UNCHARGE	1
+#define MEMCG_KMEM_POST_CHARGE	2
+#define MEMCG_KMEM_CHARGE_ERR	3
+bool krecharge(const void *objp, int step);
+
 /*
  * Some archs want to perform DMA into kmalloc caches and need a guaranteed
  * alignment larger than the alignment of a 64-bit integer.
diff --git a/mm/slab.c b/mm/slab.c
index f8cd00f4ba13..4795014edd30 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3798,6 +3798,91 @@ void kfree(const void *objp)
 }
 EXPORT_SYMBOL(kfree);
 
+bool krecharge(const void *objp, int step)
+{
+	void *object = (void *)objp;
+	struct obj_cgroup *objcg_old;
+	struct obj_cgroup *objcg_new;
+	struct obj_cgroup **objcgs;
+	struct kmem_cache *s;
+	struct slab *slab;
+	unsigned long flags;
+	unsigned int off;
+
+	WARN_ON(!in_task());
+
+	if (unlikely(ZERO_OR_NULL_PTR(objp)))
+		return true;
+
+	if (!memcg_kmem_enabled())
+		return true;
+
+	local_irq_save(flags);
+	s = virt_to_cache(objp);
+	if (!s)
+		goto out;
+
+	if (!(s->flags & SLAB_ACCOUNT))
+		goto out;
+
+	slab = virt_to_slab(object);
+	if (!slab)
+		goto out;
+
+	objcgs = slab_objcgs(slab);
+	if (!objcgs)
+		goto out;
+
+	off = obj_to_index(s, slab, object);
+	objcg_old = objcgs[off];
+	if (!objcg_old && step != MEMCG_KMEM_POST_CHARGE)
+		goto out;
+
+	/*
+	 *  The recharge can be separated into three steps,
+	 *  1. Pre charge to the new memcg
+	 *  2. Uncharge from the old memcg
+	 *  3. Charge to the new memcg
+	 */
+	switch (step) {
+	case MEMCG_KMEM_PRE_CHARGE:
+		/* Pre recharge */
+		objcg_new = get_obj_cgroup_from_current();
+		WARN_ON(!objcg_new);
+		if (obj_cgroup_charge(objcg_new, GFP_KERNEL, obj_full_size(s))) {
+			obj_cgroup_put(objcg_new);
+			local_irq_restore(flags);
+			return false;
+		}
+		break;
+	case MEMCG_KMEM_UNCHARGE:
+		/* Uncharge from the old memcg */
+		obj_cgroup_uncharge(objcg_old, obj_full_size(s));
+		objcgs[off] = NULL;
+		mod_objcg_state(objcg_old, slab_pgdat(slab), cache_vmstat_idx(s),
+				-obj_full_size(s));
+		obj_cgroup_put(objcg_old);
+		break;
+	case MEMCG_KMEM_POST_CHARGE:
+		/* Charge to the new memcg */
+		objcg_new = obj_cgroup_from_current();
+		objcgs[off] = objcg_new;
+		mod_objcg_state(objcg_new, slab_pgdat(slab), cache_vmstat_idx(s), obj_full_size(s));
+		break;
+	case MEMCG_KMEM_CHARGE_ERR:
+		objcg_new = obj_cgroup_from_current();
+		obj_cgroup_uncharge(objcg_new, obj_full_size(s));
+		obj_cgroup_put(objcg_new);
+		break;
+	}
+
+out:
+	local_irq_restore(flags);
+
+	return true;
+}
+EXPORT_SYMBOL(krecharge);
+
 /*
  * This initializes kmem_cache_node or resizes various caches for all nodes.
  */
diff --git a/mm/slob.c b/mm/slob.c
index f47811f09aca..6d68ad57b4a2 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -574,6 +574,13 @@ void kfree(const void *block)
 }
 EXPORT_SYMBOL(kfree);
 
+/* kmemcg is no supported for SLOB */
+bool krecharge(const void *block, int step)
+{
+	return true;
+}
+EXPORT_SYMBOL(krecharge);
+
 /* can't use ksize for kmem_cache_alloc memory, only kmalloc */
 size_t __ksize(const void *block)
 {
diff --git a/mm/slub.c b/mm/slub.c
index e5535020e0fd..ef6475ed6407 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4556,6 +4556,131 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+bool krecharge(const void *x, int step)
+{
+	void *object = (void *)x;
+	struct obj_cgroup *objcg_old;
+	struct obj_cgroup *objcg_new;
+	struct obj_cgroup **objcgs;
+	struct kmem_cache *s;
+	struct folio *folio;
+	struct slab *slab;
+	unsigned int off;
+
+	WARN_ON(!in_task());
+
+	if (!memcg_kmem_enabled())
+		return true;
+
+	if (unlikely(ZERO_OR_NULL_PTR(x)))
+		return true;
+
+	folio = virt_to_folio(x);
+	if (unlikely(!folio_test_slab(folio))) {
+		unsigned int order = folio_order(folio);
+		struct page *page;
+
+		switch (step) {
+		case MEMCG_KMEM_PRE_CHARGE:
+			objcg_new = get_obj_cgroup_from_current();
+			WARN_ON(!objcg_new);
+			/* Try charge current memcg */
+			if (obj_cgroup_charge_pages(objcg_new, GFP_KERNEL,
+						    1 << order)) {
+				obj_cgroup_put(objcg_new);
+				return false;
+			}
+			break;
+		case MEMCG_KMEM_UNCHARGE:
+			/* Uncharge folio memcg */
+			objcg_old = __folio_objcg(folio);
+			page = folio_page(folio, 0);
+			WARN_ON(!objcg_old);
+			obj_cgroup_uncharge_pages(objcg_old, 1 << order);
+			mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
+						-(PAGE_SIZE << order));
+			page->memcg_data = 0;
+			obj_cgroup_put(objcg_old);
+			break;
+		case MEMCG_KMEM_POST_CHARGE:
+			/* Set current memcg to folio page */
+			objcg_new = obj_cgroup_from_current();
+			page = folio_page(folio, 0);
+			page->memcg_data = (unsigned long)objcg_new | MEMCG_DATA_KMEM;
+			mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
+						-(PAGE_SIZE << order));
+			break;
+		case MEMCG_KMEM_CHARGE_ERR:
+			objcg_new = obj_cgroup_from_current();
+			obj_cgroup_uncharge_pages(objcg_new, 1 << order);
+			obj_cgroup_put(objcg_new);
+			break;
+		}
+		return true;
+	}
+
+	slab = folio_slab(folio);
+	if (!slab)
+		return true;
+
+	s = slab->slab_cache;
+	if (!(s->flags & SLAB_ACCOUNT))
+		return true;
+
+	objcgs = slab_objcgs(slab);
+	if (!objcgs)
+		return true;
+	off = obj_to_index(s, slab, object);
+	objcg_old = objcgs[off];
+	/* In step MEMCG_KMEM_UNCHARGE, the objcg will set to NULL. */
+	if (!objcg_old && step != MEMCG_KMEM_POST_CHARGE)
+		return true;
+
+	/*
+	 *  The recharge can be separated into three steps,
+	 *  1. Pre charge to the new memcg
+	 *  2. Uncharge from the old memcg
+	 *  3. Charge to the new memcg
+	 */
+	switch (step) {
+	case MEMCG_KMEM_PRE_CHARGE:
+		/*
+		 * Before uncharge from the old memcg, we must pre charge the new memcg
+		 * first, to make sure it always succeed to recharge to the new memcg
+		 * after uncharge from the old memcg.
+		 */
+		objcg_new = get_obj_cgroup_from_current();
+		WARN_ON(!objcg_new);
+		if (obj_cgroup_charge(objcg_new, GFP_KERNEL, obj_full_size(s))) {
+			obj_cgroup_put(objcg_new);
+			return false;
+		}
+		break;
+	case MEMCG_KMEM_UNCHARGE:
+		/* Uncharge from old memcg */
+		obj_cgroup_uncharge(objcg_old, obj_full_size(s));
+		objcgs[off] = NULL;
+		mod_objcg_state(objcg_old, slab_pgdat(slab), cache_vmstat_idx(s),
+				-obj_full_size(s));
+		obj_cgroup_put(objcg_old);
+		break;
+	case MEMCG_KMEM_POST_CHARGE:
+		/* Charge to the new memcg */
+		objcg_new = obj_cgroup_from_current();
+		objcgs[off] = objcg_new;
+		mod_objcg_state(objcg_new, slab_pgdat(slab), cache_vmstat_idx(s), obj_full_size(s));
+		break;
+	case MEMCG_KMEM_CHARGE_ERR:
+		objcg_new = obj_cgroup_from_current();
+		obj_cgroup_uncharge(objcg_new, obj_full_size(s));
+		obj_cgroup_put(objcg_new);
+		break;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL(krecharge);
+
 #define SHRINK_PROMOTE_MAX 32
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 06/10] mm: Add helper to recharge vmalloc'ed address
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
                   ` (4 preceding siblings ...)
  2022-06-19 15:50 ` [RFC PATCH bpf-next 05/10] mm: Add helper to recharge kmalloc'ed address Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 07/10] mm: Add helper to recharge percpu address Yafang Shao
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

This patch introduces a helper to recharge the corresponding pages
of a given vmalloc'ed address. It is similar with how to recharge
a kmalloced'ed address.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/slab.h    |  1 +
 include/linux/vmalloc.h |  2 +
 mm/util.c               |  9 +++++
 mm/vmalloc.c            | 87 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 99 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 18ab30aa8fe8..e8fb0f6a3660 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -794,6 +794,7 @@ extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flag
 		      __alloc_size(3);
 extern void kvfree(const void *addr);
 extern void kvfree_sensitive(const void *addr, size_t len);
+bool kvrecharge(const void *addr, int step);
 
 unsigned int kmem_cache_size(struct kmem_cache *s);
 void __init kmem_cache_init_late(void);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..37c6d0e7b8d5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -162,6 +162,8 @@ extern void *vcalloc(size_t n, size_t size) __alloc_size(1, 2);
 
 extern void vfree(const void *addr);
 extern void vfree_atomic(const void *addr);
+bool vrecharge(const void *addr, int step);
+void vuncharge(const void *addr);
 
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
diff --git a/mm/util.c b/mm/util.c
index 0837570c9225..312c05e83132 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -656,6 +656,15 @@ void kvfree(const void *addr)
 }
 EXPORT_SYMBOL(kvfree);
 
+bool kvrecharge(const void *addr, int step)
+{
+	if (is_vmalloc_addr(addr))
+		return vrecharge(addr, step);
+
+	return krecharge(addr, step);
+}
+EXPORT_SYMBOL(kvrecharge);
+
 /**
  * kvfree_sensitive - Free a data object containing sensitive information.
  * @addr: address of the data object to be freed.
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index effd1ff6a4b4..7da6e429a45f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2745,6 +2745,93 @@ void vfree(const void *addr)
 }
 EXPORT_SYMBOL(vfree);
 
+bool vrecharge(const void *addr, int step)
+{
+	struct obj_cgroup *objcg_new;
+	unsigned int page_order;
+	struct vm_struct *area;
+	struct folio *folio;
+	int i;
+
+	WARN_ON(!in_task());
+
+	if (!addr)
+		return true;
+
+	area = find_vm_area(addr);
+	if (unlikely(!area))
+		return true;
+
+	page_order = vm_area_page_order(area);
+
+	switch (step) {
+	case MEMCG_KMEM_PRE_CHARGE:
+		for (i = 0; i < area->nr_pages; i += 1U << page_order) {
+			struct page *page = area->pages[i];
+
+			WARN_ON(!page);
+			objcg_new = get_obj_cgroup_from_current();
+			WARN_ON(!objcg_new);
+			if (obj_cgroup_charge_pages(objcg_new, GFP_KERNEL,
+						    1 << page_order))
+				goto out_pre;
+			cond_resched();
+		}
+		break;
+	case MEMCG_KMEM_UNCHARGE:
+		for (i = 0; i < area->nr_pages; i += 1U << page_order) {
+			struct page *page = area->pages[i];
+			struct obj_cgroup *objcg_old;
+
+			WARN_ON(!page);
+			folio = page_folio(page);
+			WARN_ON(!folio_memcg_kmem(folio));
+			objcg_old = __folio_objcg(folio);
+
+			obj_cgroup_uncharge_pages(objcg_old, 1 << page_order);
+			/* mod memcg from page */
+			mod_memcg_state(page_memcg(page), MEMCG_VMALLOC,
+					-(1U << page_order));
+			page->memcg_data = 0;
+			obj_cgroup_put(objcg_old);
+			cond_resched();
+		}
+		break;
+	case MEMCG_KMEM_POST_CHARGE:
+		objcg_new = obj_cgroup_from_current();
+		for (i = 0; i < area->nr_pages; i += 1U << page_order) {
+			struct page *page = area->pages[i];
+
+			page->memcg_data = (unsigned long)objcg_new | MEMCG_DATA_KMEM;
+			/* mod memcg from current */
+			mod_memcg_state(page_memcg(page), MEMCG_VMALLOC,
+					1U << page_order);
+
+		}
+		break;
+	case MEMCG_KMEM_CHARGE_ERR:
+		objcg_new = obj_cgroup_from_current();
+		for (i = 0; i < area->nr_pages; i += 1U << page_order) {
+			obj_cgroup_uncharge_pages(objcg_new, 1 << page_order);
+			obj_cgroup_put(objcg_new);
+			cond_resched();
+		}
+		break;
+	}
+
+	return true;
+
+out_pre:
+	for (; i > 0; i -= 1U << page_order) {
+		obj_cgroup_uncharge_pages(objcg_new, 1 << page_order);
+		obj_cgroup_put(objcg_new);
+		cond_resched();
+	}
+
+	return false;
+}
+EXPORT_SYMBOL(vrecharge);
+
 /**
  * vunmap - release virtual mapping obtained by vmap()
  * @addr:   memory base address
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 07/10] mm: Add helper to recharge percpu address
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
                   ` (5 preceding siblings ...)
  2022-06-19 15:50 ` [RFC PATCH bpf-next 06/10] mm: Add helper to recharge vmalloc'ed address Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-23  5:25   ` Dennis Zhou
  2022-06-19 15:50 ` [RFC PATCH bpf-next 08/10] bpf: Recharge memory when reuse bpf map Yafang Shao
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

This patch introduces a helper to recharge the corresponding pages of
a given percpu address. It is similar with how to recharge a kmalloc'ed
address.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/percpu.h |  1 +
 mm/percpu.c            | 98 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index f1ec5ad1351c..e88429410179 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -128,6 +128,7 @@ extern void __init setup_per_cpu_areas(void);
 extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) __alloc_size(1);
 extern void __percpu *__alloc_percpu(size_t size, size_t align) __alloc_size(1);
 extern void free_percpu(void __percpu *__pdata);
+bool recharge_percpu(void __percpu *__pdata, int step);
 extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
 
 #define alloc_percpu_gfp(type, gfp)					\
diff --git a/mm/percpu.c b/mm/percpu.c
index 3633eeefaa0d..fd81f4d79f2f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2310,6 +2310,104 @@ void free_percpu(void __percpu *ptr)
 }
 EXPORT_SYMBOL_GPL(free_percpu);
 
+#ifdef CONFIG_MEMCG_KMEM
+bool recharge_percpu(void __percpu *ptr, int step)
+{
+	int bit_off, off, bits, size, end;
+	struct obj_cgroup *objcg_old;
+	struct obj_cgroup *objcg_new;
+	struct pcpu_chunk *chunk;
+	unsigned long flags;
+	void *addr;
+
+	WARN_ON(!in_task());
+
+	if (!ptr)
+		return true;
+
+	addr = __pcpu_ptr_to_addr(ptr);
+	spin_lock_irqsave(&pcpu_lock, flags);
+	chunk = pcpu_chunk_addr_search(addr);
+	off = addr - chunk->base_addr;
+	objcg_old = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+	if (!objcg_old && step != MEMCG_KMEM_POST_CHARGE) {
+		spin_unlock_irqrestore(&pcpu_lock, flags);
+		return true;
+	}
+
+	bit_off = off / PCPU_MIN_ALLOC_SIZE;
+	/* find end index */
+	end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
+			bit_off + 1);
+	bits = end - bit_off;
+	size = bits * PCPU_MIN_ALLOC_SIZE;
+
+	switch (step) {
+	case MEMCG_KMEM_PRE_CHARGE:
+		objcg_new = get_obj_cgroup_from_current();
+		WARN_ON(!objcg_new);
+		if (obj_cgroup_charge(objcg_new, GFP_KERNEL,
+				      size * num_possible_cpus())) {
+			obj_cgroup_put(objcg_new);
+			spin_unlock_irqrestore(&pcpu_lock, flags);
+			return false;
+		}
+		break;
+	case MEMCG_KMEM_UNCHARGE:
+		obj_cgroup_uncharge(objcg_old, size * num_possible_cpus());
+		rcu_read_lock();
+		mod_memcg_state(obj_cgroup_memcg(objcg_old), MEMCG_PERCPU_B,
+			-(size * num_possible_cpus()));
+		rcu_read_unlock();
+		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
+		obj_cgroup_put(objcg_old);
+		break;
+	case MEMCG_KMEM_POST_CHARGE:
+		rcu_read_lock();
+		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = obj_cgroup_from_current();
+		mod_memcg_state(mem_cgroup_from_task(current), MEMCG_PERCPU_B,
+			(size * num_possible_cpus()));
+		rcu_read_unlock();
+		break;
+	case MEMCG_KMEM_CHARGE_ERR:
+		/*
+		 * In case fail to charge to the new one in the pre charge state,
+		 * for example, we have pre-charged one memcg successfully but fail
+		 * to pre-charge the second memcg, then we should uncharge the first
+		 * memcg.
+		 */
+		objcg_new = obj_cgroup_from_current();
+		obj_cgroup_uncharge(objcg_new, size * num_possible_cpus());
+		obj_cgroup_put(objcg_new);
+		rcu_read_lock();
+		mod_memcg_state(obj_cgroup_memcg(objcg_new), MEMCG_PERCPU_B,
+			-(size * num_possible_cpus()));
+		rcu_read_unlock();
+
+		break;
+	}
+
+	spin_unlock_irqrestore(&pcpu_lock, flags);
+
+	return true;
+}
+EXPORT_SYMBOL(recharge_percpu);
+
+#else /* CONFIG_MEMCG_KMEM */
+
+bool charge_percpu(void __percpu *ptr, bool charge)
+{
+	return true;
+}
+EXPORT_SYMBOL(charge_percpu);
+
+void uncharge_percpu(void __percpu *ptr)
+{
+}
+EXPORT_SYMBOL(uncharge_percpu);
+
+#endif /* CONFIG_MEMCG_KMEM */
+
 bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr)
 {
 #ifdef CONFIG_SMP
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 08/10] bpf: Recharge memory when reuse bpf map
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
                   ` (6 preceding siblings ...)
  2022-06-19 15:50 ` [RFC PATCH bpf-next 07/10] mm: Add helper to recharge percpu address Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 09/10] bpf: Make bpf_map_{save, release}_memcg public Yafang Shao
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

When we reuse a pinned bpf map, if it belongs to a memcg which needs to
be recharged, we will uncharge the pages of this bpf map from its
original memcg and then charge its pages to the current memcg.

We have to explicitly tell the kernel if it is a reuse path as the
kernel can't detect it intelligently. That can be done in libbpf, then the
user code don't need to be changed.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  2 ++
 include/uapi/linux/bpf.h       |  2 +-
 kernel/bpf/syscall.c           | 10 ++++++++++
 tools/include/uapi/linux/bpf.h |  2 +-
 tools/lib/bpf/libbpf.c         |  2 +-
 5 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0edd7d2c0064..b18a30e70507 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -152,6 +152,8 @@ struct bpf_map_ops {
 				     bpf_callback_t callback_fn,
 				     void *callback_ctx, u64 flags);
 
+	bool (*map_memcg_recharge)(struct bpf_map *map);
+
 	/* BTF id of struct allocated by map_alloc */
 	int *map_btf_id;
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f2f658e224a7..ffbe15c1c8c6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6093,7 +6093,7 @@ struct bpf_map_info {
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
 	__s8  memcg_state;
-	__s8  :8;	/* alignment pad */
+	__s8  memcg_recharge;
 	__u16 :16;	/* alignment pad */
 	__u64 map_extra;
 } __attribute__((aligned(8)));
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d4659d58d288..8817c40275f3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4170,12 +4170,22 @@ static int bpf_map_get_info_by_fd(struct file *file,
 
 #ifdef CONFIG_MEMCG_KMEM
 	if (map->memcg) {
+		size_t offset = offsetof(struct bpf_map_info, memcg_recharge);
 		struct mem_cgroup *memcg = map->memcg;
+		char recharge;
 
 		if (memcg == root_mem_cgroup)
 			info.memcg_state = 0;
 		else
 			info.memcg_state = memcg_need_recharge(memcg) ? -1 : 1;
+
+		if (copy_from_user(&recharge, (char __user *)uinfo + offset, sizeof(char)))
+			return -EFAULT;
+
+		if (recharge && memcg_need_recharge(memcg)) {
+			if (map->ops->map_memcg_recharge)
+				map->ops->map_memcg_recharge(map);
+		}
 	}
 #endif
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f2f658e224a7..ffbe15c1c8c6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6093,7 +6093,7 @@ struct bpf_map_info {
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
 	__s8  memcg_state;
-	__s8  :8;	/* alignment pad */
+	__s8  memcg_recharge;
 	__u16 :16;	/* alignment pad */
 	__u64 map_extra;
 } __attribute__((aligned(8)));
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 49e359cd34df..f0eb67c983d8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4488,7 +4488,7 @@ int bpf_map__set_autocreate(struct bpf_map *map, bool autocreate)
 
 int bpf_map__reuse_fd(struct bpf_map *map, int fd)
 {
-	struct bpf_map_info info = {};
+	struct bpf_map_info info = {.memcg_recharge = 1};
 	__u32 len = sizeof(info);
 	int new_fd, err;
 	char *new_name;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 09/10] bpf: Make bpf_map_{save, release}_memcg public
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
                   ` (7 preceding siblings ...)
  2022-06-19 15:50 ` [RFC PATCH bpf-next 08/10] bpf: Recharge memory when reuse bpf map Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 10/10] bpf: Support recharge for hash map Yafang Shao
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

These two helpers will be used in map specific files later.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h  | 21 +++++++++++++++++++++
 kernel/bpf/syscall.c | 19 -------------------
 2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b18a30e70507..a0f21d4382ff 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -27,6 +27,7 @@
 #include <linux/bpfptr.h>
 #include <linux/btf.h>
 #include <linux/rcupdate_trace.h>
+#include <linux/memcontrol.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -248,6 +249,26 @@ struct bpf_map {
 	bool frozen; /* write-once; write-protected by freeze_mutex */
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+static inline void bpf_map_save_memcg(struct bpf_map *map)
+{
+	map->memcg = get_mem_cgroup_from_mm(current->mm);
+}
+
+static inline void bpf_map_release_memcg(struct bpf_map *map)
+{
+	mem_cgroup_put(map->memcg);
+}
+#else
+static inline void bpf_map_save_memcg(struct bpf_map *map)
+{
+}
+
+static inline void bpf_map_release_memcg(struct bpf_map *map)
+{
+}
+#endif
+
 static inline bool map_value_has_spin_lock(const struct bpf_map *map)
 {
 	return map->spin_lock_off >= 0;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8817c40275f3..5159b97d1064 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -417,16 +417,6 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-	map->memcg = get_mem_cgroup_from_mm(current->mm);
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-	mem_cgroup_put(map->memcg);
-}
-
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
@@ -464,15 +454,6 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 
 	return ptr;
 }
-
-#else
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-}
 #endif
 
 static int bpf_map_kptr_off_cmp(const void *a, const void *b)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 10/10] bpf: Support recharge for hash map
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
                   ` (8 preceding siblings ...)
  2022-06-19 15:50 ` [RFC PATCH bpf-next 09/10] bpf: Make bpf_map_{save, release}_memcg public Yafang Shao
@ 2022-06-19 15:50 ` Yafang Shao
  2022-06-21 23:28 ` [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Alexei Starovoitov
  2022-06-23  3:29 ` Roman Gushchin
  11 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

This patch introduces a helper to recharge pages of a hash map. We have
already known how the hash map is allocated and freed, we can also know
how to charge and uncharge the hash map.

Firstly, we need to pre charge to the new memcg, if the pre charge
successes then we uncharge from the old memcg. Finnaly we do the post
charge to the new memcg, in which we will modify the counter in memcgs.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/hashtab.c | 74 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 17fb69c0e0dc..fe61976262ee 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -11,6 +11,7 @@
 #include <uapi/linux/btf.h>
 #include <linux/rcupdate_trace.h>
 #include <linux/btf_ids.h>
+#include <linux/memcontrol.h>
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
@@ -1499,6 +1500,75 @@ static void htab_map_free(struct bpf_map *map)
 	kfree(htab);
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static bool htab_map_memcg_recharge(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct mem_cgroup *old = map->memcg;
+	int i;
+
+	/*
+	 * Although the bpf map's offline memcg has been reparented, there
+	 * is still reference on it, so it is safe to be accessed.
+	 */
+	if (!old)
+		return false;
+
+	/* Pre charge to the new memcg */
+	if (!krecharge(htab, MEMCG_KMEM_PRE_CHARGE))
+		return false;
+
+	if (!kvrecharge(htab->buckets, MEMCG_KMEM_PRE_CHARGE))
+		goto out_k;
+
+	if (!recharge_percpu(htab->extra_elems, MEMCG_KMEM_PRE_CHARGE))
+		goto out_kv;
+
+	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++) {
+		if (!recharge_percpu(htab->map_locked[i], MEMCG_KMEM_PRE_CHARGE))
+			goto out_p;
+	}
+
+	/* Uncharge from the old memcg. */
+	krecharge(htab, MEMCG_KMEM_UNCHARGE);
+	kvrecharge(htab->buckets, MEMCG_KMEM_UNCHARGE);
+	recharge_percpu(htab->extra_elems, MEMCG_KMEM_UNCHARGE);
+	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
+		recharge_percpu(htab->map_locked[i], MEMCG_KMEM_UNCHARGE);
+
+	/* Release the old memcg */
+	bpf_map_release_memcg(map);
+
+	/* Post charge to the new memcg */
+	krecharge(htab, MEMCG_KMEM_POST_CHARGE);
+	kvrecharge(htab->buckets, MEMCG_KMEM_POST_CHARGE);
+	recharge_percpu(htab->extra_elems, MEMCG_KMEM_POST_CHARGE);
+	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
+		recharge_percpu(htab->map_locked[i], MEMCG_KMEM_POST_CHARGE);
+
+	/* Save the new memcg */
+	bpf_map_save_memcg(map);
+
+	return true;
+
+out_p:
+	for (; i > 0; i--)
+		recharge_percpu(htab->map_locked[i], MEMCG_KMEM_CHARGE_ERR);
+	recharge_percpu(htab->extra_elems, MEMCG_KMEM_CHARGE_ERR);
+out_kv:
+	kvrecharge(htab->buckets, MEMCG_KMEM_CHARGE_ERR);
+out_k:
+	krecharge(htab, MEMCG_KMEM_CHARGE_ERR);
+
+	return false;
+}
+#else
+static bool htab_map_memcg_recharge(struct bpf_map *map)
+{
+	return true;
+}
+#endif
+
 static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
 				   struct seq_file *m)
 {
@@ -2152,6 +2222,7 @@ const struct bpf_map_ops htab_map_ops = {
 	.map_alloc_check = htab_map_alloc_check,
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
+	.map_memcg_recharge = htab_map_memcg_recharge,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_release_uref = htab_map_free_timers,
 	.map_lookup_elem = htab_map_lookup_elem,
@@ -2172,6 +2243,7 @@ const struct bpf_map_ops htab_lru_map_ops = {
 	.map_alloc_check = htab_map_alloc_check,
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
+	.map_memcg_recharge = htab_map_memcg_recharge,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_release_uref = htab_map_free_timers,
 	.map_lookup_elem = htab_lru_map_lookup_elem,
@@ -2325,6 +2397,7 @@ const struct bpf_map_ops htab_percpu_map_ops = {
 	.map_alloc_check = htab_map_alloc_check,
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
+	.map_memcg_recharge = htab_map_memcg_recharge,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_lookup_elem = htab_percpu_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_percpu_map_lookup_and_delete_elem,
@@ -2344,6 +2417,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
 	.map_alloc_check = htab_map_alloc_check,
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
+	.map_memcg_recharge = htab_map_memcg_recharge,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_lookup_elem = htab_lru_percpu_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_lru_percpu_map_lookup_and_delete_elem,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
                   ` (9 preceding siblings ...)
  2022-06-19 15:50 ` [RFC PATCH bpf-next 10/10] bpf: Support recharge for hash map Yafang Shao
@ 2022-06-21 23:28 ` Alexei Starovoitov
  2022-06-22 14:03   ` Yafang Shao
  2022-06-23  3:29 ` Roman Gushchin
  11 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2022-06-21 23:28 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	linux-mm, bpf

On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> After switching to memcg-based bpf memory accounting, the bpf memory is
> charged to the loader's memcg by default, that causes unexpected issues for
> us. For instance, the container of the loader may be restarted after
> pinning progs and maps, but the bpf memcg will be left and pinned on the
> system. Once the loader's new generation container is started, the leftover
> pages won't be charged to it. That inconsistent behavior will make trouble
> for the memory resource management for this container.
> 
> In the past few days, I have proposed two patchsets[1][2] to try to resolve
> this issue, but in both of these two proposals the user code has to be
> changed to adapt to it, that is a pain for us. This patchset relieves the
> pain by triggering the recharge in libbpf. It also addresses Roman's
> critical comments.
> 
> The key point we can avoid changing the user code is that there's a resue
> path in libbpf. Once the bpf container is restarted again, it will try
> to re-run the required bpf programs, if the bpf programs are the same with
> the already pinned one, it will reuse them.
> 
> To make sure we either recharge all of them successfully or don't recharge
> any of them. The recharge prograss is divided into three steps:
>   - Pre charge to the new generation 
>     To make sure once we uncharge from the old generation, we can always
>     charge to the new generation succeesfully. If we can't pre charge to
>     the new generation, we won't allow it to be uncharged from the old
>     generation.
>   - Uncharge from the old generation
>     After pre charge to the new generation, we can uncharge from the old
>     generation.
>   - Post charge to the new generation
>     Finnaly we can set pages' memcg_data to the new generation. 
> In the pre charge step, we may succeed to charge some addresses, but fail
> to charge a new address, then we should uncharge the already charged
> addresses, so another recharge-err step is instroduced.
>  
> This pachset has finished recharging bpf hash map. which is mostly used
> by our bpf services. The other maps hasn't been implemented yet. The bpf
> progs hasn't been implemented neither.

... and the implementation in patch 10 is missing recharge of htab->elems
which consumes the most memory.
That begs the question how the whole set was tested.

Even if that bug is fixed this recharge approach works only with preallocated
maps. Their use will be reduced in the future.
Maps with kmalloc won't work with this multi step approach.
There is no place where bpf_map_release_memcg can be done without racing
with concurrent kmallocs from bpf program side.

Despite being painful for user space the user space has to deal with it.
It created a container, charged its memcg, then destroyed the container,
but didn't free the bpf map. It's a user bug. It has to free the map.
The user space can use map-in-map solution. In the new container the new bpf map
can be allocated (and charged to new memcg), the data copied from old map,
and then inner map replaced. At this point old map can be freed and memcg
uncharged.
To make things easier we can consider introducing an anon FD that points to a memcg.
Then user can pick a memcg at map creation time instead of get_mem_cgroup_from_current.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-21 23:28 ` [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Alexei Starovoitov
@ 2022-06-22 14:03   ` Yafang Shao
  0 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-22 14:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, songmuchun, Andrew Morton, Christoph Lameter,
	penberg, David Rientjes, iamjoonsoo.kim, Vlastimil Babka,
	Linux MM, bpf

On Wed, Jun 22, 2022 at 7:28 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> > After switching to memcg-based bpf memory accounting, the bpf memory is
> > charged to the loader's memcg by default, that causes unexpected issues for
> > us. For instance, the container of the loader may be restarted after
> > pinning progs and maps, but the bpf memcg will be left and pinned on the
> > system. Once the loader's new generation container is started, the leftover
> > pages won't be charged to it. That inconsistent behavior will make trouble
> > for the memory resource management for this container.
> >
> > In the past few days, I have proposed two patchsets[1][2] to try to resolve
> > this issue, but in both of these two proposals the user code has to be
> > changed to adapt to it, that is a pain for us. This patchset relieves the
> > pain by triggering the recharge in libbpf. It also addresses Roman's
> > critical comments.
> >
> > The key point we can avoid changing the user code is that there's a resue
> > path in libbpf. Once the bpf container is restarted again, it will try
> > to re-run the required bpf programs, if the bpf programs are the same with
> > the already pinned one, it will reuse them.
> >
> > To make sure we either recharge all of them successfully or don't recharge
> > any of them. The recharge prograss is divided into three steps:
> >   - Pre charge to the new generation
> >     To make sure once we uncharge from the old generation, we can always
> >     charge to the new generation succeesfully. If we can't pre charge to
> >     the new generation, we won't allow it to be uncharged from the old
> >     generation.
> >   - Uncharge from the old generation
> >     After pre charge to the new generation, we can uncharge from the old
> >     generation.
> >   - Post charge to the new generation
> >     Finnaly we can set pages' memcg_data to the new generation.
> > In the pre charge step, we may succeed to charge some addresses, but fail
> > to charge a new address, then we should uncharge the already charged
> > addresses, so another recharge-err step is instroduced.
> >
> > This pachset has finished recharging bpf hash map. which is mostly used
> > by our bpf services. The other maps hasn't been implemented yet. The bpf
> > progs hasn't been implemented neither.
>
> ... and the implementation in patch 10 is missing recharge of htab->elems
> which consumes the most memory.

You got to the point. I intended to skip htab->elemes due to some reasons.
You have pointed out one of the reasons that it is complex for
non-preallocated memory.
Another reason is memcg reparenting, for example, once the memcg is
offline, its pages and other related memory things like obj_cgroup
will be reparented, but unlikely the map->memcg is still the offline
memcg. There _must_ be issues in it, but I haven't figured out what it
may cause to the user. One issue I can imagine is that the memcg limit
may not work well in this case.  That should be another different
issue, and I'm still working on it.

> That begs the question how the whole set was tested.

In my test case, htab->buckets is around 120MB, which can be used to
do the comparison.  The number is charged back without any kernel
warnings, so I posted this incomplete patchset to get feedback to
check if I'm in the right direction.

>
> Even if that bug is fixed this recharge approach works only with preallocated
> maps. Their use will be reduced in the future.
> Maps with kmalloc won't work with this multi step approach.
> There is no place where bpf_map_release_memcg can be done without racing
> with concurrent kmallocs from bpf program side.

Right, that is really an issue.

>
> Despite being painful for user space the user space has to deal with it.
> It created a container, charged its memcg, then destroyed the container,
> but didn't free the bpf map. It's a user bug. It has to free the map.

It seems that I didn't describe this case clearly.
It is not a user bug but a user case in the k8s environment. For
example, after the user has deployed its container, it may need to
upgrade its user application or bpf prog, then the user should upgrade
its container, that means the old container will be destroyed and a
new container will be created.  In this upgrade progress, the bpf
program can continue to work without interruption because they are
pinned.

> The user space can use map-in-map solution. In the new container the new bpf map
> can be allocated (and charged to new memcg), the data copied from old map,
> and then inner map replaced. At this point old map can be freed and memcg
> uncharged.

The map-in-map solution may work for some cases, but it will allocate
more memory, which is not okay if the bpf map has lots of memory.

> To make things easier we can consider introducing an anon FD that points to a memcg.
> Then user can pick a memcg at map creation time instead of get_mem_cgroup_from_current.

This may be a workable solution.  The life cycle of a pinned bpf
program is independent of the application which loads an pins it, so
it is reasonable to introduce an independent memcg to manage the bpf
memory, for example a memcg like /sys/fs/cgroup/memory/pinned_bpf
which is independent of the k8s pod.
I will analyze it. Thanks for the suggestion.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 03/10] mm, memcg: Add new helper obj_cgroup_from_current()
  2022-06-19 15:50 ` [RFC PATCH bpf-next 03/10] mm, memcg: Add new helper obj_cgroup_from_current() Yafang Shao
@ 2022-06-23  3:01   ` Roman Gushchin
  2022-06-25 13:54     ` Yafang Shao
  0 siblings, 1 reply; 30+ messages in thread
From: Roman Gushchin @ 2022-06-23  3:01 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, shakeelb, songmuchun, akpm, cl,
	penberg, rientjes, iamjoonsoo.kim, vbabka, linux-mm, bpf

On Sun, Jun 19, 2022 at 03:50:25PM +0000, Yafang Shao wrote:
> The difference between get_obj_cgroup_from_current() and obj_cgroup_from_current()
> is that the later one doesn't add objcg's refcnt.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/memcontrol.h |  1 +
>  mm/memcontrol.c            | 24 ++++++++++++++++++++++++
>  2 files changed, 25 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index cf074156c6ac..402b42670bcd 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1703,6 +1703,7 @@ bool mem_cgroup_kmem_disabled(void);
>  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
>  void __memcg_kmem_uncharge_page(struct page *page, int order);
>  
> +struct obj_cgroup *obj_cgroup_from_current(void);
>  struct obj_cgroup *get_obj_cgroup_from_current(void);
>  struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index abec50f31fe6..350a7849dac3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2950,6 +2950,30 @@ struct obj_cgroup *get_obj_cgroup_from_page(struct page *page)
>  	return objcg;
>  }
>  
> +__always_inline struct obj_cgroup *obj_cgroup_from_current(void)
> +{
> +	struct obj_cgroup *objcg = NULL;
> +	struct mem_cgroup *memcg;
> +
> +	if (memcg_kmem_bypass())
> +		return NULL;
> +
> +	rcu_read_lock();
> +	if (unlikely(active_memcg()))
> +		memcg = active_memcg();
> +	else
> +		memcg = mem_cgroup_from_task(current);
> +
> +	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> +		objcg = rcu_dereference(memcg->objcg);
> +		if (objcg)
> +			break;
> +	}
> +	rcu_read_unlock();

Hm, what prevents the objcg from being released here? Under which conditions
it's safe to call it?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
                   ` (10 preceding siblings ...)
  2022-06-21 23:28 ` [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Alexei Starovoitov
@ 2022-06-23  3:29 ` Roman Gushchin
  2022-06-25  3:26   ` Yafang Shao
  11 siblings, 1 reply; 30+ messages in thread
From: Roman Gushchin @ 2022-06-23  3:29 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, shakeelb, songmuchun, akpm, cl,
	penberg, rientjes, iamjoonsoo.kim, vbabka, linux-mm, bpf

On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> After switching to memcg-based bpf memory accounting, the bpf memory is
> charged to the loader's memcg by default, that causes unexpected issues for
> us. For instance, the container of the loader may be restarted after
> pinning progs and maps, but the bpf memcg will be left and pinned on the
> system. Once the loader's new generation container is started, the leftover
> pages won't be charged to it. That inconsistent behavior will make trouble
> for the memory resource management for this container.
> 
> In the past few days, I have proposed two patchsets[1][2] to try to resolve
> this issue, but in both of these two proposals the user code has to be
> changed to adapt to it, that is a pain for us. This patchset relieves the
> pain by triggering the recharge in libbpf. It also addresses Roman's
> critical comments.
> 
> The key point we can avoid changing the user code is that there's a resue
> path in libbpf. Once the bpf container is restarted again, it will try
> to re-run the required bpf programs, if the bpf programs are the same with
> the already pinned one, it will reuse them.
> 
> To make sure we either recharge all of them successfully or don't recharge
> any of them. The recharge prograss is divided into three steps:
>   - Pre charge to the new generation 
>     To make sure once we uncharge from the old generation, we can always
>     charge to the new generation succeesfully. If we can't pre charge to
>     the new generation, we won't allow it to be uncharged from the old
>     generation.
>   - Uncharge from the old generation
>     After pre charge to the new generation, we can uncharge from the old
>     generation.
>   - Post charge to the new generation
>     Finnaly we can set pages' memcg_data to the new generation. 
> In the pre charge step, we may succeed to charge some addresses, but fail
> to charge a new address, then we should uncharge the already charged
> addresses, so another recharge-err step is instroduced.
>  
> This pachset has finished recharging bpf hash map. which is mostly used
> by our bpf services. The other maps hasn't been implemented yet. The bpf
> progs hasn't been implemented neither.

Without going into the implementation details, the overall approach looks
ok to me. But it adds complexity and code into several different subsystems,
and I'm 100% sure it's not worth it if we talking about a partial support
of a single map type. Are you committed to implement the recharging
for all/most map types and progs and support this code in the future?

I'm still feeling you trying to solve a userspace problem in the kernel.
Not saying it can't be solved this way, but it seems like there are
easier options.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 07/10] mm: Add helper to recharge percpu address
  2022-06-19 15:50 ` [RFC PATCH bpf-next 07/10] mm: Add helper to recharge percpu address Yafang Shao
@ 2022-06-23  5:25   ` Dennis Zhou
  2022-06-27  3:04       ` Yafang Shao
  0 siblings, 1 reply; 30+ messages in thread
From: Dennis Zhou @ 2022-06-23  5:25 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	linux-mm, bpf

Hello,

On Sun, Jun 19, 2022 at 03:50:29PM +0000, Yafang Shao wrote:
> This patch introduces a helper to recharge the corresponding pages of
> a given percpu address. It is similar with how to recharge a kmalloc'ed
> address.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/percpu.h |  1 +
>  mm/percpu.c            | 98 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 99 insertions(+)
> 
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index f1ec5ad1351c..e88429410179 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -128,6 +128,7 @@ extern void __init setup_per_cpu_areas(void);
>  extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) __alloc_size(1);
>  extern void __percpu *__alloc_percpu(size_t size, size_t align) __alloc_size(1);
>  extern void free_percpu(void __percpu *__pdata);
> +bool recharge_percpu(void __percpu *__pdata, int step);

Nit: can you add extern to keep the file consistent.

>  extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
>  
>  #define alloc_percpu_gfp(type, gfp)					\
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 3633eeefaa0d..fd81f4d79f2f 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -2310,6 +2310,104 @@ void free_percpu(void __percpu *ptr)
>  }
>  EXPORT_SYMBOL_GPL(free_percpu);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +bool recharge_percpu(void __percpu *ptr, int step)
> +{
> +	int bit_off, off, bits, size, end;
> +	struct obj_cgroup *objcg_old;
> +	struct obj_cgroup *objcg_new;
> +	struct pcpu_chunk *chunk;
> +	unsigned long flags;
> +	void *addr;
> +
> +	WARN_ON(!in_task());
> +
> +	if (!ptr)
> +		return true;
> +
> +	addr = __pcpu_ptr_to_addr(ptr);
> +	spin_lock_irqsave(&pcpu_lock, flags);
> +	chunk = pcpu_chunk_addr_search(addr);
> +	off = addr - chunk->base_addr;
> +	objcg_old = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
> +	if (!objcg_old && step != MEMCG_KMEM_POST_CHARGE) {
> +		spin_unlock_irqrestore(&pcpu_lock, flags);
> +		return true;
> +	}
> +
> +	bit_off = off / PCPU_MIN_ALLOC_SIZE;
> +	/* find end index */
> +	end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
> +			bit_off + 1);
> +	bits = end - bit_off;
> +	size = bits * PCPU_MIN_ALLOC_SIZE;
> +
> +	switch (step) {
> +	case MEMCG_KMEM_PRE_CHARGE:
> +		objcg_new = get_obj_cgroup_from_current();
> +		WARN_ON(!objcg_new);
> +		if (obj_cgroup_charge(objcg_new, GFP_KERNEL,
> +				      size * num_possible_cpus())) {
> +			obj_cgroup_put(objcg_new);
> +			spin_unlock_irqrestore(&pcpu_lock, flags);
> +			return false;
> +		}
> +		break;
> +	case MEMCG_KMEM_UNCHARGE:
> +		obj_cgroup_uncharge(objcg_old, size * num_possible_cpus());
> +		rcu_read_lock();
> +		mod_memcg_state(obj_cgroup_memcg(objcg_old), MEMCG_PERCPU_B,
> +			-(size * num_possible_cpus()));
> +		rcu_read_unlock();
> +		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
> +		obj_cgroup_put(objcg_old);
> +		break;
> +	case MEMCG_KMEM_POST_CHARGE:
> +		rcu_read_lock();
> +		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = obj_cgroup_from_current();
> +		mod_memcg_state(mem_cgroup_from_task(current), MEMCG_PERCPU_B,
> +			(size * num_possible_cpus()));
> +		rcu_read_unlock();
> +		break;
> +	case MEMCG_KMEM_CHARGE_ERR:
> +		/*
> +		 * In case fail to charge to the new one in the pre charge state,
> +		 * for example, we have pre-charged one memcg successfully but fail
> +		 * to pre-charge the second memcg, then we should uncharge the first
> +		 * memcg.
> +		 */
> +		objcg_new = obj_cgroup_from_current();
> +		obj_cgroup_uncharge(objcg_new, size * num_possible_cpus());
> +		obj_cgroup_put(objcg_new);
> +		rcu_read_lock();
> +		mod_memcg_state(obj_cgroup_memcg(objcg_new), MEMCG_PERCPU_B,
> +			-(size * num_possible_cpus()));
> +		rcu_read_unlock();
> +
> +		break;
> +	}

I'm not really the biggest fan of this step stuff. I see why you're
doing it because you want to do all or nothing recharging the percpu bpf
maps. Is there a way to have percpu own this logic and attempt to do all
or nothing instead? I realize bpf is likely the largest percpu user, but
the recharge_percpu() api seems to be more generic than forcing
potential other users in the future to open code it.

> +
> +	spin_unlock_irqrestore(&pcpu_lock, flags);
> +
> +	return true;
> +}
> +EXPORT_SYMBOL(recharge_percpu);
> +
> +#else /* CONFIG_MEMCG_KMEM */
> +
> +bool charge_percpu(void __percpu *ptr, bool charge)
> +{
> +	return true;
> +}
> +EXPORT_SYMBOL(charge_percpu);
> +
> +void uncharge_percpu(void __percpu *ptr)
> +{
> +}

I'm guessing this is supposed to be recharge_percpu() not
(un)charge_percpu().

> +EXPORT_SYMBOL(uncharge_percpu);
> +
> +#endif /* CONFIG_MEMCG_KMEM */
> +
>  bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr)
>  {
>  #ifdef CONFIG_SMP
> -- 
> 2.17.1
> 
> 

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-23  3:29 ` Roman Gushchin
@ 2022-06-25  3:26   ` Yafang Shao
  2022-06-26  3:28     ` Roman Gushchin
  2022-06-27  0:40     ` Alexei Starovoitov
  0 siblings, 2 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-25  3:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	songmuchun, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf

On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> > After switching to memcg-based bpf memory accounting, the bpf memory is
> > charged to the loader's memcg by default, that causes unexpected issues for
> > us. For instance, the container of the loader may be restarted after
> > pinning progs and maps, but the bpf memcg will be left and pinned on the
> > system. Once the loader's new generation container is started, the leftover
> > pages won't be charged to it. That inconsistent behavior will make trouble
> > for the memory resource management for this container.
> >
> > In the past few days, I have proposed two patchsets[1][2] to try to resolve
> > this issue, but in both of these two proposals the user code has to be
> > changed to adapt to it, that is a pain for us. This patchset relieves the
> > pain by triggering the recharge in libbpf. It also addresses Roman's
> > critical comments.
> >
> > The key point we can avoid changing the user code is that there's a resue
> > path in libbpf. Once the bpf container is restarted again, it will try
> > to re-run the required bpf programs, if the bpf programs are the same with
> > the already pinned one, it will reuse them.
> >
> > To make sure we either recharge all of them successfully or don't recharge
> > any of them. The recharge prograss is divided into three steps:
> >   - Pre charge to the new generation
> >     To make sure once we uncharge from the old generation, we can always
> >     charge to the new generation succeesfully. If we can't pre charge to
> >     the new generation, we won't allow it to be uncharged from the old
> >     generation.
> >   - Uncharge from the old generation
> >     After pre charge to the new generation, we can uncharge from the old
> >     generation.
> >   - Post charge to the new generation
> >     Finnaly we can set pages' memcg_data to the new generation.
> > In the pre charge step, we may succeed to charge some addresses, but fail
> > to charge a new address, then we should uncharge the already charged
> > addresses, so another recharge-err step is instroduced.
> >
> > This pachset has finished recharging bpf hash map. which is mostly used
> > by our bpf services. The other maps hasn't been implemented yet. The bpf
> > progs hasn't been implemented neither.
>
> Without going into the implementation details, the overall approach looks
> ok to me. But it adds complexity and code into several different subsystems,
> and I'm 100% sure it's not worth it if we talking about a partial support
> of a single map type. Are you committed to implement the recharging
> for all/most map types and progs and support this code in the future?
>

I'm planning to support it for all map types and progs. Regarding the
progs, it seems that we have to introduce a new UAPI for the user to
do the recharge, because there's no similar reuse path in libbpf.

Our company is a heavy bpf user. We have many bpf programs running on
our production environment, including networking bpf,
tracing/profiling bpf, and some other bpf programs which are not
supported in upstream kernel, for example we're even trying the
sched-bpf[1] proposed by you (and you may remember that I reviewed
your patchset).  Most of the networking bpf, e.g. gateway-bpf,
edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system.

It is a trend that bpf will be introduced in more and more subsystems,
and thus it is no doubt that a bpf patchset will involve many
subsystems.

That means I will be continuously active in these areas in the near
future,  several years at least.

[1]. https://lwn.net/Articles/869433/

> I'm still feeling you trying to solve a userspace problem in the kernel.

Your feeling can be converted to a simple question: is it allowed to
pin a bpf program by a process running in a container.  The answer to
this simple question can help us to understand whether it is a user
bug or a kernel bug.

I think you will agree with me that there's definitely no reason to
refuse to pin a bpf program by a containerized process.  And then we
will find that the pinned-bpf-program doesn't cooperate well with
memcg.  A kernel feature can't work together with another kernel
feature, and there's not even an interface provided to the user to
adjust it. The user either doesn't pin the bpf program or disable
kmemcg.   Isn't it a kernel bug ?

You may have a doubt why these two features can't cooperate.  I will
explain it in detail.  That will be a long story.

It should begin with why we introduce bpf pinning. We pin it because
sometimes the lifecycle of a user application is different with the
bpf program, or there's no user agent at all.  In order to make it
simple, I will take the no-user-agent (agent exits after pinning bpf
program) case as an example.

Now thinking about what will happen if the agent which pins the bpf
program has a memcg. No matter if the agent destroys the memcg or not
once it exits, the memcg will not disappear because it is pinned by
the bpf program. To make it easy, let's assume the memcg isn't being
destroyed, IOW, it is online.

An online memcg is not populated, but it is still being remote charged
(if it is a non-preallocate bpf map), that looks like a ghost. Now we
will look into the details to find what will happen to this ghost
memcg.

If this ghost memcg is limited, it will introduce many issues AFAICS.
Firstly, the memcg will be force charged[2], and I think I don't need
to explain the reason to you.
Even worse is that it force-charges silently without any event,
because it comes from,
        if (!gfpflags_allow_blocking(gfp_mask))
            goto nomem;
And then all memcg events will be skipped. So at least we will
introduce a force-charge event,
    force:
+      memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE);
        page_counter_charge(&memcg->memory, nr_pages);

And then we should allow alloc_htab_elem() to fail,
                l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
-                                            GFP_ATOMIC | __GFP_NOWARN,
+                                            __GFP_ATOMIC |
__GFP_KSWAPD_RECLAIM | __GFP_NOWARN,
                                             htab->map.numa_node);
And then we'd better introduce an improvement for memcg,
+      /*
+       *  Should wakeup async memcg reclaim first,
+       *   in case there will be no direct memcg reclaim for a long time.
+       *   We can either introduce async memcg reclaim
+       *   or modify kswapd to reclaim a specific memcg
+       */
+       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+            wake_up_async_memcg_reclaim();
         if (!gfpflags_allow_blocking(gfp_mask))
                goto nomem;

And .....

Really bad luck that there are so many issues in memcg, but it may
also be because I don't have a deep understanding of memcg ......

I have to clarify that these issues are not caused by
memcg-based-bpf-accounting, but exposed by it.

[ Time for lunch here, so I have to stop. ]

[2]. https://elixir.bootlin.com/linux/v5.19-rc3/source/mm/memcontrol.c#L2685


> Not saying it can't be solved this way, but it seems like there are
> easier options.
>

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 03/10] mm, memcg: Add new helper obj_cgroup_from_current()
  2022-06-23  3:01   ` Roman Gushchin
@ 2022-06-25 13:54     ` Yafang Shao
  2022-06-26  1:52       ` Roman Gushchin
  0 siblings, 1 reply; 30+ messages in thread
From: Yafang Shao @ 2022-06-25 13:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	songmuchun, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf

On Thu, Jun 23, 2022 at 11:01 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Sun, Jun 19, 2022 at 03:50:25PM +0000, Yafang Shao wrote:
> > The difference between get_obj_cgroup_from_current() and obj_cgroup_from_current()
> > is that the later one doesn't add objcg's refcnt.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/memcontrol.h |  1 +
> >  mm/memcontrol.c            | 24 ++++++++++++++++++++++++
> >  2 files changed, 25 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index cf074156c6ac..402b42670bcd 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1703,6 +1703,7 @@ bool mem_cgroup_kmem_disabled(void);
> >  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
> >  void __memcg_kmem_uncharge_page(struct page *page, int order);
> >
> > +struct obj_cgroup *obj_cgroup_from_current(void);
> >  struct obj_cgroup *get_obj_cgroup_from_current(void);
> >  struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index abec50f31fe6..350a7849dac3 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2950,6 +2950,30 @@ struct obj_cgroup *get_obj_cgroup_from_page(struct page *page)
> >       return objcg;
> >  }
> >
> > +__always_inline struct obj_cgroup *obj_cgroup_from_current(void)
> > +{
> > +     struct obj_cgroup *objcg = NULL;
> > +     struct mem_cgroup *memcg;
> > +
> > +     if (memcg_kmem_bypass())
> > +             return NULL;
> > +
> > +     rcu_read_lock();
> > +     if (unlikely(active_memcg()))
> > +             memcg = active_memcg();
> > +     else
> > +             memcg = mem_cgroup_from_task(current);
> > +
> > +     for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > +             objcg = rcu_dereference(memcg->objcg);
> > +             if (objcg)
> > +                     break;
> > +     }
> > +     rcu_read_unlock();
>
> Hm, what prevents the objcg from being released here? Under which conditions
> it's safe to call it?

obj_cgroup_from_current() is used when we know the objcg's refcnt has
already been incremented.
For example in my case, it is called after we have already call get_
parent_mem_cgroup().
I should add a comment or a WARN_ON() in this function.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 07/10] mm: Add helper to recharge percpu address
  2022-06-23  5:25   ` Dennis Zhou
@ 2022-06-27  3:04       ` Yafang Shao
  0 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-25 14:18 UTC (permalink / raw)
  To: Dennis Zhou, kbuild test robot, kbuild-all
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, songmuchun, Andrew Morton, Christoph Lameter,
	penberg, David Rientjes, iamjoonsoo.kim, Vlastimil Babka,
	Linux MM, bpf

On Thu, Jun 23, 2022 at 1:25 PM Dennis Zhou <dennisszhou@gmail.com> wrote:
>
> Hello,
>
> On Sun, Jun 19, 2022 at 03:50:29PM +0000, Yafang Shao wrote:
> > This patch introduces a helper to recharge the corresponding pages of
> > a given percpu address. It is similar with how to recharge a kmalloc'ed
> > address.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/percpu.h |  1 +
> >  mm/percpu.c            | 98 ++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 99 insertions(+)
> >
> > diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> > index f1ec5ad1351c..e88429410179 100644
> > --- a/include/linux/percpu.h
> > +++ b/include/linux/percpu.h
> > @@ -128,6 +128,7 @@ extern void __init setup_per_cpu_areas(void);
> >  extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) __alloc_size(1);
> >  extern void __percpu *__alloc_percpu(size_t size, size_t align) __alloc_size(1);
> >  extern void free_percpu(void __percpu *__pdata);
> > +bool recharge_percpu(void __percpu *__pdata, int step);
>
> Nit: can you add extern to keep the file consistent.
>

Sure, I will do it.

> >  extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
> >
> >  #define alloc_percpu_gfp(type, gfp)                                  \
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 3633eeefaa0d..fd81f4d79f2f 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -2310,6 +2310,104 @@ void free_percpu(void __percpu *ptr)
> >  }
> >  EXPORT_SYMBOL_GPL(free_percpu);
> >
> > +#ifdef CONFIG_MEMCG_KMEM
> > +bool recharge_percpu(void __percpu *ptr, int step)
> > +{
> > +     int bit_off, off, bits, size, end;
> > +     struct obj_cgroup *objcg_old;
> > +     struct obj_cgroup *objcg_new;
> > +     struct pcpu_chunk *chunk;
> > +     unsigned long flags;
> > +     void *addr;
> > +
> > +     WARN_ON(!in_task());
> > +
> > +     if (!ptr)
> > +             return true;
> > +
> > +     addr = __pcpu_ptr_to_addr(ptr);
> > +     spin_lock_irqsave(&pcpu_lock, flags);
> > +     chunk = pcpu_chunk_addr_search(addr);
> > +     off = addr - chunk->base_addr;
> > +     objcg_old = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
> > +     if (!objcg_old && step != MEMCG_KMEM_POST_CHARGE) {
> > +             spin_unlock_irqrestore(&pcpu_lock, flags);
> > +             return true;
> > +     }
> > +
> > +     bit_off = off / PCPU_MIN_ALLOC_SIZE;
> > +     /* find end index */
> > +     end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
> > +                     bit_off + 1);
> > +     bits = end - bit_off;
> > +     size = bits * PCPU_MIN_ALLOC_SIZE;
> > +
> > +     switch (step) {
> > +     case MEMCG_KMEM_PRE_CHARGE:
> > +             objcg_new = get_obj_cgroup_from_current();
> > +             WARN_ON(!objcg_new);
> > +             if (obj_cgroup_charge(objcg_new, GFP_KERNEL,
> > +                                   size * num_possible_cpus())) {
> > +                     obj_cgroup_put(objcg_new);
> > +                     spin_unlock_irqrestore(&pcpu_lock, flags);
> > +                     return false;
> > +             }
> > +             break;
> > +     case MEMCG_KMEM_UNCHARGE:
> > +             obj_cgroup_uncharge(objcg_old, size * num_possible_cpus());
> > +             rcu_read_lock();
> > +             mod_memcg_state(obj_cgroup_memcg(objcg_old), MEMCG_PERCPU_B,
> > +                     -(size * num_possible_cpus()));
> > +             rcu_read_unlock();
> > +             chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
> > +             obj_cgroup_put(objcg_old);
> > +             break;
> > +     case MEMCG_KMEM_POST_CHARGE:
> > +             rcu_read_lock();
> > +             chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = obj_cgroup_from_current();
> > +             mod_memcg_state(mem_cgroup_from_task(current), MEMCG_PERCPU_B,
> > +                     (size * num_possible_cpus()));
> > +             rcu_read_unlock();
> > +             break;
> > +     case MEMCG_KMEM_CHARGE_ERR:
> > +             /*
> > +              * In case fail to charge to the new one in the pre charge state,
> > +              * for example, we have pre-charged one memcg successfully but fail
> > +              * to pre-charge the second memcg, then we should uncharge the first
> > +              * memcg.
> > +              */
> > +             objcg_new = obj_cgroup_from_current();
> > +             obj_cgroup_uncharge(objcg_new, size * num_possible_cpus());
> > +             obj_cgroup_put(objcg_new);
> > +             rcu_read_lock();
> > +             mod_memcg_state(obj_cgroup_memcg(objcg_new), MEMCG_PERCPU_B,
> > +                     -(size * num_possible_cpus()));
> > +             rcu_read_unlock();
> > +
> > +             break;
> > +     }
>
> I'm not really the biggest fan of this step stuff. I see why you're
> doing it because you want to do all or nothing recharging the percpu bpf
> maps. Is there a way to have percpu own this logic and attempt to do all
> or nothing instead? I realize bpf is likely the largest percpu user, but
> the recharge_percpu() api seems to be more generic than forcing
> potential other users in the future to open code it.
>

Agree with you that the recharge api may be used by other users. It
should be a more generic helper.
Maybe we can make percpu own this logic by introducing a new value for
the parameter step, for example,
    recharge_percpu(ptr, -1); // -1 means the user doesn't need to
care about the multiple steps.

> > +
> > +     spin_unlock_irqrestore(&pcpu_lock, flags);
> > +
> > +     return true;
> > +}
> > +EXPORT_SYMBOL(recharge_percpu);
> > +
> > +#else /* CONFIG_MEMCG_KMEM */
> > +
> > +bool charge_percpu(void __percpu *ptr, bool charge)
> > +{
> > +     return true;
> > +}
> > +EXPORT_SYMBOL(charge_percpu);
> > +
> > +void uncharge_percpu(void __percpu *ptr)
> > +{
> > +}
>
> I'm guessing this is supposed to be recharge_percpu() not
> (un)charge_percpu().

Thanks for pointing out this bug.  The lkp robot also reported this bug to me.
I will change it.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 03/10] mm, memcg: Add new helper obj_cgroup_from_current()
  2022-06-25 13:54     ` Yafang Shao
@ 2022-06-26  1:52       ` Roman Gushchin
  0 siblings, 0 replies; 30+ messages in thread
From: Roman Gushchin @ 2022-06-26  1:52 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	songmuchun, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf

On Sat, Jun 25, 2022 at 09:54:17PM +0800, Yafang Shao wrote:
> On Thu, Jun 23, 2022 at 11:01 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > On Sun, Jun 19, 2022 at 03:50:25PM +0000, Yafang Shao wrote:
> > > The difference between get_obj_cgroup_from_current() and obj_cgroup_from_current()
> > > is that the later one doesn't add objcg's refcnt.
> > >
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > ---
> > >  include/linux/memcontrol.h |  1 +
> > >  mm/memcontrol.c            | 24 ++++++++++++++++++++++++
> > >  2 files changed, 25 insertions(+)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index cf074156c6ac..402b42670bcd 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -1703,6 +1703,7 @@ bool mem_cgroup_kmem_disabled(void);
> > >  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
> > >  void __memcg_kmem_uncharge_page(struct page *page, int order);
> > >
> > > +struct obj_cgroup *obj_cgroup_from_current(void);
> > >  struct obj_cgroup *get_obj_cgroup_from_current(void);
> > >  struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index abec50f31fe6..350a7849dac3 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2950,6 +2950,30 @@ struct obj_cgroup *get_obj_cgroup_from_page(struct page *page)
> > >       return objcg;
> > >  }
> > >
> > > +__always_inline struct obj_cgroup *obj_cgroup_from_current(void)
> > > +{
> > > +     struct obj_cgroup *objcg = NULL;
> > > +     struct mem_cgroup *memcg;
> > > +
> > > +     if (memcg_kmem_bypass())
> > > +             return NULL;
> > > +
> > > +     rcu_read_lock();
> > > +     if (unlikely(active_memcg()))
> > > +             memcg = active_memcg();
> > > +     else
> > > +             memcg = mem_cgroup_from_task(current);
> > > +
> > > +     for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > > +             objcg = rcu_dereference(memcg->objcg);
> > > +             if (objcg)
> > > +                     break;
> > > +     }
> > > +     rcu_read_unlock();
> >
> > Hm, what prevents the objcg from being released here? Under which conditions
> > it's safe to call it?
> 
> obj_cgroup_from_current() is used when we know the objcg's refcnt has
> already been incremented.
> For example in my case, it is called after we have already call get_
> parent_mem_cgroup().
> I should add a comment or a WARN_ON() in this function.

Yes, it's very confusing, please add a big comment explaining under which
conditions it's safe to call it.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-25  3:26   ` Yafang Shao
@ 2022-06-26  3:28     ` Roman Gushchin
  2022-06-26  3:32       ` Roman Gushchin
  2022-06-26  6:25       ` Yafang Shao
  2022-06-27  0:40     ` Alexei Starovoitov
  1 sibling, 2 replies; 30+ messages in thread
From: Roman Gushchin @ 2022-06-26  3:28 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	songmuchun, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf

On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote:
> On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > charged to the loader's memcg by default, that causes unexpected issues for
> > > us. For instance, the container of the loader may be restarted after
> > > pinning progs and maps, but the bpf memcg will be left and pinned on the
> > > system. Once the loader's new generation container is started, the leftover
> > > pages won't be charged to it. That inconsistent behavior will make trouble
> > > for the memory resource management for this container.
> > >
> > > In the past few days, I have proposed two patchsets[1][2] to try to resolve
> > > this issue, but in both of these two proposals the user code has to be
> > > changed to adapt to it, that is a pain for us. This patchset relieves the
> > > pain by triggering the recharge in libbpf. It also addresses Roman's
> > > critical comments.
> > >
> > > The key point we can avoid changing the user code is that there's a resue
> > > path in libbpf. Once the bpf container is restarted again, it will try
> > > to re-run the required bpf programs, if the bpf programs are the same with
> > > the already pinned one, it will reuse them.
> > >
> > > To make sure we either recharge all of them successfully or don't recharge
> > > any of them. The recharge prograss is divided into three steps:
> > >   - Pre charge to the new generation
> > >     To make sure once we uncharge from the old generation, we can always
> > >     charge to the new generation succeesfully. If we can't pre charge to
> > >     the new generation, we won't allow it to be uncharged from the old
> > >     generation.
> > >   - Uncharge from the old generation
> > >     After pre charge to the new generation, we can uncharge from the old
> > >     generation.
> > >   - Post charge to the new generation
> > >     Finnaly we can set pages' memcg_data to the new generation.
> > > In the pre charge step, we may succeed to charge some addresses, but fail
> > > to charge a new address, then we should uncharge the already charged
> > > addresses, so another recharge-err step is instroduced.
> > >
> > > This pachset has finished recharging bpf hash map. which is mostly used
> > > by our bpf services. The other maps hasn't been implemented yet. The bpf
> > > progs hasn't been implemented neither.
> >
> > Without going into the implementation details, the overall approach looks
> > ok to me. But it adds complexity and code into several different subsystems,
> > and I'm 100% sure it's not worth it if we talking about a partial support
> > of a single map type. Are you committed to implement the recharging
> > for all/most map types and progs and support this code in the future?
> >
> 
> I'm planning to support it for all map types and progs. Regarding the
> progs, it seems that we have to introduce a new UAPI for the user to
> do the recharge, because there's no similar reuse path in libbpf.
> 
> Our company is a heavy bpf user. We have many bpf programs running on
> our production environment, including networking bpf,
> tracing/profiling bpf, and some other bpf programs which are not
> supported in upstream kernel, for example we're even trying the
> sched-bpf[1] proposed by you (and you may remember that I reviewed
> your patchset).  Most of the networking bpf, e.g. gateway-bpf,
> edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system.
> 
> It is a trend that bpf will be introduced in more and more subsystems,
> and thus it is no doubt that a bpf patchset will involve many
> subsystems.
> 
> That means I will be continuously active in these areas in the near
> future,  several years at least.

Ok, I'm glad to hear this. I highly recommend to cover more map types
and use cases in next iterations of the patchset.

> 
> [1]. https://lwn.net/Articles/869433/
> 
> > I'm still feeling you trying to solve a userspace problem in the kernel.
> 
> Your feeling can be converted to a simple question: is it allowed to
> pin a bpf program by a process running in a container.  The answer to
> this simple question can help us to understand whether it is a user
> bug or a kernel bug.
> 
> I think you will agree with me that there's definitely no reason to
> refuse to pin a bpf program by a containerized process.  And then we
> will find that the pinned-bpf-program doesn't cooperate well with
> memcg.  A kernel feature can't work together with another kernel
> feature, and there's not even an interface provided to the user to
> adjust it. The user either doesn't pin the bpf program or disable
> kmemcg.   Isn't it a kernel bug ?
> 
> You may have a doubt why these two features can't cooperate.  I will
> explain it in detail.  That will be a long story.
> 
> It should begin with why we introduce bpf pinning. We pin it because
> sometimes the lifecycle of a user application is different with the
> bpf program, or there's no user agent at all.  In order to make it
> simple, I will take the no-user-agent (agent exits after pinning bpf
> program) case as an example.
> 
> Now thinking about what will happen if the agent which pins the bpf
> program has a memcg. No matter if the agent destroys the memcg or not
> once it exits, the memcg will not disappear because it is pinned by
> the bpf program. To make it easy, let's assume the memcg isn't being
> destroyed, IOW, it is online.
> 
> An online memcg is not populated, but it is still being remote charged
> (if it is a non-preallocate bpf map), that looks like a ghost. Now we
> will look into the details to find what will happen to this ghost
> memcg.
> 
> If this ghost memcg is limited, it will introduce many issues AFAICS.
> Firstly, the memcg will be force charged[2], and I think I don't need
> to explain the reason to you.
> Even worse is that it force-charges silently without any event,
> because it comes from,
>         if (!gfpflags_allow_blocking(gfp_mask))
>             goto nomem;
> And then all memcg events will be skipped. So at least we will
> introduce a force-charge event,
>     force:
> +      memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE);
>         page_counter_charge(&memcg->memory, nr_pages);

This is actually a good point, let me try to fix it.

> 
> And then we should allow alloc_htab_elem() to fail,
>                 l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
> -                                            GFP_ATOMIC | __GFP_NOWARN,
> +                                            __GFP_ATOMIC |
> __GFP_KSWAPD_RECLAIM | __GFP_NOWARN,

It's not a memcg thing, it was done this way previously. Probably Alexei
has a better explanation. Personally, I'm totally fine with removing
__GFP_NOWARN, but maybe I just don't know something.

>                                              htab->map.numa_node);
> And then we'd better introduce an improvement for memcg,
> +      /*
> +       *  Should wakeup async memcg reclaim first,
> +       *   in case there will be no direct memcg reclaim for a long time.
> +       *   We can either introduce async memcg reclaim
> +       *   or modify kswapd to reclaim a specific memcg
> +       */
> +       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> +            wake_up_async_memcg_reclaim();
>          if (!gfpflags_allow_blocking(gfp_mask))
>                 goto nomem;

Hm, I see. It might be an issue if there is no global memory pressure, right?
Let me think what I can do here too.

> 
> And .....
> 
> Really bad luck that there are so many issues in memcg, but it may
> also be because I don't have a deep understanding of memcg ......
> 
> I have to clarify that these issues are not caused by
> memcg-based-bpf-accounting, but exposed by it.
> 
> [ Time for lunch here, so I have to stop. ]

Thank you for writing this text, it was interesting to follow your thinking.
And thank you for bringing in these problems above.

Let me be clear: I'm not opposing the idea of recharging, I'm only against
introducing hacks for bpf-specific issues, which can't be nicely generalized
for other use cases and subsystems. That's the only reason why I'm a bit
defensive here.

In general, as now memory cgroups do not provide an ability to recharge
accounted objects (with some exceptions from the v1 legacy). It applies
both to user and kernel memory. I agree, that bpf maps are in some sense
unique, as they are potentially large kernel objects with a lot of control
from the userspace. Is this a good reason to extend memory cgroup API
with the recharging ability? Maybe, but if yes, let's do it well.

The big question is how to do it? Memcg accounting is done in a way
that requires little changes from the kernel code, right? You just
add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to
an already accounted object. The same applies for uncharging.
It works transparently.

Recharging is different: a caller should have some sort of the ownership
over the object (to make sure we are not racing against the reclaim and/or
another recharging). And the rules are different for each type of objects.
It's a caller duty to make sure all parts of the complex object are properly
recharged and nothing is left behind. There is also the reparenting mechanism
which can race against the recharging. So it's not an easy problem.
If an object is large, we probably don't want to recharge it at once,
otherwise temporarily doubling of the accounted memory (thanks to the
precharge-uncharge-commit approach) risks introducing spurious OOMs
on memory-limited systems.

So yeah, if it doesn't sound too scary for you, I'm happy to help
with this. But it's a lot of work to do it properly, that's why I'm thinking
that maybe it's better to workaround it in userspace, as Alexei suggested.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-26  3:28     ` Roman Gushchin
@ 2022-06-26  3:32       ` Roman Gushchin
  2022-06-26  6:38         ` Yafang Shao
  2022-06-26  6:25       ` Yafang Shao
  1 sibling, 1 reply; 30+ messages in thread
From: Roman Gushchin @ 2022-06-26  3:32 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	songmuchun, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf

On Sat, Jun 25, 2022 at 08:28:37PM -0700, Roman Gushchin wrote:
> On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote:
> > On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> > >
> > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> > > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > > charged to the loader's memcg by default, that causes unexpected issues for
> > > > us. For instance, the container of the loader may be restarted after
> > > > pinning progs and maps, but the bpf memcg will be left and pinned on the
> > > > system. Once the loader's new generation container is started, the leftover
> > > > pages won't be charged to it. That inconsistent behavior will make trouble
> > > > for the memory resource management for this container.
> > > >
> > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve
> > > > this issue, but in both of these two proposals the user code has to be
> > > > changed to adapt to it, that is a pain for us. This patchset relieves the
> > > > pain by triggering the recharge in libbpf. It also addresses Roman's
> > > > critical comments.
> > > >
> > > > The key point we can avoid changing the user code is that there's a resue
> > > > path in libbpf. Once the bpf container is restarted again, it will try
> > > > to re-run the required bpf programs, if the bpf programs are the same with
> > > > the already pinned one, it will reuse them.
> > > >
> > > > To make sure we either recharge all of them successfully or don't recharge
> > > > any of them. The recharge prograss is divided into three steps:
> > > >   - Pre charge to the new generation
> > > >     To make sure once we uncharge from the old generation, we can always
> > > >     charge to the new generation succeesfully. If we can't pre charge to
> > > >     the new generation, we won't allow it to be uncharged from the old
> > > >     generation.
> > > >   - Uncharge from the old generation
> > > >     After pre charge to the new generation, we can uncharge from the old
> > > >     generation.
> > > >   - Post charge to the new generation
> > > >     Finnaly we can set pages' memcg_data to the new generation.
> > > > In the pre charge step, we may succeed to charge some addresses, but fail
> > > > to charge a new address, then we should uncharge the already charged
> > > > addresses, so another recharge-err step is instroduced.
> > > >
> > > > This pachset has finished recharging bpf hash map. which is mostly used
> > > > by our bpf services. The other maps hasn't been implemented yet. The bpf
> > > > progs hasn't been implemented neither.
> > >
> > > Without going into the implementation details, the overall approach looks
> > > ok to me. But it adds complexity and code into several different subsystems,
> > > and I'm 100% sure it's not worth it if we talking about a partial support
> > > of a single map type. Are you committed to implement the recharging
> > > for all/most map types and progs and support this code in the future?
> > >
> > 
> > I'm planning to support it for all map types and progs. Regarding the
> > progs, it seems that we have to introduce a new UAPI for the user to
> > do the recharge, because there's no similar reuse path in libbpf.
> > 
> > Our company is a heavy bpf user. We have many bpf programs running on
> > our production environment, including networking bpf,
> > tracing/profiling bpf, and some other bpf programs which are not
> > supported in upstream kernel, for example we're even trying the
> > sched-bpf[1] proposed by you (and you may remember that I reviewed
> > your patchset).  Most of the networking bpf, e.g. gateway-bpf,
> > edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system.
> > 
> > It is a trend that bpf will be introduced in more and more subsystems,
> > and thus it is no doubt that a bpf patchset will involve many
> > subsystems.
> > 
> > That means I will be continuously active in these areas in the near
> > future,  several years at least.
> 
> Ok, I'm glad to hear this. I highly recommend to cover more map types
> and use cases in next iterations of the patchset.
> 
> > 
> > [1]. https://lwn.net/Articles/869433/
> > 
> > > I'm still feeling you trying to solve a userspace problem in the kernel.
> > 
> > Your feeling can be converted to a simple question: is it allowed to
> > pin a bpf program by a process running in a container.  The answer to
> > this simple question can help us to understand whether it is a user
> > bug or a kernel bug.
> > 
> > I think you will agree with me that there's definitely no reason to
> > refuse to pin a bpf program by a containerized process.  And then we
> > will find that the pinned-bpf-program doesn't cooperate well with
> > memcg.  A kernel feature can't work together with another kernel
> > feature, and there's not even an interface provided to the user to
> > adjust it. The user either doesn't pin the bpf program or disable
> > kmemcg.   Isn't it a kernel bug ?
> > 
> > You may have a doubt why these two features can't cooperate.  I will
> > explain it in detail.  That will be a long story.
> > 
> > It should begin with why we introduce bpf pinning. We pin it because
> > sometimes the lifecycle of a user application is different with the
> > bpf program, or there's no user agent at all.  In order to make it
> > simple, I will take the no-user-agent (agent exits after pinning bpf
> > program) case as an example.
> > 
> > Now thinking about what will happen if the agent which pins the bpf
> > program has a memcg. No matter if the agent destroys the memcg or not
> > once it exits, the memcg will not disappear because it is pinned by
> > the bpf program. To make it easy, let's assume the memcg isn't being
> > destroyed, IOW, it is online.
> > 
> > An online memcg is not populated, but it is still being remote charged
> > (if it is a non-preallocate bpf map), that looks like a ghost. Now we
> > will look into the details to find what will happen to this ghost
> > memcg.
> > 
> > If this ghost memcg is limited, it will introduce many issues AFAICS.
> > Firstly, the memcg will be force charged[2], and I think I don't need
> > to explain the reason to you.
> > Even worse is that it force-charges silently without any event,
> > because it comes from,
> >         if (!gfpflags_allow_blocking(gfp_mask))
> >             goto nomem;
> > And then all memcg events will be skipped. So at least we will
> > introduce a force-charge event,
> >     force:
> > +      memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE);
> >         page_counter_charge(&memcg->memory, nr_pages);
> 
> This is actually a good point, let me try to fix it.
> 
> > 
> > And then we should allow alloc_htab_elem() to fail,
> >                 l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
> > -                                            GFP_ATOMIC | __GFP_NOWARN,
> > +                                            __GFP_ATOMIC |
> > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN,
> 
> It's not a memcg thing, it was done this way previously. Probably Alexei
> has a better explanation. Personally, I'm totally fine with removing
> __GFP_NOWARN, but maybe I just don't know something.
> 
> >                                              htab->map.numa_node);
> > And then we'd better introduce an improvement for memcg,
> > +      /*
> > +       *  Should wakeup async memcg reclaim first,
> > +       *   in case there will be no direct memcg reclaim for a long time.
> > +       *   We can either introduce async memcg reclaim
> > +       *   or modify kswapd to reclaim a specific memcg
> > +       */
> > +       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> > +            wake_up_async_memcg_reclaim();
> >          if (!gfpflags_allow_blocking(gfp_mask))
> >                 goto nomem;
> 
> Hm, I see. It might be an issue if there is no global memory pressure, right?
> Let me think what I can do here too.
> 
> > 
> > And .....
> > 
> > Really bad luck that there are so many issues in memcg, but it may
> > also be because I don't have a deep understanding of memcg ......
> > 
> > I have to clarify that these issues are not caused by
> > memcg-based-bpf-accounting, but exposed by it.
> > 
> > [ Time for lunch here, so I have to stop. ]
> 
> Thank you for writing this text, it was interesting to follow your thinking.
> And thank you for bringing in these problems above.
> 
> Let me be clear: I'm not opposing the idea of recharging, I'm only against
> introducing hacks for bpf-specific issues, which can't be nicely generalized
> for other use cases and subsystems. That's the only reason why I'm a bit
> defensive here.
> 
> In general, as now memory cgroups do not provide an ability to recharge
> accounted objects (with some exceptions from the v1 legacy). It applies
> both to user and kernel memory. I agree, that bpf maps are in some sense
> unique, as they are potentially large kernel objects with a lot of control
> from the userspace. Is this a good reason to extend memory cgroup API
> with the recharging ability? Maybe, but if yes, let's do it well.
> 
> The big question is how to do it? Memcg accounting is done in a way
> that requires little changes from the kernel code, right? You just
> add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to
> an already accounted object. The same applies for uncharging.
> It works transparently.
> 
> Recharging is different: a caller should have some sort of the ownership
> over the object (to make sure we are not racing against the reclaim and/or
> another recharging). And the rules are different for each type of objects.
> It's a caller duty to make sure all parts of the complex object are properly
> recharged and nothing is left behind. There is also the reparenting mechanism
> which can race against the recharging. So it's not an easy problem.
> If an object is large, we probably don't want to recharge it at once,
> otherwise temporarily doubling of the accounted memory (thanks to the
> precharge-uncharge-commit approach) risks introducing spurious OOMs
> on memory-limited systems.
> 
> So yeah, if it doesn't sound too scary for you, I'm happy to help
> with this. But it's a lot of work to do it properly, that's why I'm thinking
> that maybe it's better to workaround it in userspace, as Alexei suggested.

And as Alexei mentioned, there are some changes coming around the way
how memory allocations will be usually performed by the bpf code, which might
make the whole problem and/or your solution obsolete. Please, make sure it's
still actual before sending the next version.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-26  3:28     ` Roman Gushchin
  2022-06-26  3:32       ` Roman Gushchin
@ 2022-06-26  6:25       ` Yafang Shao
  2022-07-02  4:23         ` Roman Gushchin
  1 sibling, 1 reply; 30+ messages in thread
From: Yafang Shao @ 2022-06-26  6:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	songmuchun, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf

On Sun, Jun 26, 2022 at 11:28 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote:
> > On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> > >
> > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> > > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > > charged to the loader's memcg by default, that causes unexpected issues for
> > > > us. For instance, the container of the loader may be restarted after
> > > > pinning progs and maps, but the bpf memcg will be left and pinned on the
> > > > system. Once the loader's new generation container is started, the leftover
> > > > pages won't be charged to it. That inconsistent behavior will make trouble
> > > > for the memory resource management for this container.
> > > >
> > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve
> > > > this issue, but in both of these two proposals the user code has to be
> > > > changed to adapt to it, that is a pain for us. This patchset relieves the
> > > > pain by triggering the recharge in libbpf. It also addresses Roman's
> > > > critical comments.
> > > >
> > > > The key point we can avoid changing the user code is that there's a resue
> > > > path in libbpf. Once the bpf container is restarted again, it will try
> > > > to re-run the required bpf programs, if the bpf programs are the same with
> > > > the already pinned one, it will reuse them.
> > > >
> > > > To make sure we either recharge all of them successfully or don't recharge
> > > > any of them. The recharge prograss is divided into three steps:
> > > >   - Pre charge to the new generation
> > > >     To make sure once we uncharge from the old generation, we can always
> > > >     charge to the new generation succeesfully. If we can't pre charge to
> > > >     the new generation, we won't allow it to be uncharged from the old
> > > >     generation.
> > > >   - Uncharge from the old generation
> > > >     After pre charge to the new generation, we can uncharge from the old
> > > >     generation.
> > > >   - Post charge to the new generation
> > > >     Finnaly we can set pages' memcg_data to the new generation.
> > > > In the pre charge step, we may succeed to charge some addresses, but fail
> > > > to charge a new address, then we should uncharge the already charged
> > > > addresses, so another recharge-err step is instroduced.
> > > >
> > > > This pachset has finished recharging bpf hash map. which is mostly used
> > > > by our bpf services. The other maps hasn't been implemented yet. The bpf
> > > > progs hasn't been implemented neither.
> > >
> > > Without going into the implementation details, the overall approach looks
> > > ok to me. But it adds complexity and code into several different subsystems,
> > > and I'm 100% sure it's not worth it if we talking about a partial support
> > > of a single map type. Are you committed to implement the recharging
> > > for all/most map types and progs and support this code in the future?
> > >
> >
> > I'm planning to support it for all map types and progs. Regarding the
> > progs, it seems that we have to introduce a new UAPI for the user to
> > do the recharge, because there's no similar reuse path in libbpf.
> >
> > Our company is a heavy bpf user. We have many bpf programs running on
> > our production environment, including networking bpf,
> > tracing/profiling bpf, and some other bpf programs which are not
> > supported in upstream kernel, for example we're even trying the
> > sched-bpf[1] proposed by you (and you may remember that I reviewed
> > your patchset).  Most of the networking bpf, e.g. gateway-bpf,
> > edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system.
> >
> > It is a trend that bpf will be introduced in more and more subsystems,
> > and thus it is no doubt that a bpf patchset will involve many
> > subsystems.
> >
> > That means I will be continuously active in these areas in the near
> > future,  several years at least.
>
> Ok, I'm glad to hear this. I highly recommend to cover more map types
> and use cases in next iterations of the patchset.
>
> >
> > [1]. https://lwn.net/Articles/869433/
> >
> > > I'm still feeling you trying to solve a userspace problem in the kernel.
> >
> > Your feeling can be converted to a simple question: is it allowed to
> > pin a bpf program by a process running in a container.  The answer to
> > this simple question can help us to understand whether it is a user
> > bug or a kernel bug.
> >
> > I think you will agree with me that there's definitely no reason to
> > refuse to pin a bpf program by a containerized process.  And then we
> > will find that the pinned-bpf-program doesn't cooperate well with
> > memcg.  A kernel feature can't work together with another kernel
> > feature, and there's not even an interface provided to the user to
> > adjust it. The user either doesn't pin the bpf program or disable
> > kmemcg.   Isn't it a kernel bug ?
> >
> > You may have a doubt why these two features can't cooperate.  I will
> > explain it in detail.  That will be a long story.
> >
> > It should begin with why we introduce bpf pinning. We pin it because
> > sometimes the lifecycle of a user application is different with the
> > bpf program, or there's no user agent at all.  In order to make it
> > simple, I will take the no-user-agent (agent exits after pinning bpf
> > program) case as an example.
> >
> > Now thinking about what will happen if the agent which pins the bpf
> > program has a memcg. No matter if the agent destroys the memcg or not
> > once it exits, the memcg will not disappear because it is pinned by
> > the bpf program. To make it easy, let's assume the memcg isn't being
> > destroyed, IOW, it is online.
> >
> > An online memcg is not populated, but it is still being remote charged
> > (if it is a non-preallocate bpf map), that looks like a ghost. Now we
> > will look into the details to find what will happen to this ghost
> > memcg.
> >
> > If this ghost memcg is limited, it will introduce many issues AFAICS.
> > Firstly, the memcg will be force charged[2], and I think I don't need
> > to explain the reason to you.
> > Even worse is that it force-charges silently without any event,
> > because it comes from,
> >         if (!gfpflags_allow_blocking(gfp_mask))
> >             goto nomem;
> > And then all memcg events will be skipped. So at least we will
> > introduce a force-charge event,
> >     force:
> > +      memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE);
> >         page_counter_charge(&memcg->memory, nr_pages);
>
> This is actually a good point, let me try to fix it.
>

Thanks for following up with it.

> >
> > And then we should allow alloc_htab_elem() to fail,
> >                 l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
> > -                                            GFP_ATOMIC | __GFP_NOWARN,
> > +                                            __GFP_ATOMIC |
> > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN,
>
> It's not a memcg thing, it was done this way previously. Probably Alexei
> has a better explanation. Personally, I'm totally fine with removing
> __GFP_NOWARN, but maybe I just don't know something.
>

Ah, the automatic-line-feed misled you.
What I really want to remove is the '__GFP_HIGH' in the GFP_ATOMIC,
because then it will hit the 'nomem' check[1] if the memcg limit is
reached.
As it is allowed to fail the allocation in alloc_htab_elem(), it won't
be an issue to remove this flag.
alloc_htab_elem() may allocate lots of memory, so it is also
reasonable to make it a low-priority allocation.
Alexei may give us more information on it.

[1]. https://elixir.bootlin.com/linux/v5.19-rc3/source/mm/memcontrol.c#L2683

> >                                              htab->map.numa_node);
> > And then we'd better introduce an improvement for memcg,
> > +      /*
> > +       *  Should wakeup async memcg reclaim first,
> > +       *   in case there will be no direct memcg reclaim for a long time.
> > +       *   We can either introduce async memcg reclaim
> > +       *   or modify kswapd to reclaim a specific memcg
> > +       */
> > +       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> > +            wake_up_async_memcg_reclaim();
> >          if (!gfpflags_allow_blocking(gfp_mask))
> >                 goto nomem;
>
> Hm, I see. It might be an issue if there is no global memory pressure, right?
> Let me think what I can do here too.
>

Right. It is not a good idea to expect a global memory reclaimer to do it.
Thanks for following up with it again.

> >
> > And .....
> >
> > Really bad luck that there are so many issues in memcg, but it may
> > also be because I don't have a deep understanding of memcg ......
> >
> > I have to clarify that these issues are not caused by
> > memcg-based-bpf-accounting, but exposed by it.
> >
> > [ Time for lunch here, so I have to stop. ]
>
> Thank you for writing this text, it was interesting to follow your thinking.
> And thank you for bringing in these problems above.
>
> Let me be clear: I'm not opposing the idea of recharging, I'm only against
> introducing hacks for bpf-specific issues, which can't be nicely generalized
> for other use cases and subsystems. That's the only reason why I'm a bit
> defensive here.
>
> In general, as now memory cgroups do not provide an ability to recharge
> accounted objects (with some exceptions from the v1 legacy). It applies
> both to user and kernel memory. I agree, that bpf maps are in some sense
> unique, as they are potentially large kernel objects with a lot of control
> from the userspace. Is this a good reason to extend memory cgroup API
> with the recharging ability? Maybe, but if yes, let's do it well.
>
> The big question is how to do it? Memcg accounting is done in a way
> that requires little changes from the kernel code, right? You just
> add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to
> an already accounted object. The same applies for uncharging.
> It works transparently.
>
> Recharging is different: a caller should have some sort of the ownership
> over the object (to make sure we are not racing against the reclaim and/or
> another recharging). And the rules are different for each type of objects.
> It's a caller duty to make sure all parts of the complex object are properly
> recharged and nothing is left behind. There is also the reparenting mechanism
> which can race against the recharging. So it's not an easy problem.
> If an object is large, we probably don't want to recharge it at once,
> otherwise temporarily doubling of the accounted memory (thanks to the
> precharge-uncharge-commit approach) risks introducing spurious OOMs
> on memory-limited systems.
>

As I explained in the cover-letter, the 'doubling of the accounted
memory' can be avoided.
The doubling will happen when the src and dst memcg have the same
ancestor, so we can stop the iter at this common ancestor. For
example,

                           memcg A
                              /            \
                    memcg AB      memcg AC             <----  We can
stop uncharge and recharge here
                                             /              \
                                   memcg ACA      memcg ACB
                                        |                         |
                                    src memcg            dst memcg
Of course we must do it carefully.

> So yeah, if it doesn't sound too scary for you, I'm happy to help
> with this.

I'm glad to hear that you can help with it.

>  But it's a lot of work to do it properly, that's why I'm thinking
> that maybe it's better to workaround it in userspace, as Alexei suggested.
>

I don't mind working around it as Alexei suggested, that's why I
investigated if it is possible to introduce a ghost memcg and then
found so many issues in it.
Even if all the issues I mentioned above are fixed, we still need to
change the kernel to adapt to it.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-26  3:32       ` Roman Gushchin
@ 2022-06-26  6:38         ` Yafang Shao
  0 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-26  6:38 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	songmuchun, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf

On Sun, Jun 26, 2022 at 11:32 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Sat, Jun 25, 2022 at 08:28:37PM -0700, Roman Gushchin wrote:
> > On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote:
> > > On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin
> > > <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> > > > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > > > charged to the loader's memcg by default, that causes unexpected issues for
> > > > > us. For instance, the container of the loader may be restarted after
> > > > > pinning progs and maps, but the bpf memcg will be left and pinned on the
> > > > > system. Once the loader's new generation container is started, the leftover
> > > > > pages won't be charged to it. That inconsistent behavior will make trouble
> > > > > for the memory resource management for this container.
> > > > >
> > > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve
> > > > > this issue, but in both of these two proposals the user code has to be
> > > > > changed to adapt to it, that is a pain for us. This patchset relieves the
> > > > > pain by triggering the recharge in libbpf. It also addresses Roman's
> > > > > critical comments.
> > > > >
> > > > > The key point we can avoid changing the user code is that there's a resue
> > > > > path in libbpf. Once the bpf container is restarted again, it will try
> > > > > to re-run the required bpf programs, if the bpf programs are the same with
> > > > > the already pinned one, it will reuse them.
> > > > >
> > > > > To make sure we either recharge all of them successfully or don't recharge
> > > > > any of them. The recharge prograss is divided into three steps:
> > > > >   - Pre charge to the new generation
> > > > >     To make sure once we uncharge from the old generation, we can always
> > > > >     charge to the new generation succeesfully. If we can't pre charge to
> > > > >     the new generation, we won't allow it to be uncharged from the old
> > > > >     generation.
> > > > >   - Uncharge from the old generation
> > > > >     After pre charge to the new generation, we can uncharge from the old
> > > > >     generation.
> > > > >   - Post charge to the new generation
> > > > >     Finnaly we can set pages' memcg_data to the new generation.
> > > > > In the pre charge step, we may succeed to charge some addresses, but fail
> > > > > to charge a new address, then we should uncharge the already charged
> > > > > addresses, so another recharge-err step is instroduced.
> > > > >
> > > > > This pachset has finished recharging bpf hash map. which is mostly used
> > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf
> > > > > progs hasn't been implemented neither.
> > > >
> > > > Without going into the implementation details, the overall approach looks
> > > > ok to me. But it adds complexity and code into several different subsystems,
> > > > and I'm 100% sure it's not worth it if we talking about a partial support
> > > > of a single map type. Are you committed to implement the recharging
> > > > for all/most map types and progs and support this code in the future?
> > > >
> > >
> > > I'm planning to support it for all map types and progs. Regarding the
> > > progs, it seems that we have to introduce a new UAPI for the user to
> > > do the recharge, because there's no similar reuse path in libbpf.
> > >
> > > Our company is a heavy bpf user. We have many bpf programs running on
> > > our production environment, including networking bpf,
> > > tracing/profiling bpf, and some other bpf programs which are not
> > > supported in upstream kernel, for example we're even trying the
> > > sched-bpf[1] proposed by you (and you may remember that I reviewed
> > > your patchset).  Most of the networking bpf, e.g. gateway-bpf,
> > > edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system.
> > >
> > > It is a trend that bpf will be introduced in more and more subsystems,
> > > and thus it is no doubt that a bpf patchset will involve many
> > > subsystems.
> > >
> > > That means I will be continuously active in these areas in the near
> > > future,  several years at least.
> >
> > Ok, I'm glad to hear this. I highly recommend to cover more map types
> > and use cases in next iterations of the patchset.
> >
> > >
> > > [1]. https://lwn.net/Articles/869433/
> > >
> > > > I'm still feeling you trying to solve a userspace problem in the kernel.
> > >
> > > Your feeling can be converted to a simple question: is it allowed to
> > > pin a bpf program by a process running in a container.  The answer to
> > > this simple question can help us to understand whether it is a user
> > > bug or a kernel bug.
> > >
> > > I think you will agree with me that there's definitely no reason to
> > > refuse to pin a bpf program by a containerized process.  And then we
> > > will find that the pinned-bpf-program doesn't cooperate well with
> > > memcg.  A kernel feature can't work together with another kernel
> > > feature, and there's not even an interface provided to the user to
> > > adjust it. The user either doesn't pin the bpf program or disable
> > > kmemcg.   Isn't it a kernel bug ?
> > >
> > > You may have a doubt why these two features can't cooperate.  I will
> > > explain it in detail.  That will be a long story.
> > >
> > > It should begin with why we introduce bpf pinning. We pin it because
> > > sometimes the lifecycle of a user application is different with the
> > > bpf program, or there's no user agent at all.  In order to make it
> > > simple, I will take the no-user-agent (agent exits after pinning bpf
> > > program) case as an example.
> > >
> > > Now thinking about what will happen if the agent which pins the bpf
> > > program has a memcg. No matter if the agent destroys the memcg or not
> > > once it exits, the memcg will not disappear because it is pinned by
> > > the bpf program. To make it easy, let's assume the memcg isn't being
> > > destroyed, IOW, it is online.
> > >
> > > An online memcg is not populated, but it is still being remote charged
> > > (if it is a non-preallocate bpf map), that looks like a ghost. Now we
> > > will look into the details to find what will happen to this ghost
> > > memcg.
> > >
> > > If this ghost memcg is limited, it will introduce many issues AFAICS.
> > > Firstly, the memcg will be force charged[2], and I think I don't need
> > > to explain the reason to you.
> > > Even worse is that it force-charges silently without any event,
> > > because it comes from,
> > >         if (!gfpflags_allow_blocking(gfp_mask))
> > >             goto nomem;
> > > And then all memcg events will be skipped. So at least we will
> > > introduce a force-charge event,
> > >     force:
> > > +      memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE);
> > >         page_counter_charge(&memcg->memory, nr_pages);
> >
> > This is actually a good point, let me try to fix it.
> >
> > >
> > > And then we should allow alloc_htab_elem() to fail,
> > >                 l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
> > > -                                            GFP_ATOMIC | __GFP_NOWARN,
> > > +                                            __GFP_ATOMIC |
> > > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN,
> >
> > It's not a memcg thing, it was done this way previously. Probably Alexei
> > has a better explanation. Personally, I'm totally fine with removing
> > __GFP_NOWARN, but maybe I just don't know something.
> >
> > >                                              htab->map.numa_node);
> > > And then we'd better introduce an improvement for memcg,
> > > +      /*
> > > +       *  Should wakeup async memcg reclaim first,
> > > +       *   in case there will be no direct memcg reclaim for a long time.
> > > +       *   We can either introduce async memcg reclaim
> > > +       *   or modify kswapd to reclaim a specific memcg
> > > +       */
> > > +       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> > > +            wake_up_async_memcg_reclaim();
> > >          if (!gfpflags_allow_blocking(gfp_mask))
> > >                 goto nomem;
> >
> > Hm, I see. It might be an issue if there is no global memory pressure, right?
> > Let me think what I can do here too.
> >
> > >
> > > And .....
> > >
> > > Really bad luck that there are so many issues in memcg, but it may
> > > also be because I don't have a deep understanding of memcg ......
> > >
> > > I have to clarify that these issues are not caused by
> > > memcg-based-bpf-accounting, but exposed by it.
> > >
> > > [ Time for lunch here, so I have to stop. ]
> >
> > Thank you for writing this text, it was interesting to follow your thinking.
> > And thank you for bringing in these problems above.
> >
> > Let me be clear: I'm not opposing the idea of recharging, I'm only against
> > introducing hacks for bpf-specific issues, which can't be nicely generalized
> > for other use cases and subsystems. That's the only reason why I'm a bit
> > defensive here.
> >
> > In general, as now memory cgroups do not provide an ability to recharge
> > accounted objects (with some exceptions from the v1 legacy). It applies
> > both to user and kernel memory. I agree, that bpf maps are in some sense
> > unique, as they are potentially large kernel objects with a lot of control
> > from the userspace. Is this a good reason to extend memory cgroup API
> > with the recharging ability? Maybe, but if yes, let's do it well.
> >
> > The big question is how to do it? Memcg accounting is done in a way
> > that requires little changes from the kernel code, right? You just
> > add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to
> > an already accounted object. The same applies for uncharging.
> > It works transparently.
> >
> > Recharging is different: a caller should have some sort of the ownership
> > over the object (to make sure we are not racing against the reclaim and/or
> > another recharging). And the rules are different for each type of objects.
> > It's a caller duty to make sure all parts of the complex object are properly
> > recharged and nothing is left behind. There is also the reparenting mechanism
> > which can race against the recharging. So it's not an easy problem.
> > If an object is large, we probably don't want to recharge it at once,
> > otherwise temporarily doubling of the accounted memory (thanks to the
> > precharge-uncharge-commit approach) risks introducing spurious OOMs
> > on memory-limited systems.
> >
> > So yeah, if it doesn't sound too scary for you, I'm happy to help
> > with this. But it's a lot of work to do it properly, that's why I'm thinking
> > that maybe it's better to workaround it in userspace, as Alexei suggested.
>
> And as Alexei mentioned, there are some changes coming around the way
> how memory allocations will be usually performed by the bpf code, which might
> make the whole problem and/or your solution obsolete. Please, make sure it's
> still actual before sending the next version.
>

I have taken a look at Alexei's patchset.
For the non-preallocated bpf memory, we have to use map->memcg to do
the accounting, then this issue will still exist.
It also reminds us that when we reuse a map we must make sure the
map->memcg is the expected one.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-25  3:26   ` Yafang Shao
  2022-06-26  3:28     ` Roman Gushchin
@ 2022-06-27  0:40     ` Alexei Starovoitov
  2022-06-27 15:02       ` Yafang Shao
  1 sibling, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2022-06-27  0:40 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Roman Gushchin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Quentin Monnet, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, penberg, David Rientjes, iamjoonsoo.kim,
	Vlastimil Babka, Linux MM, bpf

On Fri, Jun 24, 2022 at 8:26 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> I'm planning to support it for all map types and progs. Regarding the
> progs, it seems that we have to introduce a new UAPI for the user to
> do the recharge, because there's no similar reuse path in libbpf.
>
> Our company is a heavy bpf user.

What company is that?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 07/10] mm: Add helper to recharge percpu address
@ 2022-06-27  3:04       ` Yafang Shao
  0 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-27  3:04 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6249 bytes --]

On Thu, Jun 23, 2022 at 1:25 PM Dennis Zhou <dennisszhou@gmail.com> wrote:
>
> Hello,
>
> On Sun, Jun 19, 2022 at 03:50:29PM +0000, Yafang Shao wrote:
> > This patch introduces a helper to recharge the corresponding pages of
> > a given percpu address. It is similar with how to recharge a kmalloc'ed
> > address.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/percpu.h |  1 +
> >  mm/percpu.c            | 98 ++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 99 insertions(+)
> >
> > diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> > index f1ec5ad1351c..e88429410179 100644
> > --- a/include/linux/percpu.h
> > +++ b/include/linux/percpu.h
> > @@ -128,6 +128,7 @@ extern void __init setup_per_cpu_areas(void);
> >  extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) __alloc_size(1);
> >  extern void __percpu *__alloc_percpu(size_t size, size_t align) __alloc_size(1);
> >  extern void free_percpu(void __percpu *__pdata);
> > +bool recharge_percpu(void __percpu *__pdata, int step);
>
> Nit: can you add extern to keep the file consistent.
>

Sure, I will do it.

> >  extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
> >
> >  #define alloc_percpu_gfp(type, gfp)                                  \
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 3633eeefaa0d..fd81f4d79f2f 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -2310,6 +2310,104 @@ void free_percpu(void __percpu *ptr)
> >  }
> >  EXPORT_SYMBOL_GPL(free_percpu);
> >
> > +#ifdef CONFIG_MEMCG_KMEM
> > +bool recharge_percpu(void __percpu *ptr, int step)
> > +{
> > +     int bit_off, off, bits, size, end;
> > +     struct obj_cgroup *objcg_old;
> > +     struct obj_cgroup *objcg_new;
> > +     struct pcpu_chunk *chunk;
> > +     unsigned long flags;
> > +     void *addr;
> > +
> > +     WARN_ON(!in_task());
> > +
> > +     if (!ptr)
> > +             return true;
> > +
> > +     addr = __pcpu_ptr_to_addr(ptr);
> > +     spin_lock_irqsave(&pcpu_lock, flags);
> > +     chunk = pcpu_chunk_addr_search(addr);
> > +     off = addr - chunk->base_addr;
> > +     objcg_old = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
> > +     if (!objcg_old && step != MEMCG_KMEM_POST_CHARGE) {
> > +             spin_unlock_irqrestore(&pcpu_lock, flags);
> > +             return true;
> > +     }
> > +
> > +     bit_off = off / PCPU_MIN_ALLOC_SIZE;
> > +     /* find end index */
> > +     end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
> > +                     bit_off + 1);
> > +     bits = end - bit_off;
> > +     size = bits * PCPU_MIN_ALLOC_SIZE;
> > +
> > +     switch (step) {
> > +     case MEMCG_KMEM_PRE_CHARGE:
> > +             objcg_new = get_obj_cgroup_from_current();
> > +             WARN_ON(!objcg_new);
> > +             if (obj_cgroup_charge(objcg_new, GFP_KERNEL,
> > +                                   size * num_possible_cpus())) {
> > +                     obj_cgroup_put(objcg_new);
> > +                     spin_unlock_irqrestore(&pcpu_lock, flags);
> > +                     return false;
> > +             }
> > +             break;
> > +     case MEMCG_KMEM_UNCHARGE:
> > +             obj_cgroup_uncharge(objcg_old, size * num_possible_cpus());
> > +             rcu_read_lock();
> > +             mod_memcg_state(obj_cgroup_memcg(objcg_old), MEMCG_PERCPU_B,
> > +                     -(size * num_possible_cpus()));
> > +             rcu_read_unlock();
> > +             chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
> > +             obj_cgroup_put(objcg_old);
> > +             break;
> > +     case MEMCG_KMEM_POST_CHARGE:
> > +             rcu_read_lock();
> > +             chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = obj_cgroup_from_current();
> > +             mod_memcg_state(mem_cgroup_from_task(current), MEMCG_PERCPU_B,
> > +                     (size * num_possible_cpus()));
> > +             rcu_read_unlock();
> > +             break;
> > +     case MEMCG_KMEM_CHARGE_ERR:
> > +             /*
> > +              * In case fail to charge to the new one in the pre charge state,
> > +              * for example, we have pre-charged one memcg successfully but fail
> > +              * to pre-charge the second memcg, then we should uncharge the first
> > +              * memcg.
> > +              */
> > +             objcg_new = obj_cgroup_from_current();
> > +             obj_cgroup_uncharge(objcg_new, size * num_possible_cpus());
> > +             obj_cgroup_put(objcg_new);
> > +             rcu_read_lock();
> > +             mod_memcg_state(obj_cgroup_memcg(objcg_new), MEMCG_PERCPU_B,
> > +                     -(size * num_possible_cpus()));
> > +             rcu_read_unlock();
> > +
> > +             break;
> > +     }
>
> I'm not really the biggest fan of this step stuff. I see why you're
> doing it because you want to do all or nothing recharging the percpu bpf
> maps. Is there a way to have percpu own this logic and attempt to do all
> or nothing instead? I realize bpf is likely the largest percpu user, but
> the recharge_percpu() api seems to be more generic than forcing
> potential other users in the future to open code it.
>

Agree with you that the recharge api may be used by other users. It
should be a more generic helper.
Maybe we can make percpu own this logic by introducing a new value for
the parameter step, for example,
    recharge_percpu(ptr, -1); // -1 means the user doesn't need to
care about the multiple steps.

> > +
> > +     spin_unlock_irqrestore(&pcpu_lock, flags);
> > +
> > +     return true;
> > +}
> > +EXPORT_SYMBOL(recharge_percpu);
> > +
> > +#else /* CONFIG_MEMCG_KMEM */
> > +
> > +bool charge_percpu(void __percpu *ptr, bool charge)
> > +{
> > +     return true;
> > +}
> > +EXPORT_SYMBOL(charge_percpu);
> > +
> > +void uncharge_percpu(void __percpu *ptr)
> > +{
> > +}
>
> I'm guessing this is supposed to be recharge_percpu() not
> (un)charge_percpu().

Thanks for pointing out this bug.  The lkp robot also reported this bug to me.
I will change it.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-27  0:40     ` Alexei Starovoitov
@ 2022-06-27 15:02       ` Yafang Shao
  0 siblings, 0 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-27 15:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Roman Gushchin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Quentin Monnet, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, penberg, David Rientjes, iamjoonsoo.kim,
	Vlastimil Babka, Linux MM, bpf

On Mon, Jun 27, 2022 at 8:40 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jun 24, 2022 at 8:26 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > I'm planning to support it for all map types and progs. Regarding the
> > progs, it seems that we have to introduce a new UAPI for the user to
> > do the recharge, because there's no similar reuse path in libbpf.
> >
> > Our company is a heavy bpf user.
>
> What company is that?

Pinduoduo, aka PDD, a small company in China.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-06-26  6:25       ` Yafang Shao
@ 2022-07-02  4:23         ` Roman Gushchin
  2022-07-02 15:24           ` Yafang Shao
  0 siblings, 1 reply; 30+ messages in thread
From: Roman Gushchin @ 2022-07-02  4:23 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	songmuchun, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf

On Sun, Jun 26, 2022 at 02:25:51PM +0800, Yafang Shao wrote:
> > >                                              htab->map.numa_node);
> > > And then we'd better introduce an improvement for memcg,
> > > +      /*
> > > +       *  Should wakeup async memcg reclaim first,
> > > +       *   in case there will be no direct memcg reclaim for a long time.
> > > +       *   We can either introduce async memcg reclaim
> > > +       *   or modify kswapd to reclaim a specific memcg
> > > +       */
> > > +       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> > > +            wake_up_async_memcg_reclaim();
> > >          if (!gfpflags_allow_blocking(gfp_mask))
> > >                 goto nomem;
> >
> > Hm, I see. It might be an issue if there is no global memory pressure, right?
> > Let me think what I can do here too.
> >
> 
> Right. It is not a good idea to expect a global memory reclaimer to do it.
> Thanks for following up with it again.

After thinking a bit more, I'm not sure if it's actually a good idea:
there might be not much memory to reclaim except the memory consumed by the bpf
map itself, so waking kswapd might be useless (and just consume cpu and drain
batteries).

What we need to do instead is to prevent bpf maps to meaningfully exceed
memory.max, which is btw guaranteed by the cgroup API: memory.max is defined
as a hard limit in docs. Your recent patch is actually doing this for hash maps,
let's fix the rest of the bpf code.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-07-02  4:23         ` Roman Gushchin
@ 2022-07-02 15:24           ` Yafang Shao
  2022-07-02 15:33             ` Roman Gushchin
  0 siblings, 1 reply; 30+ messages in thread
From: Yafang Shao @ 2022-07-02 15:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf

On Sat, Jul 2, 2022 at 12:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Sun, Jun 26, 2022 at 02:25:51PM +0800, Yafang Shao wrote:
> > > >                                              htab->map.numa_node);
> > > > And then we'd better introduce an improvement for memcg,
> > > > +      /*
> > > > +       *  Should wakeup async memcg reclaim first,
> > > > +       *   in case there will be no direct memcg reclaim for a long time.
> > > > +       *   We can either introduce async memcg reclaim
> > > > +       *   or modify kswapd to reclaim a specific memcg
> > > > +       */
> > > > +       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> > > > +            wake_up_async_memcg_reclaim();
> > > >          if (!gfpflags_allow_blocking(gfp_mask))
> > > >                 goto nomem;
> > >
> > > Hm, I see. It might be an issue if there is no global memory pressure, right?
> > > Let me think what I can do here too.
> > >
> >
> > Right. It is not a good idea to expect a global memory reclaimer to do it.
> > Thanks for following up with it again.
>
> After thinking a bit more, I'm not sure if it's actually a good idea:
> there might be not much memory to reclaim except the memory consumed by the bpf
> map itself, so waking kswapd might be useless (and just consume cpu and drain
> batteries).
>

I'm not sure if it is a generic problem.
For example, a latency-sensitive process running in a container
doesn't set __GFP_DIRECT_RECLAIM, but there're page cache pages in
this container. If there's no global memory pressure or no other kinds
of memory allocation in this container, these page cache pages will
not be reclaimed for a long time.
Maybe we should also check the number of page cache pages in this
container before waking up kswapd, but I'm not quite sure of it.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
  2022-07-02 15:24           ` Yafang Shao
@ 2022-07-02 15:33             ` Roman Gushchin
  0 siblings, 0 replies; 30+ messages in thread
From: Roman Gushchin @ 2022-07-02 15:33 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Quentin Monnet, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, penberg,
	David Rientjes, iamjoonsoo.kim, Vlastimil Babka, Linux MM, bpf


On Sat, Jul 02, 2022 at 11:24:10PM +0800, Yafang Shao wrote:
> On Sat, Jul 2, 2022 at 12:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Sun, Jun 26, 2022 at 02:25:51PM +0800, Yafang Shao wrote:
> > > > >                                              htab->map.numa_node);
> > > > > And then we'd better introduce an improvement for memcg,
> > > > > +      /*
> > > > > +       *  Should wakeup async memcg reclaim first,
> > > > > +       *   in case there will be no direct memcg reclaim for a long time.
> > > > > +       *   We can either introduce async memcg reclaim
> > > > > +       *   or modify kswapd to reclaim a specific memcg
> > > > > +       */
> > > > > +       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> > > > > +            wake_up_async_memcg_reclaim();
> > > > >          if (!gfpflags_allow_blocking(gfp_mask))
> > > > >                 goto nomem;
> > > >
> > > > Hm, I see. It might be an issue if there is no global memory pressure, right?
> > > > Let me think what I can do here too.
> > > >
> > >
> > > Right. It is not a good idea to expect a global memory reclaimer to do it.
> > > Thanks for following up with it again.
> >
> > After thinking a bit more, I'm not sure if it's actually a good idea:
> > there might be not much memory to reclaim except the memory consumed by the bpf
> > map itself, so waking kswapd might be useless (and just consume cpu and drain
> > batteries).
> >
> 
> I'm not sure if it is a generic problem.
> For example, a latency-sensitive process running in a container
> doesn't set __GFP_DIRECT_RECLAIM, but there're page cache pages in
> this container. If there's no global memory pressure or no other kinds
> of memory allocation in this container, these page cache pages will
> not be reclaimed for a long time.
> Maybe we should also check the number of page cache pages in this
> container before waking up kswapd, but I'm not quite sure of it.

It's not a generic problem but it might be very specific.
Anyway, it doesn't really matter, we shouldn't exceed memory.max.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-07-02 15:34 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-19 15:50 [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 01/10] mm, memcg: Add a new helper memcg_should_recharge() Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 02/10] bpftool: Show memcg info of bpf map Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 03/10] mm, memcg: Add new helper obj_cgroup_from_current() Yafang Shao
2022-06-23  3:01   ` Roman Gushchin
2022-06-25 13:54     ` Yafang Shao
2022-06-26  1:52       ` Roman Gushchin
2022-06-19 15:50 ` [RFC PATCH bpf-next 04/10] mm, memcg: Make obj_cgroup_{charge, uncharge}_pages public Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 05/10] mm: Add helper to recharge kmalloc'ed address Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 06/10] mm: Add helper to recharge vmalloc'ed address Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 07/10] mm: Add helper to recharge percpu address Yafang Shao
2022-06-23  5:25   ` Dennis Zhou
2022-06-25 14:18     ` Yafang Shao
2022-06-27  3:04       ` Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 08/10] bpf: Recharge memory when reuse bpf map Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 09/10] bpf: Make bpf_map_{save, release}_memcg public Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 10/10] bpf: Support recharge for hash map Yafang Shao
2022-06-21 23:28 ` [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Alexei Starovoitov
2022-06-22 14:03   ` Yafang Shao
2022-06-23  3:29 ` Roman Gushchin
2022-06-25  3:26   ` Yafang Shao
2022-06-26  3:28     ` Roman Gushchin
2022-06-26  3:32       ` Roman Gushchin
2022-06-26  6:38         ` Yafang Shao
2022-06-26  6:25       ` Yafang Shao
2022-07-02  4:23         ` Roman Gushchin
2022-07-02 15:24           ` Yafang Shao
2022-07-02 15:33             ` Roman Gushchin
2022-06-27  0:40     ` Alexei Starovoitov
2022-06-27 15:02       ` Yafang Shao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.