All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg
@ 2022-03-08 13:10 Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 1/9] bpftool: fix print error when show bpf man Yafang Shao
                   ` (9 more replies)
  0 siblings, 10 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao

When we use memcg to limit the containers which load bpf progs and maps,
we find there is an issue that the lifecycle of container and bpf are not
always the same, because we may pin the maps and progs while update the
container only. So once the container which has alreay pinned progs and
maps is restarted, the pinned progs and maps are no longer charged to it
any more. In other words, this kind of container can steal memory from the
host, that is not expected by us. This patchset means to resolve this
issue.

After the container is restarted, the old memcg which is charged by the
pinned progs and maps will be offline but won't be freed until all of the
related maps and progs are freed. If we want to charge these bpf memory to
the new started memcg, we should uncharge them from the offline memcg first
and then charge it to the new one. As we have already known how the bpf
memroy is allocated and freed, we can also know how to charge and uncharge
it. This pathset implements various charge and uncharge methords for these
memory.

Regarding how to do the recharge, we decide to implement new bpf syscalls
to do it. With the new implemented bpf syscall, the agent running in the
container can use it to do the recharge. As of now we only implement it for
the bpf hash maps. Below is a simple example how to do the recharge,

====
int main(int argc, char *argv[])
{
	union bpf_attr attr = {};
	int map_id;
	int pfd;

	if (argc < 2) {
		printf("Pls. give a map id \n");
		exit(-1);
	}

	map_id = atoi(argv[1]);
	attr.map_id = map_id;
	pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
	if (pfd < 0)
		perror("BPF_MAP_RECHARGE");

	return 0;
}

====

Patch #1 and #2 is for the observability, with which we can easily check
whether the bpf maps is charged to a memcg and whether the memcg is offline.
Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
kmalloc-ed and percpu memory.
Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
by our bpf services. The other maps hasn't been implemented yet. The bpf progs
hasn't been implemented neither.

This pathset is still a POC now, with limited testing. Any feedback is
welcomed.

Yafang Shao (9):
  bpftool: fix print error when show bpf man
  bpftool: show memcg info of bpf map
  mm: add methord to charge kmalloc-ed address
  mm: add methord to charge vmalloc-ed address
  mm: add methord to charge percpu address
  bpf: add a helper to find map by id
  bpf: add BPF_MAP_RECHARGE syscall
  bpf: make bpf_map_{save, release}_memcg public
  bpf: support recharge for hash map

 include/linux/bpf.h            | 23 +++++++++++++
 include/linux/percpu.h         |  1 +
 include/linux/slab.h           |  2 ++
 include/linux/vmalloc.h        |  1 +
 include/uapi/linux/bpf.h       | 10 ++++++
 kernel/bpf/hashtab.c           | 35 ++++++++++++++++++++
 kernel/bpf/syscall.c           | 73 ++++++++++++++++++++++++++----------------
 mm/percpu.c                    | 50 +++++++++++++++++++++++++++++
 mm/slab.c                      |  6 ++++
 mm/slob.c                      |  6 ++++
 mm/slub.c                      | 32 ++++++++++++++++++
 mm/util.c                      |  9 ++++++
 mm/vmalloc.c                   | 29 +++++++++++++++++
 tools/bpf/bpftool/map.c        |  9 +++---
 tools/include/uapi/linux/bpf.h |  1 +
 15 files changed, 254 insertions(+), 33 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/9] bpftool: fix print error when show bpf man
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
@ 2022-03-08 13:10 ` Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 2/9] bpftool: show memcg info of bpf map Yafang Shao
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao, Joanne Koong

If there is no btf_id or frozen, it will not show the pids,
but the pids doesn't depends on any one of them.

Fixes: 9330986c0300 ("bpf: Add bloom filter map implementation")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Joanne Koong <joannekoong@fb.com>
---
 tools/bpf/bpftool/map.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index e746642..0bba337 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -620,17 +620,14 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
 					    u32_as_hash_field(info->id))
 			printf("\n\tpinned %s", (char *)entry->value);
 	}
-	printf("\n");
 
 	if (frozen_str) {
 		frozen = atoi(frozen_str);
 		free(frozen_str);
 	}
 
-	if (!info->btf_id && !frozen)
-		return 0;
-
-	printf("\t");
+	if (info->btf_id || frozen)
+		printf("\n\t");
 
 	if (info->btf_id)
 		printf("btf_id %d", info->btf_id);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/9] bpftool: show memcg info of bpf map
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 1/9] bpftool: fix print error when show bpf man Yafang Shao
@ 2022-03-08 13:10 ` Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 3/9] mm: add methord to charge kmalloc-ed address Yafang Shao
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao

bpf map can be charged to a memcg, so we'd better show the memcg info to
do better bpf memory management. This patch adds a new field
"memcg_state" to show whether a bpf map is charged to a memcg and
whether the memcg is offlined. Currently it has three values,
   0 : not charged
  -1 : the charged memcg is offline
   1 : the charged memcg is online

For instance,

$ bpftool map show
2: array  name iterator.rodata  flags 0x480
        key 4B  value 98B  max_entries 1  memlock 4096B
        btf_id 240  frozen
        memcg_state 0
3: hash  name calico_failsafe  flags 0x1
        key 4B  value 1B  max_entries 65535  memlock 524288B
        memcg_state 1
6: lru_hash  name access_record  flags 0x0
        key 8B  value 24B  max_entries 102400  memlock 3276800B
        btf_id 256
        memcg_state -1

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           | 11 +++++++++++
 tools/bpf/bpftool/map.c        |  2 ++
 tools/include/uapi/linux/bpf.h |  1 +
 4 files changed, 15 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4eebea8..a448b06 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5864,6 +5864,7 @@ struct bpf_map_info {
 	__u32 btf_value_type_id;
 	__u32 :32;	/* alignment pad */
 	__u64 map_extra;
+	__s8  memcg_state;
 } __attribute__((aligned(8)));
 
 struct bpf_btf_info {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index db402eb..3b50fcb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3939,6 +3939,17 @@ static int bpf_map_get_info_by_fd(struct file *file,
 	}
 	info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id;
 
+#ifdef CONFIG_MEMCG_KMEM
+	if (map->memcg) {
+		struct mem_cgroup *memcg = map->memcg;
+
+		if (memcg == root_mem_cgroup)
+			info.memcg_state = 0;
+		else
+			info.memcg_state = memcg->kmemcg_id < 0 ? -1 : 1;
+	}
+#endif
+
 	if (bpf_map_is_dev_bound(map)) {
 		err = bpf_map_offload_info_fill(&info, map);
 		if (err)
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 0bba337..fe8322f 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -550,6 +550,7 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
 		jsonw_end_array(json_wtr);
 	}
 
+	jsonw_int_field(json_wtr, "memcg_state", info->memcg_state);
 	emit_obj_refs_json(refs_table, info->id, json_wtr);
 
 	jsonw_end_object(json_wtr);
@@ -635,6 +636,7 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
 	if (frozen)
 		printf("%sfrozen", info->btf_id ? "  " : "");
 
+	printf("\n\tmemcg_state %d", info->memcg_state);
 	emit_obj_refs_plain(refs_table, info->id, "\n\tpids ");
 
 	printf("\n");
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4eebea8..41e65b3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5864,6 +5864,7 @@ struct bpf_map_info {
 	__u32 btf_value_type_id;
 	__u32 :32;	/* alignment pad */
 	__u64 map_extra;
+	__s8 memcg_state;
 } __attribute__((aligned(8)));
 
 struct bpf_btf_info {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 3/9] mm: add methord to charge kmalloc-ed address
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 1/9] bpftool: fix print error when show bpf man Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 2/9] bpftool: show memcg info of bpf map Yafang Shao
@ 2022-03-08 13:10 ` Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 4/9] mm: add methord to charge vmalloc-ed address Yafang Shao
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao

This patch implements a methord to charge or uncharge related pages
or objects from a given kmalloc-ed address. It is similar to kfree,
except that it doesn't touch the pages or objects while does account
only.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/slab.h |  1 +
 mm/slab.c            |  6 ++++++
 mm/slob.c            |  6 ++++++
 mm/slub.c            | 32 ++++++++++++++++++++++++++++++++
 4 files changed, 45 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 5b6193f..ae82e23 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -182,6 +182,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) __alloc_size(2);
 void kfree(const void *objp);
 void kfree_sensitive(const void *objp);
+void kcharge(const void *objp, bool charge);
 size_t __ksize(const void *objp);
 size_t ksize(const void *objp);
 #ifdef CONFIG_PRINTK
diff --git a/mm/slab.c b/mm/slab.c
index ddf5737..fbff613 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3796,6 +3796,12 @@ void kfree(const void *objp)
 }
 EXPORT_SYMBOL(kfree);
 
+void kcharge(const void *objp, bool charge)
+{
+	/* Not implemented yet */
+}
+EXPORT_SYMBOL(kfree);
+
 /*
  * This initializes kmem_cache_node or resizes various caches for all nodes.
  */
diff --git a/mm/slob.c b/mm/slob.c
index 60c5842..d3a789f 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -569,6 +569,12 @@ void kfree(const void *block)
 }
 EXPORT_SYMBOL(kfree);
 
+void kcharge(const void *block, bool charge)
+{
+	/* not implemented yet. */
+}
+EXPORT_SYMBOL(kcharge);
+
 /* can't use ksize for kmem_cache_alloc memory, only kmalloc */
 size_t __ksize(const void *block)
 {
diff --git a/mm/slub.c b/mm/slub.c
index 2614740..e933d45 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4563,6 +4563,38 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+void kcharge(const void *x, bool charge)
+{
+	void *object = (void *)x;
+	struct folio *folio;
+	struct slab *slab;
+
+	WARN_ON(!in_task());
+
+	if (unlikely(ZERO_OR_NULL_PTR(x)))
+		return;
+
+	folio = virt_to_folio(x);
+	if (unlikely(!folio_test_slab(folio))) {
+		unsigned int order = folio_order(folio);
+		int sign = charge ? 1 : -1;
+
+		mod_lruvec_page_state(folio_page(folio, 0), NR_SLAB_UNRECLAIMABLE_B,
+			sign * (PAGE_SIZE << order));
+
+		return;
+	}
+
+	slab = folio_slab(folio);
+	if (charge)
+		memcg_slab_post_alloc_hook(slab->slab_cache,
+			get_obj_cgroup_from_current(), GFP_KERNEL, 1, &object);
+	else
+		memcg_slab_free_hook(slab->slab_cache, &object, 1);
+
+}
+EXPORT_SYMBOL(kcharge);
+
 #define SHRINK_PROMOTE_MAX 32
 
 /*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 4/9] mm: add methord to charge vmalloc-ed address
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
                   ` (2 preceding siblings ...)
  2022-03-08 13:10 ` [PATCH RFC 3/9] mm: add methord to charge kmalloc-ed address Yafang Shao
@ 2022-03-08 13:10 ` Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 5/9] mm: add methord to charge percpu address Yafang Shao
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao

This patch adds a methord to charge or uncharge a given vmalloc-ed
address. It is similar to vfree, except that it doesn't touch the
related pages while does account only.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/slab.h    |  1 +
 include/linux/vmalloc.h |  1 +
 mm/util.c               |  9 +++++++++
 mm/vmalloc.c            | 29 +++++++++++++++++++++++++++++
 4 files changed, 40 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ae82e23..7173354 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -759,6 +759,7 @@ extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flag
 		      __alloc_size(3);
 extern void kvfree(const void *addr);
 extern void kvfree_sensitive(const void *addr, size_t len);
+void kvcharge(const void *addr, bool charge);
 
 unsigned int kmem_cache_size(struct kmem_cache *s);
 void __init kmem_cache_init_late(void);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 880227b..b48d941 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -161,6 +161,7 @@ void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
 
 extern void vfree(const void *addr);
 extern void vfree_atomic(const void *addr);
+void vcharge(const void *addr, bool charge);
 
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
diff --git a/mm/util.c b/mm/util.c
index 7e433690..f5f5e05 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -614,6 +614,15 @@ void kvfree(const void *addr)
 }
 EXPORT_SYMBOL(kvfree);
 
+void kvcharge(const void *addr, bool charge)
+{
+	if (is_vmalloc_addr(addr))
+		vcharge(addr, charge);
+	else
+		kcharge(addr, charge);
+}
+EXPORT_SYMBOL(kvcharge);
+
 /**
  * kvfree_sensitive - Free a data object containing sensitive information.
  * @addr: address of the data object to be freed.
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4165304..6fc2295 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2715,6 +2715,35 @@ void vfree(const void *addr)
 }
 EXPORT_SYMBOL(vfree);
 
+void vcharge(const void *addr, bool charge)
+{
+	unsigned int page_order;
+	struct vm_struct *area;
+	int i;
+
+	WARN_ON(!in_task());
+
+	if (!addr)
+		return;
+
+	area = find_vm_area(addr);
+	if (unlikely(!area))
+		return;
+
+	page_order = vm_area_page_order(area);
+	for (i = 0; i < area->nr_pages; i += 1U << page_order) {
+		struct page *page = area->pages[i];
+
+		WARN_ON(!page);
+		if (charge)
+			memcg_kmem_charge_page(page, GFP_KERNEL, page_order);
+		else
+			memcg_kmem_uncharge_page(page, page_order);
+		cond_resched();
+	}
+}
+EXPORT_SYMBOL(vcharge);
+
 /**
  * vunmap - release virtual mapping obtained by vmap()
  * @addr:   memory base address
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 5/9] mm: add methord to charge percpu address
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
                   ` (3 preceding siblings ...)
  2022-03-08 13:10 ` [PATCH RFC 4/9] mm: add methord to charge vmalloc-ed address Yafang Shao
@ 2022-03-08 13:10 ` Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 6/9] bpf: add a helper to find map by id Yafang Shao
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao

This patch adds a methord to charge or uncharge a percpu address.
It is similar to free_percpu except that it doesn't touch the related
pages while does account only.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/percpu.h |  1 +
 mm/percpu.c            | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index f1ec5ad..1a65221 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -128,6 +128,7 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
 extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) __alloc_size(1);
 extern void __percpu *__alloc_percpu(size_t size, size_t align) __alloc_size(1);
 extern void free_percpu(void __percpu *__pdata);
+void charge_percpu(void __percpu *__pdata, bool charge);
 extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
 
 #define alloc_percpu_gfp(type, gfp)					\
diff --git a/mm/percpu.c b/mm/percpu.c
index ea28db2..22fc0ff 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2309,6 +2309,56 @@ void free_percpu(void __percpu *ptr)
 }
 EXPORT_SYMBOL_GPL(free_percpu);
 
+void charge_percpu(void __percpu *ptr, bool charge)
+{
+	int bit_off, off, bits, size, end;
+	struct obj_cgroup *objcg;
+	struct pcpu_chunk *chunk;
+	unsigned long flags;
+	void *addr;
+
+	WARN_ON(!in_task());
+
+	if (!ptr)
+		return;
+
+	addr = __pcpu_ptr_to_addr(ptr);
+	spin_lock_irqsave(&pcpu_lock, flags);
+	chunk = pcpu_chunk_addr_search(addr);
+	off = addr - chunk->base_addr;
+	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+	if (!objcg) {
+		spin_unlock_irqrestore(&pcpu_lock, flags);
+		return;
+	}
+
+	bit_off = off / PCPU_MIN_ALLOC_SIZE;
+	/* find end index */
+	end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
+			bit_off + 1);
+	bits = end - bit_off;
+	size = bits * PCPU_MIN_ALLOC_SIZE;
+
+	if (charge) {
+		obj_cgroup_get(objcg);
+		obj_cgroup_charge(objcg, GFP_KERNEL, size * num_possible_cpus());
+		rcu_read_lock();
+		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+			(size * num_possible_cpus()));
+		rcu_read_unlock();
+	} else {
+		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+		rcu_read_lock();
+		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+			-(size * num_possible_cpus()));
+		rcu_read_unlock();
+		obj_cgroup_put(objcg);
+	}
+
+	spin_unlock_irqrestore(&pcpu_lock, flags);
+}
+EXPORT_SYMBOL(charge_percpu);
+
 bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr)
 {
 #ifdef CONFIG_SMP
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 6/9] bpf: add a helper to find map by id
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
                   ` (4 preceding siblings ...)
  2022-03-08 13:10 ` [PATCH RFC 5/9] mm: add methord to charge percpu address Yafang Shao
@ 2022-03-08 13:10 ` Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 7/9] bpf: add BPF_MAP_RECHARGE syscall Yafang Shao
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao

A new helper bpf_map_idr_find() is introduced for later use.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/syscall.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3b50fcb..68fea3b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3474,6 +3474,21 @@ static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
 	return fd;
 }
 
+static struct bpf_map *bpf_map_idr_find(unsigned long id)
+{
+	void *map;
+
+	spin_lock_bh(&map_idr_lock);
+	map = idr_find(&map_idr, id);
+	if (map)
+		map = __bpf_map_inc_not_zero(map, true);
+	else
+		map = ERR_PTR(-ENOENT);
+	spin_unlock_bh(&map_idr_lock);
+
+	return map;
+}
+
 #define BPF_MAP_GET_FD_BY_ID_LAST_FIELD open_flags
 
 static int bpf_map_get_fd_by_id(const union bpf_attr *attr)
@@ -3494,14 +3509,7 @@ static int bpf_map_get_fd_by_id(const union bpf_attr *attr)
 	if (f_flags < 0)
 		return f_flags;
 
-	spin_lock_bh(&map_idr_lock);
-	map = idr_find(&map_idr, id);
-	if (map)
-		map = __bpf_map_inc_not_zero(map, true);
-	else
-		map = ERR_PTR(-ENOENT);
-	spin_unlock_bh(&map_idr_lock);
-
+	map = bpf_map_idr_find(id);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 7/9] bpf: add BPF_MAP_RECHARGE syscall
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
                   ` (5 preceding siblings ...)
  2022-03-08 13:10 ` [PATCH RFC 6/9] bpf: add a helper to find map by id Yafang Shao
@ 2022-03-08 13:10 ` Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 8/9] bpf: make bpf_map_{save, release}_memcg public Yafang Shao
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao

This patch adds a new bpf syscall BPF_MAP_RECHARGE, which means to
recharge the allocated memory of a bpf map from an offline memcg to
the current memcg.

The recharge methord for each map will be implemented in the follow-up
patches.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h      |  2 ++
 include/uapi/linux/bpf.h |  9 +++++++++
 kernel/bpf/syscall.c     | 19 ++++++++++++++++++-
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 88449fb..fca274e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -147,6 +147,8 @@ struct bpf_map_ops {
 				     bpf_callback_t callback_fn,
 				     void *callback_ctx, u64 flags);
 
+	bool (*map_recharge_memcg)(struct bpf_map *map);
+
 	/* BTF name and id of struct allocated by map_alloc */
 	const char * const map_btf_name;
 	int *map_btf_id;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a448b06..290ea67 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -821,6 +821,14 @@ struct bpf_cgroup_storage_key {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_MAP_RECHARGE
+ *  Description
+ *		Recharge bpf memory from an offline memcg
+ *
+ *	Return
+ *		Returns zero on success. On error, -1 is returned and *errno*
+ *		is set appropriately.
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -875,6 +883,7 @@ enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_MAP_RECHARGE,
 };
 
 enum bpf_map_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 68fea3b..85456f1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1128,7 +1128,6 @@ static int map_lookup_elem(union bpf_attr *attr)
 	return err;
 }
 
-
 #define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags
 
 static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
@@ -4621,6 +4620,21 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
 	return ret;
 }
 
+static int map_recharge_elem(union bpf_attr *attr)
+{
+	int id = attr->map_id;
+	struct bpf_map *map;
+
+	map = bpf_map_idr_find(id);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	if (map->ops->map_recharge_memcg)
+		map->ops->map_recharge_memcg(map);
+
+	return 0;
+}
+
 static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 {
 	union bpf_attr attr;
@@ -4757,6 +4771,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 	case BPF_PROG_BIND_MAP:
 		err = bpf_prog_bind_map(&attr);
 		break;
+	case BPF_MAP_RECHARGE:
+		err = map_recharge_elem(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 8/9] bpf: make bpf_map_{save, release}_memcg public
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
                   ` (6 preceding siblings ...)
  2022-03-08 13:10 ` [PATCH RFC 7/9] bpf: add BPF_MAP_RECHARGE syscall Yafang Shao
@ 2022-03-08 13:10 ` Yafang Shao
  2022-03-08 13:10 ` [PATCH RFC 9/9] bpf: support recharge for hash map Yafang Shao
  2022-03-09  1:09 ` [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Roman Gushchin
  9 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao

These two helpers will be used in map specific files later.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h  | 21 +++++++++++++++++++++
 kernel/bpf/syscall.c | 19 -------------------
 2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fca274e..2f3f092 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/percpu-refcount.h>
 #include <linux/bpfptr.h>
+#include <linux/memcontrol.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -209,6 +210,26 @@ struct bpf_map {
 	} owner;
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+static inline void bpf_map_save_memcg(struct bpf_map *map)
+{
+	map->memcg = get_mem_cgroup_from_mm(current->mm);
+}
+
+static inline void bpf_map_release_memcg(struct bpf_map *map)
+{
+	mem_cgroup_put(map->memcg);
+}
+#else
+static inline void bpf_map_save_memcg(struct bpf_map *map)
+{
+}
+
+static inline void bpf_map_release_memcg(struct bpf_map *map)
+{
+}
+#endif
+
 static inline bool map_value_has_spin_lock(const struct bpf_map *map)
 {
 	return map->spin_lock_off >= 0;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 85456f1..7b4cbe7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -414,16 +414,6 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-	map->memcg = get_mem_cgroup_from_mm(current->mm);
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-	mem_cgroup_put(map->memcg);
-}
-
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
@@ -461,15 +451,6 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 
 	return ptr;
 }
-
-#else
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-}
 #endif
 
 /* called from workqueue */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 9/9] bpf: support recharge for hash map
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
                   ` (7 preceding siblings ...)
  2022-03-08 13:10 ` [PATCH RFC 8/9] bpf: make bpf_map_{save, release}_memcg public Yafang Shao
@ 2022-03-08 13:10 ` Yafang Shao
  2022-03-09  1:09 ` [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Roman Gushchin
  9 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-08 13:10 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro
  Cc: linux-mm, netdev, bpf, Yafang Shao

This patch supports recharge for hash map. We have already known how the
hash map is allocated and freed, we can also know how to charge and
uncharge the hash map. Firstly, we need to uncharge it from the old
memcg, then charge it to the current memcg. The old memcg must be an
offline memcg.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/hashtab.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6587796..4d103f1 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -10,6 +10,7 @@
 #include <linux/random.h>
 #include <uapi/linux/btf.h>
 #include <linux/rcupdate_trace.h>
+#include <linux/memcontrol.h>
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
@@ -1466,6 +1467,36 @@ static void htab_map_free(struct bpf_map *map)
 	kfree(htab);
 }
 
+static bool htab_map_recharge_memcg(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct mem_cgroup *old = map->memcg;
+	int i;
+
+	if (!old)
+		return false;
+
+	/* Only process offline memcg */
+	if (old == root_mem_cgroup || old->kmemcg_id >= 0)
+		return false;
+
+	bpf_map_release_memcg(map);
+	kcharge(htab, false);
+	kvcharge(htab->buckets, false);
+	charge_percpu(htab->extra_elems, false);
+	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
+		charge_percpu(htab->map_locked[i], false);
+
+	kcharge(htab, true);
+	kvcharge(htab->buckets, true);
+	charge_percpu(htab->extra_elems, true);
+	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
+		charge_percpu(htab->map_locked[i], true);
+	bpf_map_save_memcg(map);
+
+	return true;
+}
+
 static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
 				   struct seq_file *m)
 {
@@ -2111,6 +2142,7 @@ static int bpf_for_each_hash_elem(struct bpf_map *map, bpf_callback_t callback_f
 	.map_alloc_check = htab_map_alloc_check,
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
+	.map_recharge_memcg = htab_map_recharge_memcg,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_release_uref = htab_map_free_timers,
 	.map_lookup_elem = htab_map_lookup_elem,
@@ -2133,6 +2165,7 @@ static int bpf_for_each_hash_elem(struct bpf_map *map, bpf_callback_t callback_f
 	.map_alloc_check = htab_map_alloc_check,
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
+	.map_recharge_memcg = htab_map_recharge_memcg,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_release_uref = htab_map_free_timers,
 	.map_lookup_elem = htab_lru_map_lookup_elem,
@@ -2258,6 +2291,7 @@ static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key,
 	.map_alloc_check = htab_map_alloc_check,
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
+	.map_recharge_memcg = htab_map_recharge_memcg,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_lookup_elem = htab_percpu_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_percpu_map_lookup_and_delete_elem,
@@ -2278,6 +2312,7 @@ static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key,
 	.map_alloc_check = htab_map_alloc_check,
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
+	.map_recharge_memcg = htab_map_recharge_memcg,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_lookup_elem = htab_lru_percpu_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_lru_percpu_map_lookup_and_delete_elem,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg
  2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
                   ` (8 preceding siblings ...)
  2022-03-08 13:10 ` [PATCH RFC 9/9] bpf: support recharge for hash map Yafang Shao
@ 2022-03-09  1:09 ` Roman Gushchin
  2022-03-09 13:28   ` Yafang Shao
  9 siblings, 1 reply; 18+ messages in thread
From: Roman Gushchin @ 2022-03-09  1:09 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka,
	hannes, mhocko, vdavydov.dev, guro, linux-mm, netdev, bpf

On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote:
> When we use memcg to limit the containers which load bpf progs and maps,
> we find there is an issue that the lifecycle of container and bpf are not
> always the same, because we may pin the maps and progs while update the
> container only. So once the container which has alreay pinned progs and
> maps is restarted, the pinned progs and maps are no longer charged to it
> any more. In other words, this kind of container can steal memory from the
> host, that is not expected by us. This patchset means to resolve this
> issue.
> 
> After the container is restarted, the old memcg which is charged by the
> pinned progs and maps will be offline but won't be freed until all of the
> related maps and progs are freed. If we want to charge these bpf memory to
> the new started memcg, we should uncharge them from the offline memcg first
> and then charge it to the new one. As we have already known how the bpf
> memroy is allocated and freed, we can also know how to charge and uncharge
> it. This pathset implements various charge and uncharge methords for these
> memory.
> 
> Regarding how to do the recharge, we decide to implement new bpf syscalls
> to do it. With the new implemented bpf syscall, the agent running in the
> container can use it to do the recharge. As of now we only implement it for
> the bpf hash maps. Below is a simple example how to do the recharge,
> 
> ====
> int main(int argc, char *argv[])
> {
> 	union bpf_attr attr = {};
> 	int map_id;
> 	int pfd;
> 
> 	if (argc < 2) {
> 		printf("Pls. give a map id \n");
> 		exit(-1);
> 	}
> 
> 	map_id = atoi(argv[1]);
> 	attr.map_id = map_id;
> 	pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
> 	if (pfd < 0)
> 		perror("BPF_MAP_RECHARGE");
> 
> 	return 0;
> }
> 
> ====
> 
> Patch #1 and #2 is for the observability, with which we can easily check
> whether the bpf maps is charged to a memcg and whether the memcg is offline.
> Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
> kmalloc-ed and percpu memory.
> Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
> by our bpf services. The other maps hasn't been implemented yet. The bpf progs
> hasn't been implemented neither.
> 
> This pathset is still a POC now, with limited testing. Any feedback is
> welcomed.

Hello Yafang!

It's an interesting topic, which goes well beyond bpf. In general, on cgroup
offlining we either do nothing either recharge pages to the parent cgroup
(latter is preferred), which helps to release the pinned memcg structure.

Your approach raises some questions:
1) what if the new cgroup is not large enough to contain the bpf map?
2) does it mean that some userspace app will monitor the state of the cgroup
which was the original owner of the bpf map and recharge once it's deleted?
3) what if there are several cgroups are sharing the same map? who will be
the next owner?
4) because recharging is fully voluntary, why any application should want to do
it, if it can just use the memory for free? it doesn't really look as a working
resource control mechanism.

Will reparenting work for your case? If not, can you, please, describe the
problem you're trying to solve by recharging the memory?

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg
  2022-03-09  1:09 ` [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Roman Gushchin
@ 2022-03-09 13:28   ` Yafang Shao
  2022-03-09 23:35     ` Roman Gushchin
  0 siblings, 1 reply; 18+ messages in thread
From: Yafang Shao @ 2022-03-09 13:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, Andrew Morton,
	Christoph Lameter, penberg, David Rientjes, iamjoonsoo.kim,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Linux MM, netdev, bpf

On Wed, Mar 9, 2022 at 9:09 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote:
> > When we use memcg to limit the containers which load bpf progs and maps,
> > we find there is an issue that the lifecycle of container and bpf are not
> > always the same, because we may pin the maps and progs while update the
> > container only. So once the container which has alreay pinned progs and
> > maps is restarted, the pinned progs and maps are no longer charged to it
> > any more. In other words, this kind of container can steal memory from the
> > host, that is not expected by us. This patchset means to resolve this
> > issue.
> >
> > After the container is restarted, the old memcg which is charged by the
> > pinned progs and maps will be offline but won't be freed until all of the
> > related maps and progs are freed. If we want to charge these bpf memory to
> > the new started memcg, we should uncharge them from the offline memcg first
> > and then charge it to the new one. As we have already known how the bpf
> > memroy is allocated and freed, we can also know how to charge and uncharge
> > it. This pathset implements various charge and uncharge methords for these
> > memory.
> >
> > Regarding how to do the recharge, we decide to implement new bpf syscalls
> > to do it. With the new implemented bpf syscall, the agent running in the
> > container can use it to do the recharge. As of now we only implement it for
> > the bpf hash maps. Below is a simple example how to do the recharge,
> >
> > ====
> > int main(int argc, char *argv[])
> > {
> >       union bpf_attr attr = {};
> >       int map_id;
> >       int pfd;
> >
> >       if (argc < 2) {
> >               printf("Pls. give a map id \n");
> >               exit(-1);
> >       }
> >
> >       map_id = atoi(argv[1]);
> >       attr.map_id = map_id;
> >       pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
> >       if (pfd < 0)
> >               perror("BPF_MAP_RECHARGE");
> >
> >       return 0;
> > }
> >
> > ====
> >
> > Patch #1 and #2 is for the observability, with which we can easily check
> > whether the bpf maps is charged to a memcg and whether the memcg is offline.
> > Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
> > kmalloc-ed and percpu memory.
> > Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
> > by our bpf services. The other maps hasn't been implemented yet. The bpf progs
> > hasn't been implemented neither.
> >
> > This pathset is still a POC now, with limited testing. Any feedback is
> > welcomed.
>
> Hello Yafang!
>
> It's an interesting topic, which goes well beyond bpf. In general, on cgroup
> offlining we either do nothing either recharge pages to the parent cgroup
> (latter is preferred), which helps to release the pinned memcg structure.
>

We have thought about recharging pages to the parent cgroup (the root
memcg in our case),
but it can't resolve our issue.
Releasing the pinned memcg struct is the benefit of recharging pages
to the parent,
but as there won't be too many memcgs pinned by bpf, so it may not be worth it.


> Your approach raises some questions:

Nice questions.

> 1) what if the new cgroup is not large enough to contain the bpf map?

The recharge is supposed to be triggered at the container start time.
After the container is started, the agent which will load the bpf
programs will do it as follows,
1. Check if the bpf program has already been loaded,
    if not,  goto 5.
2. Check if the bpf program will pin maps or progs,
    if not, goto 6.
3. Check if the pinned maps and progs are charged to an offline memcg,
    if not, goto 6.
4. Recharge the pinned maps or progs to the current memcg.
   goto 6.
5. load new bpf program, and also pinned maps and progs if desired.
6. End.

If the recharge fails, it means that the memcg limit is too low, we
should reconsider
the limit of the container.

Regarding other cases that it may do the recharge in the runtime, I
think the failure is
a common OOM case, that means the usage in this container is out of memory, we
should kill something.


> 2) does it mean that some userspace app will monitor the state of the cgroup
> which was the original owner of the bpf map and recharge once it's deleted?

In our use case,  we don't need to monitor that behavior.
The agent which loads the bpf programs has the responsibility to do
the recharge.
As all the agents are controlled by ourselves, it is easy to do it like that.

For more generic use cases, it can do the bpf maintenance in a sidecar container
in the containerized environment.  The admin can provide such sidercar
to bpf owners.
The admin can also introduce an agent on the host to check if there're
maps or progs
charged to an offline memcg and then take the action. It is not easy
to find which one owns
the pinned maps or progs as the pinned path is unique.

> 3) what if there are several cgroups are sharing the same map? who will be
> the next owner?

I think we can follow the same rule that we take care of sharing pages
across memcgs
currently: who loads it first, who owns the map. Then after the first
one exit, the next owner
is who firstly does the recharge.

> 4) because recharging is fully voluntary, why any application should want to do
> it, if it can just use the memory for free? it doesn't really look as a working
> resource control mechanism.
>

As I explained in 2), all the agents are under our control, so we can
easily handle it like that.
For generic use cases, an agent running on the host and sidecar (or
SDK) provided
to bpf users can also handle it.

> Will reparenting work for your case? If not, can you, please, describe the
> problem you're trying to solve by recharging the memory?
>

Reparenting doesn't work for us.
The problem is memory resource control: the limitation on the bpf
containers will be useless
if the lifecycle of bpf progs can containers are not the same.
The containers are always upgraded - IOW restarted - more frequently
than the bpf progs and maps,
that is also one of the reasons why we choose to pin them on the host.

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg
  2022-03-09 13:28   ` Yafang Shao
@ 2022-03-09 23:35     ` Roman Gushchin
  2022-03-10 13:20       ` Yafang Shao
  0 siblings, 1 reply; 18+ messages in thread
From: Roman Gushchin @ 2022-03-09 23:35 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, Andrew Morton,
	Christoph Lameter, penberg, David Rientjes, iamjoonsoo.kim,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Linux MM, netdev, bpf

On Wed, Mar 09, 2022 at 09:28:58PM +0800, Yafang Shao wrote:
> On Wed, Mar 9, 2022 at 9:09 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote:
> > > When we use memcg to limit the containers which load bpf progs and maps,
> > > we find there is an issue that the lifecycle of container and bpf are not
> > > always the same, because we may pin the maps and progs while update the
> > > container only. So once the container which has alreay pinned progs and
> > > maps is restarted, the pinned progs and maps are no longer charged to it
> > > any more. In other words, this kind of container can steal memory from the
> > > host, that is not expected by us. This patchset means to resolve this
> > > issue.
> > >
> > > After the container is restarted, the old memcg which is charged by the
> > > pinned progs and maps will be offline but won't be freed until all of the
> > > related maps and progs are freed. If we want to charge these bpf memory to
> > > the new started memcg, we should uncharge them from the offline memcg first
> > > and then charge it to the new one. As we have already known how the bpf
> > > memroy is allocated and freed, we can also know how to charge and uncharge
> > > it. This pathset implements various charge and uncharge methords for these
> > > memory.
> > >
> > > Regarding how to do the recharge, we decide to implement new bpf syscalls
> > > to do it. With the new implemented bpf syscall, the agent running in the
> > > container can use it to do the recharge. As of now we only implement it for
> > > the bpf hash maps. Below is a simple example how to do the recharge,
> > >
> > > ====
> > > int main(int argc, char *argv[])
> > > {
> > >       union bpf_attr attr = {};
> > >       int map_id;
> > >       int pfd;
> > >
> > >       if (argc < 2) {
> > >               printf("Pls. give a map id \n");
> > >               exit(-1);
> > >       }
> > >
> > >       map_id = atoi(argv[1]);
> > >       attr.map_id = map_id;
> > >       pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
> > >       if (pfd < 0)
> > >               perror("BPF_MAP_RECHARGE");
> > >
> > >       return 0;
> > > }
> > >
> > > ====
> > >
> > > Patch #1 and #2 is for the observability, with which we can easily check
> > > whether the bpf maps is charged to a memcg and whether the memcg is offline.
> > > Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
> > > kmalloc-ed and percpu memory.
> > > Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
> > > by our bpf services. The other maps hasn't been implemented yet. The bpf progs
> > > hasn't been implemented neither.
> > >
> > > This pathset is still a POC now, with limited testing. Any feedback is
> > > welcomed.
> >
> > Hello Yafang!
> >
> > It's an interesting topic, which goes well beyond bpf. In general, on cgroup
> > offlining we either do nothing either recharge pages to the parent cgroup
> > (latter is preferred), which helps to release the pinned memcg structure.
> >
> 
> We have thought about recharging pages to the parent cgroup (the root
> memcg in our case),
> but it can't resolve our issue.
> Releasing the pinned memcg struct is the benefit of recharging pages
> to the parent,
> but as there won't be too many memcgs pinned by bpf, so it may not be worth it.

I agree, that was my thinking too.

> 
> 
> > Your approach raises some questions:
> 
> Nice questions.
> 
> > 1) what if the new cgroup is not large enough to contain the bpf map?
> 
> The recharge is supposed to be triggered at the container start time.
> After the container is started, the agent which will load the bpf
> programs will do it as follows,
> 1. Check if the bpf program has already been loaded,
>     if not,  goto 5.
> 2. Check if the bpf program will pin maps or progs,
>     if not, goto 6.
> 3. Check if the pinned maps and progs are charged to an offline memcg,
>     if not, goto 6.
> 4. Recharge the pinned maps or progs to the current memcg.
>    goto 6.
> 5. load new bpf program, and also pinned maps and progs if desired.
> 6. End.
> 
> If the recharge fails, it means that the memcg limit is too low, we
> should reconsider
> the limit of the container.
> 
> Regarding other cases that it may do the recharge in the runtime, I
> think the failure is
> a common OOM case, that means the usage in this container is out of memory, we
> should kill something.

The problem here is that even invoking the oom killer might not help here,
if the size of the bpf map is larger than memory.max.

Also because recharging of a large object might take time and it's happening
simultaneously with other processes in the system (e.g. memory allocations,
cgroup limit changes, etc), potentially we might end up in the situation
when the new cgroup is not large enough to include the transferred object,
but also the original cgroup is not large enough (due to the limit set on one
of it's ancestors), so we'll need to break memory.max of either cgroup,
which is not great. We might solve this by pre-charging of target cgroup
and keeping the double-charge during the process, but it might not work
well for really large objects on small machines. Another approach is to transfer
in small chunks (e.g. pages), but then we might end with a partially transferred
object, which is also a questionable result.

<...>

> > Will reparenting work for your case? If not, can you, please, describe the
> > problem you're trying to solve by recharging the memory?
> >
> 
> Reparenting doesn't work for us.
> The problem is memory resource control: the limitation on the bpf
> containers will be useless
> if the lifecycle of bpf progs can containers are not the same.
> The containers are always upgraded - IOW restarted - more frequently
> than the bpf progs and maps,
> that is also one of the reasons why we choose to pin them on the host.

In general, I think I understand why this feature is useful for your case,
however I do have some serious concerns about adding such feature to
the upstream kernel:
1) The interface and the proposed feature is bpf-specific, however the problem
isn't. The same issue (an under reported memory consumption) can be caused by
other types of memory: pagecache, various kernel objects e.g. vfs cache etc.
If we introduce such a feature, we'd better be consistent across various
types of objects (how it's a good question).
2) Moving charges is proven to be tricky and cause various problems in the past.
If we're going back into this direction, we should come up with a really solid
plan for how to avoid past issues.
3) It would be great to understand who and how will use this feature in a more
generic environment. E.g. is it useful for systemd? Is it common to use bpf maps
over multiple cgroups? What for (given that these are not system-wide programs,
otherwise why would we charge their memory to some specific container)?

Btw, aren't you able to run a new container in the same cgroup? Or associate
the bpf map with the persistent parent cgroup?

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg
  2022-03-09 23:35     ` Roman Gushchin
@ 2022-03-10 13:20       ` Yafang Shao
  2022-03-10 18:00         ` Roman Gushchin
  0 siblings, 1 reply; 18+ messages in thread
From: Yafang Shao @ 2022-03-10 13:20 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, Andrew Morton,
	Christoph Lameter, penberg, David Rientjes, iamjoonsoo.kim,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Linux MM, netdev, bpf

On Thu, Mar 10, 2022 at 7:35 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Wed, Mar 09, 2022 at 09:28:58PM +0800, Yafang Shao wrote:
> > On Wed, Mar 9, 2022 at 9:09 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote:
> > > > When we use memcg to limit the containers which load bpf progs and maps,
> > > > we find there is an issue that the lifecycle of container and bpf are not
> > > > always the same, because we may pin the maps and progs while update the
> > > > container only. So once the container which has alreay pinned progs and
> > > > maps is restarted, the pinned progs and maps are no longer charged to it
> > > > any more. In other words, this kind of container can steal memory from the
> > > > host, that is not expected by us. This patchset means to resolve this
> > > > issue.
> > > >
> > > > After the container is restarted, the old memcg which is charged by the
> > > > pinned progs and maps will be offline but won't be freed until all of the
> > > > related maps and progs are freed. If we want to charge these bpf memory to
> > > > the new started memcg, we should uncharge them from the offline memcg first
> > > > and then charge it to the new one. As we have already known how the bpf
> > > > memroy is allocated and freed, we can also know how to charge and uncharge
> > > > it. This pathset implements various charge and uncharge methords for these
> > > > memory.
> > > >
> > > > Regarding how to do the recharge, we decide to implement new bpf syscalls
> > > > to do it. With the new implemented bpf syscall, the agent running in the
> > > > container can use it to do the recharge. As of now we only implement it for
> > > > the bpf hash maps. Below is a simple example how to do the recharge,
> > > >
> > > > ====
> > > > int main(int argc, char *argv[])
> > > > {
> > > >       union bpf_attr attr = {};
> > > >       int map_id;
> > > >       int pfd;
> > > >
> > > >       if (argc < 2) {
> > > >               printf("Pls. give a map id \n");
> > > >               exit(-1);
> > > >       }
> > > >
> > > >       map_id = atoi(argv[1]);
> > > >       attr.map_id = map_id;
> > > >       pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
> > > >       if (pfd < 0)
> > > >               perror("BPF_MAP_RECHARGE");
> > > >
> > > >       return 0;
> > > > }
> > > >
> > > > ====
> > > >
> > > > Patch #1 and #2 is for the observability, with which we can easily check
> > > > whether the bpf maps is charged to a memcg and whether the memcg is offline.
> > > > Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
> > > > kmalloc-ed and percpu memory.
> > > > Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
> > > > by our bpf services. The other maps hasn't been implemented yet. The bpf progs
> > > > hasn't been implemented neither.
> > > >
> > > > This pathset is still a POC now, with limited testing. Any feedback is
> > > > welcomed.
> > >
> > > Hello Yafang!
> > >
> > > It's an interesting topic, which goes well beyond bpf. In general, on cgroup
> > > offlining we either do nothing either recharge pages to the parent cgroup
> > > (latter is preferred), which helps to release the pinned memcg structure.
> > >
> >
> > We have thought about recharging pages to the parent cgroup (the root
> > memcg in our case),
> > but it can't resolve our issue.
> > Releasing the pinned memcg struct is the benefit of recharging pages
> > to the parent,
> > but as there won't be too many memcgs pinned by bpf, so it may not be worth it.
>
> I agree, that was my thinking too.
>
> >
> >
> > > Your approach raises some questions:
> >
> > Nice questions.
> >
> > > 1) what if the new cgroup is not large enough to contain the bpf map?
> >
> > The recharge is supposed to be triggered at the container start time.
> > After the container is started, the agent which will load the bpf
> > programs will do it as follows,
> > 1. Check if the bpf program has already been loaded,
> >     if not,  goto 5.
> > 2. Check if the bpf program will pin maps or progs,
> >     if not, goto 6.
> > 3. Check if the pinned maps and progs are charged to an offline memcg,
> >     if not, goto 6.
> > 4. Recharge the pinned maps or progs to the current memcg.
> >    goto 6.
> > 5. load new bpf program, and also pinned maps and progs if desired.
> > 6. End.
> >
> > If the recharge fails, it means that the memcg limit is too low, we
> > should reconsider
> > the limit of the container.
> >
> > Regarding other cases that it may do the recharge in the runtime, I
> > think the failure is
> > a common OOM case, that means the usage in this container is out of memory, we
> > should kill something.
>
> The problem here is that even invoking the oom killer might not help here,
> if the size of the bpf map is larger than memory.max.
>

Then we should introduce a fallback.

> Also because recharging of a large object might take time and it's happening
> simultaneously with other processes in the system (e.g. memory allocations,
> cgroup limit changes, etc), potentially we might end up in the situation
> when the new cgroup is not large enough to include the transferred object,
> but also the original cgroup is not large enough (due to the limit set on one
> of it's ancestors), so we'll need to break memory.max of either cgroup,
> which is not great. We might solve this by pre-charging of target cgroup
> and keeping the double-charge during the process, but it might not work
> well for really large objects on small machines. Another approach is to transfer
> in small chunks (e.g. pages), but then we might end with a partially transferred
> object, which is also a questionable result.
>

For this case it is not difficult to do the fallback because the
original one is restricted to an offline memcg only, that means there
are no any activities  in the original memcg. So recharge these pages
to the original one back will always succeed.

> <...>
>
> > > Will reparenting work for your case? If not, can you, please, describe the
> > > problem you're trying to solve by recharging the memory?
> > >
> >
> > Reparenting doesn't work for us.
> > The problem is memory resource control: the limitation on the bpf
> > containers will be useless
> > if the lifecycle of bpf progs can containers are not the same.
> > The containers are always upgraded - IOW restarted - more frequently
> > than the bpf progs and maps,
> > that is also one of the reasons why we choose to pin them on the host.
>
> In general, I think I understand why this feature is useful for your case,
> however I do have some serious concerns about adding such feature to
> the upstream kernel:
> 1) The interface and the proposed feature is bpf-specific, however the problem
> isn't. The same issue (an under reported memory consumption) can be caused by
> other types of memory: pagecache, various kernel objects e.g. vfs cache etc.
> If we introduce such a feature, we'd better be consistent across various
> types of objects (how it's a good question).

That is really a good question, which drives me to think more and
investigate more.

Per my understanding the under reported pages can be divided into several cases,
1) The pages aren't charged correctly when they are allocated.
   In this case, we should fix it when we allocate it.
2) The pages should be recharged back to the original memcg
   The pages are charged correctly but then we lost track of it.
   In this case the kernel must introduce some way to keep track of
and recharge it back in the proper circumstance.
3) Undistributed estate
   The original owner was dead, left with some persistent memory.
   Should the new one who uses this memory take charge of it?

So case #3 is what we should discuss here.

Before answering the question, I will explain another option we have
thought about to fix our issue.
Instead of recharging the bpf memory in the bpf syscall, the other
option is to set the target memcg only in the syscall and then wake up
a kworker to do the recharge. That means separate the recharge into
two steps, 1) assign the inheritor, 2) transfer the estate.
At last we didn't choose it because we want an immediate error if the
new owner doesn't have large enough space.
But this option can partly answer your question here, one possible way
to do it more generic is to abstract
two methods to get -
1). Who is the best inheritor               =>  assigner
2). How to charge the memory to it    =>  charger

Then let consider the option we choose again, we can find that it can be
easily extended to work in that way,

       assigner                             charger

    bpf_syscall
       wakeup the charger            waken
       wait for the result                 do the recharge and give the result
       return the result

In other words, we don't have a clear idea what issues we may face in
the future, but we know we can extend it to fix the new coming issue.
I think that is the most important thing.

> 2) Moving charges is proven to be tricky and cause various problems in the past.
> If we're going back into this direction, we should come up with a really solid
> plan for how to avoid past issues.

I know the reason why we disable move_charge_at_immigrate in cgroup2,
but I don't know if I know all of the past issues.
Appreciate if you could share the past issues you know and I will
check if they apply to this case as well.

In order to avoid possible risks, I have restricted the recharge to
happen in very strict conditions,
1. The original memcg must be an offline memcg
2.  The target memcg must be the memcg of the one who calls the bpf syscall
     That means the outsider doesn't have a way to do the recharge.
3. only kmem is supported now. (The may be extend it the future for
other types of memory)

> 3) It would be great to understand who and how will use this feature in a more
> generic environment. E.g. is it useful for systemd? Is it common to use bpf maps
> over multiple cgroups? What for (given that these are not system-wide programs,
> otherwise why would we charge their memory to some specific container)?
>

It is useful for containerized environments.
The container which pinned bpf can use it.
In our case we may use it in two ways as I explained in the prev mail that,
1) The one who load the bpf who do the recharge
2) A sidecar to maintain the bpf cycle

For the systemd, it may need to do some extend that,
The bpf services should describe,
1) if the bpf service needs the recharge (the one who limited by memcg
should be forcefully do the recharge)
2) the pinned progs and maps to check
3) the service identifier (with which we can get the target memcg)

We don't have the case that the bpf map is shared by multiple cgroups,
that should be a rare case.
I think that case is similar to the sharing page caches across
multiple cgroups, which are used by many cgroups but only charged to
one specific memcg.

> Btw, aren't you able to run a new container in the same cgroup? Or associate
> the bpf map with the persistent parent cgroup?
>

We have discussed if we can keep the parent cgroup alive, but
unfortunately it can't be guaranteed.
It may be hard and not flexible  to run a new container in the same
cgroup, which requires to not rmdir cgroup and use it again in the
next time.

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg
  2022-03-10 13:20       ` Yafang Shao
@ 2022-03-10 18:00         ` Roman Gushchin
  2022-03-11 12:48           ` Yafang Shao
  0 siblings, 1 reply; 18+ messages in thread
From: Roman Gushchin @ 2022-03-10 18:00 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, Andrew Morton,
	Christoph Lameter, penberg, David Rientjes, iamjoonsoo.kim,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Linux MM, netdev, bpf

On Thu, Mar 10, 2022 at 09:20:54PM +0800, Yafang Shao wrote:
> On Thu, Mar 10, 2022 at 7:35 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Wed, Mar 09, 2022 at 09:28:58PM +0800, Yafang Shao wrote:
> > > On Wed, Mar 9, 2022 at 9:09 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote:
> > > > > When we use memcg to limit the containers which load bpf progs and maps,
> > > > > we find there is an issue that the lifecycle of container and bpf are not
> > > > > always the same, because we may pin the maps and progs while update the
> > > > > container only. So once the container which has alreay pinned progs and
> > > > > maps is restarted, the pinned progs and maps are no longer charged to it
> > > > > any more. In other words, this kind of container can steal memory from the
> > > > > host, that is not expected by us. This patchset means to resolve this
> > > > > issue.
> > > > >
> > > > > After the container is restarted, the old memcg which is charged by the
> > > > > pinned progs and maps will be offline but won't be freed until all of the
> > > > > related maps and progs are freed. If we want to charge these bpf memory to
> > > > > the new started memcg, we should uncharge them from the offline memcg first
> > > > > and then charge it to the new one. As we have already known how the bpf
> > > > > memroy is allocated and freed, we can also know how to charge and uncharge
> > > > > it. This pathset implements various charge and uncharge methords for these
> > > > > memory.
> > > > >
> > > > > Regarding how to do the recharge, we decide to implement new bpf syscalls
> > > > > to do it. With the new implemented bpf syscall, the agent running in the
> > > > > container can use it to do the recharge. As of now we only implement it for
> > > > > the bpf hash maps. Below is a simple example how to do the recharge,
> > > > >
> > > > > ====
> > > > > int main(int argc, char *argv[])
> > > > > {
> > > > >       union bpf_attr attr = {};
> > > > >       int map_id;
> > > > >       int pfd;
> > > > >
> > > > >       if (argc < 2) {
> > > > >               printf("Pls. give a map id \n");
> > > > >               exit(-1);
> > > > >       }
> > > > >
> > > > >       map_id = atoi(argv[1]);
> > > > >       attr.map_id = map_id;
> > > > >       pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
> > > > >       if (pfd < 0)
> > > > >               perror("BPF_MAP_RECHARGE");
> > > > >
> > > > >       return 0;
> > > > > }
> > > > >
> > > > > ====
> > > > >
> > > > > Patch #1 and #2 is for the observability, with which we can easily check
> > > > > whether the bpf maps is charged to a memcg and whether the memcg is offline.
> > > > > Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
> > > > > kmalloc-ed and percpu memory.
> > > > > Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
> > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf progs
> > > > > hasn't been implemented neither.
> > > > >
> > > > > This pathset is still a POC now, with limited testing. Any feedback is
> > > > > welcomed.
> > > >
> > > > Hello Yafang!
> > > >
> > > > It's an interesting topic, which goes well beyond bpf. In general, on cgroup
> > > > offlining we either do nothing either recharge pages to the parent cgroup
> > > > (latter is preferred), which helps to release the pinned memcg structure.
> > > >
> > >
> > > We have thought about recharging pages to the parent cgroup (the root
> > > memcg in our case),
> > > but it can't resolve our issue.
> > > Releasing the pinned memcg struct is the benefit of recharging pages
> > > to the parent,
> > > but as there won't be too many memcgs pinned by bpf, so it may not be worth it.
> >
> > I agree, that was my thinking too.
> >
> > >
> > >
> > > > Your approach raises some questions:
> > >
> > > Nice questions.
> > >
> > > > 1) what if the new cgroup is not large enough to contain the bpf map?
> > >
> > > The recharge is supposed to be triggered at the container start time.
> > > After the container is started, the agent which will load the bpf
> > > programs will do it as follows,
> > > 1. Check if the bpf program has already been loaded,
> > >     if not,  goto 5.
> > > 2. Check if the bpf program will pin maps or progs,
> > >     if not, goto 6.
> > > 3. Check if the pinned maps and progs are charged to an offline memcg,
> > >     if not, goto 6.
> > > 4. Recharge the pinned maps or progs to the current memcg.
> > >    goto 6.
> > > 5. load new bpf program, and also pinned maps and progs if desired.
> > > 6. End.
> > >
> > > If the recharge fails, it means that the memcg limit is too low, we
> > > should reconsider
> > > the limit of the container.
> > >
> > > Regarding other cases that it may do the recharge in the runtime, I
> > > think the failure is
> > > a common OOM case, that means the usage in this container is out of memory, we
> > > should kill something.
> >
> > The problem here is that even invoking the oom killer might not help here,
> > if the size of the bpf map is larger than memory.max.
> >
> 
> Then we should introduce a fallback.

Can you, please, elaborate a bit more?

> 
> > Also because recharging of a large object might take time and it's happening
> > simultaneously with other processes in the system (e.g. memory allocations,
> > cgroup limit changes, etc), potentially we might end up in the situation
> > when the new cgroup is not large enough to include the transferred object,
> > but also the original cgroup is not large enough (due to the limit set on one
> > of it's ancestors), so we'll need to break memory.max of either cgroup,
> > which is not great. We might solve this by pre-charging of target cgroup
> > and keeping the double-charge during the process, but it might not work
> > well for really large objects on small machines. Another approach is to transfer
> > in small chunks (e.g. pages), but then we might end with a partially transferred
> > object, which is also a questionable result.
> >
> 
> For this case it is not difficult to do the fallback because the
> original one is restricted to an offline memcg only, that means there
> are no any activities  in the original memcg. So recharge these pages
> to the original one back will always succeed.

The problem is that the original cgroup might be not a top-level cgroup.
So even if it's offline, it doesn't really change anything: it's parent cgroup
can be online and experience concurrent limits changes, allocations etc.

> 
> > <...>
> >
> > > > Will reparenting work for your case? If not, can you, please, describe the
> > > > problem you're trying to solve by recharging the memory?
> > > >
> > >
> > > Reparenting doesn't work for us.
> > > The problem is memory resource control: the limitation on the bpf
> > > containers will be useless
> > > if the lifecycle of bpf progs can containers are not the same.
> > > The containers are always upgraded - IOW restarted - more frequently
> > > than the bpf progs and maps,
> > > that is also one of the reasons why we choose to pin them on the host.
> >
> > In general, I think I understand why this feature is useful for your case,
> > however I do have some serious concerns about adding such feature to
> > the upstream kernel:
> > 1) The interface and the proposed feature is bpf-specific, however the problem
> > isn't. The same issue (an under reported memory consumption) can be caused by
> > other types of memory: pagecache, various kernel objects e.g. vfs cache etc.
> > If we introduce such a feature, we'd better be consistent across various
> > types of objects (how it's a good question).
> 
> That is really a good question, which drives me to think more and
> investigate more.
> 
> Per my understanding the under reported pages can be divided into several cases,
> 1) The pages aren't charged correctly when they are allocated.
>    In this case, we should fix it when we allocate it.
> 2) The pages should be recharged back to the original memcg
>    The pages are charged correctly but then we lost track of it.
>    In this case the kernel must introduce some way to keep track of
> and recharge it back in the proper circumstance.
> 3) Undistributed estate
>    The original owner was dead, left with some persistent memory.
>    Should the new one who uses this memory take charge of it?
> 
> So case #3 is what we should discuss here.

Right, this is the case I'm focused on too.

A particular case is when there are multiple generations of the "same"
workload each running in a new cgroup. Likely there is a lot of pagecache
and vfs cache (and maybe bpf programs etc) is re-used by the second and
newer generations, however they are accounted towards the first dead cgroup.
So the memory consumption of the second and newer generations is systematically
under-reported.

> 
> Before answering the question, I will explain another option we have
> thought about to fix our issue.
> Instead of recharging the bpf memory in the bpf syscall, the other
> option is to set the target memcg only in the syscall and then wake up
> a kworker to do the recharge. That means separate the recharge into
> two steps, 1) assign the inheritor, 2) transfer the estate.
> At last we didn't choose it because we want an immediate error if the
> new owner doesn't have large enough space.

The problem is that we often don't know this in advance. Imagine a cgroup
with memory.max set to 1Gb and current usage 0.8Gb. Can it fit a 0.5Gb bpf map?
The true answer is it depends on whether we can reclaim extra 0.3Gb. And there
is no way to say it for sure without making a real attempt to reclaim.

> But this option can partly answer your question here, one possible way
> to do it more generic is to abstract
> two methods to get -
> 1). Who is the best inheritor               =>  assigner
> 2). How to charge the memory to it    =>  charger
> 
> Then let consider the option we choose again, we can find that it can be
> easily extended to work in that way,
> 
>        assigner                             charger
> 
>     bpf_syscall
>        wakeup the charger            waken
>        wait for the result                 do the recharge and give the result
>        return the result
> 
> In other words, we don't have a clear idea what issues we may face in
> the future, but we know we can extend it to fix the new coming issue.
> I think that is the most important thing.
> 
> > 2) Moving charges is proven to be tricky and cause various problems in the past.
> > If we're going back into this direction, we should come up with a really solid
> > plan for how to avoid past issues.
> 
> I know the reason why we disable move_charge_at_immigrate in cgroup2,
> but I don't know if I know all of the past issues.
> Appreciate if you could share the past issues you know and I will
> check if they apply to this case as well.

As I mentioned above, recharging is a complex and potentially long process,
which can unexpectedly fail. And rolling it back is also tricky and not always
possible without breaking other things.
So there are difficulties with:
1) providing a reasonable interface,
2) implementing it in way which doesn't bring significant performance overhead.

That said, I'm not saying it's not possible at all, but it's a serious open
problem.

> In order to avoid possible risks, I have restricted the recharge to
> happen in very strict conditions,
> 1. The original memcg must be an offline memcg
> 2.  The target memcg must be the memcg of the one who calls the bpf syscall
>      That means the outsider doesn't have a way to do the recharge.
> 3. only kmem is supported now. (The may be extend it the future for
> other types of memory)
> 
> > 3) It would be great to understand who and how will use this feature in a more
> > generic environment. E.g. is it useful for systemd? Is it common to use bpf maps
> > over multiple cgroups? What for (given that these are not system-wide programs,
> > otherwise why would we charge their memory to some specific container)?
> >
> 
> It is useful for containerized environments.
> The container which pinned bpf can use it.
> In our case we may use it in two ways as I explained in the prev mail that,
> 1) The one who load the bpf who do the recharge
> 2) A sidecar to maintain the bpf cycle
> 
> For the systemd, it may need to do some extend that,
> The bpf services should describe,
> 1) if the bpf service needs the recharge (the one who limited by memcg
> should be forcefully do the recharge)
> 2) the pinned progs and maps to check
> 3) the service identifier (with which we can get the target memcg)
> 
> We don't have the case that the bpf map is shared by multiple cgroups,
> that should be a rare case.
> I think that case is similar to the sharing page caches across
> multiple cgroups, which are used by many cgroups but only charged to
> one specific memcg.

I understand the case with the pagecache. E.g. we're running essentially the
same workload in a new cgroup and it likely uses the same or similar set of
files, it will actively use the pagecache created by the previous generation.
And this can be a memcg-specific pagecache, which nobody except these cgroups is
using.

But what kind of bpf data has the same property? Why it has to be persistent
across multiple generations of the same workload?

In the end, if the data is not too big (and assuming it's not happening too
often), it's possible to re-create the map and copy the data.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg
  2022-03-10 18:00         ` Roman Gushchin
@ 2022-03-11 12:48           ` Yafang Shao
  2022-03-11 17:49             ` Roman Gushchin
  0 siblings, 1 reply; 18+ messages in thread
From: Yafang Shao @ 2022-03-11 12:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, Andrew Morton,
	Christoph Lameter, penberg, David Rientjes, iamjoonsoo.kim,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Linux MM, netdev, bpf

On Fri, Mar 11, 2022 at 2:00 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Thu, Mar 10, 2022 at 09:20:54PM +0800, Yafang Shao wrote:
> > On Thu, Mar 10, 2022 at 7:35 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Wed, Mar 09, 2022 at 09:28:58PM +0800, Yafang Shao wrote:
> > > > On Wed, Mar 9, 2022 at 9:09 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote:
> > > > > > When we use memcg to limit the containers which load bpf progs and maps,
> > > > > > we find there is an issue that the lifecycle of container and bpf are not
> > > > > > always the same, because we may pin the maps and progs while update the
> > > > > > container only. So once the container which has alreay pinned progs and
> > > > > > maps is restarted, the pinned progs and maps are no longer charged to it
> > > > > > any more. In other words, this kind of container can steal memory from the
> > > > > > host, that is not expected by us. This patchset means to resolve this
> > > > > > issue.
> > > > > >
> > > > > > After the container is restarted, the old memcg which is charged by the
> > > > > > pinned progs and maps will be offline but won't be freed until all of the
> > > > > > related maps and progs are freed. If we want to charge these bpf memory to
> > > > > > the new started memcg, we should uncharge them from the offline memcg first
> > > > > > and then charge it to the new one. As we have already known how the bpf
> > > > > > memroy is allocated and freed, we can also know how to charge and uncharge
> > > > > > it. This pathset implements various charge and uncharge methords for these
> > > > > > memory.
> > > > > >
> > > > > > Regarding how to do the recharge, we decide to implement new bpf syscalls
> > > > > > to do it. With the new implemented bpf syscall, the agent running in the
> > > > > > container can use it to do the recharge. As of now we only implement it for
> > > > > > the bpf hash maps. Below is a simple example how to do the recharge,
> > > > > >
> > > > > > ====
> > > > > > int main(int argc, char *argv[])
> > > > > > {
> > > > > >       union bpf_attr attr = {};
> > > > > >       int map_id;
> > > > > >       int pfd;
> > > > > >
> > > > > >       if (argc < 2) {
> > > > > >               printf("Pls. give a map id \n");
> > > > > >               exit(-1);
> > > > > >       }
> > > > > >
> > > > > >       map_id = atoi(argv[1]);
> > > > > >       attr.map_id = map_id;
> > > > > >       pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
> > > > > >       if (pfd < 0)
> > > > > >               perror("BPF_MAP_RECHARGE");
> > > > > >
> > > > > >       return 0;
> > > > > > }
> > > > > >
> > > > > > ====
> > > > > >
> > > > > > Patch #1 and #2 is for the observability, with which we can easily check
> > > > > > whether the bpf maps is charged to a memcg and whether the memcg is offline.
> > > > > > Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
> > > > > > kmalloc-ed and percpu memory.
> > > > > > Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
> > > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf progs
> > > > > > hasn't been implemented neither.
> > > > > >
> > > > > > This pathset is still a POC now, with limited testing. Any feedback is
> > > > > > welcomed.
> > > > >
> > > > > Hello Yafang!
> > > > >
> > > > > It's an interesting topic, which goes well beyond bpf. In general, on cgroup
> > > > > offlining we either do nothing either recharge pages to the parent cgroup
> > > > > (latter is preferred), which helps to release the pinned memcg structure.
> > > > >
> > > >
> > > > We have thought about recharging pages to the parent cgroup (the root
> > > > memcg in our case),
> > > > but it can't resolve our issue.
> > > > Releasing the pinned memcg struct is the benefit of recharging pages
> > > > to the parent,
> > > > but as there won't be too many memcgs pinned by bpf, so it may not be worth it.
> > >
> > > I agree, that was my thinking too.
> > >
> > > >
> > > >
> > > > > Your approach raises some questions:
> > > >
> > > > Nice questions.
> > > >
> > > > > 1) what if the new cgroup is not large enough to contain the bpf map?
> > > >
> > > > The recharge is supposed to be triggered at the container start time.
> > > > After the container is started, the agent which will load the bpf
> > > > programs will do it as follows,
> > > > 1. Check if the bpf program has already been loaded,
> > > >     if not,  goto 5.
> > > > 2. Check if the bpf program will pin maps or progs,
> > > >     if not, goto 6.
> > > > 3. Check if the pinned maps and progs are charged to an offline memcg,
> > > >     if not, goto 6.
> > > > 4. Recharge the pinned maps or progs to the current memcg.
> > > >    goto 6.
> > > > 5. load new bpf program, and also pinned maps and progs if desired.
> > > > 6. End.
> > > >
> > > > If the recharge fails, it means that the memcg limit is too low, we
> > > > should reconsider
> > > > the limit of the container.
> > > >
> > > > Regarding other cases that it may do the recharge in the runtime, I
> > > > think the failure is
> > > > a common OOM case, that means the usage in this container is out of memory, we
> > > > should kill something.
> > >
> > > The problem here is that even invoking the oom killer might not help here,
> > > if the size of the bpf map is larger than memory.max.
> > >
> >
> > Then we should introduce a fallback.
>
> Can you, please, elaborate a bit more?
>
> >
> > > Also because recharging of a large object might take time and it's happening
> > > simultaneously with other processes in the system (e.g. memory allocations,
> > > cgroup limit changes, etc), potentially we might end up in the situation
> > > when the new cgroup is not large enough to include the transferred object,
> > > but also the original cgroup is not large enough (due to the limit set on one
> > > of it's ancestors), so we'll need to break memory.max of either cgroup,
> > > which is not great. We might solve this by pre-charging of target cgroup
> > > and keeping the double-charge during the process, but it might not work
> > > well for really large objects on small machines. Another approach is to transfer
> > > in small chunks (e.g. pages), but then we might end with a partially transferred
> > > object, which is also a questionable result.
> > >
> >
> > For this case it is not difficult to do the fallback because the
> > original one is restricted to an offline memcg only, that means there
> > are no any activities  in the original memcg. So recharge these pages
> > to the original one back will always succeed.
>
> The problem is that the original cgroup might be not a top-level cgroup.
> So even if it's offline, it doesn't really change anything: it's parent cgroup
> can be online and experience concurrent limits changes, allocations etc.
>
> >
> > > <...>
> > >
> > > > > Will reparenting work for your case? If not, can you, please, describe the
> > > > > problem you're trying to solve by recharging the memory?
> > > > >
> > > >
> > > > Reparenting doesn't work for us.
> > > > The problem is memory resource control: the limitation on the bpf
> > > > containers will be useless
> > > > if the lifecycle of bpf progs can containers are not the same.
> > > > The containers are always upgraded - IOW restarted - more frequently
> > > > than the bpf progs and maps,
> > > > that is also one of the reasons why we choose to pin them on the host.
> > >
> > > In general, I think I understand why this feature is useful for your case,
> > > however I do have some serious concerns about adding such feature to
> > > the upstream kernel:
> > > 1) The interface and the proposed feature is bpf-specific, however the problem
> > > isn't. The same issue (an under reported memory consumption) can be caused by
> > > other types of memory: pagecache, various kernel objects e.g. vfs cache etc.
> > > If we introduce such a feature, we'd better be consistent across various
> > > types of objects (how it's a good question).
> >
> > That is really a good question, which drives me to think more and
> > investigate more.
> >
> > Per my understanding the under reported pages can be divided into several cases,
> > 1) The pages aren't charged correctly when they are allocated.
> >    In this case, we should fix it when we allocate it.
> > 2) The pages should be recharged back to the original memcg
> >    The pages are charged correctly but then we lost track of it.
> >    In this case the kernel must introduce some way to keep track of
> > and recharge it back in the proper circumstance.
> > 3) Undistributed estate
> >    The original owner was dead, left with some persistent memory.
> >    Should the new one who uses this memory take charge of it?
> >
> > So case #3 is what we should discuss here.
>
> Right, this is the case I'm focused on too.
>
> A particular case is when there are multiple generations of the "same"
> workload each running in a new cgroup. Likely there is a lot of pagecache
> and vfs cache (and maybe bpf programs etc) is re-used by the second and
> newer generations, however they are accounted towards the first dead cgroup.
> So the memory consumption of the second and newer generations is systematically
> under-reported.
>

Right, the sharing pagecache pages and vfs cache are more complicated.
The trouble is that we don't have a clear rule on what they should
belong to. If we want to handle them, we must make the rule first that
1) Should we charge these pages to a specific memcg in the first place ?
    If not, things will be very easy. If yes, things will be very complicated.
    Unfortunately we selected the complicated way.
2) Now that we selected the complicated way, can we have a clear rule
to manage them ?
    Our current status is that let it be, and it doesn't matter what
they belong to as long as they have a memcg.

> >
> > Before answering the question, I will explain another option we have
> > thought about to fix our issue.
> > Instead of recharging the bpf memory in the bpf syscall, the other
> > option is to set the target memcg only in the syscall and then wake up
> > a kworker to do the recharge. That means separate the recharge into
> > two steps, 1) assign the inheritor, 2) transfer the estate.
> > At last we didn't choose it because we want an immediate error if the
> > new owner doesn't have large enough space.
>
> The problem is that we often don't know this in advance. Imagine a cgroup
> with memory.max set to 1Gb and current usage 0.8Gb. Can it fit a 0.5Gb bpf map?
> The true answer is it depends on whether we can reclaim extra 0.3Gb. And there
> is no way to say it for sure without making a real attempt to reclaim.
>
> > But this option can partly answer your question here, one possible way
> > to do it more generic is to abstract
> > two methods to get -
> > 1). Who is the best inheritor               =>  assigner
> > 2). How to charge the memory to it    =>  charger
> >
> > Then let consider the option we choose again, we can find that it can be
> > easily extended to work in that way,
> >
> >        assigner                             charger
> >
> >     bpf_syscall
> >        wakeup the charger            waken
> >        wait for the result                 do the recharge and give the result
> >        return the result
> >
> > In other words, we don't have a clear idea what issues we may face in
> > the future, but we know we can extend it to fix the new coming issue.
> > I think that is the most important thing.
> >
> > > 2) Moving charges is proven to be tricky and cause various problems in the past.
> > > If we're going back into this direction, we should come up with a really solid
> > > plan for how to avoid past issues.
> >
> > I know the reason why we disable move_charge_at_immigrate in cgroup2,
> > but I don't know if I know all of the past issues.
> > Appreciate if you could share the past issues you know and I will
> > check if they apply to this case as well.
>
> As I mentioned above, recharging is a complex and potentially long process,
> which can unexpectedly fail. And rolling it back is also tricky and not always
> possible without breaking other things.
> So there are difficulties with:
> 1) providing a reasonable interface,
> 2) implementing it in way which doesn't bring significant performance overhead.
>
> That said, I'm not saying it's not possible at all, but it's a serious open
> problem.
>
> > In order to avoid possible risks, I have restricted the recharge to
> > happen in very strict conditions,
> > 1. The original memcg must be an offline memcg
> > 2.  The target memcg must be the memcg of the one who calls the bpf syscall
> >      That means the outsider doesn't have a way to do the recharge.
> > 3. only kmem is supported now. (The may be extend it the future for
> > other types of memory)
> >
> > > 3) It would be great to understand who and how will use this feature in a more
> > > generic environment. E.g. is it useful for systemd? Is it common to use bpf maps
> > > over multiple cgroups? What for (given that these are not system-wide programs,
> > > otherwise why would we charge their memory to some specific container)?
> > >
> >
> > It is useful for containerized environments.
> > The container which pinned bpf can use it.
> > In our case we may use it in two ways as I explained in the prev mail that,
> > 1) The one who load the bpf who do the recharge
> > 2) A sidecar to maintain the bpf cycle
> >
> > For the systemd, it may need to do some extend that,
> > The bpf services should describe,
> > 1) if the bpf service needs the recharge (the one who limited by memcg
> > should be forcefully do the recharge)
> > 2) the pinned progs and maps to check
> > 3) the service identifier (with which we can get the target memcg)
> >
> > We don't have the case that the bpf map is shared by multiple cgroups,
> > that should be a rare case.
> > I think that case is similar to the sharing page caches across
> > multiple cgroups, which are used by many cgroups but only charged to
> > one specific memcg.
>
> I understand the case with the pagecache. E.g. we're running essentially the
> same workload in a new cgroup and it likely uses the same or similar set of
> files, it will actively use the pagecache created by the previous generation.
> And this can be a memcg-specific pagecache, which nobody except these cgroups is
> using.
>
> But what kind of bpf data has the same property? Why it has to be persistent
> across multiple generations of the same workload?
>

Ah, it can be considered as shared, between the bpf memcg and the root
memcg. While it can only be written by bpf memcg. For example, in the
root memcg, some networking facilities like clsact qdisc also read
these maps.

The key point is that the charging behavior must be consistent, either
always charged or always uncharged. That will be good for memory
resource management. It is bad that sometimes it gets charged while
sometimes not.

Another possible solution is to introduce a way to allow not to charge
pages, IOW these pages will be accounted to root only. If we go that
direction, things will get simpler. What do you think?

> In the end, if the data is not too big (and assuming it's not happening too
> often), it's possible to re-create the map and copy the data.
>

For one of our bpf services, the total size of its maps is around 1GB,
which is not small.

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg
  2022-03-11 12:48           ` Yafang Shao
@ 2022-03-11 17:49             ` Roman Gushchin
  2022-03-12  6:45               ` Yafang Shao
  0 siblings, 1 reply; 18+ messages in thread
From: Roman Gushchin @ 2022-03-11 17:49 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, Andrew Morton,
	Christoph Lameter, penberg, David Rientjes, iamjoonsoo.kim,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Linux MM, netdev, bpf

On Fri, Mar 11, 2022 at 08:48:27PM +0800, Yafang Shao wrote:
> On Fri, Mar 11, 2022 at 2:00 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Thu, Mar 10, 2022 at 09:20:54PM +0800, Yafang Shao wrote:
> > > On Thu, Mar 10, 2022 at 7:35 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Wed, Mar 09, 2022 at 09:28:58PM +0800, Yafang Shao wrote:
> > > > > On Wed, Mar 9, 2022 at 9:09 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > >
> > > > > > On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote:
> > > > > > > When we use memcg to limit the containers which load bpf progs and maps,
> > > > > > > we find there is an issue that the lifecycle of container and bpf are not
> > > > > > > always the same, because we may pin the maps and progs while update the
> > > > > > > container only. So once the container which has alreay pinned progs and
> > > > > > > maps is restarted, the pinned progs and maps are no longer charged to it
> > > > > > > any more. In other words, this kind of container can steal memory from the
> > > > > > > host, that is not expected by us. This patchset means to resolve this
> > > > > > > issue.
> > > > > > >
> > > > > > > After the container is restarted, the old memcg which is charged by the
> > > > > > > pinned progs and maps will be offline but won't be freed until all of the
> > > > > > > related maps and progs are freed. If we want to charge these bpf memory to
> > > > > > > the new started memcg, we should uncharge them from the offline memcg first
> > > > > > > and then charge it to the new one. As we have already known how the bpf
> > > > > > > memroy is allocated and freed, we can also know how to charge and uncharge
> > > > > > > it. This pathset implements various charge and uncharge methords for these
> > > > > > > memory.
> > > > > > >
> > > > > > > Regarding how to do the recharge, we decide to implement new bpf syscalls
> > > > > > > to do it. With the new implemented bpf syscall, the agent running in the
> > > > > > > container can use it to do the recharge. As of now we only implement it for
> > > > > > > the bpf hash maps. Below is a simple example how to do the recharge,
> > > > > > >
> > > > > > > ====
> > > > > > > int main(int argc, char *argv[])
> > > > > > > {
> > > > > > >       union bpf_attr attr = {};
> > > > > > >       int map_id;
> > > > > > >       int pfd;
> > > > > > >
> > > > > > >       if (argc < 2) {
> > > > > > >               printf("Pls. give a map id \n");
> > > > > > >               exit(-1);
> > > > > > >       }
> > > > > > >
> > > > > > >       map_id = atoi(argv[1]);
> > > > > > >       attr.map_id = map_id;
> > > > > > >       pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
> > > > > > >       if (pfd < 0)
> > > > > > >               perror("BPF_MAP_RECHARGE");
> > > > > > >
> > > > > > >       return 0;
> > > > > > > }
> > > > > > >
> > > > > > > ====
> > > > > > >
> > > > > > > Patch #1 and #2 is for the observability, with which we can easily check
> > > > > > > whether the bpf maps is charged to a memcg and whether the memcg is offline.
> > > > > > > Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
> > > > > > > kmalloc-ed and percpu memory.
> > > > > > > Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
> > > > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf progs
> > > > > > > hasn't been implemented neither.
> > > > > > >
> > > > > > > This pathset is still a POC now, with limited testing. Any feedback is
> > > > > > > welcomed.
> > > > > >
> > > > > > Hello Yafang!
> > > > > >
> > > > > > It's an interesting topic, which goes well beyond bpf. In general, on cgroup
> > > > > > offlining we either do nothing either recharge pages to the parent cgroup
> > > > > > (latter is preferred), which helps to release the pinned memcg structure.
> > > > > >
> > > > >
> > > > > We have thought about recharging pages to the parent cgroup (the root
> > > > > memcg in our case),
> > > > > but it can't resolve our issue.
> > > > > Releasing the pinned memcg struct is the benefit of recharging pages
> > > > > to the parent,
> > > > > but as there won't be too many memcgs pinned by bpf, so it may not be worth it.
> > > >
> > > > I agree, that was my thinking too.
> > > >
> > > > >
> > > > >
> > > > > > Your approach raises some questions:
> > > > >
> > > > > Nice questions.
> > > > >
> > > > > > 1) what if the new cgroup is not large enough to contain the bpf map?
> > > > >
> > > > > The recharge is supposed to be triggered at the container start time.
> > > > > After the container is started, the agent which will load the bpf
> > > > > programs will do it as follows,
> > > > > 1. Check if the bpf program has already been loaded,
> > > > >     if not,  goto 5.
> > > > > 2. Check if the bpf program will pin maps or progs,
> > > > >     if not, goto 6.
> > > > > 3. Check if the pinned maps and progs are charged to an offline memcg,
> > > > >     if not, goto 6.
> > > > > 4. Recharge the pinned maps or progs to the current memcg.
> > > > >    goto 6.
> > > > > 5. load new bpf program, and also pinned maps and progs if desired.
> > > > > 6. End.
> > > > >
> > > > > If the recharge fails, it means that the memcg limit is too low, we
> > > > > should reconsider
> > > > > the limit of the container.
> > > > >
> > > > > Regarding other cases that it may do the recharge in the runtime, I
> > > > > think the failure is
> > > > > a common OOM case, that means the usage in this container is out of memory, we
> > > > > should kill something.
> > > >
> > > > The problem here is that even invoking the oom killer might not help here,
> > > > if the size of the bpf map is larger than memory.max.
> > > >
> > >
> > > Then we should introduce a fallback.
> >
> > Can you, please, elaborate a bit more?
> >
> > >
> > > > Also because recharging of a large object might take time and it's happening
> > > > simultaneously with other processes in the system (e.g. memory allocations,
> > > > cgroup limit changes, etc), potentially we might end up in the situation
> > > > when the new cgroup is not large enough to include the transferred object,
> > > > but also the original cgroup is not large enough (due to the limit set on one
> > > > of it's ancestors), so we'll need to break memory.max of either cgroup,
> > > > which is not great. We might solve this by pre-charging of target cgroup
> > > > and keeping the double-charge during the process, but it might not work
> > > > well for really large objects on small machines. Another approach is to transfer
> > > > in small chunks (e.g. pages), but then we might end with a partially transferred
> > > > object, which is also a questionable result.
> > > >
> > >
> > > For this case it is not difficult to do the fallback because the
> > > original one is restricted to an offline memcg only, that means there
> > > are no any activities  in the original memcg. So recharge these pages
> > > to the original one back will always succeed.
> >
> > The problem is that the original cgroup might be not a top-level cgroup.
> > So even if it's offline, it doesn't really change anything: it's parent cgroup
> > can be online and experience concurrent limits changes, allocations etc.
> >
> > >
> > > > <...>
> > > >
> > > > > > Will reparenting work for your case? If not, can you, please, describe the
> > > > > > problem you're trying to solve by recharging the memory?
> > > > > >
> > > > >
> > > > > Reparenting doesn't work for us.
> > > > > The problem is memory resource control: the limitation on the bpf
> > > > > containers will be useless
> > > > > if the lifecycle of bpf progs can containers are not the same.
> > > > > The containers are always upgraded - IOW restarted - more frequently
> > > > > than the bpf progs and maps,
> > > > > that is also one of the reasons why we choose to pin them on the host.
> > > >
> > > > In general, I think I understand why this feature is useful for your case,
> > > > however I do have some serious concerns about adding such feature to
> > > > the upstream kernel:
> > > > 1) The interface and the proposed feature is bpf-specific, however the problem
> > > > isn't. The same issue (an under reported memory consumption) can be caused by
> > > > other types of memory: pagecache, various kernel objects e.g. vfs cache etc.
> > > > If we introduce such a feature, we'd better be consistent across various
> > > > types of objects (how it's a good question).
> > >
> > > That is really a good question, which drives me to think more and
> > > investigate more.
> > >
> > > Per my understanding the under reported pages can be divided into several cases,
> > > 1) The pages aren't charged correctly when they are allocated.
> > >    In this case, we should fix it when we allocate it.
> > > 2) The pages should be recharged back to the original memcg
> > >    The pages are charged correctly but then we lost track of it.
> > >    In this case the kernel must introduce some way to keep track of
> > > and recharge it back in the proper circumstance.
> > > 3) Undistributed estate
> > >    The original owner was dead, left with some persistent memory.
> > >    Should the new one who uses this memory take charge of it?
> > >
> > > So case #3 is what we should discuss here.
> >
> > Right, this is the case I'm focused on too.
> >
> > A particular case is when there are multiple generations of the "same"
> > workload each running in a new cgroup. Likely there is a lot of pagecache
> > and vfs cache (and maybe bpf programs etc) is re-used by the second and
> > newer generations, however they are accounted towards the first dead cgroup.
> > So the memory consumption of the second and newer generations is systematically
> > under-reported.
> >
> 
> Right, the sharing pagecache pages and vfs cache are more complicated.
> The trouble is that we don't have a clear rule on what they should
> belong to. If we want to handle them, we must make the rule first that
> 1) Should we charge these pages to a specific memcg in the first place ?
>     If not, things will be very easy. If yes, things will be very complicated.
>     Unfortunately we selected the complicated way.
> 2) Now that we selected the complicated way, can we have a clear rule
> to manage them ?
>     Our current status is that let it be, and it doesn't matter what
> they belong to as long as they have a memcg.
> 
> > >
> > > Before answering the question, I will explain another option we have
> > > thought about to fix our issue.
> > > Instead of recharging the bpf memory in the bpf syscall, the other
> > > option is to set the target memcg only in the syscall and then wake up
> > > a kworker to do the recharge. That means separate the recharge into
> > > two steps, 1) assign the inheritor, 2) transfer the estate.
> > > At last we didn't choose it because we want an immediate error if the
> > > new owner doesn't have large enough space.
> >
> > The problem is that we often don't know this in advance. Imagine a cgroup
> > with memory.max set to 1Gb and current usage 0.8Gb. Can it fit a 0.5Gb bpf map?
> > The true answer is it depends on whether we can reclaim extra 0.3Gb. And there
> > is no way to say it for sure without making a real attempt to reclaim.
> >
> > > But this option can partly answer your question here, one possible way
> > > to do it more generic is to abstract
> > > two methods to get -
> > > 1). Who is the best inheritor               =>  assigner
> > > 2). How to charge the memory to it    =>  charger
> > >
> > > Then let consider the option we choose again, we can find that it can be
> > > easily extended to work in that way,
> > >
> > >        assigner                             charger
> > >
> > >     bpf_syscall
> > >        wakeup the charger            waken
> > >        wait for the result                 do the recharge and give the result
> > >        return the result
> > >
> > > In other words, we don't have a clear idea what issues we may face in
> > > the future, but we know we can extend it to fix the new coming issue.
> > > I think that is the most important thing.
> > >
> > > > 2) Moving charges is proven to be tricky and cause various problems in the past.
> > > > If we're going back into this direction, we should come up with a really solid
> > > > plan for how to avoid past issues.
> > >
> > > I know the reason why we disable move_charge_at_immigrate in cgroup2,
> > > but I don't know if I know all of the past issues.
> > > Appreciate if you could share the past issues you know and I will
> > > check if they apply to this case as well.
> >
> > As I mentioned above, recharging is a complex and potentially long process,
> > which can unexpectedly fail. And rolling it back is also tricky and not always
> > possible without breaking other things.
> > So there are difficulties with:
> > 1) providing a reasonable interface,
> > 2) implementing it in way which doesn't bring significant performance overhead.
> >
> > That said, I'm not saying it's not possible at all, but it's a serious open
> > problem.
> >
> > > In order to avoid possible risks, I have restricted the recharge to
> > > happen in very strict conditions,
> > > 1. The original memcg must be an offline memcg
> > > 2.  The target memcg must be the memcg of the one who calls the bpf syscall
> > >      That means the outsider doesn't have a way to do the recharge.
> > > 3. only kmem is supported now. (The may be extend it the future for
> > > other types of memory)
> > >
> > > > 3) It would be great to understand who and how will use this feature in a more
> > > > generic environment. E.g. is it useful for systemd? Is it common to use bpf maps
> > > > over multiple cgroups? What for (given that these are not system-wide programs,
> > > > otherwise why would we charge their memory to some specific container)?
> > > >
> > >
> > > It is useful for containerized environments.
> > > The container which pinned bpf can use it.
> > > In our case we may use it in two ways as I explained in the prev mail that,
> > > 1) The one who load the bpf who do the recharge
> > > 2) A sidecar to maintain the bpf cycle
> > >
> > > For the systemd, it may need to do some extend that,
> > > The bpf services should describe,
> > > 1) if the bpf service needs the recharge (the one who limited by memcg
> > > should be forcefully do the recharge)
> > > 2) the pinned progs and maps to check
> > > 3) the service identifier (with which we can get the target memcg)
> > >
> > > We don't have the case that the bpf map is shared by multiple cgroups,
> > > that should be a rare case.
> > > I think that case is similar to the sharing page caches across
> > > multiple cgroups, which are used by many cgroups but only charged to
> > > one specific memcg.
> >
> > I understand the case with the pagecache. E.g. we're running essentially the
> > same workload in a new cgroup and it likely uses the same or similar set of
> > files, it will actively use the pagecache created by the previous generation.
> > And this can be a memcg-specific pagecache, which nobody except these cgroups is
> > using.
> >
> > But what kind of bpf data has the same property? Why it has to be persistent
> > across multiple generations of the same workload?
> >
> 
> Ah, it can be considered as shared, between the bpf memcg and the root
> memcg. While it can only be written by bpf memcg. For example, in the
> root memcg, some networking facilities like clsact qdisc also read
> these maps.
> 
> The key point is that the charging behavior must be consistent, either
> always charged or always uncharged. That will be good for memory
> resource management. It is bad that sometimes it gets charged while
> sometimes not.

I agree, consistency is very important. That is why I don't quite like the idea
of a voluntary recharging performed by userspace. It might work in your case,
but in general it's hard to expect that everybody will consistently recharge
their maps.

> 
> Another possible solution is to introduce a way to allow not to charge
> pages, IOW these pages will be accounted to root only. If we go that
> direction, things will get simpler. What do you think?

Is your map pre-allocated or not? Pre-allocated maps can be created by a process
(temporarily) placed into the root memcg to disable accounting. We can also
think of a flag to do it explicitly (e.g. on creating a map).
But we must be careful here to not introduce security issues: e.g. a non-root
memcg shouldn't be able to allocate (unaccounted) memory by writing into a
bpf map belonging to the root memcg.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg
  2022-03-11 17:49             ` Roman Gushchin
@ 2022-03-12  6:45               ` Yafang Shao
  0 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2022-03-12  6:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, Andrew Morton,
	Christoph Lameter, penberg, David Rientjes, iamjoonsoo.kim,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Linux MM, netdev, bpf

On Sat, Mar 12, 2022 at 1:49 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Fri, Mar 11, 2022 at 08:48:27PM +0800, Yafang Shao wrote:
> > On Fri, Mar 11, 2022 at 2:00 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Thu, Mar 10, 2022 at 09:20:54PM +0800, Yafang Shao wrote:
> > > > On Thu, Mar 10, 2022 at 7:35 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Wed, Mar 09, 2022 at 09:28:58PM +0800, Yafang Shao wrote:
> > > > > > On Wed, Mar 9, 2022 at 9:09 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > >
> > > > > > > On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote:
> > > > > > > > When we use memcg to limit the containers which load bpf progs and maps,
> > > > > > > > we find there is an issue that the lifecycle of container and bpf are not
> > > > > > > > always the same, because we may pin the maps and progs while update the
> > > > > > > > container only. So once the container which has alreay pinned progs and
> > > > > > > > maps is restarted, the pinned progs and maps are no longer charged to it
> > > > > > > > any more. In other words, this kind of container can steal memory from the
> > > > > > > > host, that is not expected by us. This patchset means to resolve this
> > > > > > > > issue.
> > > > > > > >
> > > > > > > > After the container is restarted, the old memcg which is charged by the
> > > > > > > > pinned progs and maps will be offline but won't be freed until all of the
> > > > > > > > related maps and progs are freed. If we want to charge these bpf memory to
> > > > > > > > the new started memcg, we should uncharge them from the offline memcg first
> > > > > > > > and then charge it to the new one. As we have already known how the bpf
> > > > > > > > memroy is allocated and freed, we can also know how to charge and uncharge
> > > > > > > > it. This pathset implements various charge and uncharge methords for these
> > > > > > > > memory.
> > > > > > > >
> > > > > > > > Regarding how to do the recharge, we decide to implement new bpf syscalls
> > > > > > > > to do it. With the new implemented bpf syscall, the agent running in the
> > > > > > > > container can use it to do the recharge. As of now we only implement it for
> > > > > > > > the bpf hash maps. Below is a simple example how to do the recharge,
> > > > > > > >
> > > > > > > > ====
> > > > > > > > int main(int argc, char *argv[])
> > > > > > > > {
> > > > > > > >       union bpf_attr attr = {};
> > > > > > > >       int map_id;
> > > > > > > >       int pfd;
> > > > > > > >
> > > > > > > >       if (argc < 2) {
> > > > > > > >               printf("Pls. give a map id \n");
> > > > > > > >               exit(-1);
> > > > > > > >       }
> > > > > > > >
> > > > > > > >       map_id = atoi(argv[1]);
> > > > > > > >       attr.map_id = map_id;
> > > > > > > >       pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
> > > > > > > >       if (pfd < 0)
> > > > > > > >               perror("BPF_MAP_RECHARGE");
> > > > > > > >
> > > > > > > >       return 0;
> > > > > > > > }
> > > > > > > >
> > > > > > > > ====
> > > > > > > >
> > > > > > > > Patch #1 and #2 is for the observability, with which we can easily check
> > > > > > > > whether the bpf maps is charged to a memcg and whether the memcg is offline.
> > > > > > > > Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
> > > > > > > > kmalloc-ed and percpu memory.
> > > > > > > > Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
> > > > > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf progs
> > > > > > > > hasn't been implemented neither.
> > > > > > > >
> > > > > > > > This pathset is still a POC now, with limited testing. Any feedback is
> > > > > > > > welcomed.
> > > > > > >
> > > > > > > Hello Yafang!
> > > > > > >
> > > > > > > It's an interesting topic, which goes well beyond bpf. In general, on cgroup
> > > > > > > offlining we either do nothing either recharge pages to the parent cgroup
> > > > > > > (latter is preferred), which helps to release the pinned memcg structure.
> > > > > > >
> > > > > >
> > > > > > We have thought about recharging pages to the parent cgroup (the root
> > > > > > memcg in our case),
> > > > > > but it can't resolve our issue.
> > > > > > Releasing the pinned memcg struct is the benefit of recharging pages
> > > > > > to the parent,
> > > > > > but as there won't be too many memcgs pinned by bpf, so it may not be worth it.
> > > > >
> > > > > I agree, that was my thinking too.
> > > > >
> > > > > >
> > > > > >
> > > > > > > Your approach raises some questions:
> > > > > >
> > > > > > Nice questions.
> > > > > >
> > > > > > > 1) what if the new cgroup is not large enough to contain the bpf map?
> > > > > >
> > > > > > The recharge is supposed to be triggered at the container start time.
> > > > > > After the container is started, the agent which will load the bpf
> > > > > > programs will do it as follows,
> > > > > > 1. Check if the bpf program has already been loaded,
> > > > > >     if not,  goto 5.
> > > > > > 2. Check if the bpf program will pin maps or progs,
> > > > > >     if not, goto 6.
> > > > > > 3. Check if the pinned maps and progs are charged to an offline memcg,
> > > > > >     if not, goto 6.
> > > > > > 4. Recharge the pinned maps or progs to the current memcg.
> > > > > >    goto 6.
> > > > > > 5. load new bpf program, and also pinned maps and progs if desired.
> > > > > > 6. End.
> > > > > >
> > > > > > If the recharge fails, it means that the memcg limit is too low, we
> > > > > > should reconsider
> > > > > > the limit of the container.
> > > > > >
> > > > > > Regarding other cases that it may do the recharge in the runtime, I
> > > > > > think the failure is
> > > > > > a common OOM case, that means the usage in this container is out of memory, we
> > > > > > should kill something.
> > > > >
> > > > > The problem here is that even invoking the oom killer might not help here,
> > > > > if the size of the bpf map is larger than memory.max.
> > > > >
> > > >
> > > > Then we should introduce a fallback.
> > >
> > > Can you, please, elaborate a bit more?
> > >
> > > >
> > > > > Also because recharging of a large object might take time and it's happening
> > > > > simultaneously with other processes in the system (e.g. memory allocations,
> > > > > cgroup limit changes, etc), potentially we might end up in the situation
> > > > > when the new cgroup is not large enough to include the transferred object,
> > > > > but also the original cgroup is not large enough (due to the limit set on one
> > > > > of it's ancestors), so we'll need to break memory.max of either cgroup,
> > > > > which is not great. We might solve this by pre-charging of target cgroup
> > > > > and keeping the double-charge during the process, but it might not work
> > > > > well for really large objects on small machines. Another approach is to transfer
> > > > > in small chunks (e.g. pages), but then we might end with a partially transferred
> > > > > object, which is also a questionable result.
> > > > >
> > > >
> > > > For this case it is not difficult to do the fallback because the
> > > > original one is restricted to an offline memcg only, that means there
> > > > are no any activities  in the original memcg. So recharge these pages
> > > > to the original one back will always succeed.
> > >
> > > The problem is that the original cgroup might be not a top-level cgroup.
> > > So even if it's offline, it doesn't really change anything: it's parent cgroup
> > > can be online and experience concurrent limits changes, allocations etc.
> > >
> > > >
> > > > > <...>
> > > > >
> > > > > > > Will reparenting work for your case? If not, can you, please, describe the
> > > > > > > problem you're trying to solve by recharging the memory?
> > > > > > >
> > > > > >
> > > > > > Reparenting doesn't work for us.
> > > > > > The problem is memory resource control: the limitation on the bpf
> > > > > > containers will be useless
> > > > > > if the lifecycle of bpf progs can containers are not the same.
> > > > > > The containers are always upgraded - IOW restarted - more frequently
> > > > > > than the bpf progs and maps,
> > > > > > that is also one of the reasons why we choose to pin them on the host.
> > > > >
> > > > > In general, I think I understand why this feature is useful for your case,
> > > > > however I do have some serious concerns about adding such feature to
> > > > > the upstream kernel:
> > > > > 1) The interface and the proposed feature is bpf-specific, however the problem
> > > > > isn't. The same issue (an under reported memory consumption) can be caused by
> > > > > other types of memory: pagecache, various kernel objects e.g. vfs cache etc.
> > > > > If we introduce such a feature, we'd better be consistent across various
> > > > > types of objects (how it's a good question).
> > > >
> > > > That is really a good question, which drives me to think more and
> > > > investigate more.
> > > >
> > > > Per my understanding the under reported pages can be divided into several cases,
> > > > 1) The pages aren't charged correctly when they are allocated.
> > > >    In this case, we should fix it when we allocate it.
> > > > 2) The pages should be recharged back to the original memcg
> > > >    The pages are charged correctly but then we lost track of it.
> > > >    In this case the kernel must introduce some way to keep track of
> > > > and recharge it back in the proper circumstance.
> > > > 3) Undistributed estate
> > > >    The original owner was dead, left with some persistent memory.
> > > >    Should the new one who uses this memory take charge of it?
> > > >
> > > > So case #3 is what we should discuss here.
> > >
> > > Right, this is the case I'm focused on too.
> > >
> > > A particular case is when there are multiple generations of the "same"
> > > workload each running in a new cgroup. Likely there is a lot of pagecache
> > > and vfs cache (and maybe bpf programs etc) is re-used by the second and
> > > newer generations, however they are accounted towards the first dead cgroup.
> > > So the memory consumption of the second and newer generations is systematically
> > > under-reported.
> > >
> >
> > Right, the sharing pagecache pages and vfs cache are more complicated.
> > The trouble is that we don't have a clear rule on what they should
> > belong to. If we want to handle them, we must make the rule first that
> > 1) Should we charge these pages to a specific memcg in the first place ?
> >     If not, things will be very easy. If yes, things will be very complicated.
> >     Unfortunately we selected the complicated way.
> > 2) Now that we selected the complicated way, can we have a clear rule
> > to manage them ?
> >     Our current status is that let it be, and it doesn't matter what
> > they belong to as long as they have a memcg.
> >
> > > >
> > > > Before answering the question, I will explain another option we have
> > > > thought about to fix our issue.
> > > > Instead of recharging the bpf memory in the bpf syscall, the other
> > > > option is to set the target memcg only in the syscall and then wake up
> > > > a kworker to do the recharge. That means separate the recharge into
> > > > two steps, 1) assign the inheritor, 2) transfer the estate.
> > > > At last we didn't choose it because we want an immediate error if the
> > > > new owner doesn't have large enough space.
> > >
> > > The problem is that we often don't know this in advance. Imagine a cgroup
> > > with memory.max set to 1Gb and current usage 0.8Gb. Can it fit a 0.5Gb bpf map?
> > > The true answer is it depends on whether we can reclaim extra 0.3Gb. And there
> > > is no way to say it for sure without making a real attempt to reclaim.
> > >
> > > > But this option can partly answer your question here, one possible way
> > > > to do it more generic is to abstract
> > > > two methods to get -
> > > > 1). Who is the best inheritor               =>  assigner
> > > > 2). How to charge the memory to it    =>  charger
> > > >
> > > > Then let consider the option we choose again, we can find that it can be
> > > > easily extended to work in that way,
> > > >
> > > >        assigner                             charger
> > > >
> > > >     bpf_syscall
> > > >        wakeup the charger            waken
> > > >        wait for the result                 do the recharge and give the result
> > > >        return the result
> > > >
> > > > In other words, we don't have a clear idea what issues we may face in
> > > > the future, but we know we can extend it to fix the new coming issue.
> > > > I think that is the most important thing.
> > > >
> > > > > 2) Moving charges is proven to be tricky and cause various problems in the past.
> > > > > If we're going back into this direction, we should come up with a really solid
> > > > > plan for how to avoid past issues.
> > > >
> > > > I know the reason why we disable move_charge_at_immigrate in cgroup2,
> > > > but I don't know if I know all of the past issues.
> > > > Appreciate if you could share the past issues you know and I will
> > > > check if they apply to this case as well.
> > >
> > > As I mentioned above, recharging is a complex and potentially long process,
> > > which can unexpectedly fail. And rolling it back is also tricky and not always
> > > possible without breaking other things.
> > > So there are difficulties with:
> > > 1) providing a reasonable interface,
> > > 2) implementing it in way which doesn't bring significant performance overhead.
> > >
> > > That said, I'm not saying it's not possible at all, but it's a serious open
> > > problem.
> > >
> > > > In order to avoid possible risks, I have restricted the recharge to
> > > > happen in very strict conditions,
> > > > 1. The original memcg must be an offline memcg
> > > > 2.  The target memcg must be the memcg of the one who calls the bpf syscall
> > > >      That means the outsider doesn't have a way to do the recharge.
> > > > 3. only kmem is supported now. (The may be extend it the future for
> > > > other types of memory)
> > > >
> > > > > 3) It would be great to understand who and how will use this feature in a more
> > > > > generic environment. E.g. is it useful for systemd? Is it common to use bpf maps
> > > > > over multiple cgroups? What for (given that these are not system-wide programs,
> > > > > otherwise why would we charge their memory to some specific container)?
> > > > >
> > > >
> > > > It is useful for containerized environments.
> > > > The container which pinned bpf can use it.
> > > > In our case we may use it in two ways as I explained in the prev mail that,
> > > > 1) The one who load the bpf who do the recharge
> > > > 2) A sidecar to maintain the bpf cycle
> > > >
> > > > For the systemd, it may need to do some extend that,
> > > > The bpf services should describe,
> > > > 1) if the bpf service needs the recharge (the one who limited by memcg
> > > > should be forcefully do the recharge)
> > > > 2) the pinned progs and maps to check
> > > > 3) the service identifier (with which we can get the target memcg)
> > > >
> > > > We don't have the case that the bpf map is shared by multiple cgroups,
> > > > that should be a rare case.
> > > > I think that case is similar to the sharing page caches across
> > > > multiple cgroups, which are used by many cgroups but only charged to
> > > > one specific memcg.
> > >
> > > I understand the case with the pagecache. E.g. we're running essentially the
> > > same workload in a new cgroup and it likely uses the same or similar set of
> > > files, it will actively use the pagecache created by the previous generation.
> > > And this can be a memcg-specific pagecache, which nobody except these cgroups is
> > > using.
> > >
> > > But what kind of bpf data has the same property? Why it has to be persistent
> > > across multiple generations of the same workload?
> > >
> >
> > Ah, it can be considered as shared, between the bpf memcg and the root
> > memcg. While it can only be written by bpf memcg. For example, in the
> > root memcg, some networking facilities like clsact qdisc also read
> > these maps.
> >
> > The key point is that the charging behavior must be consistent, either
> > always charged or always uncharged. That will be good for memory
> > resource management. It is bad that sometimes it gets charged while
> > sometimes not.
>
> I agree, consistency is very important. That is why I don't quite like the idea
> of a voluntary recharging performed by userspace. It might work in your case,
> but in general it's hard to expect that everybody will consistently recharge
> their maps.
>
> >
> > Another possible solution is to introduce a way to allow not to charge
> > pages, IOW these pages will be accounted to root only. If we go that
> > direction, things will get simpler. What do you think?
>
> Is your map pre-allocated or not? Pre-allocated maps can be created by a process
> (temporarily) placed into the root memcg to disable accounting.

It is not pre-allocated. It may be updated dynamically - removes the
old one and then adds a new one.

> We can also
> think of a flag to do it explicitly (e.g. on creating a map).

That is a workable solution.

> But we must be careful here to not introduce security issues: e.g. a non-root
> memcg shouldn't be able to allocate (unaccounted) memory by writing into a
> bpf map belonging to the root memcg.
>

Thanks for your suggestion.

In the end, many thanks for the enlightening discussion with you.

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-03-12  6:46 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-08 13:10 [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Yafang Shao
2022-03-08 13:10 ` [PATCH RFC 1/9] bpftool: fix print error when show bpf man Yafang Shao
2022-03-08 13:10 ` [PATCH RFC 2/9] bpftool: show memcg info of bpf map Yafang Shao
2022-03-08 13:10 ` [PATCH RFC 3/9] mm: add methord to charge kmalloc-ed address Yafang Shao
2022-03-08 13:10 ` [PATCH RFC 4/9] mm: add methord to charge vmalloc-ed address Yafang Shao
2022-03-08 13:10 ` [PATCH RFC 5/9] mm: add methord to charge percpu address Yafang Shao
2022-03-08 13:10 ` [PATCH RFC 6/9] bpf: add a helper to find map by id Yafang Shao
2022-03-08 13:10 ` [PATCH RFC 7/9] bpf: add BPF_MAP_RECHARGE syscall Yafang Shao
2022-03-08 13:10 ` [PATCH RFC 8/9] bpf: make bpf_map_{save, release}_memcg public Yafang Shao
2022-03-08 13:10 ` [PATCH RFC 9/9] bpf: support recharge for hash map Yafang Shao
2022-03-09  1:09 ` [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg Roman Gushchin
2022-03-09 13:28   ` Yafang Shao
2022-03-09 23:35     ` Roman Gushchin
2022-03-10 13:20       ` Yafang Shao
2022-03-10 18:00         ` Roman Gushchin
2022-03-11 12:48           ` Yafang Shao
2022-03-11 17:49             ` Roman Gushchin
2022-03-12  6:45               ` Yafang Shao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.